Wednesday, August 06, 2008

I hate math

Agglodex(http://www.agglodex.com) is humming along, eating feeds, and tagging entries. The storage class is happily keeping all the data. Now comes the hard part; using that data.

My first task is to calculate how similar two users are so I can display a list of similar users on each profile. To accomplish this I have a collection called storage.relations that links two users, and has fields for their similarity. A cron job periodically loads a user and compares them to as many other users as it can, storing the results of the analyzation in the collection. The similarities will be based on the two users' terms. Each time an entry appears in a user's feed it is analyzed for significant terms which are stored in storage.terms. Each time a term is identified, the interest.count for that user and term is incremented along with the term.count which counts how many times a term is used sitewide.

My first attempt was using Euclidian distance. In this algorithm the interests that two users share are looped over, finding the difference between their interest.count, squaring it, and summing all the squares. I then divide 1 by 1 plus the sqrt of the sum and that gives me a number between 1 and 0 where 1 is complete similarity. This worked okay.

The second attempt was using the Pearson coefficient. This one was more complicated, and yielded a stranger score from -1 to 1 where 1 meant that the users had the same interest.count for every term they share.

My third attempt is trying to use the Tanimoto coefficient, which is an extention of the Jaccard index, but failing miserably. The Jaccard index is a way of calculating similarity between two sets of binary data, like two questionnaires of yes/no questions. However my 'interests' have a 'count' and are therefore more like vectors I think. Does anyone know how I could apply the Tanimoto coefficient to my users?

One problem I have with these algorithms is that they don't take into account all the terms that the users *don't* share. If two users each have hundreds of interests but happen to share an interest in "Google" or "iPhone" is what they have in common more important than what they don't?

I suck at math, so any help I can get here would be much appreciated! Thanks in advance!

0 Comments:

Post a Comment

<< Home