Optimized Method For Calculating Cosine Distance In Python

December 26, 2023 Post a Comment

I wrote a method to calculate the cosine distance between two arrays: def cosine_distance(a, b): if len(a) != len(b): return False numerator = 0 denoma = 0

Solution 1:

If you can use SciPy, you can use cosine from spatial.distance:

http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).

from itertools import izip
from math import sqrt

defcosine_distance(a, b):
    iflen(a) != len(b):
        raise ValueError, "a and b must be same length"
    numerator = sum(tup[0] * tup[1] for tup in izip(a,b))
    denoma = sum(avalue ** 2for avalue in a)
    denomb = sum(bvalue ** 2for bvalue in b)
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

It is better to raise an exception when the lengths of a and b are mismatched.

By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.

I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.

If you are doing bioinformatics in Python, you really should be using SciPy anyway.

EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.

I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.

I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.

EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)

Solution 2:

(I originally thought) you're not going to speed it up a lot without breaking out to C (like numpy or scipy) or changing what you compute. But here's how I'd try that, anyway:

from itertools import imap
from math import sqrt
from operator import mul

defcosine_distance(a, b):
    assertlen(a) == len(b)
    return1 - (sum(imap(mul, a, b))
                / sqrt(sum(imap(mul, a, a))
                       * sum(imap(mul, b, b))))

It's roughly twice as fast in Python 2.6 with 500k-element arrays. (After changing map to imap, following Jarret Hardie.)

Here's a tweaked version of the original poster's revised code:

from itertools import izip

defcosine_distance(a, b):
    assertlen(a) == len(b)
    ab_sum, a_sum, b_sum = 0, 0, 0for ai, bi in izip(a, b):
        ab_sum += ai * bi
        a_sum += ai * ai
        b_sum += bi * bi
    return1 - ab_sum / sqrt(a_sum * b_sum)

It's ugly, but it does come out faster. . .

Edit: And try Psyco! It speeds up the final version by another factor of 4. How could I forget?

Solution 3:

No need to take abs() of a[i] and b[i] if you're squaring it.

Store a[i] and b[i] in temporary variables, to avoid doing the indexing more than once. Maybe the compiler can optimize this, but maybe not.

Check into the **2 operator. Is it simplifying it into a multiply, or is it using a general power function (log - multiply by 2 - antilog).

Don't do sqrt twice (though the cost of that is small). Do sqrt(denoma * denomb).

Solution 4:

Similar to Darius Bacon's answer, I've been toying with operator and itertools to produce a faster answer. The following seems to be 1/3 faster on a 500-item array according to timeit:

from math import sqrt
from itertools import imap
from operator import mul

defop_cosine(a, b):
    dot_prod = sum(imap(mul, a, b))
    a_veclen = sqrt(sum(i ** 2for i in a))
    b_veclen = sqrt(sum(i ** 2for i in b))

    return1 - dot_prod / (a_veclen * b_veclen)

Solution 5:

This is faster for arrays of around 1000+ elements.

from numpy import array
def cosine_distance(a, b):
    a=array(a)
    b=array(b)
    numerator=(a*b).sum()
    denoma=(a*a).sum()
    denomb=(b*b).sum()
    result=1- numerator /sqrt(denoma*denomb)
    returnresult

Python Tutorial for Beginners