Skip to content Skip to sidebar Skip to footer

Pairwise Similarities With Numpy Without For Loop

I am trying to calculate similarities between each row using numpy. Would you please advice how this can be done without for loop? import numpy as np x = np.array([[1, 2, 3], [4

Solution 1:

You can replace the for-loop using itertools, which may be more efficient (I'm assuming efficiency is your actual goal):

from itertools import product, starmap
it = starmap(my_similarity_func, product(x, x))
similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))

Solution 2:

A couple options have been giving for removing the for loops. Assuming this is due to concerns about efficiency, I've provided some benchmarks. Profiling this sort of thing is very dependent on what the function being called does and how large the array is. Timing several of the methods given here (using np.dot as the similarity function) gives pretty similar results, with the for loop being surprisingly competitive.

%timeit tmp=test_using_for_loop(x)
5.88 µs ± 164 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
6.54 µs ± 101 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)

%timeit it = starmap(my_similarity_func, product(x, x)); similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))
5.34 µs ± 364 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)

%timeit dist.cdist(x,x,metric=my_similarity_func)
15 µs ± 136 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)

On the other hand, the data given is rather small. In many applications it is typical to compute a similarity metric on hundreds or thousands of samples. And after all, why optimize for a 2 by 3 matrix? Using larger data

x = np.random.randn(3000, 150)

The results are

%timeit tmp=test_using_for_loop(x)
5.69 s ± 54.9 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
5.17 s ± 29.9 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

%timeit it = starmap(my_similarity_func, product(x, x)); similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))
3.74 s ± 20.2 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

%timeit dist.cdist(x,x,metric=my_similarity_func)
8.08 s ± 156 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

There's also the question brought up by coldspeed and several other commenters - would it be better to optimize the similarity function rather than how it is called? A custom similarity function will not be as optimized as np.dot. So, using what is a purposefully worst-case (and absolutely useless) similarity function

def my_similarity_func(a,b):
    calc1 = a.dot(b)
    calc2 = sqrt(abs(sum(a)+sum(b)))
    calc3 = calc1**2 / calc2 + 1
    return calc3

What was a fairly substantial difference in performance almost disappears. The percent difference between the itertools method and basic looping is around 5 or 6% (still larger than expected, but not much)

%timeit tmp=test_using_for_loop(x)
1min 11s ± 2.02 s per loop (mean ± std. dev. of7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 7s ± 468 ms per loop (mean ± std. dev. of7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 7s ± 322 ms per loop (mean ± std. dev. of7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 8s ± 1.31 s per loop (mean ± std. dev. of7 runs, 1 loop each)

In summary, there are several ways to remove the for loop but performance-wise they will probably all be similar. If performance matters it would be best to re-write the similarity function in a way that can take advantage of broadcasting or other optimizations. Doing that to the worst-case similarity function here reduces the running time to a few hundred milliseconds.

%timeit x.dot(x.T)**2/sqrt(abs(sum(x, 1)[:,None] +sum(x.T, 0))) +1128 ms ± 3.14 ms per loop (mean ± std. dev. of7 runs, 10 loops each)

Post a Comment for "Pairwise Similarities With Numpy Without For Loop"