Skip to content Skip to sidebar Skip to footer

How To Create Colormap Of Confidence Estimates For K-nearest Neighbor Classification

What I want: To display the results of my simple classification algorithm (see below) as a colormap in python (the data is in 2D), where each class is assigned a color, and the con

Solution 1:

So, you can calculate two numbers for each point in the 2D plane

  • confidence (0 .. 1)
  • class (an integer)

One possibility is to calculate your own RGB map and show it with imshow. Like this:

import numpy as np
import matplotlib.pyplot as plt

# color vector with N x 3 colors, where N is the maximum number of classes and the colors are in RGB
mycolors = np.array([
  [ 0, 0, 1],
  [ 0, 1, 0],
  [ 1, 0, 1],
  [ 1, 1, 0],
  [ 0, 1, 1],
  [ 0, 0, 0],
  [ 0, .5, 1]])

# negate the colors
mycolors = 1 - mycolors 

# extents of the area
x0 = -2
x1 = 2
y0 = -2
y1 = 2

# grid over the area
X, Y = np.meshgrid(np.linspace(x0, x1, 1000), np.linspace(y0, y1, 1000))

# calculate the classification and probabilities
classes = classify_func(X, Y)
probabilities = prob_func(X, Y)

# create the basic color map by the class
img = mycolors[classes]

# fade the color by the probability (black for zero prob)
img *= probabilities[:,:,None]

# reverse the negative image back
img = 1 - img

# draw it
plt.imshow(img, extent=[x0,x1,y0,y1], origin='lower')
plt.axis('equal')

# save it
plt.savefig("mymap.png")

The trick of making negative colors is there just to make the maths a bit easier to undestand. The code can of course be written much denser.

I created two very simple functions to mimic the classification and probabilities:

defclassify_func(X, Y):
    return np.round(abs(X+Y)).astype('int')

defprob_func(X,Y):
    return1 - 2*abs(abs(X+Y)-classify_func(X,Y))

The former gives for the given area integer values from 0 to 4, and the latter gives smoothly changing probabilities.

The result:

enter image description here

If you do not like the way the colors fade towards zero probability, you may always create some non-linearity which is the applied when multiplying with the probabilities.


Here the functions classify_func and prob_func are given two arrays as the arguments, first one being the X coordinates where the values are to be calculated, and second one Y coordinates. This works well, if the underlying calculations are fully vectorized. With the code in the question this is not the case, as it only calculates single values.

In that case the code changes slightly:

x = np.linspace(x0, x1, 1000)
y = np.linspace(y0, y1, 1000)
classes = np.empty((len(y), len(x)), dtype='int')
probabilities = np.empty((len(y), len(x)))
for yi, yv inenumerate(y):
    for xi, xv inenumerate(x):
    classes[yi, xi], probabilities[yi, xi] = kNN((xv, yv), D)

Also as your confidence estimates are not 0..1, they need to be scaled:

probabilities -= np.amin(probabilities)
probabilities /= np.amax(probabilities)

After this is done, your map should look like this with extents -4,-4..4,4 (as per the color map: green=1, magenta=2, yellow=3):

kNN map


To vectorize or not to vectorize - that is the question

This question pops up from time to time. There is a lot of information about vectorization in the web, but as a quick search did not reveal any short summaries, I'll give some thoughts here. This is quite a subjective matter, so everything just represents my humble opinions. Other people may have different opinions.

There are three factors to consider:

  • performance
  • legibility
  • memory use

Usually (but not always) vectorization makes code faster, more difficult to understand, and consume more memory. Memory use is not usually a big problem, but with large arrays it is something to think of (hundreds of megs are usually ok, gigabytes are troublesome).

Trivial cases aside (element-wise simple operations, simple matrix operations), my approach is:

  • write the code without vectorizations and check it works
  • profile the code
  • vectorize the inner loops if needed and possible (1D vectorization)
  • create a 2D vectorization if it is simple

For example, a pixel-by-pixel image processing operation may lead to a situation where I end up with one-dimensional vectorizations (for each row). Then the inner loop (for each pixel) is fast, and the outer loop (for each row) does not really matter. The code may look much simpler if it does not try to be usable with all possible input dimensions.

I am such a lousy algorithmist that in more complex cases I like to verify my vectorized code against the non-vectorized versions. Hence I almost invariably first create the non-vectorized code before optimizing it at all.

Sometimes vectorization does not offer any performance benefit. For example, the handy function numpy.vectorize can be used to vectorize practically any function, but its documentation states:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

(This function could have been used in the code above, as well. I chose the loop version for legibility for people not very familiar with numpy.)

Vectorization gives more performance only if the underlying vectorized functions are faster. They sometimes are, sometimes aren't. Only profiling and experience will tell. Also, it is not always necessary to vectorize everything. You may have an image processing algorithm which has both vectorized and pixel-by-pixel operations. There numpy.vectorize is very useful.

I would try to vectorize the kNN search algorithm above at least to one dimension. There is no conditional code (it wouldn't be a show-stopper but it would complicates things), and the algorithm is rather straight-forward. The memory consumption will go up, but with one-dimensional vectorization it does not matter.

And it may happen that along the way you notice that a n-dimensional generalization is not much more complicated. Then do that if memory allows.

Post a Comment for "How To Create Colormap Of Confidence Estimates For K-nearest Neighbor Classification"