Skip to content Skip to sidebar Skip to footer

Using Memmap Files For Batch Processing

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA. Dataset Size-(140000,3504) The d

Solution 1:

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.


Detailed diagnosis:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

is

Traceback (most recent calllast):
  File "<stdin>", line 1, in<module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array= np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

Solution 2:

Does the following alone trigger the crash?

X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
                         mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)

If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:

X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
                             mode='w+', shape=(n_samples, clf.n_components))
forbatchingen_batches(n_samples, self.batch_size_):
    X_batch_projected = clf.transform(X_train_mmap[batch])
    X_projected_mmap[batch] = X_batch_projected

I have not tested that code but I hope that you get the idea.

Post a Comment for "Using Memmap Files For Batch Processing"