How To Impute Each Categorical Column In Numpy Array
Solution 1:
We could use Scipy's mode
to get the highest value in each column. Leftover work would be to get the NaN
indices and replace those in input array with the mode
values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas
, with value_counts
, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode
, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings
and NaNs
, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning)
. But since, we actually want to ignore NaNs
for that mode calculation, we should be okay there.
Post a Comment for "How To Impute Each Categorical Column In Numpy Array"