Best Model For Variable Selection With Big Data?
Solution 1:
I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()
) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()
), as further steps I'd suggest to get any methods mentioned earlier.
Solution 2:
1.Variante A: Feature Selection Scikit
For future selection scikit
offers a lot of different approaches:
https://scikit-learn.org/stable/modules/feature_selection.html
Here it sumps up the comments from above.
2.Variante B: Feature Selection with linear regression
You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_
will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.
3.Variante C: PCA (not for binary case)
Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.
The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.
Scikik-learn
has a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Post a Comment for "Best Model For Variable Selection With Big Data?"