Best Model For Variable Selection With Big Data?
Solution 1:
I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()), as further steps I'd suggest to get any methods mentioned earlier.
Solution 2:
1.Variante A: Feature Selection Scikit
For future selection scikitoffers a lot of different approaches:
https://scikit-learn.org/stable/modules/feature_selection.html
Here it sumps up the comments from above.
2.Variante B: Feature Selection with linear regression
You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.
3.Variante C: PCA (not for binary case)
Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.
The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.
Scikik-learnhas a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Post a Comment for "Best Model For Variable Selection With Big Data?"