Skip to content Skip to sidebar Skip to footer

Evaluate Slope And Error For Specific Category For Statsmodels Ols Fit

I have a dataframe df with the following fields: weight, length, and animal. The first 2 are continuous variables, while animal is a categorical variable with the values cat, dog,

Solution 1:

Very brief background:

The general question for this is how does the prediction change if we change on of the explanatory variables, holding other explanatory variables fixed or averaging over those.

In the nonlinear discrete models, there is a special Margins method that calculates this, although it is not implemented for changes in categorical variables.

In the linear model, the prediction and change in prediction is just a linear function of the estimated parameters, and we can (mis)use t_test to calculate the effect, its standard error and confidence interval for us.

(Aside: There are more helper methods in the works for statsmodels to make prediction and margin calculations like this easier and will be available most likely later in the year.)

As brief explanation of the following code:

  • I make up a similar example.
  • I define the explanatory variables for length = 1 or 2, for each animal type
  • Then, I calculate the difference in these explanatory variables
  • This defines linear combinations or contrast of parameters, that can be used in t_test.

Finally, I compare with the result from predict to check that I didn't make any obvious mistakes. (I assume this is correct but I had written it pretty fast.)

import numpy as np
import pandas as pd

from statsmodels.regression.linear_model import OLS

np.random.seed(2)
nobs = 20
animal_names = np.array(['cat', 'dog', 'snake'])
animal_idx = np.random.random_integers(0, 2, size=nobs)
animal = animal_names[animal_idx]
length = np.random.randn(nobs) + animal_idx
weight = np.random.randn(nobs) + animal_idx + length

data = pd.DataFrame(dict(length=length, weight=weight, animal=animal))

res = OLS.from_formula('weight ~ length * animal', data=data).fit()
print(res.summary())


data_predict1 = data = pd.DataFrame(dict(length=np.ones(3), weight=np.ones(3), 
                                        animal=animal_names))

data_predict2 = data = pd.DataFrame(dict(length=2*np.ones(3), weight=np.ones(3), 
                                        animal=animal_names))

import patsy
x1 = patsy.dmatrix('length * animal', data_predict1)
x2 = patsy.dmatrix('length * animal', data_predict2)

tt = res.t_test(x2 - x1)
print(tt.summary(xname=animal_names.tolist()))

The result of the last print is

                             Test for Constraints                             
==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
cat            1.0980      0.280      3.926      0.002         0.498     1.698
dog            0.9664      0.860      1.124      0.280        -0.878     2.811
snake          1.5930      0.428      3.720      0.002         0.675     2.511

We can verify the results by using predict and compare the difference in predicted weight if the length for a given animal type increases from 1 to 2:

>>> [res.predict({'length': 2, 'animal':[an]}) - res.predict({'length': 1, 'animal':[an]}) for an in animal_names]
[array([ 1.09801656]), array([ 0.96641455]), array([ 1.59301594])]
>>> tt.effectarray([ 1.09801656,  0.96641455,  1.59301594])

Note: I forgot to add a seed for the random numbers and the numbers cannot be replicated.

Post a Comment for "Evaluate Slope And Error For Specific Category For Statsmodels Ols Fit"