Numpy, Pandas: What Is The Fastest Way To Calculate Dataset Row Value Basing On Previous N Values?
I have a dataset and I want to enrich it. I need to calculate some new dataset column which is some function of previous N rows of another column. As an example, given I want to ca
Solution 1:
use rolling/moving window functions.
Sample DF:
In [46]:df=pd.DataFrame({'date':pd.date_range('2000-01-01',freq='D',periods=15),'temp':np.random.rand(15)*20})In [47]:dfOut[47]:datetemp02000-01-01 17.24661612000-01-02 18.22846822000-01-03 6.24599132000-01-04 8.89006942000-01-05 6.83728552000-01-06 1.55592462000-01-07 18.64191872000-01-08 6.30817482000-01-09 13.60120392000-01-10 6.482098102000-01-11 15.711497112000-01-12 18.690925122000-01-13 2.493110132000-01-14 17.626622142000-01-15 6.982129
Answer :
In [48]:df['higher_3avg']=df.rolling(3)['temp'].mean().diff().gt(0)In [49]:dfOut[49]:datetemphigher_3avg02000-01-01 17.246616False12000-01-02 18.228468False22000-01-03 6.245991False32000-01-04 8.890069False42000-01-05 6.837285False52000-01-06 1.555924False62000-01-07 18.641918True72000-01-08 6.308174False82000-01-09 13.601203True92000-01-10 6.482098False102000-01-11 15.711497True112000-01-12 18.690925True122000-01-13 2.493110False132000-01-14 17.626622True142000-01-15 6.982129False
Explanation:
In [50]: df.rolling(3)['temp'].mean()
Out[50]:
0 NaN
1 NaN
213.907025311.12150947.32444855.76109369.01170978.835339812.85043198.7971581011.9315991113.6281731212.2985111312.936886149.033954
Name: temp, dtype: float64
Solution 2:
for huge data, Numpy solutions are 30x faster. from Here :
def moving_average(a, n=3) :
ret = a.cumsum()
ret[n:] -= ret[:-n]
return ret[n - 1:] / n
In [419]: %timeit moving_average(df.values)38.2 µs ± 1.97 µs per loop(mean ± std. dev. of 7 runs, 10000 loops each)
In [420]: %timeit df.rolling(3).mean()
1.42 ms ± 11.5 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)
Post a Comment for "Numpy, Pandas: What Is The Fastest Way To Calculate Dataset Row Value Basing On Previous N Values?"