Sum Large Pandas Dataframe Based On Smaller Date Ranges
I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into 'monthly' data that sums the hourly data. However, the months aren't neces
Solution 1:
pd.merge_asof
only available with pandas 0.19
combination of pd.merge_asof
+ query
+ groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start
date.
query
I take care of end
date with query
since I now conveniently have end
in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`
Solution 2:
Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns
Post a Comment for "Sum Large Pandas Dataframe Based On Smaller Date Ranges"