Groupby Pandas , Calculate Multiple Columns Based On Date Difference
I have a pandas dataframe shown below: CID RefID Date Group MID 100 1 1/01/2021 A 100 2 3/01/2021 A
Solution 1:
You could do something like this:
def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result
df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()
Result for df
created by
from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
is
CIDRefIDDateGroupMIDdays_diffAB010012021-01-01 ANaNNaT11.0110022021-01-03 ANaN211.0210032021-01-04 A101.0311.0310042021-01-15 ANaNNaT24.0410052021-01-18 ANaN324.0520062021-03-03 BNaNNaT16.0620072021-04-04 BNaNNaT27.0720082021-04-09 B102.0527.0820092021-04-25 BNaNNaT39.09300102021-04-26 CNaNNaT110.010300112021-05-27 CNaNNaT211.011300122021-05-28 C103.01211.0
A few explanations:
- The function
days_diff
produces a dataframe with the two columnsdays_diff
andA
. It is applied to the grouped by columnCID
sub-dataframes ofdf
. - First step: Initializing the result dataframe
result
(columndays_diff
filled withNaT
, columnA
withNone
), and setting the starting valuestart
for the day differences to the first day in the group. - Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column
Date
, and a boolean valuenext_MID_is_na
that signifies if the value of theMID
column in the next row istNaN
(via.shift(1).isna()
). - In every step of the loop:
- Calculation of the difference of the current day to the start day.
- Checking the rules for the
days_diff
column:- If difference of current and start day <= 30 days and
NaN
in nextMID
-row -> day-difference. - Otherwise -> reset of
start
to the current day.
- If difference of current and start day <= 30 days and
- After finishing column
days_diff
calculation of columnA
:result.days_diff.isna()
isTrue
(== 1
) whendays_diff
isNaN
,False
(== 0
) otherwise. Therefore the cummulative sum (.cumsum()
) gives the required result. - After the
groupby-apply
to produce the columnsdays_diff
andA
finally the calculation of columnB
: Selection ofRefID
-values where the valuesA
change (via.where(df.A != df.A.shift(1))
), and then forward filling the remainingNaN
s.
Post a Comment for "Groupby Pandas , Calculate Multiple Columns Based On Date Difference"