Speeding Up Pandas Code By Replacing Iterrows
Solution 1:
This works on the sample data. Does it work on your actual data?
# Sample data.df = pd.DataFrame({
'InvoiceNo': ['Inv_001'] * 3 + ['Inv_002'] * 5,
'totalamt': [1720] * 3 + [1160] * 5,
'Item#': [260, 777, 888, 260, 777, 888, 999, 111],
'price': [1500, 100, 120, 700, 100, 120, 140, 100],
'qty': [1] * 8,
'MainCode': [0, 260, 260, 0, 260, 260, 260, 0],
'ProdTotal': [1500, 100, 120, 700 ,100 ,120, 140, 100]
})
subtotals = df[df['MainCode'].ne(0)].groupby(
['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = df[df['MainCode'].eq(0)]
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result['price'] = result.eval('ProdTotal / qty')
result = result.drop(columns=['ProdSubTotal'])
>>> result
InvoiceNo totalamt Item# price qty MainCode ProdTotal
0 Inv_001 1720 260 1720.0 1 0 1720.0
1 Inv_002 1160 260 1060.0 1 0 1060.0
2 Inv_002 1160 111 100.0 1 0 100.0
We first want to get the aggregate ProdTotal
per InvoiceNo
and MainCode
(but only in the case where the MainCode
is not equal to zero, .ne(0)
):
subtotals = df[df['MainCode'].ne(0)].groupby(
['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
>>> subtotals
InvoiceNo MainCode ProdTotal
0 Inv_001 260 220
1 Inv_002 260 360
We then need to filter this data from the main dataframe, so we just filter where the MainCode
equals zero, .eq(0)
.
result = df[df['MainCode'].eq(0)]
>>> result
InvoiceNo totalamt Item# price qty MainCode ProdTotal0 Inv_001 172026015001015003 Inv_002 1160260700107007 Inv_002 116011110010100
We want to join the subtotals to this result where the InvoiceNo
matches and the Item#
in result
matches the MainCode
in subtotal
. One way to do this is change the column names in subtotal
and then perform a left merge:
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
>>> result
InvoiceNo totalamt Item# price qty MainCode ProdTotal ProdSubTotal0 Inv_001 17202601500101500220.01 Inv_002 116026070010700360.02 Inv_002 116011110010100 NaN
Now we add the ProdSubTotal
to the ProdTotal
and drop the column.
result['ProdTotal'] +=result['ProdSubTotal'].fillna(0)
result= result.drop(columns=['ProdSubTotal'])
>>>result
InvoiceNo totalamt Item# price qty MainCode ProdTotal
0 Inv_001 17202601500101720.01 Inv_002 1160260700101060.02 Inv_002 116011110010100.0
Finally, we recalculate the price
given the qty
and new ProdTotal
.
result['price']=result.eval('ProdTotal/qty')>>>resultInvoiceNototalamtItem#priceqtyMainCodeProdTotal0Inv_0011720 2601720.0 101720.01Inv_0021160 2601060.0 101060.02Inv_0021160 111100.010100.0
Solution 2:
Do pandas merge. Split the data into two dataframes, one with invoice, total_amt,item# price,qty and another with invoice, maincode. The do a inner join using merge operation after which you can sum the values of columns row-wise and drop those columns that are not required.
Post a Comment for "Speeding Up Pandas Code By Replacing Iterrows"