Skip to content Skip to sidebar Skip to footer

Speeding Up Pandas Code By Replacing Iterrows

I have a Dataframe like below +-----------+----------+-------+-------+-----+----------+-----------+ | InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal | +---------

Solution 1:

This works on the sample data. Does it work on your actual data?

# Sample data.df = pd.DataFrame({
    'InvoiceNo': ['Inv_001'] * 3 + ['Inv_002'] * 5,
    'totalamt': [1720] * 3 + [1160] * 5,
    'Item#': [260, 777, 888, 260, 777, 888, 999, 111],
    'price': [1500, 100, 120, 700, 100, 120, 140, 100],
    'qty': [1] * 8,
    'MainCode': [0, 260, 260, 0, 260, 260, 260, 0],
    'ProdTotal': [1500, 100, 120, 700 ,100 ,120, 140, 100]
})

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})

result = df[df['MainCode'].eq(0)]
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result['price'] = result.eval('ProdTotal / qty')
result = result.drop(columns=['ProdSubTotal'])

>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

We first want to get the aggregate ProdTotal per InvoiceNo and MainCode (but only in the case where the MainCode is not equal to zero, .ne(0)):

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
>>> subtotals
  InvoiceNo  MainCode  ProdTotal
0   Inv_001       260        220
1   Inv_002       260        360

We then need to filter this data from the main dataframe, so we just filter where the MainCode equals zero, .eq(0).

result = df[df['MainCode'].eq(0)]
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal0   Inv_001      172026015001015003   Inv_002      1160260700107007   Inv_002      116011110010100

We want to join the subtotals to this result where the InvoiceNo matches and the Item# in result matches the MainCode in subtotal. One way to do this is change the column names in subtotal and then perform a left merge:

subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal  ProdSubTotal0   Inv_001      17202601500101500220.01   Inv_002      116026070010700360.02   Inv_002      116011110010100           NaN

Now we add the ProdSubTotal to the ProdTotal and drop the column.

result['ProdTotal'] +=result['ProdSubTotal'].fillna(0)
result= result.drop(columns=['ProdSubTotal'])
>>>result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      17202601500101720.01   Inv_002      1160260700101060.02   Inv_002      116011110010100.0

Finally, we recalculate the price given the qty and new ProdTotal.

result['price']=result.eval('ProdTotal/qty')>>>resultInvoiceNototalamtItem#priceqtyMainCodeProdTotal0Inv_0011720    2601720.0    101720.01Inv_0021160    2601060.0    101060.02Inv_0021160    111100.010100.0

Solution 2:

Do pandas merge. Split the data into two dataframes, one with invoice, total_amt,item# price,qty and another with invoice, maincode. The do a inner join using merge operation after which you can sum the values of columns row-wise and drop those columns that are not required.

Post a Comment for "Speeding Up Pandas Code By Replacing Iterrows"