Skip to content Skip to sidebar Skip to footer

Iterate Through Df Rows Faster

I am trying to iterate through rows of a Pandas df to get data from one column of the row, and using that data to add new columns. The code is listed below but it is VERY slow. Is

Solution 1:

It's hard to say exactly what your trying to do. However, if you're looping through rows chances are that there is a better way to do it.

For example, given a csv file that looks like this..

Event_Start_Time,TPRev,Subtest
4/12/19 06:00,"this. string. has dots.. in it.",{'A_Dict':'maybe?'}
6/10/19 04:27,"another stri.ng wi.th d.ots.",{'A_Dict':'aVal'}

You may want to:

  1. Format Event_Start_Time as datetime.
  2. Get the week number from Event_Start_Time.
  3. Remove all the dots (.) from the strings in column TPRev.
  4. Expand a dictionary contained in Subtest to its own column.

Without looping through the rows, consider doing thing by columns. Like doing it to the first 'cell' of the column and it replicates all the way down.

Code:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

     Event_Start_Time    TPRev                              Subtest
04/12/1906:00       this. string. has dots.. in it.    {'A_Dict':'maybe?'}
16/10/1904:27       another stri.ng wi.th d.ots.       {'A_Dict':'aVal'}


# format 'Event_Start_Time' as as datetime
df['Event_Start_Time'] = pd.to_datetime(df['Event_Start_Time'], format='%d/%m/%y %H:%M')

# get the week number from 'Event_Start_Time'
df['Week_Number'] = df['Event_Start_Time'].dt.isocalendar().week

# replace all '.' (periods) in the 'TPRev' column
df['TPRev'] = df['TPRev'].str.replace('.', '', regex=False)

# get a dictionary string out of column 'Subtest' and put into a new column
df = pd.concat([df.drop(['Subtest'], axis=1), df['Subtest'].map(eval).apply(pd.Series)], axis=1)

print(df)

     Event_Start_Time      TPRev                       Week_Number    A_Dict
02019-12-0406:00:00   this string has dots in it  49             maybe?
12019-10-0604:27:00   another string with dots    40             aVal


print(df.info())

Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Event_Start_Time  2 non-null      datetime64[ns]
 1   TPRev             2 non-nullobject2   Week_Number       2 non-null      UInt32        
 3   A_Dict            2 non-nullobject        
dtypes: UInt32(1), datetime64[ns](1), object(2)

So you end up with a dataframe like this...

     Event_Start_Time      TPRev                       Week_Number    A_Dict
02019-12-0406:00:00thisstring has dots in it  49             maybe?
12019-10-0604:27:00   another stringwith dots    40             aVa

Obviously you'll probably want to do other things. Look at your data. Make a list of what you want to do to each column or what new columns you need. Don't mention how right now as chances are it's possible and has been done before - you just need to find the existing method.

You may write down get the difference in days from the current row and the row beneath etc.). Finally search out how to do the formatting or calculation you require. Break the problem down.

Post a Comment for "Iterate Through Df Rows Faster"