Sum Values Of Columns Starting With The Same String In Pandas Dataframe
I have a dataframe with about 100 columns that looks like this: Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \ 0 56 1 1
Solution 1:
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().Tab033155277399
does what you want.
In your case, make sure to split using the '-'
character.
Solution 2:
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
defcorrect_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])## mapping: common starting pattern: desired name#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
defcorrect_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))#print(correct_categories(df.columns), len(correct_categories(df.columns)))#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
EconomicsEnglishHistoryLiteratureId56112111100161100432011141110HistoryId56211060431141Name:History,dtype:int64
PS You may want to add missing categories to categories
map/dictionary
Solution 3:
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Post a Comment for "Sum Values Of Columns Starting With The Same String In Pandas Dataframe"