Skip to content Skip to sidebar Skip to footer

Sum Values Of Columns Starting With The Same String In Pandas Dataframe

I have a dataframe with about 100 columns that looks like this: Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \ 0 56 1 1

Solution 1:

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.

Consider the following:

df = pd.DataFrame({
        'a_a': [1, 2, 3, 4],
        'a_b': [2, 3, 4, 5],
        'b_a': [1, 2, 3, 4],
        'b_b': [2, 3, 4, 5],
    })

Now

[s.split('_')[0] for s in df.T.index.values]

is the prefix of the columns. So

>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().Tab033155277399

does what you want.

In your case, make sure to split using the '-' character.

Solution 2:

Using brilliant DSM's idea:

from __future__ import print_function

import pandas as pd

categories = set(['Economics', 'English', 'Histo', 'Literature'])

defcorrect_categories(cols):
    return [cat for col in cols for cat in categories if col.startswith(cat)]    

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')

#print(df)print(df.groupby(correct_categories(df.columns),axis=1).sum())

Output:

    Economics  English  Histo  Literature
Id
56          1        1      2           1
11          1        0      0           1
6           1        1      0           0
43          2        0      1           1
14          1        1      1           0

Here is another version, which takes care of "Histo/History" problematic..

from __future__ import print_function

import pandas as pd

#categories = set(['Economics', 'English', 'Histo', 'Literature'])## mapping: common starting pattern: desired name#
categories = {
    'Histo': 'History',
    'Economics': 'Economics',
    'English': 'English',
    'Literature': 'Literature'
}

defcorrect_categories(cols):
    return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))#print(correct_categories(df.columns), len(correct_categories(df.columns)))#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())

rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])

Output:

EconomicsEnglishHistoryLiteratureId56112111100161100432011141110HistoryId56211060431141Name:History,dtype:int64

PS You may want to add missing categories to categories map/dictionary

Solution 3:

You can use these to create sum of columns starting with specific name,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

Post a Comment for "Sum Values Of Columns Starting With The Same String In Pandas Dataframe"