Skip to content Skip to sidebar Skip to footer

Remove Duplicates From Rows And Columns (cell) In A Dataframe, Python

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this: Index x y 1 1 ec, us, us, gbr, lst 2 5 ec, us, us, us

Solution 1:

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')

0         us, ec, gbr, lst
1                   us, ec
2         us, ec, gbr, lst
3               us, ec, ir
4    us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)

# Replace all the braces and nan with `''`, then split and apply set and join

Solution 2:

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

Solution 3:

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

Solution 4:

use the apply method on the dataframe.

# change this function according to your needsdefdedup(row):
    returnlist(set(row.y))

df['deduped'] = df.apply(dedup, axis=1)

Post a Comment for "Remove Duplicates From Rows And Columns (cell) In A Dataframe, Python"