Skip to content Skip to sidebar Skip to footer

Merge Pandas Dataframe With Key Duplicates

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, bu

Solution 1:

faster again

%%cython
# using cython in jupyter notebook# in another cell run `%load_ext Cython`from collections import defaultdict
import numpy as np

defcg(x):
    cnt = defaultdict(lambda: 0)

    for j in x.tolist():
        cnt[j] += 1yield cnt[j]


deffastcount(x):
    return [i for i in cg(x)]

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

faster answer; not scalable

def fastcount(x):
    unq, inv = np.unique(x, return_inverse=1)
    m = np.arange(len(unq))[:, None] == inv
    return (m.cumsum(1) * m).sum(0)

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

old answer

df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()

df1.merge(df2, how='outer').drop('cc', 1)

enter image description here

Solution 2:

df1.set_index('key', inplace=True)

df2.set_index('key', inplace=True)

merged_df = pd.merge(df1, df2, left_index =True, right_index =True, how='inner')
merged_df.reset_index('key', drop=False, inplace=True)

Post a Comment for "Merge Pandas Dataframe With Key Duplicates"