How To Do Fuzzy Match Merge To Match Based On A Few Columns
Solution 1:
Solution one:
If your data is as clean as you claim (there are no typo in the names in the example), then you can do this:
# Cleaning the capitalization error
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()
df_total = df1.append(df2,ignore_index=True)
df_total = df_total.groupby(["store code","name"]).first()
Solution two (if you have typo in the string values):
But if there are typo in the names and you want to merge them according to fuzzy matching, then you need to follow this:
- We need these libraries to help us:
import pandas as pd import networkx as nx from fuzzywuzzy import fuzz import itertools from itertools import permutations
Lets match the cases so we are on the safe side:
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()
Then lets start matching!
We need to make all combinations of the two names in dataframes (source) and make a dataframe out of it so we can use apply that is much faster than for loop:
combs = list(itertools.product(df1["name"], df2["name"]))
combs = pd.DataFrame(combs)
Then we score each combination. The WRatio
will do just fine, but you can use your custom made functions for matching:
combs['score'] = combs.apply(lambda x: fuzz.WRatio(x[0],x[1]), axis=1)
Now, lets make a graph out of it. I used the min score of 90 as the criteria. you can use which ever that suits you the best:
threshold = 90G_name = nx.from_pandas_edgelist(combs[combs['score']>=threshold],0,1, create_using=nx.Graph)
If names fit the matching criteria, then they will become connected in our graph. So each interconnected cluster represent same name. With this information we can create a dictionary that replaces all deviations of a single name in our data to a unique one.
This code is a bit complex. In short, it creates a dataframe which each row is one name and for columns has its variations. Then it melts the dataframe and create a dictionary that has deviation of names as key and the unique representation of a name as value. This dictionary allows us to replace all deviated names in your dataframe with unique one so the groupby
can function correctly:
connected_names=pd.DataFrame()
for cluster in nx.connected_components(G_name):
if len(cluster) !=1:
connected_names = connected_names.append([list(cluster)])
connected_names = connected_names\
.reset_index(drop=True)\
.melt(id_vars=0)\
.drop('variable', axis=1)\
.dropna()\
.reset_index(drop=True)\
.set_index('value')
names_dict = connected_names.to_dict()[0]
Now we have the dictionary. All that remains is replacing the names and use the groupby
method:
df1["name"] = df1["name"].replace(names_dict)
df2["name"] = df2["name"].replace(names_dict)
df_total = df1.append(df2,ignore_index=True)
df_total = df_total.groupby(["store code","name"]).first()
Post a Comment for "How To Do Fuzzy Match Merge To Match Based On A Few Columns"