How To Pick The Rows Which Contains All The Keywords?
I have 2 csv files as below : File-1 procedure code anand database 321-87 shiva network 321-123 jana audit 321-56 kalai recruitment 321-10 in file-1, each word in a row is
Solution 1:
If df
is relatively small, you could use str.contains
. First, build a pattern from df
.
df
procedure code
0 anand database 321-871 shiva network 321-1232 jana audit 321-563 kalai recruitment 321-10
p = df.procedure.str.split().str.join('.*?').str.cat(sep='|')
p
'anand.*?database|shiva.*?network|jana.*?audit|kalai.*?recruitment'
Now, pass it to str.contains
on df2.procedure
.
df2[df2.procedure.str.contains(p)]
s.no procedure
01 kalai has a recruitment group12 shiva is the network person in my office
34 anand is the database here
56 jana is working in the audit team
Solution 2:
Another solution than regex is flashtext, this will be faster if you have more number of keywords i.e
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(df['procedure'].str.split().sum())
df2[df2['procedure'].apply(keyword_processor.extract_keywords).str.len()>1]
s.no procedure
01 kalai has a recruitment group
12 shiva is the network person in my office
34 anand is the database here
56 jana is working in the audit team
To know more about this library and its speed you can check here
Further Reading :
There is a need for easy interface via pandas, lets wait till its done.
Post a Comment for "How To Pick The Rows Which Contains All The Keywords?"