Skip to content Skip to sidebar Skip to footer

How To Pick The Rows Which Contains All The Keywords?

I have 2 csv files as below : File-1 procedure code anand database 321-87 shiva network 321-123 jana audit 321-56 kalai recruitment 321-10 in file-1, each word in a row is

Solution 1:

If df is relatively small, you could use str.contains. First, build a pattern from df.

df

           procedure     code
0     anand database   321-871      shiva network  321-1232         jana audit   321-563  kalai recruitment   321-10

p = df.procedure.str.split().str.join('.*?').str.cat(sep='|')

p
'anand.*?database|shiva.*?network|jana.*?audit|kalai.*?recruitment'

Now, pass it to str.contains on df2.procedure.

df2[df2.procedure.str.contains(p)]

   s.no                                 procedure
01             kalai has a recruitment group12  shiva is the network person in my office
34                anand is the database here
56         jana is working in the audit team

Solution 2:

Another solution than regex is flashtext, this will be faster if you have more number of keywords i.e

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(df['procedure'].str.split().sum())

df2[df2['procedure'].apply(keyword_processor.extract_keywords).str.len()>1]

    s.no                                procedure
01             kalai has a recruitment group
12  shiva is the network person in my office
34                anand is the database here
56         jana is working in the audit team 

To know more about this library and its speed you can check here

Further Reading :

  1. Docs

  2. Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

There is a need for easy interface via pandas, lets wait till its done.

Post a Comment for "How To Pick The Rows Which Contains All The Keywords?"