Skip to content Skip to sidebar Skip to footer

Split Text In Cells And Create Additional Rows For The Tokens

Let's suppose that I have the following in a DataFrame in pandas: id text 1 I am the first document and I am very happy. 2 Here is the second document and it likes playing ten

Solution 1:

You can use something like:

defdivide_chunks(l, n): 
    # looping till length l for i inrange(0, len(l), n):  
        yield l[i:i + n] 

Then using unnesting:

df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)

              text_new  id
0             I am the   10   first document and   10            I am very   10               happy.   11Here is the   21  second document and   21     it likes playing   21              tennis.   22This is the   32   third document and   32        it looks very   32          good today.   3

EDIT:

m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)

   id                 text
00             I am the
10   first document and
20            I am very
30               happy.
01Here is the
11  second document and
21     it likes playing
31              tennis.
02This is the
12   third document and
22        it looks very
32          good today.

Solution 2:

A self contained solution, maybe a little slower:

# Split every n words
n = 3# incase id is not index yet
df.set_index('id', inplace=True)

new_df = df.text.str.split(' ', expand=True).stack().reset_index()

new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
                .apply(lambda x: ' '.join(x))
                .reset_index(level=1, drop=True)
         )

new_df is a series:

id1               I am the
1     first document and1              I am very
1                 happy.
2            Here is the
2    second document and2       it likes playing
2                tennis.
3            This is the
3     third document and3          it looks very
3            good today.
Name: 0, dtype: object

Post a Comment for "Split Text In Cells And Create Additional Rows For The Tokens"