Split Text In Cells And Create Additional Rows For The Tokens
Let's suppose that I have the following in a DataFrame in pandas: id text 1 I am the first document and I am very happy. 2 Here is the second document and it likes playing ten
Solution 1:
You can use something like:
defdivide_chunks(l, n):
# looping till length l for i inrange(0, len(l), n):
yield l[i:i + n]
Then using unnesting
:
df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)
text_new id
0 I am the 10 first document and 10 I am very 10 happy. 11Here is the 21 second document and 21 it likes playing 21 tennis. 22This is the 32 third document and 32 it looks very 32 good today. 3
EDIT:
m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)
id text
00 I am the
10 first document and
20 I am very
30 happy.
01Here is the
11 second document and
21 it likes playing
31 tennis.
02This is the
12 third document and
22 it looks very
32 good today.
Solution 2:
A self contained solution, maybe a little slower:
# Split every n words
n = 3# incase id is not index yet
df.set_index('id', inplace=True)
new_df = df.text.str.split(' ', expand=True).stack().reset_index()
new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
.apply(lambda x: ' '.join(x))
.reset_index(level=1, drop=True)
)
new_df
is a series:
id1 I am the
1 first document and1 I am very
1 happy.
2 Here is the
2 second document and2 it likes playing
2 tennis.
3 This is the
3 third document and3 it looks very
3 good today.
Name: 0, dtype: object
Post a Comment for "Split Text In Cells And Create Additional Rows For The Tokens"