Skip to content Skip to sidebar Skip to footer

Python Pandas From Itemset To Dataframe

What is the more scalable way to go from an itemset list:: itemset = [['a', 'b'], ['b', 'c', 'd'], ['a', 'c', 'd', 'e'], ['d'], ['a', 'b

Solution 1:

You can use get_dummies:

print (pd.DataFrame(itemset))
   0     1     2     3
0  a     b  None  None
1  b     c     d  None
2  a     c     d     e
3  d  None  None  None
4  a     b     c  None
5  a     b     c     d
df1 = (pd.get_dummies(pd.DataFrame(itemset), prefix='', prefix_sep='' ))
print (df1)
     a    b    d    b    c    c    d    d    e
0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0
2  1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  1.0
3  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
4  1.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0
5  1.0  0.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0

print (df1.groupby(df1.columns, axis=1).sum().astype(int))
   a  b  c  d  e
0  1  1  0  0  0
1  0  1  1  1  0
2  1  0  1  1  1
3  0  0  0  1  0
4  1  1  1  0  0
5  1  1  1  1  0

Solution 2:

Here's an almost vectorized approach -

items = np.concatenate(itemset)           
col_idx = np.fromstring(items, dtype=np.uint8)-97

lens = np.array([len(item) for item in itemset])
row_idx = np.repeat(np.arange(lens.size),lens)
out = np.zeros((lens.size,lens.max()+1),dtype=int)
out[row_idx,col_idx] = 1   

df = pd.DataFrame(out,columns=np.unique(items))

The last line could be replaced by something like this and could be more performant -

df = pd.DataFrame(out,columns=items[np.unique(col_idx,return_index=True)[1]])

Post a Comment for "Python Pandas From Itemset To Dataframe"