Skip to content Skip to sidebar Skip to footer

Separating Categories Within One Column In My Dataframe

I need to research something about what are the most cost efficient movie genres. My problem is that the genres are provided all within one string: This gives me about 300 differe

Solution 1:

Thanks to Yong Wang who suggested the get_dummies function within pandas. We can shorten the code significantly:

df = pd.DataFrame({
    'movie_id': range(5),
    'gernes': [
                'Action|Adventure|Fantasy|Sci-Fi',
                'Action|Adventure|Fantasy',
                'Action|Adventure|Thriller',
                'Action|Thriller',
                'Action|Adventure|Sci-Fi'
              ]
})  
dummies = df['gernes'].str.get_dummies(sep='|')
final = pd.concat([df, dummies], axis=1)

Result:

   movie_id                           gernes  Action  Adventure  Fantasy  Sci-Fi  Thriller
0         0  Action|Adventure|Fantasy|Sci-Fi       1          1        1       1         0
1         1         Action|Adventure|Fantasy       1          1        1       0         0
2         2        Action|Adventure|Thriller       1          1        0       0         1
3         3                  Action|Thriller       1          0        0       0         1
4         4          Action|Adventure|Sci-Fi       1          1        0       1         0

Original answer

One solution combining pandas and data preparation techniques from Machine Learning. Assuming you are on pandas v0.25 or later.

First, let's create a dataframe from your screenshot:

df = pd.DataFrame({
    'movie_id': range(5),
    'gernes': [
                'Action|Adventure|Fantasy|Sci-Fi',
                'Action|Adventure|Fantasy',
                'Action|Adventure|Thriller',
                'Action|Thriller',
                'Action|Adventure|Sci-Fi'
              ]
})

   movie_id                           gernes
0         0  Action|Adventure|Fantasy|Sci-Fi
1         1         Action|Adventure|Fantasy
2         2        Action|Adventure|Thriller
3         3                  Action|Thriller
4         4          Action|Adventure|Sci-Fi

A movie can belong to multiple gernes. What we want is to seperate those gernes through a process called one-hot encoding. We define the categories (Action, Adventure, Thriller, etc.) and mark each movie as belonging to each category or not:

from sklearn.preprocessing import OneHotEncoder

s = df['gernes'].str.split('|').explode()
encoder = OneHotEncoder()
encoded = encoder.fit_transform(s.values[:, None])
one_hot_df = pd.DataFrame(encoded.toarray(), columns=np.ravel(encoder.categories_), dtype='int') \
                .groupby(s.index) \
                .sum()

   Action  Adventure  Fantasy  Sci-Fi  Thriller
011110111100211001310001411010

What it means is that the first movie belongs to the Action, Adventure, Fantasy and Sci-Fi but not Thriller categories, the second movie belongs to Action, Adventure and Fantasy and so on. The final stop is to combine them together:

final = pd.concat([df, one_hot_df], axis=1)

   movie_id                           gernes  Action  Adventure  Fantasy  Sci-Fi  Thriller
0         0  Action|Adventure|Fantasy|Sci-Fi       1          1        1       1         0
1         1         Action|Adventure|Fantasy       1          1        1       0         0
2         2        Action|Adventure|Thriller       1          1        0       0         1
3         3                  Action|Thriller       1          0        0       0         1
4         4          Action|Adventure|Sci-Fi       1          1        0       1         0

Post a Comment for "Separating Categories Within One Column In My Dataframe"