Issue
I have a two different datasets:
users
:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies
:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Solution
- Where
df
is the movies dataframe anddfu
is the users dataframe - The
'genre'
column needs to be split into a list withpandas.Series.str.split
, and then usingpandas.DataFrame.explode
, transform each element of the list into a row, replicating index values. pandas.merge
the two dataframes on'movie_id'
- Use
pandas.DataFrame.groupby
on'user_id'
and'genre'
and aggregate bycount
. - Shape
final
- Tested in
python 3.10
,pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.