Issue
I have an original df with 4 columns: user (user id visiting website), month (month user visited website), year (year user visited website), num_hits (number of times user visited that month for that year.
I want to plot by user and year, the month (x-axis) and the num_hits (y-axis). I created a list of tuples in pandas as a column using:
df['tup'] = list(zip(df['month'], df['num_hits']))
df1 = df.groupby(['user', 'year'], as_index = False)['tup'].agg(list)
But here is where I got stuck, as I wanted to sort the list of tuples in the column 'tup' by their first element so then I could plot each of these list of tuples. My solution to this was to create a list of lists from the df and then sort the first element like this:
df2 = df1['tup'].values.tolist()
for i in df2:
i.sort(key=lambda x: x[0])
So then I could plot them using:
for i in range(len(df2)):
plt.plot(*zip(*df2[i]))
But by doing this, I lost the user and year information that I wanted to keep in order to display it on the legend of the plot for the corresponding line. Is there anyway of sorting the list of tuples in the pandas df and then plotting it directly using matplotlib so that I could display the user and the year in the legend for that corresponding line? Thank you in advance.
Solution
The simplest solution is to not use tuples at all. You can create a pivot table, with the user
and year
columns as an index, the month
column as the columns, and the num_hits
column as the values. By first sorting the rows by month
the columns will be in the correct order. By transposing the dataframe, so that month
is now the index, and user
and year
are the column, you can simply call .plot()
which will return what you need:
df.sort_values("month").pivot(index=["user", "year"], columns="month", values="num_hits").T.plot()
This could be broken up into stages, if you would prefer:
# create the pivot table
df1 = df.sort_values("month").pivot(index=["user", "year"], columns="month", values="num_hits")
# transpose
df2 = df1.T
# plot
df2.plot()
And the data I used, ensuring that the months were not sorted to start with, so that it would definitely need to change to be correct:
import pandas as pd
import numpy as np
df = pd.DataFrame({"user": [1]*12*3 + [2]*12*3 + [3]*12*3 + [4]*12*3 + [5]*12*3,
"month": list(np.arange(12, 0, -1))*3*5,
"year": ([2019]*12 + [2020]*12 + [2021]*12)*5,
"num_hits": np.random.randint(0, 1000, 12*3*5)})
Although it is not stated in the documentation from what I can see, the .pivot()
appears to sort the columns anyway, so you shouldn't even need to use .sort_values()
.
Answered By - Rawson
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.