Issue
I have a large pandas dataframe, which is a log of user ids that login in a website:
id datetime
130 2018-05-17 19:46:18
133 2018-05-17 20:59:57
133 2018-05-17 21:54:01
142 2018-05-17 22:49:07
114 2018-05-17 23:02:34
136 2018-05-18 06:06:48
136 2018-05-18 12:21:38
180 2018-05-18 12:49:33
.......
120 2018-05-18 14:03:58
120 2018-05-18 15:28:36
How can I visualize the above pandas dataframe as a time series plot? For example I would like to represent the frequency of logins of each person id
as a line of a different color (note that I have about 400 ids
). Something like this plot (*)
:
]
I tried to:
from datetime import date
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
# set your data as df
# strip only YYYY-mm-dd part from original `datetime` column
df3.timestamp = df3.datetime.apply(lambda x: str(x)[:10])
df3.timestamp = df3.datetime.apply(lambda x: date(int(x[:4]), int(x[5:7]), int(x[8:10])))
# plot
plt.figure(figsize=(150,10))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(df3.datetime[:800], df3.id[:800], '-')
plt.gcf().autofmt_xdate()
and
import matplotlib.dates as dates
df5 = df3.set_index('datetime')
df5.plot(x_compat=True)
plt.gca().xaxis.set_major_locator(dates.DayLocator())
plt.gca().xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a'))
plt.gca().invert_xaxis()
plt.gcf().autofmt_xdate(rotation=0, ha="center")
plt.figure(figsize=(150,10))
However, I got something like this:
]
Any idea of how to get a plot similar to (*)
?
Solution
I've played with your sample data a little so that one user logs in on three days. The problem in your attempt is that you are trying to "just plot" the logins. If you want to see the frequency of logins, you have to calculate that. So I read the data and use a proper DateTime index, then use groupby
followed by resample
to calculate the frequencies. I think with 400 users this might become a bit messy, but this will do a plot of the daily logins per user.
import pandas
import io
d = """id,datetime
130,2018-05-17T19:46:18
133,2018-05-17T20:59:57
133,2018-05-17T21:54:01
142,2018-05-17T22:49:07
114,2018-05-17T23:02:34
136,2018-05-18T06:06:48
136,2018-05-18T12:21:38
130,2018-05-18T12:49:33
120,2018-05-18T14:03:58
130,2018-05-19T15:28:36"""
# for the data aboce, this is a quick way to parse it
df = pandas.read_csv(io.StringIO(d), parse_dates=['datetime'], index_col='datetime')
# This method is more roundabout but is perhaps useful if you have other data
df = pandas.read_csv(io.StringIO(d))
df.datetime = pandas.to_datetime(df.datetime)
df = df.set_index('datetime')
# Plot daily logins per user id
r = df.groupby('id').resample('D').apply(len).unstack('id').plot()
Answered By - chthonicdaemon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.