Issue
I am having difficulty creating a stacked bar chart time series from my Pandas dataframe (image below). I would like to have the 'Date' on the x axis, the 'Hours' on the y axis, and each bar to show the time spent with each group in 'Category'.
Do I need to use Pandas - Groupby function? The dataframe is a sample. I have hundreds of rows of data from 2018 to 2020.
Solution
- There is one solution to pandas - stacked bar chart with timeseries data
- The issue with that question, is that OP is not aggregating any data, so that solution doesn't work for this question.
- Use
pandas.DataFrame.groupby
on'date'
and'group'
, while aggregating.sum
on'time'
- The
.dt
extractor is used to extract only the.date
component of the'date'
column. - Make certain the
'Date'
column of your dataframe is properly formatted as adatetime
dtype
, withdf.Date = pd.to_datetime(df.Date)
- The
- The grouped dataframe,
dfg
, must be shaped into the correct form, which can be accomplished withpandas.DataFrame.pivot
. - The easiest way to stack a bar plot is with
pandas.DataFrame.plot.bar
and use thestacked
parameter.- See
pandas.DataFrame.plot
for all the parameters.
- See
Imports and Data Transformation
import pandas as pd
import matplotlib.pyplot as plt
import random # for test data
import numpy as np # for test data
# setup dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'hours': np.random.randint(10, size=(rows)),
'group': [random.choice(['A', 'B', 'C']) for _ in range(rows)],
'date': pd.bdate_range('2020-11-24', freq='h', periods=rows).tolist()}
df = pd.DataFrame(data)
# display(df.head())
hours group date
0 2 C 2020-11-24 00:00:00
1 4 B 2020-11-24 01:00:00
2 1 C 2020-11-24 02:00:00
3 5 A 2020-11-24 03:00:00
4 2 B 2020-11-24 04:00:00
# use groupby on df
dfg = df.groupby([df.date.dt.date, 'group'])['hours'].sum().reset_index()
# pivot the dataframe into the correct format
dfp = dfg.pivot(index='date', columns='group', values='hours')
# display(dfp.head())
group A B C
date
2020-11-24 49 25 29
2020-11-25 62 18 57
2020-11-26 42 77 4
2020-11-27 34 43 17
2020-11-28 28 53 23
- More succinctly, the groupby and pivot step can be replaced with
.pivot_table
, which both reshapes and aggregatesindex=df.date.dt.date
is used so the index doesn't include the time component, since the data for the entire day is being aggregated.
dfp = df.pivot_table(index=df.date.dt.date, columns='group', values='hours', aggfunc='sum')
Plot
# plot the pivoted dataframe
dfp.plot.bar(stacked=True, figsize=(10, 6), ylabel='Hours', xlabel='Date', title='Sum of Daily Category Hours')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
- There will be a bar for each day, this is how bar plot ticks work, so the plot could be very wide, if there are many dates.
- Consider using
pandas.DataFrame.barh
dfp.plot.barh(stacked=True, figsize=(6, 10), title='Sum of Daily Category Hours')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xlabel('Hours')
plt.ylabel('Date')
plt.show()
- The OP states there is data from 2018 to 2020, which means there could be over 700 days worth of data, which translates to over 700 bars in the bar plot.
- A standard line plot might be the best option to properly visualize the data.
dfp.plot(figsize=(10, 6))
plt.show()
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.