Issue
I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'
I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.
id time fuel
0 1 2022-02-26 19:08:33 100
2 1 2022-02-26 20:09:35 None
3 2 2022-02-26 21:09:35 70
4 3 2022-02-26 21:10:55 60
5 4 2022-02-26 21:10:55 None
6 5 2022-02-26 22:12:43 50
7 6 2022-02-26 23:10:50 None
So for example, in the following code I computed the percentage of the missing value for every hour for specific id:
df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
Is there any solution?
Solution
Update: The heatmap now plots id
vs time
vs percentage of null fuel
. The original answer at the bottom was just id
vs time
vs fuel
.
I want something almost like a github style calendar.
To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and pivot
into 1 id
per row and 1 hour per column. Then use sns.heatmap
to color each cell based on percentage of null fuel
.
# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')
# compute null percentage per (id, hour)
nulls = (df.set_index('time')
.groupby(['id', pd.Grouper(freq='H')])['fuel']
.apply(lambda x: x.isnull().mean() * 100))
# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
.pivot(index='id', columns='time', values='null (%)')
.fillna(0))
# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
linewidth=1, linecolor='lightgray', clip_on=False,
xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))
This is the original answer for visualizing id
by time
by fuel
:
- Pivot into an
id
vstime
matrix. Normallypivot
is fine, but since your real data contains duplicate indexes, usepivot_table
. resample
thetime
columns into hourly means.- Plot the time series matrix using
sns.heatmap
.
# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')
# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)
# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()
# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
linewidth=1, linecolor='lightgray', clip_on=False,
xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))
Answered By - tdy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.