Monday, March 28, 2022

[FIXED] Visualise missing values in a time series heatmap

March 28, 2022 dataframe, heatmap, missing-data, pandas, python No comments

Issue

I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'

I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

So for example, in the following code I computed the percentage of the missing value for every hour for specific id:

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

Is there any solution?

Solution

Update: The heatmap now plots id vs time vs percentage of null fuel. The original answer at the bottom was just id vs time vs fuel.

I want something almost like a github style calendar.

To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and pivot into 1 id per row and 1 hour per column. Then use sns.heatmap to color each cell based on percentage of null fuel.

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# compute null percentage per (id, hour)
nulls = (df.set_index('time')
           .groupby(['id', pd.Grouper(freq='H')])['fuel']
           .apply(lambda x: x.isnull().mean() * 100))

# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
               .pivot(index='id', columns='time', values='null (%)')
               .fillna(0))

# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

This is the original answer for visualizing id by time by fuel:

Pivot into an id vs time matrix. Normally pivot is fine, but since your real data contains duplicate indexes, use pivot_table.
resample the time columns into hourly means.
Plot the time series matrix using sns.heatmap.

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)

# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()

# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

Answered By - tdy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 28, 2022

[FIXED] Visualise missing values in a time series heatmap

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels