Issue
I have 2 pandas dataframes, df1 and df2 which both have data from 2 different days between 21:00 and 8:00. The data should be 1 data point per minute, however there are there are missing values e.g.
location time Data
0 1 21:00:00 8
1 1 21:02:00 6
the data point for 21:01:00 does not exist. The missing data points occur at different times for each of the dataframes, so when I try to plot both of them on the same plot this happens:
If I plot them individually they're both correct. I think the horizontal red lines are caused by the time values that exist in the red dataframe but not in the blue dataframe.
Has anyone encountered this before? I want to plot both of them on the same axis, starting at 21:00 and finishing at 08:00.
Here is the code I'm using:
import pandas as pd
import plotly.express as px
df1 = pd.DataFrame({'location': 1,
'data': ['3', '4', '5'],
'time': [datetime.datetime(2022,7,16,21,0,0).time(),
datetime.datetime(2022,7,16,21,1,0).time(),
datetime.datetime(2022,7,16,21,3,0).time()]})
df2 = pd.DataFrame({'location': 2,
'data': ['8', '6', '7'],
'time': [datetime.datetime(2022,7,17,21,0,0).time(),
datetime.datetime(2022,7,17,21,2,0).time(),
datetime.datetime(2022,7,17,21,3,0).time()]})
df = pd.concat([df1,df2], axis=0)
fig = px.line(df, x="time", y="data", color='location')
fig.show()
Thanks!
Solution
- started by simulating data that has the features you describe. From 21:00 to 08:00 on different dates and with different randomly removed minutes
- now integrate this data. Have taken approach
- fill missing minutes by outer join to all minutes in each dataframe
- outer join the two data frames on time only
This gives a different struct data frame:
location_x | time_x | Data_x | t | location_y | time_y | Data_y | |
---|---|---|---|---|---|---|---|
0 | 1 | 2022-09-01 21:00:00 | 0 | 21:00:00 | 2 | 2022-09-04 21:00:00 | 1 |
1 | 1 | 2022-09-01 21:01:00 | 0.0302984 | 21:01:00 | 2 | 2022-09-04 21:01:00 | 0.999541 |
2 | 1 | 2022-09-01 21:02:00 | 0.060569 | 21:02:00 | 2 | 2022-09-04 21:02:00 | 0.998164 |
3 | 1 | 2022-09-01 21:03:00 | 0.0907839 | 21:03:00 | 2 | 2022-09-04 21:03:00 | 0.995871 |
4 | 1 | 2022-09-01 21:04:00 | 0.120916 | 21:04:00 | 2 | 2022-09-04 21:04:00 | nan |
This is then simple to generate a px.line()
figure from. Traces being Data_x and Data_y. Have used datetime column time_x for xaxis. This then works well as datetime and continuous axes are well integrated. Updated tickformat
so date part of axis is not displayed.
import pandas as pd
import numpy as np
import plotly.express as px
dr = pd.date_range("2022-09-01 21:00", "2022-09-02 08:00", freq="1Min")
# data to match question, two dataframes from 21:00 to 08:00, different dates with some holes
# with different dates
dfs = [
pd.DataFrame(
{
"location": np.full(len(dr), l),
"time": dr + pd.DateOffset(days=o),
"Data": f(np.linspace(0, 20, len(dr))),
}
)
.sample(frac=0.95)
.sort_index()
for l, o, f in zip([1, 2], [0, 3], [np.sin, np.cos])
]
df1 = dfs[0]
df2 = dfs[1]
# let's integrate the dataframes
# 1. fill the holes in each dataframe by doing an outer join to all times
# 2. outer join the two dataframes on just the time
df = pd.merge(
*[
pd.merge(
d,
pd.DataFrame(
{"time": pd.date_range(d["time"].min(), d["time"].max(), freq="1min")}
),
on="time",
how="outer",
)
.fillna({"location": l})
.assign(t=lambda d: d["time"].dt.time)
for d, l in zip([df1, df2], [1, 2])
],
on="t",
how="outer",
)
# finally generate plotly line chart using columns created by merging the data
# it's clearly observed there are gaps in both traces
px.line(
df.sort_values("time_x"), x="time_x", y=["Data_x", "Data_y"], hover_data=["time_y"]
).update_layout({"xaxis": {"tickformat": "%H:%M"}})
output
Answered By - Rob Raymond
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.