Issue
I would like to know if there is a way to calculate a sum over different dataframe rows when they have different rows' number due to missing timestamps. I have the following example:
import pandas as pd
data1 = {'produced': [19.7, 39.1, 86.4, 167.1]}
data2 = {'produced': [22.4, 95, 144.3, 300.2]}
data3 = {'produced': [15.1, 44.1, 80, 302.5]}
df1 = pd.DataFrame(data1, index = ['01/06/2021 09:35', '01/06/2021 09:40', '01/06/2021 09:45', '01/06/2021 09:50'])
df2 = pd.DataFrame(data2, index = ['01/06/2021 09:35', '01/06/2021 09:45', '01/06/2021 09:50', '01/06/2021 09:55'])
df3 = pd.DataFrame(data3, index = ['01/06/2021 09:35', '01/06/2021 09:40', '01/06/2021 09:45', '01/06/2021 09:55'])
As you can see, I have some dfs with distinct row numbers because there are a few "blank" values. What I would like to do is to calculate the sum, over rows, for each 5min. interval and simply skip the blanks. If I sum it over the rows as it is, I would be incorrectly summing different timestamps.
Also, wouldbe possible to store the missing timestamps in an array or something similar? Thank you in advance!
Solution
You can also merge the dataframes on indices and sum across rows. That way, you can see the dataframes that had a missing values.
df_total = df1.add_suffix("_df1")
for i, df in enumerate([df2, df3], start=2):
df = df.add_suffix("_df{}".format(i))
df_total = df_total.merge(df, right_index=True, left_index=True, how="outer")
df_total["sum"] = df_total.sum(axis=1)
print(df_total)
Output:
produced_df1 produced_df2 produced_df3 sum
01/06/2021 09:35 19.7 22.4 15.1 57.2
01/06/2021 09:40 39.1 NaN 44.1 83.2
01/06/2021 09:45 86.4 95.0 80.0 261.4
01/06/2021 09:50 167.1 144.3 NaN 311.4
01/06/2021 09:55 NaN 300.2 302.5 602.7
Answered By - Manlai A
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.