Issue
I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
Solution
As @Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29
which will throw
ValueError: day is out of range for month
. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()
Answered By - Derek O
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.