Wednesday, October 12, 2022

[FIXED] How to replace blank values and data rearranging by date?

October 12, 2022 numpy, pandas, python No comments

Issue

I have a CSV file containing values for 1988-2020. The first two columns in the file contain year and month (1988,1 …. 2020,12). The remaining column contains respective values of each date of the month. However, there are missing values for particular dates, and also some months are missing. The missing values need to be replaced by “NA” OR “-99”, and it also needs to consider leap year. I need to convert this data in the following format without skipping a date from 01-01-1988 to 31-12-2020.

Column1,Column2
“dd-mm-yyyy”, “Value”

example data

Solution

Here I attempted to reshape your data into rows from 01-01-1988 to 31-12-2020 adding missing days filled with NaN in every missing month. Could have missed something, so don't hesitate to check it yourself and let me know if something was wrong:

import pandas as pd
import numpy as np

# parse first two columns as dates
df = pd.read_csv('/path_to_file/Data.csv', parse_dates=[[0, 1]], header=None)

# rename columns
df.columns = range(0, 33)

# inspecting the df has shown that the row with index 2016-12-01 is invalid
# and the column 32 is empty  
df = df.drop(columns=32).set_index(0).drop(index='2016-12-01')

# fill missing data in non-month-end days with -99
df.iloc[:, 0:28] = df.iloc[:, 0:28].fillna(value=-99, axis=0)

# reformat index
df.index = df.index.strftime('%m-%Y')

print(df)

Intermediate output:

           1     2     3     4     5     6     7     8     9    10  ...   22  \
0                                                                   ...        
01-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
02-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
03-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
04-1988   0.0   0.0   0.0   0.0   0.0   0.0  36.1   0.0   0.0  0.0  ...  0.0   
01-1990   9.8  13.0   0.0   0.0   0.0   0.0  16.8  26.0  10.0  0.0  ...  0.0   
...       ...   ...   ...   ...   ...   ...   ...   ...   ...  ...  ...  ...   
08-2020   0.0   8.0  15.0   9.0   8.0  13.0   0.0  12.0   0.0  0.0  ...  0.0   
09-2020  49.0  15.0   7.0   9.0   5.0  22.0   0.0   0.0   5.0  3.0  ...  0.0   
10-2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
11-2020  32.0   4.0  26.0  20.0  10.0   0.0  10.0   0.0   0.0  0.0  ...  0.0   
12-2020   0.0   0.0   3.0   2.0   0.0   8.0   6.0  23.0   3.0  0.0  ...  0.0   

          23    24    25   26    27   28   29   30    31  
0                                                         
01-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
02-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  NaN   NaN  
03-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
04-1988  0.0  35.3  48.0  0.0   4.0  0.0  0.0  0.0   NaN  
01-1990  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
...      ...   ...   ...  ...   ...  ...  ...  ...   ...  
08-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0  36.0  
09-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   NaN  
10-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
11-2020  0.0   0.0   0.0  0.0  32.0  2.0  5.0  0.0   NaN  
12-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  

[199 rows x 31 columns]

# flatten df into 1-d array
flattened = df.to_numpy().flatten()

# expand months in df into days
valid_days = [pd.date_range(start=m, periods=pd.Period(m).days_in_month, freq='D')
              for m in df.index]

# create datetime index by days
valid_days = pd.to_datetime([item for sublist in valid_days 
                             for item in sublist]).strftime('%d-%m-%Y')

# create new DataFrame from flattened values and daily index
df_days = pd.DataFrame(flattened[~np.isnan(flattened)], 
                       index=valid_days, 
                       columns=['value'])

# create the range of all days within the period
total_days = pd.date_range('01-01-1988', '31-12-2020', freq='D').strftime('%d-%m-%Y')

# reindex df_days with all days within the period
df_days = df_days.reindex(total_days)

print(df_days)

Output:

          value
01-01-1988  0.0
02-01-1988  0.0
03-01-1988  0.0
04-01-1988  0.0
05-01-1988  0.0
    ...     ...
27-12-2020  0.0
28-12-2020  0.0
29-12-2020  0.0
30-12-2020  0.0
31-12-2020  0.0
12054 rows × 1 columns

I would be glad to know how to solve this in a much more elegant way.

Answered By - n.shabankin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 12, 2022

[FIXED] How to replace blank values and data rearranging by date?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels