Issue
I have a list of csvs, seven for every month, they all have the same structure. I'm trying to merge them into monthly datasets using concat monthly, then merge them so that I get fields like JAN_COUNT, FEB_COUNT, MAR_COUNT etc.. based on a id number. The first part of my code produces tables like I would expect. Each month is exported to a csv and then they are combined into a 2019.csv with all of JAN's fields follower by FEB's then MAR's etc... But even though I specify id to be the index there is still this recurring building of unnamed, unnamed_x, unnamed_x.1, etc... and when I try to calculate the totals for 2019 by summing the various fields in I get key errors saying that JAN_COUNT_CARGO doesn't exist. The dataframe has JAN_COUNT_CARGO_x and I'm not sure why where I am joining on the unique id column.
print("Importing modules")
import os, glob
import pandas as pd
from functools import reduce
from gc import collect
print('Modules imported... Setting variables')
opath = r'dir of csvs'
mlist = ['JAN', 'FEB'] #mlist = ['','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
unique_gridid = 'id'
os.chdir(opath) #Set working directory
print('Variables set... Merging csv files into monthly datasets')
# Merge the splices of the gridded datasets
for m in mlist: #Loop through each month
print('Running merge for '+str(m)+' collecting csv files...')
csvlist = [i for i in glob.glob(str(m)+'2019_G*.{}'.format('csv'))] #List all the CSVs of each month
month_csv = pd.concat([pd.read_csv(csv) for csv in csvlist])
month_csv.set_index(unique_gridid)
month_csv.sort_values(by=['id'], ascending=True)
print('Saving '+str(m)+'2019.csv')
month_csv.to_csv(str(m)+'2019.csv', index=False)#Save the month dataframe to csv
print('Creating 2019.csv')
csvlist = [i for i in glob.glob('*2019.{}'.format('csv'))] #List all the CSVs for 2019
print('Gathering monthly datasets')
dflist = []
for csv in csvlist:
cdf = pd.read_csv(csv) # Read each csv into a dataframe
dflist.append(cdf) # Append the dataframe to a dataframe list
print('Merging the monthly into yearly dataset...')
ALL2019DF = reduce(lambda left,right: pd.merge(left,right,on=[str(unique_gridid)], how='outer'), dflist)
ALL2019DF.to_csv('2019.csv', index=unique_gridid)#Save the yearly dataframe to csv
ALL2019DF = pd.read_csv('2019.csv')
clist = ALL2019DF.columns.values.tolist()
ALL2019DF['2019_CARGO_COUNT'] =ALL2019DF[:[s for s in clist if "CARGO_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_CARGO_VDAYS'] =ALL2019DF[:[s for s in clist if "CARGO_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_FISHING_COUNT'] =ALL2019DF[:[s for s in clist if "FISHING_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_FISHING_VDAYS'] =ALL2019DF[:[s for s in clist if "FISHING_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_OTHER_COUNT'] =ALL2019DF[:[s for s in clist if "OTHER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_OTHER_VDAYS'] =ALL2019DF[:[s for s in clist if "OTHER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_PASSENGER_COUNT'] =ALL2019DF[:[s for s in clist if "PASSENGER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_PASSENGER_VDAYS'] =ALL2019DF[:[s for s in clist if "PASSENGER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_TANKER_COUNT'] =ALL2019DF[:[s for s in clist if "TANKER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_TANKER_VDAYS'] =ALL2019DF[:[s for s in clist if "TANKER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_ALL_COUNT'] =ALL2019DF[:[s for s in clist if "ALL_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_ALL_VDAYS'] =ALL2019DF[:[s for s in clist if "ALL_VDAYS" in s]].sum(axis=1)
print('Saving final 2019 csv with totals')
ALL2019DF.to_csv('2019.csv', index=unique_gridid)#Save the yearly dataframe to csv
print('Done')
This is suppose to join the data based on the id field ALL2019DF = reduce(lambda left,right: pd.merge(left,right,on=[str(unique_gridid)], how='outer'), dflist) but it doesn't seem to work. I use a similar method to build the seven monthly csvs MERGE_DF = reduce(lambda df1, df2: pd.merge(df1, df2, on=unique_gridid), dfList) What am I doing wrong?
Solution
with merge, when the left and the right DF have the same column-name, these are suffixed with "_x" for left, and "_y" for right.
You can overwrite it by changing the suffixes
suffixes=()
refer to documentation here pd.merge
suffixes: list-like, default is (“_x”, “_y”) A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
Answered By - Naveed
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.