Wednesday, June 29, 2022

[FIXED] After a Pandas Dataframe merge headers gain an "_x"

June 29, 2022 pandas, python No comments

Issue

I have a list of csvs, seven for every month, they all have the same structure. I'm trying to merge them into monthly datasets using concat monthly, then merge them so that I get fields like JAN_COUNT, FEB_COUNT, MAR_COUNT etc.. based on a id number. The first part of my code produces tables like I would expect. Each month is exported to a csv and then they are combined into a 2019.csv with all of JAN's fields follower by FEB's then MAR's etc... But even though I specify id to be the index there is still this recurring building of unnamed, unnamed_x, unnamed_x.1, etc... and when I try to calculate the totals for 2019 by summing the various fields in I get key errors saying that JAN_COUNT_CARGO doesn't exist. The dataframe has JAN_COUNT_CARGO_x and I'm not sure why where I am joining on the unique id column.

print("Importing modules")
import os, glob
import pandas as pd
from functools import reduce
from gc import collect
print('Modules imported... Setting variables')

opath = r'dir of csvs'
mlist = ['JAN', 'FEB'] #mlist = ['','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
unique_gridid = 'id'
os.chdir(opath) #Set working directory

print('Variables set... Merging csv files into monthly datasets')

# Merge the splices of the gridded datasets
for m in mlist: #Loop through each month
    print('Running merge for '+str(m)+' collecting csv files...')
    csvlist = [i for i in glob.glob(str(m)+'2019_G*.{}'.format('csv'))] #List all the CSVs of each month
    month_csv = pd.concat([pd.read_csv(csv) for csv in csvlist])
    month_csv.set_index(unique_gridid)
    month_csv.sort_values(by=['id'], ascending=True)
    print('Saving '+str(m)+'2019.csv')
    month_csv.to_csv(str(m)+'2019.csv', index=False)#Save the month dataframe to csv

print('Creating 2019.csv')
csvlist = [i for i in glob.glob('*2019.{}'.format('csv'))] #List all the CSVs for 2019

print('Gathering monthly datasets')
dflist = []
for csv in csvlist:
    cdf = pd.read_csv(csv)  # Read each csv into a dataframe
    dflist.append(cdf)  # Append the dataframe to a dataframe list

print('Merging the monthly into yearly dataset...')
ALL2019DF = reduce(lambda  left,right: pd.merge(left,right,on=[str(unique_gridid)], how='outer'), dflist)
ALL2019DF.to_csv('2019.csv', index=unique_gridid)#Save the yearly dataframe to csv
ALL2019DF = pd.read_csv('2019.csv')

clist = ALL2019DF.columns.values.tolist()

ALL2019DF['2019_CARGO_COUNT'] =ALL2019DF[:[s for s in clist if "CARGO_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_CARGO_VDAYS'] =ALL2019DF[:[s for s in clist if "CARGO_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_FISHING_COUNT'] =ALL2019DF[:[s for s in clist if "FISHING_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_FISHING_VDAYS'] =ALL2019DF[:[s for s in clist if "FISHING_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_OTHER_COUNT'] =ALL2019DF[:[s for s in clist if "OTHER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_OTHER_VDAYS'] =ALL2019DF[:[s for s in clist if "OTHER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_PASSENGER_COUNT'] =ALL2019DF[:[s for s in clist if "PASSENGER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_PASSENGER_VDAYS'] =ALL2019DF[:[s for s in clist if "PASSENGER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_TANKER_COUNT'] =ALL2019DF[:[s for s in clist if "TANKER_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_TANKER_VDAYS'] =ALL2019DF[:[s for s in clist if "TANKER_VDAYS" in s]].sum(axis=1)
ALL2019DF['2019_ALL_COUNT'] =ALL2019DF[:[s for s in clist if "ALL_COUNT" in s]].sum(axis=1)
ALL2019DF['2019_ALL_VDAYS'] =ALL2019DF[:[s for s in clist if "ALL_VDAYS" in s]].sum(axis=1)

print('Saving final 2019 csv with totals')
ALL2019DF.to_csv('2019.csv', index=unique_gridid)#Save the yearly dataframe to csv
print('Done')

This is suppose to join the data based on the id field ALL2019DF = reduce(lambda left,right: pd.merge(left,right,on=[str(unique_gridid)], how='outer'), dflist) but it doesn't seem to work. I use a similar method to build the seven monthly csvs MERGE_DF = reduce(lambda df1, df2: pd.merge(df1, df2, on=unique_gridid), dfList) What am I doing wrong?

Solution

with merge, when the left and the right DF have the same column-name, these are suffixed with "_x" for left, and "_y" for right.

You can overwrite it by changing the suffixes

suffixes=()

refer to documentation here pd.merge

suffixes: list-like, default is (“_x”, “_y”) A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

Answered By - Naveed

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 29, 2022

[FIXED] After a Pandas Dataframe merge headers gain an "_x"

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels