Issue
I've just started using python so could do with some help.
I've merged data in two excel files using the following code:
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
#export new dataframe to excel
df.to_excel('WLM module data_test4.xlsx')
This does merge the data, but what it also does is where dataframe 1 has multiple entries for a module, it creates duplicate data in the new merged file so that there are equal entries in the df2 data. Here's an example:
So I want to only have one entry for the moderation of the module, whereas I have two at the moment (highlighted in red).
I also want to remove the additional columns : "term_y", "semester_y", "credits_y" and "students_y" in the final output as they are just repeats of data I already have in df1.
Thanks!
Solution
I think what you want is duplicated garnerd from
Pandas - Replace Duplicates with Nan and Keep Row & Replace duplicated values with a blank string
So what you want is this after your merge: df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
Please read both stackoverflow link examples to understand how this works better.
So full code would look like this
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
#export new dataframe to excel
df.to_excel('WLM module data_test5-working.xlsx')
Many ways to drop columns too.
Ive chosen, for lack of more time, to do this:
df.drop(df.columns[2], axis=1, inplace=True)
from https://www.stackvidhya.com/drop-column-in-pandas/
change df.columns[2]
to the N'th number column you want to drop. (Since my working data was differernt to yours*)
After the merge. so that full code will look like this:
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
#https://www.stackvidhya.com/drop-column-in-pandas/
#export new dataframe to excel
df.to_excel('WLM module data_test6-working.xlsx')
df.drop(df.columns[2], axis=1, inplace=True)
Hope ive helped.
I'm just very happy I got you somwhere/did this. For both of our sakess!
Happy you have a working answer.
& if you want to create a new df out of the merged, duplicated and droped columns df, you can do this:
new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)
from Extracting specific selected columns to new DataFrame as a copy
*So that full code * would look something like this (please adjust column numbers as your need) which is what I wanted:
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)
#new=pd.DataFrame(df.drop(df.columns[2], axis=1, inplace=True))
print(new)
#export new dataframe to excel
df.to_excel('WLM module data_test12.xlsx')
new.to_excel('WLM module data_test13.xlsx')
Note: *When I did mine above , I deliberately didn't have any headers In columns, to try make it generic as possible. So used iloc to specify colum Number Initially. ( Since your original question was not that descriptive or clear, but kind got the point.). Think you should include copyable draft data (not screen shots) next time to make it easier for people/entice insentivise experts on here to enagage with the post. Plus more clearer Why's & How's. & SO isnt a free code writing servce, you know, but it was to my benefit also (hugely) to do/delve into this.
Answered By - David Wooley - AST
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.