Issue
My input is a dataframe (made from this link https://www.timeanddate.com/time/zones/):
df = pd.DataFrame({'Abbreviation': ['ADT', 'ET', 'GMT'],
'Time zone name': ["Atlantic Daylight Time\nADST – Atlantic Daylight Saving Time\nAST – Atlantic Summer Time\nHAA – Heure Avancée de l'Atlantique (French)",
'Eastern Time',
'Greenwich Mean Time\nUTC – Coordinated Universal Time\nGT – Greenwich Time']})
Abbreviation | Time zone name |
---|---|
ADT | Atlantic Daylight Time ADST – Atlantic Daylight Saving Time AST – Atlantic Summer Time HAA – Heure Avancée de l'Atlantique (French) |
ET | Eastern Time |
GMT | Greenwich Mean Time UTC – Coordinated Universal Time GT – Greenwich Time |
Some time zones can have a sort of equivalent. For example GMT
has two equivalents. But some others have no equivalents like ET
.
I'm trying to extract the equivalent time zones and make them as multindex.
My expected output is this :
Time zone name Details
Abbreviation Equivalent
ADT ADST Atlantic Daylight Time Atlantic Daylight Saving Time
AST Atlantic Daylight Time Atlantic Summer Time
HAA Atlantic Daylight Time Heure Avancée de l'Atlantique (French)
ET NaN Eastern Time NaN
GMT UTC Greenwich Mean Time Coordinated Universal Time
GT Greenwich Mean Time Greenwich Time
For that, I made the code below but unfortunately the row for the timzeone ET
is missing :
first_split = df['Time zone name'].str.split('\n')
second_split = first_split.explode().str.split(' – ', expand=True)
df['Time zone name'] = first_split.str[0]
final = pd.concat([df, second_split], axis=1).rename(columns={0: 'Equivalent', 1: 'Details'})
final = final.dropna(subset='Details')
final = final.set_index(['Abbreviation', 'Equivalent'])
Can you help me guys fix my code ? I'm open to any other idea.
Solution
You can use Index.duplicated
or Series.duplicated
for remove first duplicated values:
final = final[final.index.duplicated() | ~final.index.duplicated(keep=False)]
final = final[final['Abbreviation'].duplicated() |
~final['Abbreviation'].duplicated(keep=False)]
Or use Series.notna
with Index.map
and Index.value_counts
:
final = final[final['Details'].notna() |
(final.index.map(final.index.value_counts()) == 1)]
Instead:
final = final.dropna(subset='Details')
Another complete solution with DataFrame.explode
with Series.where
and forward filling missing values for Time zone name
column:
final = (df.assign(**{'Time zone name':df['Time zone name'].str.split('\n')})
.explode('Time zone name'))
final[['Equivalent','Details']] = final['Time zone name'].str.extract('(.*)\s*–\s*(.*)')
final['Time zone name'] = (final['Time zone name'].where(final['Equivalent'].isna())
.ffill())
final = final[final.index.duplicated() | ~final.index.duplicated(keep=False)]
final = final.set_index(['Abbreviation', 'Equivalent'])
print (final)
Time zone name \
Abbreviation Equivalent
ADT ADST Atlantic Daylight Time
AST Atlantic Daylight Time
HAA Atlantic Daylight Time
ET NaN Eastern Time
GMT UTC Greenwich Mean Time
GT Greenwich Mean Time
Details
Abbreviation Equivalent
ADT ADST Atlantic Daylight Saving Time
AST Atlantic Summer Time
HAA Heure Avancée de l'Atlantique (French)
ET NaN NaN
GMT UTC Coordinated Universal Time
GT Greenwich Time
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.