Issue
I have a large dataset where some of the rows are not placed in the right column. For example
data:
#DataFrame
data = {'Status': ['Active', 'Active', 'Active', 'Active','Active', 'Active'],
'Name': ['Tom', ' ', 'krish', ' ', 'Jack', 'Lisa'],
'Email': ['[email protected]', ' ', '[email protected]', ' ', '[email protected]', '[email protected]'],
'Name2': [' ', 'John', ' ', 'Tim', ' ', ' '],
'Email2':[' ', '[email protected]', ' ', '[email protected]', ' ', ' ']}
#Print DataFrame
df = pd.DataFrame(data)
df
In this case, the column 'Name' and 'email' has some empty spaces because they were placed in the a new column called 'name2' and 'email2'
I would like to see if I can fill the empty spaces with the actual name and email that were misplaced in different columns.
I was trying to do some research but I did not find any significant information or I do not know if this is possible.
actual dataset:
Status Name Email Name2 Email2
Active Tom [email protected]
Active Tim [email protected]
Active Tom [email protected]
Active Tim [email protected]
Expected result
Name2 Name Email2
Active Tom [email protected]
Active Tim [email protected]
Active Tom [email protected]
Active Tom [email protected]
Solution
- This option is faster than all other current answers, as per the
timeit
comparison at the bottom. - Select the appropriate columns into separate dataframes with Boolean indexing, combine the dataframes with
.concat
, and sort the index.- Boolean indexing is vectorized and much faster than using
.apply
.
- Boolean indexing is vectorized and much faster than using
- Empty cells are specified in the OP with
' '
, which is used indf.Name.ne(' ')
to skip rows with that value.- If the empty spaces are
np.nan
, then do~df.Name.isna()
instead ofdf.Name.ne(' ')
. - Use
df = df.replace('\\s+', '', regex=True)
if the empty strings are of different or indeterminate length, and change.ne(' ')
to.ne('')
.
- If the empty spaces are
.copy()
can be dropped if it doesn't result inSettingwithCopyWarning
, which it didn't during testing.
# select Status and the columns without numbers
df1 = df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']].copy()
# select Status and the columns with 2; rename the columns with 2
df2 = df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']].copy().rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)
df = pd.concat([df1, df2]).sort_index()
# display(df)
Status Name Email
0 Active Tom [email protected]
1 Active John [email protected]
2 Active krish [email protected]
3 Active Tim [email protected]
4 Active Jack [email protected]
5 Active Lisa [email protected]
One-liner
# as a single line without creating separate objects for each set of columns
df = pd.concat([df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']],
(df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']])
.rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)]).sort_index()
Comparison
# given df in the op, create a large dataset
df = pd.concat([df]*100000).reset_index(drop=True)
def trenton(df):
return pd.concat([df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']],
(df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']])
.rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)]).sort_index()
def echo(df):
df["Name"] = (df["Name"] + df["Name2"]).str.strip()
df["Email"] = (df["Email"] + df["Email2"]).str.strip()
df = df.drop(["Name2", "Email2"], axis=1)
return df
def sierra(df):
df['Name'] = df.apply(lambda x: x[3] if x[1] == ' ' else x[1], axis=1)
df['Email'] = df.apply(lambda x: x[4] if x[2] == ' ' else x[2], axis=1)
df = df.drop(labels=['Name2', 'Email2'], axis=1)
return df
def BeRT2me(df):
df = df.applymap(str.strip)
df = df.replace('', np.nan)
df.Name = df.Name.fillna(df.Name2)
df.Email = df.Email.fillna(df.Email2)
df = df.drop(['Name2', 'Email2'], axis=1)
return df
timeit
%timeit trenton(df)
[out]:
101 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit echo(dftest)
[out]:
390 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sierra(df)
[out]:
6.11 s ± 64.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit BeRT2me(df)
[out]:
258 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.