Sunday, September 18, 2022

[FIXED] Fill empty rows

September 18, 2022 dataframe, pandas, python No comments

Issue

I have a large dataset where some of the rows are not placed in the right column. For example

data:

#DataFrame
data = {'Status': ['Active', 'Active', 'Active', 'Active','Active', 'Active'],
        'Name': ['Tom', ' ', 'krish', ' ',  'Jack', 'Lisa'],
        'Email': ['[email protected]', ' ',  '[email protected]', ' ', '[email protected]', '[email protected]'],
        'Name2': [' ', 'John', ' ', 'Tim', ' ', ' '],
        'Email2':[' ', '[email protected]', ' ', '[email protected]', ' ', ' ']}

#Print DataFrame

df = pd.DataFrame(data)
df

In this case, the column 'Name' and 'email' has some empty spaces because they were placed in the a new column called 'name2' and 'email2'

I would like to see if I can fill the empty spaces with the actual name and email that were misplaced in different columns.

I was trying to do some research but I did not find any significant information or I do not know if this is possible.

actual dataset:

Status  Name  Email          Name2      Email2 
Active  Tom   [email protected]  
Active                       Tim        [email protected]
Active  Tom   [email protected] 
Active                       Tim        [email protected]

Expected result

Name2   Name  Email2 
Active  Tom   [email protected]  
Active  Tim   [email protected]
Active  Tom   [email protected] 
Active  Tom   [email protected]

Solution

This option is faster than all other current answers, as per the timeit comparison at the bottom.
Select the appropriate columns into separate dataframes with Boolean indexing, combine the dataframes with .concat, and sort the index.
- Boolean indexing is vectorized and much faster than using .apply.
Empty cells are specified in the OP with ' ', which is used in df.Name.ne(' ') to skip rows with that value.
- If the empty spaces are np.nan, then do ~df.Name.isna() instead of df.Name.ne(' ').
- Use df = df.replace('\\s+', '', regex=True) if the empty strings are of different or indeterminate length, and change .ne(' ') to .ne('').
.copy() can be dropped if it doesn't result in SettingwithCopyWarning, which it didn't during testing.

# select Status and the columns without numbers
df1 = df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']].copy()

# select Status and the columns with 2; rename the columns with 2
df2 = df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']].copy().rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)

df = pd.concat([df1, df2]).sort_index()

# display(df)
   Status   Name           Email
0  Active    Tom  [email protected]
1  Active   John  [email protected]
2  Active  krish  [email protected]
3  Active    Tim  [email protected]
4  Active   Jack  [email protected]
5  Active   Lisa  [email protected]

One-liner

# as a single line without creating separate objects for each set of columns
df = pd.concat([df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']],
                (df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']])
                .rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)]).sort_index()

Comparison

# given df in the op, create a large dataset
df = pd.concat([df]*100000).reset_index(drop=True)


def trenton(df):
    return pd.concat([df.loc[df.Name.ne(' '), ['Status', 'Name', 'Email']],
                      (df.loc[df.Name2.ne(' '), ['Status', 'Name2', 'Email2']])
                      .rename({'Name2': 'Name', 'Email2': 'Email'}, axis=1)]).sort_index()


def echo(df):
    df["Name"] = (df["Name"] + df["Name2"]).str.strip()
    df["Email"] = (df["Email"] + df["Email2"]).str.strip()
    df = df.drop(["Name2", "Email2"], axis=1)
    return df


def sierra(df):
    df['Name'] = df.apply(lambda x: x[3] if x[1] == ' ' else x[1], axis=1)
    df['Email'] = df.apply(lambda x: x[4] if x[2] == ' ' else x[2], axis=1)
    df = df.drop(labels=['Name2', 'Email2'], axis=1)
    return df


def BeRT2me(df):
    df = df.applymap(str.strip)
    df = df.replace('', np.nan)
    df.Name = df.Name.fillna(df.Name2)
    df.Email = df.Email.fillna(df.Email2)
    df = df.drop(['Name2', 'Email2'], axis=1)
    return df

`timeit`

%timeit trenton(df)
[out]:
101 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit echo(dftest)
[out]:
390 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit sierra(df)
[out]:
6.11 s ± 64.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit BeRT2me(df)
[out]:
258 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answered By - Trenton McKinney

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, September 18, 2022

[FIXED] Fill empty rows

Issue

Solution

One-liner

Comparison

`timeit`

0 comments:

Post a Comment

Popular Posts

Labels