Tuesday, November 21, 2023

[FIXED] Generate one dataframe per element in a column and merge into current dataframe

November 21, 2023 pandas, python No comments

Issue

Given the following (example) dataframe :

import pandas as pd
import pathlib
from pathlib import Path

cwd = Path('Path/to/somewhere')
df = pd.DataFrame(
    {
        'var1': [0, 5, 10, 15, 20, 25],
        'var2': ['A', 'B']*3,
        'var3': ['A', 'B']*3,
        'path_col': [cwd / 'a.dat', cwd / 'b.dat', cwd / 'c.dat', cwd / 'd.dat', cwd / 'e.dat', cwd / 'f.dat'],
    }
 )

Each path in path_col points to a datafile, which I have a function to convert into a dataframe, e.g. :

def open_and_convert_to_df(filepath: pathlib.Path):
    # do things
    return pd.Dataframe(...)

data_df = pd.DataFrame(
    {
        'var4': [10, 20, 30],
        'var5': [100, 200, 300],
        'obs': [1000, 2000, 3000],
    }
)

I'd like to generate a data_df from each path in path_col and merge into df such that the final df looks like :

    var1 var2 var3 var4 var5 obs
0   0    A    1    10   100  1000
1   0    A    1    10   100  2000
2   0    A    1    10   100  3000
3   0    A    1    10   200  1000
4   0    A    1    10   200  2000
5   0    A    1    10   200  3000
6   0    A    1    10   300  1000
...
n-3 25   B    2    30   200  3000
n-2 25   B    2    30   300  1000
n-1 25   B    2    30   300  2000
n   25   B    2    30   300  3000

In other words, variables 1 to 3 of the first df are indexes of the data contained in path_col. Inside this data, var 4 and 5 are indexes of obs. I'm trying to index obs with all variables from 1 to 5.

The best I've come up with so far is using the .map() method like so :

df['path_col'] = df['path_col'].map(open_and_convert_to_df)

I end up with the right df's in each path_col element but I'm lacking the next steps in order to "un-nest" those and obtain the desired df.

Solution

Assuming you want some kind of join or each row with the output of the function, you could use concat:

out = df.join(pd.concat({k: open_and_convert_to_df(v)
                         for k,v in df['path_col'].items()}
                        ).droplevel(1))

Used input:

df = pd.DataFrame(
    {
        'var1': [0, 5, 10, 15, 20, 25],
        'var2': ['A', 'B']*3,
        'var3': [1, 2]*3,
        'path_col': [cwd / 'a.dat', cwd / 'b.dat', cwd / 'c.dat', cwd / 'd.dat', cwd / 'e.dat', cwd / 'f.dat'],
    }
 )

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 21, 2023

[FIXED] Generate one dataframe per element in a column and merge into current dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels