Saturday, January 27, 2024

UDF over)

January 27, 2024 dataframe, pivot, python-3.x, python-polars, unpivot No comments

Issue

I am trying to pivot a table on 2 value columns, apply some user defined function on the pivotted dataframe and then unpivot (stack in pandas). In pandas, this would look like this:

import pandas as pd
import polars as pl
from polars import col, lit, selectors as cs

df = pl.DataFrame(
    {
        "foo": [1, 1, 2, 2, 3, 3],
        "bar": ["y", "x", "y", "x", "y", "x"],
        "baz1": [1, 2, None, 4, 5, None],
        "baz2": [1, None, 3, 4, None, 6]
    }
)
df
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ str ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ y   ┆ 1    ┆ 1    │
│ 1   ┆ x   ┆ 2    ┆ null │
│ 2   ┆ y   ┆ null ┆ 3    │
│ 2   ┆ x   ┆ 4    ┆ 4    │
│ 3   ┆ y   ┆ 5    ┆ null │
│ 3   ┆ x   ┆ null ┆ 6    │
└─────┴─────┴──────┴──────┘
'''

pd_df = df.to_pandas()

index_col = ['foo']
columns_col = ['bar']
values_col = ['baz1', 'baz2']

def pd_udf(df): # for example purposes, let's assume the function can be more complex
    return (
        df.ffill() * 3
    )

pd_res = (
    pd_df.groupby(index_col + columns_col).first() # or .set_index(index_col + column) for same result as no duplicates
    .unstack()
    .pipe(pd_udf)
    .stack()
    .reset_index()
    .sort_values(index_col + columns_col)
    .pipe(pl.from_pandas)
)
pd_res
'''
┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ str ┆ f64  ┆ f64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ x   ┆ 6.0  ┆ null │
│ 1   ┆ y   ┆ 3.0  ┆ 3.0  │
│ 2   ┆ x   ┆ 12.0 ┆ 12.0 │
│ 2   ┆ y   ┆ 3.0  ┆ 9.0  │
│ 3   ┆ x   ┆ 12.0 ┆ 18.0 │
│ 3   ┆ y   ┆ 15.0 ┆ 9.0  │
└─────┴─────┴──────┴──────┘
'''

I found 2 ways to achieve the same result, please let me know if there is a better way to achive the results I want to get.

With pivot, then melt, some code and another pivot

It is not the prettiest solution, the stack bit is achieved with quite a lot of manipulations due to the f'{value_name}_{column_name}_{column_value} format of the columns after the pivoting.

def pl_udf(df):
    return (
        df.with_columns(
            pl.exclude(index_col).forward_fill() * lit(3)
        )
    )

lazy_df_1 = (
    df
    .pivot(values = values_col,
            index = index_col,
            columns = columns_col
            )
    .lazy()
    .pipe(pl_udf)
    
    # pd stack bit
    .melt(id_vars = index_col)
    .select(
        col(index_col+['value']),
        col('variable').str.split('_').list.get(0).alias('temp'),
        col('variable').str.split('_').list.get(2).alias(columns_col[0]),
    )
    .collect()
    .pivot(values = 'value',
          index = index_col+columns_col,
          columns = 'temp')
    .lazy()
    .sort(index_col+columns_col)
)
lazy_df_1.collect()

With group_by and explode with a user-defined function applied to Series

The udf applied to series needs to convert the Series to a Dataframe to be able to leverage Dataframe methods. Again, I'm not sure that is the best solution.

def pl_udf_series(s):
    '''
    to apply to series directly
    '''
    return (
        s.to_frame() # to use with dataframe functions (my udf will use dataframe functions)
        .select(col(s.name).forward_fill() * lit(3))
        .to_series().to_list()
    )

lazy_df_2 = (
    df.lazy()
    .group_by(columns_col)
    .agg(
        col(index_col),
        col(values_col).map_elements(lambda x: pl_udf_series(x))  
    )
    .explode(columns=index_col + values_col)
    .sort(index_col+columns_col)
    .select(col(index_col+columns_col+values_col)) # reordering
)
lazy_df_2.collect()

Both implementations give the expected result:

pd_res.equals(lazy_df_1.collect())
# True
pd_res.equals(lazy_df_2.collect())
# True

In terms of performance:

pandas solution ~1000 micro-seconds
pivot, melt, pivot ~380 micro-seconds
group_by, explode ~450 micro-seconds (I have seen some cases where 2 is faster than 1)

Solution

Do you really need to pivot / unpivot your dataframe? You can apply functions within the group with over() method:

def func1(col): # for example purposes, let's assume the function can be more complex
    return col.forward_fill().over("bar") * lit(3)

res = df.with_columns(func1(pl.all().exclude(['foo', 'bar'])))
print(res.sort(['foo','bar']))

┌─────┬─────┬──────┬──────┐
│ foo ┆ bar ┆ baz1 ┆ baz2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ str ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ x   ┆ 6    ┆ null │
│ 1   ┆ y   ┆ 3    ┆ 3    │
│ 2   ┆ x   ┆ 12   ┆ 12   │
│ 2   ┆ y   ┆ 3    ┆ 9    │
│ 3   ┆ x   ┆ 12   ┆ 18   │
│ 3   ┆ y   ┆ 15   ┆ 9    │
└─────┴─────┴──────┴──────┘

Answered By - Roman Pekar

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] Polars pivot and melt/un_pivot (reverse pivot) operation with multiple values (pandas stack / unstack altermative / UDF over)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels