Monday, November 20, 2023

[FIXED] Is there a way to add a column of numpy random values to a polars dataframe while one column is an input to numpy.random?

November 20, 2023 numpy, python, python-polars No comments

Issue

Let's say I have a dataframe that has a column named mean that I want to use as an input to a random number generator. Coming from R, this is relatively easy to do in a pipeline:

library(dplyr)

tibble(alpha = rnorm(1000),
       beta = rnorm(1000)) %>%
  mutate(mean = alpha + beta) %>%
  bind_cols(random_output = rnorm(n = nrow(.), mean = .$mean, sd = 1))
#> # A tibble: 1,000 × 4
#>     alpha   beta    mean random_output
#>     <dbl>  <dbl>   <dbl>         <dbl>
#>  1  0.231 -0.243 -0.0125         0.551
#>  2  0.213  0.647  0.861          0.668
#>  3  0.824 -0.353  0.471          0.852
#>  4  0.665 -0.916 -0.252         -1.81 
#>  5 -0.850  0.384 -0.465         -3.90 
#>  6  0.721  0.679  1.40           2.54 
#>  7  1.46   0.857  2.32           2.14 
#>  8 -0.242 -0.431 -0.673         -0.820
#>  9  0.234  0.188  0.422         -0.662
#> 10 -0.494 -2.15  -2.65          -3.01 
#> # ℹ 990 more rows

^{Created on 2023-11-12 with reprex v2.0.2}

In python, I can create an intermediate dataframe and use it as input to np.random.normal(), then bind that to the dataframe, but this feels clunky. Is there a way to add the random_output col as a part of the pipeline/chain?

import polars as pl
import numpy as np

# create a df
df = (
    pl.DataFrame(
        {
            "alpha": np.random.standard_normal(1000),
            "beta": np.random.standard_normal(1000)
        }
    )
    .with_columns(
        (pl.col("alpha") + pl.col("beta")).alias("mean")
    )
    
)

# create an intermediate object
sim_vals = np.random.normal(df.get_column("mean"))

# bind the simulated values to the original df
(
    df.with_columns(random_output = pl.lit(sim_vals))
)
#> shape: (1_000, 4)
┌───────────┬───────────┬───────────┬───────────────┐
│ alpha     ┆ beta      ┆ mean      ┆ random_output │
│ ---       ┆ ---       ┆ ---       ┆ ---           │
│ f64       ┆ f64       ┆ f64       ┆ f64           │
╞═══════════╪═══════════╪═══════════╪═══════════════╡
│ -1.380249 ┆ 1.531959  ┆ 0.15171   ┆ 0.938207      │
│ -0.332023 ┆ -0.108255 ┆ -0.440277 ┆ 0.081628      │
│ -0.718319 ┆ -0.612187 ┆ -1.330506 ┆ -1.286229     │
│ 0.22067   ┆ -0.497258 ┆ -0.276588 ┆ 0.908147      │
│ …         ┆ …         ┆ …         ┆ …             │
│ 0.299117  ┆ -0.371846 ┆ -0.072729 ┆ 0.592632      │
│ 0.789633  ┆ 0.95712   ┆ 1.746753  ┆ 2.954801      │
│ -0.264415 ┆ -0.761634 ┆ -1.026049 ┆ -1.369753     │
│ 1.893911  ┆ 1.554736  ┆ 3.448647  ┆ 5.192537      │
└───────────┴───────────┴───────────┴───────────────┘

Solution

There are four approaches (that I can think of), 2 of which were mentioned in comments, one that I use, and the last I know it exists but don't personally use it.

First (get_column(col) or ['col']) reference

Use df.get_column as a parameter of np.random.normal which you can do in a chain if you use pipe so for example

df.with_columns(
    mean=pl.col('alpha') + pl.col('beta')
).pipe(lambda df: (
    df.with_columns(
        rando=pl.lit(np.random.normal(df['mean']))
    )
))

Second (map_batches)

Use map_batches as an expression

df.with_columns(
    mean=pl.col('alpha') + pl.col('beta')
).with_columns(
    rando=pl.col('mean').map_batches(lambda col: pl.Series(np.random.normal(col)))
)

Third (numba)

This approach is the faster than the previous two if you're going to do many randomizations but takes more setup (hence the caveat about many randomizations)

numba lets you create ufuncs which are compiled functions which you can use directly inside an expression.

You can create this function which just uses the default standard deviation

import numba as nb
@nb.guvectorize([(nb.float64[:], nb.float64[:])], '(n)->(n)', nopython=True)
def rando(means,  res):
    for i in range(len(means)):
        res[i]=np.random.normal(means[i])

then you can do

df.with_columns(
    mean=pl.col('alpha') + pl.col('beta')
).with_columns(rand_nb=rando(pl.col('mean')))

Fourth (rust extension)

Unfortunately for this answer (and I suppose myself in general) I haven't dabbled in rust programming but there's an extension interface whereby you can create functions in rust and deploy them as expressions. Here is documentation on doing that

Performance

Using a 1M row df I get...

First method: 71.1 ms ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Second method: 70.7 ms ± 7.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Third method: 45.7 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

One thing to note is that it's not faster unless you want a different mean for each row, for instance...

df.with_columns(z=rando(pl.repeat(5,pl.count()))): 43.8 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df.with_columns(z=pl.Series(np.random.normal(5,1,df.shape[0]))): 39.6 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answered By - Dean MacGregor

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 20, 2023

[FIXED] Is there a way to add a column of numpy random values to a polars dataframe while one column is an input to numpy.random?

Issue

Solution

First (get_column(col) or ['col']) reference

Second (map_batches)

Third (numba)

Fourth (rust extension)

Performance

0 comments:

Post a Comment

Popular Posts

Labels