Issue
Let's say I have a dataframe that has a column named mean
that I want to use as an input to a random number generator. Coming from R, this is relatively easy to do in a pipeline:
library(dplyr)
tibble(alpha = rnorm(1000),
beta = rnorm(1000)) %>%
mutate(mean = alpha + beta) %>%
bind_cols(random_output = rnorm(n = nrow(.), mean = .$mean, sd = 1))
#> # A tibble: 1,000 × 4
#> alpha beta mean random_output
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.231 -0.243 -0.0125 0.551
#> 2 0.213 0.647 0.861 0.668
#> 3 0.824 -0.353 0.471 0.852
#> 4 0.665 -0.916 -0.252 -1.81
#> 5 -0.850 0.384 -0.465 -3.90
#> 6 0.721 0.679 1.40 2.54
#> 7 1.46 0.857 2.32 2.14
#> 8 -0.242 -0.431 -0.673 -0.820
#> 9 0.234 0.188 0.422 -0.662
#> 10 -0.494 -2.15 -2.65 -3.01
#> # ℹ 990 more rows
Created on 2023-11-12 with reprex v2.0.2
In python, I can create an intermediate dataframe and use it as input to np.random.normal()
, then bind that to the dataframe, but this feels clunky. Is there a way to add the random_output
col as a part of the pipeline/chain?
import polars as pl
import numpy as np
# create a df
df = (
pl.DataFrame(
{
"alpha": np.random.standard_normal(1000),
"beta": np.random.standard_normal(1000)
}
)
.with_columns(
(pl.col("alpha") + pl.col("beta")).alias("mean")
)
)
# create an intermediate object
sim_vals = np.random.normal(df.get_column("mean"))
# bind the simulated values to the original df
(
df.with_columns(random_output = pl.lit(sim_vals))
)
#> shape: (1_000, 4)
┌───────────┬───────────┬───────────┬───────────────┐
│ alpha ┆ beta ┆ mean ┆ random_output │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╪═══════════════╡
│ -1.380249 ┆ 1.531959 ┆ 0.15171 ┆ 0.938207 │
│ -0.332023 ┆ -0.108255 ┆ -0.440277 ┆ 0.081628 │
│ -0.718319 ┆ -0.612187 ┆ -1.330506 ┆ -1.286229 │
│ 0.22067 ┆ -0.497258 ┆ -0.276588 ┆ 0.908147 │
│ … ┆ … ┆ … ┆ … │
│ 0.299117 ┆ -0.371846 ┆ -0.072729 ┆ 0.592632 │
│ 0.789633 ┆ 0.95712 ┆ 1.746753 ┆ 2.954801 │
│ -0.264415 ┆ -0.761634 ┆ -1.026049 ┆ -1.369753 │
│ 1.893911 ┆ 1.554736 ┆ 3.448647 ┆ 5.192537 │
└───────────┴───────────┴───────────┴───────────────┘
Solution
There are four approaches (that I can think of), 2 of which were mentioned in comments, one that I use, and the last I know it exists but don't personally use it.
First (get_column(col) or ['col']) reference
Use df.get_column
as a parameter of np.random.normal
which you can do in a chain if you use pipe
so for example
df.with_columns(
mean=pl.col('alpha') + pl.col('beta')
).pipe(lambda df: (
df.with_columns(
rando=pl.lit(np.random.normal(df['mean']))
)
))
Second (map_batches)
Use map_batches
as an expression
df.with_columns(
mean=pl.col('alpha') + pl.col('beta')
).with_columns(
rando=pl.col('mean').map_batches(lambda col: pl.Series(np.random.normal(col)))
)
Third (numba)
This approach is the faster than the previous two if you're going to do many randomizations but takes more setup (hence the caveat about many randomizations)
numba lets you create ufuncs which are compiled functions which you can use directly inside an expression.
You can create this function which just uses the default standard deviation
import numba as nb
@nb.guvectorize([(nb.float64[:], nb.float64[:])], '(n)->(n)', nopython=True)
def rando(means, res):
for i in range(len(means)):
res[i]=np.random.normal(means[i])
then you can do
df.with_columns(
mean=pl.col('alpha') + pl.col('beta')
).with_columns(rand_nb=rando(pl.col('mean')))
More reading:
Fourth (rust extension)
Unfortunately for this answer (and I suppose myself in general) I haven't dabbled in rust programming but there's an extension interface whereby you can create functions in rust and deploy them as expressions. Here is documentation on doing that
Performance
Using a 1M row df I get...
First method: 71.1 ms ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Second method: 70.7 ms ± 7.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Third method: 45.7 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
One thing to note is that it's not faster unless you want a different mean for each row, for instance...
df.with_columns(z=rando(pl.repeat(5,pl.count())))
: 43.8 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df.with_columns(z=pl.Series(np.random.normal(5,1,df.shape[0])))
: 39.6 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By - Dean MacGregor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.