Issue
Is there a way to vectorize pandas' DataFrame row operations to calculate using only previous row data without using python-level iteration?
I'm trying to calculate quantiles for a given distribution on each row, but I want to avoid data leakage.
So, for each row, I want to calculate the quantiles for the sequence[:current_row]
.
import pandas as pd
import numpy as np
np.random.seed(42)
test_df = pd.DataFrame({'column_1':np.random.random(10)})
qtls = pd.DataFrame({'q1':[],
'q2':[],
'q3':[],
'q4':[],
'q5':[],
'q6':[],
'q7':[],
'q8':[],
'q9':[]})
for i in range(1,len(test_df)+1):
qtls = pd.concat([qtls, pd.DataFrame({k:[v] for k,v in zip(['q'+str(j) for j in range(1, 10)], np.quantile(test_df['column_1'].iloc[:i], np.arange(0.1,1.0,0.1)))})])
qtls = qtls.reset_index(drop=True)
test_df = pd.concat([test_df, qtls], axis=1)
test_df
So this is the brute-force for loop implementation.
And this is the expected result:
index | column_1 | q1 | q2 | q3 | q4 | q5 | q6 | q7 | q8 | q9 |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 | 0.374540 |
1 | 0.950714 | 0.432158 | 0.489775 | 0.547392 | 0.605010 | 0.662627 | 0.720245 | 0.777862 | 0.835479 | 0.893097 |
2 | 0.731994 | 0.446031 | 0.517522 | 0.589012 | 0.660503 | 0.731994 | 0.775738 | 0.819482 | 0.863226 | 0.906970 |
3 | 0.598658 | 0.441776 | 0.509011 | 0.576247 | 0.625326 | 0.665326 | 0.705327 | 0.753866 | 0.819482 | 0.885098 |
4 | 0.156019 | 0.243427 | 0.330836 | 0.419364 | 0.509011 | 0.598658 | 0.651993 | 0.705327 | 0.775738 | 0.863226 |
5 | 0.155995 | 0.156007 | 0.156019 | 0.265279 | 0.374540 | 0.486599 | 0.598658 | 0.665326 | 0.731994 | 0.841354 |
6 | 0.058084 | 0.116830 | 0.155999 | 0.156014 | 0.243427 | 0.374540 | 0.509011 | 0.625326 | 0.705327 | 0.819482 |
7 | 0.866176 | 0.126621 | 0.156004 | 0.177871 | 0.330836 | 0.486599 | 0.625326 | 0.718660 | 0.812503 | 0.891538 |
8 | 0.601115 | 0.136412 | 0.156009 | 0.243427 | 0.419364 | 0.598658 | 0.600624 | 0.679642 | 0.785667 | 0.883084 |
9 | 0.708073 | 0.146203 | 0.156014 | 0.308984 | 0.509011 | 0.599887 | 0.643898 | 0.715249 | 0.758830 | 0.874630 |
How can I do it efficiently?
Wouldn't it be easy if there were a df['some_column'] = df['another_column'].cumquantile()
?
Solution
I don't think you can really vectorize this, but you could simplify your code to:
np.random.seed(0)
test_df = pd.DataFrame({'column_1':np.random.random(10)})
q = test_df['column_1'].expanding().quantile
out = test_df.join(pd.DataFrame({f'q{i}': q(x) for i, x in
enumerate(np.arange(0.1,1.0,0.1), start=1)}))
which should nevertheless be faster since this is no longer running concat
in a loop.
Output:
column_1 q1 q2 q3 q4 q5 q6 q7 q8 q9
0 0.548814 0.548814 0.548814 0.548814 0.548814 0.548814 0.548814 0.548814 0.548814 0.548814
1 0.715189 0.565451 0.582089 0.598726 0.615364 0.632001 0.648639 0.665277 0.681914 0.698552
2 0.602763 0.559603 0.570393 0.581183 0.591973 0.602763 0.625249 0.647734 0.670219 0.692704
3 0.544883 0.546062 0.547241 0.548420 0.559603 0.575788 0.591973 0.614006 0.647734 0.681462
4 0.423655 0.472146 0.520638 0.545669 0.547241 0.548814 0.570393 0.591973 0.625249 0.670219
5 0.645894 0.484269 0.544883 0.546848 0.548814 0.575788 0.602763 0.624329 0.645894 0.680542
6 0.437587 0.432014 0.459046 0.523424 0.546455 0.548814 0.581183 0.611390 0.637268 0.673612
7 0.891773 0.433407 0.480506 0.545276 0.548027 0.575788 0.611390 0.641581 0.687471 0.768164
8 0.963663 0.434801 0.501965 0.546455 0.559603 0.602763 0.637268 0.687471 0.785823 0.906151
9 0.383442 0.419633 0.434801 0.512694 0.547241 0.575788 0.620016 0.666683 0.750506 0.898962
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.