Issue
I need to run regression for each group, then pass the coefficients into the new column b
. Here is my code:
Self-defined function:
def simplereg(g, y, x):
try:
xvar = sm.add_constant(g[x])
yvar = g[y]
model = sm.OLS(yvar, xvar, missing='drop').fit()
b = model.params[x]
return pd.Series([b*100]*len(g))
except Exception as e:
return pd.Series([np.NaN]*len(g))
Create sample data:
import pandas as pd
import numpy as np
# Setting the parameters
gvkeys = ['A', 'B', 'C', 'D'] # Possible values for gvkey
years = np.arange(2000, 2020) # Possible values for year
# Number of rows for each gvkey, ensuring 5-7 observations for each
num_rows_per_gvkey = np.random.randint(5, 8, size=len(gvkeys))
total_rows = sum(num_rows_per_gvkey)
# Creating the DataFrame
np.random.seed(0) # For reproducibility
df = pd.DataFrame({
'gvkey': np.repeat(gvkeys, num_rows_per_gvkey),
'year': np.random.choice(years, size=total_rows),
'y': np.random.rand(total_rows),
'x': np.random.rand(total_rows)
})
df.sort_values(by='year', ignore_index=True, inplace=True) # make sure if the code can handle even data without sort
Run groupby
code:
df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x')
However, the code return column 'b' with all N/A May I ask where is the issue and how to fix it?
Thank you
Solution
Is it a bad idea to catch all exceptions?
Obviously yes. In addition you do not display any error message. If your function returns NaN
values, it's probably because your code is throwing an exception.
Your code works well for me if I import statsmodels.api as sm
and make minor changes:
import statsmodels.api as sm
def simplereg(g, y, x):
try:
xvar = sm.add_constant(g[x])
yvar = g[y]
model = sm.OLS(yvar, xvar, missing='drop').fit()
b = model.params[x]
return pd.Series([b*100]*len(g), index=g.index) # reindex here
except Exception as e:
print(e) # At least, print exception here
return pd.Series([np.NaN]*len(g), index=g.index) # reindex here
# drop the first group level (gvkey) to align indexes
df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x').droplevel('gvkey')
Output:
>>> df
gvkey year y x b
0 A 2000 0.799159 0.128926 -55.326856
1 B 2001 0.774234 0.253292 -68.351309
2 A 2003 0.461479 0.315428 -55.326856
3 A 2003 0.780529 0.363711 -55.326856
4 B 2004 0.521848 0.208877 -68.351309
5 C 2005 0.612096 0.656330 6.342994
6 D 2005 0.060225 0.096098 36.320231
7 B 2006 0.414662 0.161310 -68.351309
8 C 2006 0.456150 0.466311 6.342994
9 A 2007 0.118274 0.570197 -55.326856
10 C 2007 0.568434 0.244426 6.342994
11 C 2008 0.943748 0.196582 6.342994
12 D 2009 0.681820 0.368725 36.320231
13 B 2009 0.639921 0.438602 -68.351309
14 A 2012 0.870012 0.670638 -55.326856
15 B 2012 0.264556 0.653108 -68.351309
16 C 2013 0.616934 0.138183 6.342994
17 C 2014 0.018790 0.158970 6.342994
18 A 2015 0.978618 0.210383 -55.326856
19 D 2015 0.666767 0.976459 36.320231
20 D 2016 0.437032 0.097101 36.320231
21 C 2017 0.617635 0.110375 6.342994
22 B 2018 0.944669 0.102045 -68.351309
23 B 2019 0.143353 0.988374 -68.351309
24 D 2019 0.359508 0.820993 36.320231
25 D 2019 0.697631 0.837945 36.320231
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.