Issue
I'm working with this Dataframe named na
where I filtered only the missing values, all included in the d
column
genuine a b c d e f
23 True 171.94 103.89 103.45 NaN 3.25 112.79
75 True 171.60 103.85 103.91 NaN 2.56 113.27
210 True 172.03 103.97 103.86 NaN 3.07 112.65
539 False 172.07 103.74 103.76 NaN 3.09 112.41
642 True 172.14 104.06 103.96 NaN 3.24 113.07
780 True 172.41 103.95 103.79 NaN 3.13 113.41
798 True 171.96 103.84 103.62 NaN 3.01 114.44
I used the Sklearn linear regression to train and test a model to predict d
values based on f
column
from sklearn.linear_model import LinearRegression
# data prep
df = df_data.dropna(axis=0).reset_index(drop=True)
X = np.array(df['f']).reshape(-1, 1)
y = np.array(df['d'])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
# Training
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Evaluation
print(f"score: {lin_reg.score(X_test, y_test)}")
print(f"intercept: {lin_reg.intercept_}")
print(f"slope: {lin_reg.coef_}")
Then I used this model to predict
# new dataframe with only the missing data as shown previously
na = df_data[df_data['d'].isnull()]
x_null = na['f'].values.reshape(-1,1)
y_null = lin_reg.predict(x_null)
So now y_null
returned an array so I don't know how to impute those predicted values into the na
dataframe and then to the df_data
to fill the missing values.
If I use na.fillna({'d': y_null})
it returns an error as "value" parameter must be a scalar, dict or Series, but you passed a "ndarray"
Moreover, I tried to use a lambda function but I didn't succeed.
I want to be sure about the well correspondence with the y_null
predicted to go the right row in the d
column. I assumed y_null
array is sorted by the na
index position ?
How to impute the predicted values instead of the NaN?
Solution
To solve this topic, I finally found a way to do it (I suppose another code could be more efficient but for now it works with this one).
#create a new DF to store prediction and ID position
df_null = pd.DataFrame(y_null, columns=['prevision'])
#reset index on na DF
nan=na.copy().reset_index()
#add column in the nuw nan DF
df_prev=pd.concat([nan, df_null], axis=1)
#set index
df_prev = df_prev.set_index('index')
#fill the values
df_ok = df_data.fillna({'d':df_prev['prevision']}).copy()
So Now I'm sure the values added are sharing the same index so I shouldn't be wrong with fillna()
Answered By - Lilly_Co
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.