Wednesday, October 26, 2022

[FIXED] Impute missing values with prediction from linear regression in a Pandas dataframe

October 26, 2022 linear-regression, pandas, python, scikit-learn No comments

Issue

I'm working with this Dataframe named na where I filtered only the missing values, all included in the d column

        genuine     a       b   c       d       e       f
23      True    171.94  103.89  103.45  NaN     3.25    112.79
75      True    171.60  103.85  103.91  NaN     2.56    113.27
210     True    172.03  103.97  103.86  NaN     3.07    112.65
539     False   172.07  103.74  103.76  NaN     3.09    112.41
642     True    172.14  104.06  103.96  NaN     3.24    113.07
780     True    172.41  103.95  103.79  NaN     3.13    113.41
798     True    171.96  103.84  103.62  NaN     3.01    114.44

I used the Sklearn linear regression to train and test a model to predict d values based on f column

from sklearn.linear_model import LinearRegression

# data prep
df = df_data.dropna(axis=0).reset_index(drop=True)
X = np.array(df['f']).reshape(-1, 1)
y = np.array(df['d'])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
  
# Training
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Evaluation
print(f"score: {lin_reg.score(X_test, y_test)}")
print(f"intercept: {lin_reg.intercept_}")
print(f"slope: {lin_reg.coef_}")

Then I used this model to predict

# new dataframe with only the missing data as shown previously
na = df_data[df_data['d'].isnull()]

x_null = na['f'].values.reshape(-1,1)
y_null = lin_reg.predict(x_null)

So now y_null returned an array so I don't know how to impute those predicted values into the na dataframe and then to the df_data to fill the missing values.

If I use na.fillna({'d': y_null}) it returns an error as "value" parameter must be a scalar, dict or Series, but you passed a "ndarray" Moreover, I tried to use a lambda function but I didn't succeed.

I want to be sure about the well correspondence with the y_null predicted to go the right row in the d column. I assumed y_null array is sorted by the na index position ?

How to impute the predicted values instead of the NaN?

Solution

To solve this topic, I finally found a way to do it (I suppose another code could be more efficient but for now it works with this one).

#create a new DF to store prediction and ID position
df_null = pd.DataFrame(y_null, columns=['prevision'])

#reset index on na DF
nan=na.copy().reset_index()

#add column in the nuw nan DF
df_prev=pd.concat([nan, df_null], axis=1)

#set index
df_prev = df_prev.set_index('index')

#fill the values
df_ok = df_data.fillna({'d':df_prev['prevision']}).copy()

So Now I'm sure the values added are sharing the same index so I shouldn't be wrong with fillna()

Answered By - Lilly_Co

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 26, 2022

[FIXED] Impute missing values with prediction from linear regression in a Pandas dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels