Issue
I have a table in sql with some null values here and there and I want to drop them because RandomForest doesnt accept null values. Here is the code:
query = f"SELECT * FROM {HISTORICAL_TABLE}"
historical_data = pd.read_sql(query, conn)
# Split data before preprocessing
training_size = int(len(historical_data) * 0.6)
testing_size = len(historical_data) - training_size
historical_data = historical_data.assign(Next_Close=historical_data['Close'].shift(-1))
historical_data = historical_data.dropna()
train = historical_data.iloc[:training_size]
test = historical_data.iloc[training_size:]
features = [...]
X_train = train[features]
X_test = test[features]
y_train = train['Next_Close']
y_test = test['Next_Close']
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
When I use: historical_data = historical_data.dropna()
It says the dataset is empty and:
ValueError: Found array with 0 sample(s) (shape=(0, 17)) while a minimum of 1 is required by StandardScaler.
When I dont use dropna it says the RF doesnt accept null values. I dont want to use fillna so is there any other way I drop nan values without having an empty dataset?
Solution
View the output of historical_data.info()
to assess which columns have many NaN values,
and .drop() those that are causing you trouble.
For example, a column that shows up with
a "Non-Null Count" of zero.
The fix might look like this:
historical_data = historical_data.drop(columns=["foo", "bar"])
Then fit your revised dataset.
There are other approaches to coping with missing values besides using .fillna(). Scikit-learn offers many imputers.
only 11 null values except for 1 column with all null
Please understand that by default .dropna() will discard all rows if one or more columns are entirely null.
If you want to fill in the column later, that's fine. Drop it now, fit the data, and then get around to adding and populating the column with non-null values.
Or use .fillna(value=0)
on just that column.
Now .dropna()
will discard at most eleven rows.
Answered By - J_H
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.