Monday, November 1, 2021

[FIXED] Feature Engineering: Scaling for different distributions

November 01, 2021 feature-engineering, feature-selection, pandas, python, scikit-learn No comments

Issue

I am trying to understand the best way to scale my features and learn how to use SciKit package to transform/fit on my predicting dataset.

I have 2 groups of data.

First group has normal distribution, so I am just looking to scale the values (positive values between 20-100) using minmax scaler.

Second group of features has outliers so I believe the robustscaler will give better results.

My question is

Can I use multiple scalers on my dataset for a classification problem using RF?
Within SciKit, when I try to scale just 1 feature using robustscaler on my training data, I am getting this error. ValueError: Expected 2D array, got 1D array instead: I am not sure how to read this error, can I not scale just one feature?
If I using two scalers for my data, what is the best way to implement the feature engineering if I am looking to make predictions one row at a time? Do I just use transform?

Solution

Yes you can if you find it useful.
You can scale single feature. If you do something like this you will have an error:

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    "feature1": [1,2,3,4,5],
    "feature2": [100, 200, 300, 400, 500],
    "feature3": [200, 300, 400, 500, 600],
})

scaler = StandardScaler()

scaler.fit_transform(df["feature1"])

# output
ValueError: Expected 2D array, got 1D array instead:

You need to additionally reshape input if this is single column:

scaler = StandardScaler()

scaler.fit_transform(df["feature1"].values.reshape(-1, 1))

# output
array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

You can branch preprocessing using ColumnTransformer.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler


df = pd.DataFrame({
    "feature1": [1,2,3,4,5],
    "feature2": [100, 200, 300, 400, 500],
    "feature3": [200, 300, 400, 500, 600],
})

transformers = ColumnTransformer(
    transformers=[
        ("scaling1", MinMaxScaler(), ["feature1"]),
        ("scaling2", StandardScaler(), ["feature2", "feature3"])
    ]
)

transformed_df = transformers.fit_transform(df)

transformed

# output
array([[ 0.        , -1.41421356, -1.41421356],
       [ 0.25      , -0.70710678, -0.70710678],
       [ 0.5       ,  0.        ,  0.        ],
       [ 0.75      ,  0.70710678,  0.70710678],
       [ 1.        ,  1.41421356,  1.41421356]])

If you would like to for example use first scaler (scaling1) to inverse transform:

scaler_1 = transformers.named_transformers_["scaling1"]
scaler_1.inverse_transform(transformed[:, 0].reshape(-1, 1))

# output
array([[1.],
       [2.],
       [3.],
       [4.],
       [5.]])

Answered By - Pav3k

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 1, 2021

[FIXED] Feature Engineering: Scaling for different distributions

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels