Issue
My problem setup is as follows: Python 3.7, Pandas version 1.0.3, and sklearn version 0.22.1. I am applying a StandardScaler (to every column of a float matrix) per usual. However, the columns that I get out do not have standard deviation =1, while their mean values are (approximately) 0.
I am not sure what is going wrong here, I have checked whether the scaler
got confused and standardised the rows instead but that does not seem to be the case.
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
np.random.seed(1)
row_size = 5
n_obs = 100
X = pd.DataFrame(np.random.randint(0,1000,n_obs).reshape((row_size,int(n_obs/row_size)))
scaler = StandardScaler()
scaler.fit(X)
X_out = scaler.transform(X)
X_out = pd.DataFrame(X_out)
All columns have standard deviation 1.1180...
as opposed to 1.
X_out[0].mean()
>>Out[2]: 4.4408920985006264e-17
X_out[0].std()
>>Out[3]: 1.1180339887498947
EDIT:
I have realised as I increase row_size
above, e.g. from 5 to 10 and 100, the standard deviation of the columns approach 1. So maybe this is to do with the bias of the variance estimator getting smaller as n increases(?). However it does not make sense that I can get unit variance by manually implementing (col[i]- col[i].mean() )/ col[i].std()
but the StandardScaler struggles...
Solution
Numpy and Pandas use different definitions of standard deviation (biased vs. unbiased). Sklearn uses the numpy definition, thus the result of scaler.transform(X).std(axis=1)
results in 1
s.
But then you wrap the standardized values X_out
in a pandas DataFrame and ask pandas to give you the standard deviation for the same values, which then results in your observation.
For most cases you only care for all columns having the same spread, thus the differences are not important. But if you really want the unbiased standard deviation, you can't use the StandardScaler from sklearn.
Answered By - Niklas Mertsch
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.