Issue
I have a dataset with some parameters about the weather. All features are numerical and also target is continuous. I want to predict the amount of precipitation. My features look like this (Year, Month, and Day is just multi-index for my DataFrame):
_, _, X, y = read_daily_data()
print(X)
MEANT RH WS WD CCT MSLP MAXT MINT
Year Month Day
2014 1 1 4.494412 90.203694 16.615975 166.495278 59.916667 1014.029167 8.720245 0.310245
2 5.978995 92.044333 20.621631 184.099628 63.875000 1008.670833 9.240245 3.530245
3 6.586079 88.778159 22.263927 183.268500 50.108334 1013.070833 10.400246 2.340245
4 6.358579 94.172092 15.272616 158.277724 66.666667 1007.625000 8.480246 4.600245
5 4.995662 86.622807 16.897822 225.090521 59.383333 1010.754167 7.480245 0.440245
... ... ... ... ... ... ... ... ...
2023 11 8 7.268995 82.063136 17.965620 202.643657 33.016667 1019.379167 12.380245 3.760245
9 7.729829 82.235617 25.143419 196.132513 69.020834 1010.795833 10.380245 3.690246
10 9.101078 76.940065 27.342357 228.518643 61.875000 1005.745833 10.670245 7.960245
11 7.350245 82.186650 22.030794 242.243293 49.875000 1010.391667 8.660245 4.260245
12 5.818162 93.582846 18.648649 181.010854 85.333333 1010.112500 11.230246 2.140245
[3603 rows x 8 columns]
And also this is my target:
print(y)
Year Month Day
2014 1 1 1.4
2 6.8
3 0.8
4 16.5
5 5.5
...
2023 11 8 0.0
9 4.2
10 9.3
11 3.2
12 14.0
Name: PT, Length: 3603, dtype: float64
I apply linear regression to my dataset:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)
std = StandardScaler()
std.fit(X)
X_train_std = std.transform(X_train)
sgd_reg = SGDRegressor(random_state=42)
sgd_reg.fit(X_train_std, y_train)
X_test_std = std.transform(X_test)
sgd_score = sgd_reg.score(X_test_std, y_test)
print(f"{sgd_score:.3f}")
Which gives me this score:
0.385
Now, when I want to apply Logistic Regression:
lgs_reg = LogisticRegression(random_state=42)
lgs_reg.fit(X_train_std, y_train)
I get this error:
ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.
I know that models built for Classification like LogisticRegression predict by returning a continuous vaule which then we use threshold to quantize it. If I implement Logistic Regression on my way (for example use sigmoid functin), I definitely can input target value as a continuous number. My question is why scikit-learn don't accept this?
And also suggest a way in my specific problem to use Classification models like Logistic Regression. One thing that I understand is use Discretization to sparse the continuous target into intervals. But after that is the score from this model comparable with Linear Regression?
I appreciate your help.
Solution
Your problem is really a regression problem and not a classification problem. Although LogisticRegression
might seem like a regression algorithm from the name, but it's really a classification algorithm (I know, confusing).
You should therefore use any of the regression algorithms available in scikit-learn. You have a list of all models here and you can choose any suitable for regression. Linear models are good to start with, and HistGradientBoostingRegressor
is usually a good contender.
Answered By - adrin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.