Tuesday, June 7, 2022

[FIXED] Scikit-learn Linear Model, coef_ return high values for features

June 07, 2022 machine-learning, python, scikit-learn No comments

Issue

Problem statement: predict the weight of a courier package given a customer places order for certain items (e.g.: Boots, sneakers etc.)

So the dataframe I have is made up of historical data, where product_item_categories(e.g.: boots, sneakers etc) make up the feature and weight is my 'y' variable to be predicted. Each row of the dataframe consists of count of how many product_item_categories did the customer order.

Example: customer orders 1 pair of boots, 1 pair of sneakers. the row looks like:

x1  x2  x3  x4  x5  x6  x7  x8  x9  x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 y
1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   2   2.94

One of the features is a items_total, here x47 (how many items did the customer order in total).

I have created a linear model using

regr_model = linear_model.LinearRegression()

After splitting the dataframe into training and test sets, I run the model using regr_model.fit(x_train, y_train)

When I look at the coefficients, I get the following output(formatted to make more sense)

1   feature x1  6494532107.689080 (this is the items_total feature)
2   feature x2  (-6494532105.548431)
3   feature x3  (-6494532105.956598)
4   feature x4  (-6494532105.987348)
5   feature x5  (-6494532106.081478)
6   feature x6  (-6494532106.139558)
7   feature x7  (-6494532106.163167)
8   feature x8  (-6494532106.326231)
9   feature x9  (-6494532106.360985)
10  feature x10 (-6494532106.507434)
11  feature x11 (-6494532106.678183)
12  feature x12 (-6494532106.711108)
13  feature x13 (-6494532106.906321)
14  feature x14 (-6494532106.916800)
15  feature x15 (-6494532106.941691)
16  feature x16 (-6494532107.049221)
17  feature x17 (-6494532107.071664)
18  feature x18 (-6494532107.076819)
19  feature x19 (-6494532107.095350)
20  feature x20 (-6494532107.124458)
21  feature x21 (-6494532107.208526)
22  feature x22 (-6494532107.291896)
23  feature x23 (-6494532107.315606)
24  feature x24 (-6494532107.319578)
25  feature x25 (-6494532107.322818)
26  feature x26 (-6494532107.337678)
27  feature x27 (-6494532107.345344)
28  feature x28 (-6494532107.347136)
29  feature x29 (-6494532107.374278)
30  feature x30 (-6494532107.403748)
31  feature x31 (-6494532107.405770)
32  feature x32 (-6494532107.411852)
33  feature x33 (-6494532107.469144)
34  feature x34 (-6494532107.470899)
35  feature x35 (-6494532107.471970)
36  feature x36 (-6494532107.489899)
37  feature x37 (-6494532107.495930)
38  feature x38 (-6494532107.504712)
39  feature x39 (-6494532107.522346)
40  feature x40 (-6494532107.557917)
41  feature x41 (-6494532107.561793)
42  feature x42 (-6494532107.562286)
43  feature x43 (-6494532107.601017)
44  feature x44 (-6494532107.603461)
45  feature x45 (-6494532107.686674)
46  feature x46 (-6494532107.843128)
47  feature x47 (-6494532107.910987)

The intercept is: 0.555702083558 The model score is: 0.79

When I remove the items_total. I get coefficients which make more sense:

1   feature x2  2.140582
2   feature x3  1.732328
3   feature x4  1.701661
4   feature x5  1.607465
5   feature x6  1.549196
6   feature x7  1.526227
7   feature x8  1.363067
8   feature x9  1.329225
9   feature x10 1.18109
10  feature x11 1.010639
11  feature x12 0.978123
12  feature x13 0.782569
13  feature x14 0.773164
14  feature x15 0.747479
15  feature x16 0.638743
16  feature x17 0.617082
17  feature x18 0.61257
18  feature x19 0.593665
19  feature x20 0.565309
20  feature x21 0.480105
21  feature x22 0.396592
22  feature x23 0.373675
23  feature x24 0.369643
24  feature x25 0.365989
25  feature x26 0.350971
26  feature x27 0.343381
27  feature x28 0.34158
28  feature x29 0.314405
29  feature x30 0.285344
30  feature x31 0.282827
31  feature x32 0.277007
32  feature x33 0.219727
33  feature x34 0.217814
34  feature x35 0.217466
35  feature x36 0.198526
36  feature x37 0.193277
37  feature x38 0.184332
38  feature x39 0.166745
39  feature x40 0.130655
40  feature x41 0.127573
41  feature x42 0.126665
42  feature x43 0.087371
43  feature x44 0.085545
44  feature x45 0.003045
45  feature x46 (-0.153778)
46  feature x47 (-0.221548)

The intercept and score of the model are same. Can someone help me understand how the coefficients are so different when I remove the items_total column?

Solution

I think it's mostly theoretical problem. It's better to ask this in https://stats.stackexchange.com/ or https://datascience.stackexchange.com/

It's called Multicollinearity.

I'll provide better example to demonstrate the problem, this example is available in Russian version of wikipedia page: Let's assume that you have following features: x1, x2, x3, where x1 = x2+x3 So we have a model which looks like .

Let's add some arbitrary a to b1, and substract a from b2 and b3:

So we have achieved same model after random modification of coefficients, that's the problem. Thus you should avoid such strong correlations between features (Your last feature is correlated with all others).

Answered By - Ibraim Ganiev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, June 7, 2022

[FIXED] Scikit-learn Linear Model, coef_ return high values for features

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels