Monday, November 27, 2023

[FIXED] How can I use my own data set in sklearn - Python3

November 27, 2023 linear-regression, machine-learning, scikit-learn No comments

Issue

I'm trying to do a linear regression, but I want to use my own data from some .txt file. I have some data with a table with 3 columns.

Then, I would like to know how can I change this following code, that is an example from http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

Then I changed a bit the code in the example before and I invented some data, is that a correct way to do it? Like use some X and Y like this way. And then I would also to know how in the equation: x_train = x [:2], the [:2] has some influence on my procedure. I didn't really get this part.

from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

#X has to be numpy array not list.

x=([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10])
y=[5,3,8,3,4,5,5,7,8,9,10]

x_train = x [:2]
x_test = x [2:]

y_train = y[:2]
y_test = y[2:]

regr = linear_model.LinearRegression()
regr.fit (x_train,y_train)

y_pred = regr.predict(x_test)

#coefficient
print('Coefficients: \n', regr.coef_)

#the mean square error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, y_pred, color='blue', linewidth=3)
plt.axis([0, 20, 0, 20])
plt.show()

EDIT 1

With the help I received in this webpage, I tried to make some code, to produce a fit of my own data, but I'm not able to get the correct fit, so if someone has time to help me a bit more or tell me if I'm doing something wrong.

The code I'm using with the pics I'm getting

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')
#x = data[['col1','col2']]
x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

# define the KFolds 
kf = KFold(n_splits=2)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold 
#if you want to return other scores than r2, just change the scoring in cross_val_score
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

for train_index, test_index in kf.split(x):
  print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = x[train_index], x[test_index]
  y_train, y_test = y[train_index], y[test_index]


plt.scatter (X_test, y_test)
plt.show()

I'm here putting a picture of what looks like my data and what I get from TRAIN AND TEST

Then I did some fit procedure, but I'm not sure if it's correct:

regr.fit (X_train, y_train)
y_pred = regr.predict(X_test)
print(y_pred)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.show()

And I get a completely strange fit.

I don't understand why I get it, if when i did this using MINUIT, my fit worked. So, if someone has some hints to help me.

Why apparently the program is not using my data from "y" to do TRAIN or TEST samples?

My data can be taked here: https://www.dropbox.com/sh/nbbsc0fqznkwxvt/AAD-u6lM4orJOGrgIyz0o8B9a?dl=0

For me the only important is col1 and col3, the col2 should be ignored. Then I want make a fit in this data and extract the value of my fit. I know it's a line that fits this data.

Solution

First of all, the main reason that you want to split the data and use a part of the data to train the model and another part to evaluate it, is to avoid overfitting. Usually, we use KFolds or LOO(leave one out) to perform cross validation.

Here is an example using 30 samples, 3 variables and cross validation with KFolds.

import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

#create artificial data with 30 lines (samples) and 3 columns (variables)
x = np.random.rand(30,3)

#create the target variable y
y = range(30)

# convert the list to numpy array (this is needed for fit method of sklearn)
y = np.asarray(y)

# define the KFolds (3 folds in this example)
kf = KFold(n_splits=3)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold (here we have 3). 
#if you want to return other scores than r2, just change the scoring in cross_val_score.
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

Results:

Here you can see the r2 score of the model for each fold. So we splitted the data 3 times and we used 3 different training data to get these values. This is done automatically by sklearn inside cross_val_score method.

 array([-30.36184326,  -0.4149778 , -28.89110233])

To understand what KFold does you can print the training and testing indices using:

for train_index, test_index in kf.split(x):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = x[train_index], x[test_index]
   y_train, y_test = y[train_index], y[test_index]

Results:

('TRAIN:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19]), 'TEST:', array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]))

Now, you can see that for the 1st Fold we used the samples: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29.

Next, for the second fold we used the samples: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29.

Note: these numbers are the indices of the x data. E.g. 2 means the 3rd sample(line). In python we count from 0. As you can see, we do not use the exact same data (samples) in each Fold.

Hope this helps.

EDIT 1

To answer your question about loading the txt data. Let's suppose that you have a txt file with 3 columns. The first 2 columns are the features and the last column is the y (target).

In this case, you can do the following using pandas:

import pandas as pd
import numpy as np

data = pd.read_csv('data.txt')
x = data[['col1','col2']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

The txt is here: https://ufile.io/eb5xl (choose slow download).

EDIT 2

This is only for visualization purposes. I do not split the data. I use all the data to fit the model and then I predict on the same data. Then I plot the predicted values.

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')

x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

regr = linear_model.LinearRegression()
regr.fit(x, y)

y_predicted = regr.predict(x)

plt.scatter(x, y,  color='black')
plt.plot(x, y_predicted, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Results:

It seems that the data dont follow a linear pattern. Other models should be used (e.g. exponential fitting)

Answered By - seralouk

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] How can I use my own data set in sklearn - Python3

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels