Issue
I believe this code is causing my X and Y data to not line up as their index numbers are different. Should they not be the same so the model knows which input relates to what label?
x_train, x_valid, y_train, y_valid = train_test_split(Normalise_Data(data), labels, test_size=0.2,shuffle=True)
This is my terminal output for input and labels from this function. Should the indexes not correspond?
x_train
Out[94]:
0 1 2 3 ... 4605 4606 4607 4608
114 0.999399 0.000000 0.000000 0.0 ... 0.000025 0.000048 0.000016 0.000038
44 0.995420 0.000000 0.000000 0.0 ... 0.000066 0.000103 0.000058 0.000040
160 0.999492 0.000000 0.000000 0.0 ... 0.000021 0.000024 0.000044 0.000028
293 0.999893 0.000000 0.000250 0.0 ... 0.000002 0.000007 0.000014 0.000003
129 0.999458 0.000885 0.000976 0.0 ... 0.000005 0.000034 0.000044 0.000048
.. ... ... ... ... ... ... ... ... ...
176 0.999750 0.000041 0.000000 0.0 ... 0.000032 0.000039 0.000034 0.000029
241 0.999832 0.000000 0.000000 0.0 ... 0.000005 0.000005 0.000017 0.000003
283 0.999927 0.000170 0.000094 0.0 ... 0.000007 0.000009 0.000010 0.000012
405 0.998595 0.000000 0.000000 0.0 ... 0.000051 0.000087 0.000019 0.000031
267 0.999899 0.000000 0.000254 0.0 ... 0.000011 0.000016 0.000015 0.000020
y_train
Out[95]:
567 0
44 0
884 0
1902 0
676 0
..
1003 0
1475 0
1826 0
302 1
1718 0
Name: prediction, Length: 427, dtype: int64
Solution
train_test_split
will allow you to use pd.DataFrame
s and pd.Series
es, but it doesn't use the indexes to decide what goes with what - it just goes off of the order in which the things are presented:
In [5]: X = pd.DataFrame(np.random.random((5,5)), index=list('ABCDE'))
In [6]: y = pd.Series(np.random.random(5), index=list('FGHIJ'))
In [7]: train_test_split(X, y)
Out[7]:
[ 0 1 2 3 4
A 0.353250 0.859230 0.055278 0.871435 0.827556
B 0.906734 0.244356 0.082618 0.614280 0.200890
E 0.285790 0.483524 0.206643 0.881300 0.085348,
0 1 2 3 4
D 0.437108 0.883394 0.468495 0.329983 0.685234
C 0.387929 0.889313 0.728260 0.049744 0.819579,
F 0.720916
G 0.072408
J 0.674973
dtype: float64,
I 0.452183
H 0.202770
dtype: float64]
You can fix this pretty easily by just changing the inputs to Normalize_Data(data).sort_index()
and labels.sort_index()
Answered By - Randy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.