Issue
I have some code to validate a model where I'd like to use each year in my data as a hold out set. As such, I am using sklearn LeaveOneGroupOut:
log_loss_data = []
acc_data = []
years = np.arange(df.year.min(),df.year.max()+1)[::-1]
groups = df['year']
X = df[[__my_features__]]
y = df[__my_target__]
logo = LeaveOneGroupOut()
logo.get_n_splits(X, y, groups)
logo.get_n_splits(groups=groups)
for year, (train_index, test_index) in zip(years, logo.split(X, y, groups)):
print(f'Leaving out {year}...')
X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()
model = LGBMClassifier()
model.fit(X_train, y_train)
X_test["make_pred"] = (pd.
DataFrame(model.predict_proba(X_test),index=X_test.index,columns=[0,"pred"])[["pred"]]
)
log_loss_data.append(log_loss(y_test,X_test["pred"]))
acc_data.append(accuracy_score(y_test,np.round(X_test["pred"])))
When this is done, I have a list of log loss and accuracy scores for each group. The above code assumes that the order of groups is from greatest to least, but I am unsure if this is the case. I'd like to associate my cv scores with their according group year to see if there's any years (or groups of years/seasonality) that result in different/worse scores. In the docs, it appears as though there are only two methods .get_n_splits()
and .split()
. I thought there was for sure a way to access the group value in each cv iteration... Was I incorrect in this assumption?
EDIT: I did some testing and it turns out that numeric groups are likely iterated in order from least to greatest. To check this I created two different models. One used the earliest year in my data as a test set and the other used the latest. The respective scores for these models matched the first and last grouped cv iteration scores, respectively. While there is no official documentation (that I have come across) that confirms this, given this test I am quite confident that they are indeed iterated in order from least to greatest.
Solution
Yes, as you've discovered, the splits happen in the order of group identifiers.
In the source, you can see this: the group array is passed through numpy.unique
, which returns the items in order, then those are looped over.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.