Issue
I have two versions of a pipeline, one of which runs, one of which doesn't.
Version 1. This runs reasonably quickly. Approximately 4 hours on a machine with 32G and 16 cores. In this version I am doing a differential methylation analysis to outside of this to select several hundred variables from a set of more than 300K.
X_train, X_test, y_train, y_test = train_test_split(df, cancerType, test_size=0.2, random_state=42)
rfeFeatureSelection = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
randomForest = RandomForestClassifier(random_state=42)
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Create the pipeline with feature selection and model refinement
pipeline = Pipeline([
("featureSelection", rfeFeatureSelection),
('modelRefinement', randomForest)
])
search = GridSearchCV(pipeline,
param_grid=parameterGrid,
scoring='accuracy',
cv=stratified_cv,
verbose=2,
n_jobs=-1,
pre_dispatch='2*n_jobs',
error_score='raise',
)
search.fit(X_train, y_train)
Version 2. I would prefer the differential methylation step to be done inside the cross validation process though, so that the initial selection of variables is not seen by the test or validation sets. So I wrote a custom classifier that does the differential methylation analysis. The number of variables returned sometimes differs by one or two so I put this step in a FeatureUnion step with the RecursiveFeatureElimination, following what's been done here. My pipeline now looks like this:
X_train, X_test, y_train, y_test = train_test_split(df, cancerType, test_size=0.2, random_state=42)
differentialMethylation = DifferentialMethylation(truthValues = y_train, name=name)
rfeFeatureSelection = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
randomForest = RandomForestClassifier(random_state=42)
combinedFeatures = FeatureUnion([
("differentialMethylation", differentialMethylation),
("rfeFeatureSelection", rfeFeatureSelection)
])
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Create the pipeline with combined feature selection and model refinement
pipeline = Pipeline([
("featureSelection", combinedFeatures),
('modelRefinement', randomForest)
])
search = GridSearchCV(pipeline,
param_grid=parameterGrid,
scoring='accuracy',
cv=stratified_cv,
verbose=2,
n_jobs=-1,
pre_dispatch='2*n_jobs',
error_score='raise',
)
search.fit(X_train, y_train)
This code, will get to through the DiffernentialMethylation classifier - I've got logging statements that spit out what's happening immediately before it passes data to the rfeFeatureSelection step. If I set the verbosity to 1 in rfeFeatureSelection, it definitly gets to rfeFeatureSelection, but never exits, it will sit there happily outputting this overnight and never finishing.
[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 0.2s finished
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 0.2s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
So I am assuming I am doing something wrong with the FeatureUnion, but can't for the life of me figure out what.
What am I doing wrong?
Solution
I was under the impression that the FeatureUnion was operating in sequence. It was not, the two aspects of the FeatureUnion run in parallel - a point in the documentation that I had missed.
This meant that the the Recursive Feature Elimination step was fitting vast numbers of random forests, leading to the excessively long run times, as pointed out by Ben Reiniger.
Answered By - Ben
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.