Issue
I have a 4xRaspberry pi3 SLURM cluster with a shared NFS folder. 4 workers (The master is also a worker buy using only 3 of its 4 cores)
The cluster is working ok (I have run some parallel python examples on it using mpiexec). Now, I want to try a scikit-learn example, and some tutorials I saw were using DASK-jobqueue with SLURM.
My code looks something like this:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster( job_extra=['--partition=picluster'],
queue='myqueue',
cores=4,
memory='1GB'
)
cluster.scale(4) #the number of nodes to request
print(cluster.job_script())
from dask.distributed import Client
client = Client(cluster)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV
import numpy as np
#load the data from file
preds_trainval_file='./Predictions_TRAIN.csv'
outc_trainval_file = './Outcome_TRAIN.csv'
preds_test_file='./Predictions_TEST.csv'
outc_test_file = './Outcome_TEST.csv'
X_trainval= np.loadtxt(preds_trainval_file, delimiter=',')
y_trainval,_,_ = np.loadtxt(outc_trainval_file, delimiter=',' , usecols=(0, 1, 2), unpack=True)
X_test = np.loadtxt(preds_test_file, delimiter=',')
y_test,_ = np.loadtxt(outc_test_file, delimiter=',' , usecols=(0, 1), unpack=True)
#setup the classifier and perform cross validation
model = LogisticRegression( penalty='elasticnet', solver='saga', warm_start=True, max_iter=10000); param_grid= { 'l1_ratio' : [0, 0.25, 0.5, 0.75, 1], 'C':[0.1, 0.25, 0.5, 0.75, 1, 1.25]}
#setup grid search on the train+val data.
kfold = KFold(n_splits=5, shuffle=True)
grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='neg_brier_score', n_jobs=-1)
import joblib
with joblib.parallel_backend('dask'):
grid_search.fit(X_trainval, y_trainval)
y_prob=grid_search.predict_proba(X_test)
print(brier_score_loss(y_test, y_prob[:,0], pos_label=1))
From what I understand this is a pretty standard setup to exploit scikit's builtin parallelisation.
When I run this script I get the following:
pi@node01:/clusterfs/Python_scripts/Expert_ensemble $ python3 ensemble_tests.py
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p myqueue
#SBATCH -n 1
#SBATCH --cpus-per-task=4
#SBATCH --mem=954M
#SBATCH -t 00:30:00
#SBATCH --partition=picluster
/usr/bin/python3 -m distributed.cli.dask_worker tcp://192.168.1.10:38817 --nthreads 1 --nprocs 4 --memory-limit 250.00MB --name dummy-name --nanny --death-timeout 60 --protocol tcp://
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /usr/lib/python3.7/asyncio/tasks.py:596> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /tmp/tmpz8a3jhys.sh\nstdout:\n\nstderr:\nsbatch: error: Memory specification can not be satisfied\nsbatch: error: Batch job submission failed: Requested node configuration is not available\n\n')>
Traceback (most recent call last):
File "/usr/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/pi/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 71, in _
await self.start()
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 324, in start
out = await self._submit_job(fn)
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 307, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 407, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch /tmp/tmpz8a3jhys.sh
stdout:
stderr:
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x766f41b0>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /home/pi/.local/lib/python3.7/site-packages/distributed/deploy/spec.py:325> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /tmp/tmpc0ary0k1.sh\nstdout:\n\nstderr:\nsbatch: error: Memory specification can not be satisfied\nsbatch: error: Batch job submission failed: Requested node configuration is not available\n\n')>)
Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/pi/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/pi/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 360, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/pi/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 71, in _
await self.start()
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 324, in start
out = await self._submit_job(fn)
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 307, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 407, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch /tmp/tmpc0ary0k1.sh
stdout:
stderr:
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /usr/lib/python3.7/asyncio/tasks.py:596> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /tmp/tmp3sezvy1f.sh\nstdout:\n\nstderr:\nsbatch: error: Memory specification can not be satisfied\nsbatch: error: Batch job submission failed: Requested node configuration is not available\n\n')>
Traceback (most recent call last):
File "/usr/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/pi/.local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 71, in _
await self.start()
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 324, in start
out = await self._submit_job(fn)
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 307, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/home/pi/.local/lib/python3.7/site-packages/dask_jobqueue/core.py", line 407, in _call
"stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch /tmp/tmp3sezvy1f.sh
stdout:
stderr:
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
I am not sure what I am doing wrong. Whether it is in the SLURMCluster configuration or something else.
The output of sinfo is:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
picluster* up infinite 4 idle node[01-04]
And the output of scontrol show nodes is:
scontrol show nodes
NodeName=node01 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=3 CPULoad=0.09
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.2.10 NodeHostName=node01 Version=18.08
OS=Linux 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021
RealMemory=1 AllocMem=0 FreeMem=800 Sockets=3 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=picluster
BootTime=2021-02-20T05:49:48 SlurmdStartTime=2021-02-20T05:50:03
CfgTRES=cpu=3,mem=1M,billing=3
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=node02 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.27
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.2.11 NodeHostName=node02 Version=18.08
OS=Linux 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021
RealMemory=1 AllocMem=0 FreeMem=813 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=picluster
BootTime=2021-02-20T05:49:37 SlurmdStartTime=2021-02-20T05:50:10
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=node03 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.24
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.2.12 NodeHostName=node03 Version=18.08
OS=Linux 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021
RealMemory=1 AllocMem=0 FreeMem=821 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=picluster
BootTime=2021-02-20T05:49:37 SlurmdStartTime=2021-02-20T05:50:09
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=node04 Arch=armv7l CoresPerSocket=1
CPUAlloc=0 CPUTot=4 CPULoad=0.14
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.2.13 NodeHostName=node04 Version=18.08
OS=Linux 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021
RealMemory=1 AllocMem=0 FreeMem=813 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=picluster
BootTime=2021-02-20T05:49:40 SlurmdStartTime=2021-02-20T05:50:08
CfgTRES=cpu=4,mem=1M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
As you can see only the master node (node01) has CPUTot=3 allocated. All the other nodes have the standard 4. But I also tested with reconfiguring the cluster with all the nodes to have the same CPUTot=4 and still got the same error when running the python script. In addition, I tried to only request 500MB of memory for each node in the cluster, but I still go the same error.
Any help appreciated.
Thanks
Solution
ok so I found a solution. I am not sure what the problem was, but you can override the memory issue by overriding the memory requirement using the header_skip option. So change the line from
cluster = SLURMCluster( job_extra=['--partition=picluster'],
queue='myqueue',
cores=4,
memory='1GB'
)
to
cluster = SLURMCluster( header_skip=['--mem'],
queue='picluster',
cores=4,
memory='1GB'
)
After that, it seems to work fine. But I still don't understand what the problem is/was.
Answered By - vzografos
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.