Issue
I've noticed that installing Pandas and Numpy (it's dependency) in a Docker container using the base OS Alpine vs. CentOS or Debian takes much longer. I created a little test below to demonstrate the time difference. Aside from the few seconds Alpine takes to update and download the build dependencies to install Pandas and Numpy, why does the setup.py take around 70x more time than on Debian install?
Is there any way to speed up the install using Alpine as the base image or is there another base image of comparable size to Alpine that is better to use for packages like Pandas and Numpy?
Dockerfile.debian
FROM python:3.6.4-slim-jessie
RUN pip install pandas
Build Debian image with Pandas & Numpy:
[PandasDockerTest] time docker build -t debian-pandas -f Dockerfile.debian . --no-cache
Sending build context to Docker daemon 3.072kB
Step 1/2 : FROM python:3.6.4-slim-jessie
---> 43431c5410f3
Step 2/2 : RUN pip install pandas
---> Running in 2e4c030f8051
Collecting pandas
Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB)
Collecting numpy>=1.9.0 (from pandas)
Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
Collecting pytz>=2011k (from pandas)
Downloading pytz-2018.3-py2.py3-none-any.whl (509kB)
Collecting python-dateutil>=2 (from pandas)
Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
Collecting six>=1.5 (from python-dateutil>=2->pandas)
Downloading six-1.11.0-py2.py3-none-any.whl
Installing collected packages: numpy, pytz, six, python-dateutil, pandas
Successfully installed numpy-1.14.1 pandas-0.22.0 python-dateutil-2.6.1 pytz-2018.3 six-1.11.0
Removing intermediate container 2e4c030f8051
---> a71e1c314897
Successfully built a71e1c314897
Successfully tagged debian-pandas:latest
docker build -t debian-pandas -f Dockerfile.debian . --no-cache 0.07s user 0.06s system 0% cpu 13.605 total
Dockerfile.alpine
FROM python:3.6.4-alpine3.7
RUN apk --update add --no-cache g++
RUN pip install pandas
Build Alpine image with Pandas & Numpy:
[PandasDockerTest] time docker build -t alpine-pandas -f Dockerfile.alpine . --no-cache
Sending build context to Docker daemon 16.9kB
Step 1/3 : FROM python:3.6.4-alpine3.7
---> 4b00a94b6f26
Step 2/3 : RUN apk --update add --no-cache g++
---> Running in 4b0c32551e3f
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
(1/17) Upgrading musl (1.1.18-r2 -> 1.1.18-r3)
(2/17) Installing libgcc (6.4.0-r5)
(3/17) Installing libstdc++ (6.4.0-r5)
(4/17) Installing binutils-libs (2.28-r3)
(5/17) Installing binutils (2.28-r3)
(6/17) Installing gmp (6.1.2-r1)
(7/17) Installing isl (0.18-r0)
(8/17) Installing libgomp (6.4.0-r5)
(9/17) Installing libatomic (6.4.0-r5)
(10/17) Installing pkgconf (1.3.10-r0)
(11/17) Installing mpfr3 (3.1.5-r1)
(12/17) Installing mpc1 (1.0.3-r1)
(13/17) Installing gcc (6.4.0-r5)
(14/17) Installing musl-dev (1.1.18-r3)
(15/17) Installing libc-dev (0.7.1-r0)
(16/17) Installing g++ (6.4.0-r5)
(17/17) Upgrading musl-utils (1.1.18-r2 -> 1.1.18-r3)
Executing busybox-1.27.2-r7.trigger
OK: 184 MiB in 50 packages
Removing intermediate container 4b0c32551e3f
---> be26c3bf4e42
Step 3/3 : RUN pip install pandas
---> Running in 36f6024e5e2d
Collecting pandas
Downloading pandas-0.22.0.tar.gz (11.3MB)
Collecting python-dateutil>=2 (from pandas)
Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
Collecting pytz>=2011k (from pandas)
Downloading pytz-2018.3-py2.py3-none-any.whl (509kB)
Collecting numpy>=1.9.0 (from pandas)
Downloading numpy-1.14.1.zip (4.9MB)
Collecting six>=1.5 (from python-dateutil>=2->pandas)
Downloading six-1.11.0-py2.py3-none-any.whl
Building wheels for collected packages: pandas, numpy
Running setup.py bdist_wheel for pandas: started
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: still running...
Running setup.py bdist_wheel for pandas: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/e8/ed/46/0596b51014f3cc49259e52dff9824e1c6fe352048a2656fc92
Running setup.py bdist_wheel for numpy: started
Running setup.py bdist_wheel for numpy: still running...
Running setup.py bdist_wheel for numpy: still running...
Running setup.py bdist_wheel for numpy: still running...
Running setup.py bdist_wheel for numpy: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/9d/cd/e1/4d418b16ea662e512349ef193ed9d9ff473af715110798c984
Successfully built pandas numpy
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
Successfully installed numpy-1.14.1 pandas-0.22.0 python-dateutil-2.6.1 pytz-2018.3 six-1.11.0
Removing intermediate container 36f6024e5e2d
---> a93c59e6a106
Successfully built a93c59e6a106
Successfully tagged alpine-pandas:latest
docker build -t alpine-pandas -f Dockerfile.alpine . --no-cache 0.54s user 0.33s system 0% cpu 16:08.47 total
Solution
Debian based images use only python pip
to install packages with .whl
format:
Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB)
Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
WHL format was developed as a quicker and more reliable method of installing Python software than re-building from source code every time. WHL files only have to be moved to the correct location on the target system to be installed, whereas a source distribution requires a build step before installation.
Wheel packages pandas
and numpy
are not supported in images based on Alpine platform. That's why when we install them using python pip
during the building process, we always compile them from the source files in alpine:
Downloading pandas-0.22.0.tar.gz (11.3MB)
Downloading numpy-1.14.1.zip (4.9MB)
and we can see the following inside container during the image building:
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh -c pip install pandas
7 root 0:04 {pip} /usr/local/bin/python /usr/local/bin/pip install pandas
21 root 0:07 /usr/local/bin/python -c import setuptools, tokenize;__file__='/tmp/pip-build-en29h0ak/pandas/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n
496 root 0:00 sh
660 root 0:00 /bin/sh -c gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DTHREAD_STACK_SIZE=0x100000 -fPIC -Ibuild/src.linux-x86_64-3.6/numpy/core/src/pri
661 root 0:00 gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DTHREAD_STACK_SIZE=0x100000 -fPIC -Ibuild/src.linux-x86_64-3.6/numpy/core/src/private -Inump
662 root 0:00 /usr/libexec/gcc/x86_64-alpine-linux-musl/6.4.0/cc1 -quiet -I build/src.linux-x86_64-3.6/numpy/core/src/private -I numpy/core/include -I build/src.linux-x86_64-3.6/numpy/core/includ
663 root 0:00 ps aux
If we modify Dockerfile
a little:
FROM python:3.6.4-alpine3.7
RUN apk add --no-cache g++ wget
RUN wget https://pypi.python.org/packages/da/c6/0936bc5814b429fddb5d6252566fe73a3e40372e6ceaf87de3dec1326f28/pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl
RUN pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl
we get the following error:
Step 4/4 : RUN pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl
---> Running in 0faea63e2bda
pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform.
The command '/bin/sh -c pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl' returned a non-zero code: 1
Unfortunately, the only way to install pandas
on an Alpine image is to wait until build finishes.
Of course if you want to use the Alpine image with pandas
in CI for example, the best way to do so is to compile it once, push it to any registry and use it as a base image for your needs.
EDIT:
If you want to use the Alpine image with pandas
you can pull my nickgryg/alpine-pandas docker image. It is a python image with pre-compiled pandas
on the Alpine platform. It should save your time.
Answered By - nickgryg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.