-
-
Save bskaggs/fc3c8d0d553be54e2645616236fdc8c6 to your computer and use it in GitHub Desktop.
FROM python:3.7-alpine3.8 | |
RUN apk add --no-cache \ | |
build-base \ | |
cmake \ | |
bash \ | |
jemalloc-dev \ | |
boost-dev \ | |
autoconf \ | |
zlib-dev \ | |
flex \ | |
bison | |
RUN pip install --no-cache-dir six pytest numpy cython | |
RUN pip install --no-cache-dir pandas | |
ARG ARROW_VERSION=0.12.0 | |
ARG ARROW_SHA1=2ede75769e12df972f0acdfddd53ab15d11e0ac2 | |
ARG ARROW_BUILD_TYPE=release | |
ENV ARROW_HOME=/usr/local \ | |
PARQUET_HOME=/usr/local | |
#Download and build apache-arrow | |
RUN mkdir /arrow \ | |
&& apk add --no-cache curl \ | |
&& curl -o /tmp/apache-arrow.tar.gz -SL https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz \ | |
&& echo "$ARROW_SHA1 *apache-arrow.tar.gz" | sha1sum /tmp/apache-arrow.tar.gz \ | |
&& tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1 \ | |
&& mkdir -p /arrow/cpp/build \ | |
&& cd /arrow/cpp/build \ | |
&& cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ | |
-DCMAKE_INSTALL_LIBDIR=lib \ | |
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ | |
-DARROW_PARQUET=on \ | |
-DARROW_PYTHON=on \ | |
-DARROW_PLASMA=on \ | |
-DARROW_BUILD_TESTS=OFF \ | |
.. \ | |
&& make -j$(nproc) \ | |
&& make install \ | |
&& cd /arrow/python \ | |
&& python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet \ | |
&& python setup.py install \ | |
&& rm -rf /arrow /tmp/apache-arrow.tar.gz |
I can't make this work either. If someone has knowledge about what the underlying problem is I will gladly put in some time and effort and try to make this work. Unfortunately, my knowledge about this thus far is quite limited. I would really like to work with the Alpine base image as it is a safe and small starting point. I am using
python:3.11-alpine
as a base.
I've managed to build pyarrow with apache arrow finally, but the resulted image is 3,5GB
And building lasts about 30 min. Here is the confirmed docker file:
FROM --platform=linux/amd64 python:3.12-alpine AS base
# Setup env
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONFAULTHANDLER=1
ENV ACCEPT_EULA=Y
RUN apk update && apk add --no-cache \
gcc \
g++ \
curl \
unixodbc-dev \
bash \
libffi-dev \
openssl-dev \
cargo \
musl-dev \
postgresql-dev \
cmake \
rust \
linux-headers \
libc-dev \
libgcc \
libstdc++ \
ca-certificates \
zlib-dev \
bzip2-dev \
xz-dev \
lz4-dev \
zstd-dev \
snappy-dev \
brotli-dev \
build-base \
autoconf \
boost-dev \
flex \
libxml2-dev \
libxslt-dev \
libjpeg-turbo-dev \
ninja \
git \
&& pip install --upgrade pip && pip install pipenv cython numpy
ARG ARROW_VERSION=17.0.0
ARG ARROW_SHA256=8379554d89f19f2c8db63620721cabade62541f47a4e706dfb0a401f05a713ef
ARG ARROW_BUILD_TYPE=release
ENV ARROW_HOME=/usr/local \
PARQUET_HOME=/usr/local
RUN mkdir /arrow \
&& wget -q https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz -O /tmp/apache-arrow.tar.gz \
&& echo "${ARROW_SHA256} *apache-arrow.tar.gz" | sha256sum /tmp/apache-arrow.tar.gz \
&& tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1
# Create the patch file for re2
RUN echo "diff --git a/util/pcre.h b/util/pcre.h" > /arrow/re2_patch.diff \
&& echo "index e69de29..b6f3e31 100644" >> /arrow/re2_patch.diff \
&& echo "--- a/util/pcre.h" >> /arrow/re2_patch.diff \
&& echo "+++ b/util/pcre.h" >> /arrow/re2_patch.diff \
&& echo "@@ -21,6 +21,7 @@" >> /arrow/re2_patch.diff \
&& echo " #include \"re2/filtered_re2.h\"" >> /arrow/re2_patch.diff \
&& echo " #include \"re2/pod_array.h\"" >> /arrow/re2_patch.diff \
&& echo " #include \"re2/stringpiece.h\"" >> /arrow/re2_patch.diff \
&& echo "+#include <cstdint>" >> /arrow/re2_patch.diff
# Configure the build using CMake
RUN cd /arrow/cpp \
&& cmake --preset ninja-release-python
# Pre-fetch dependencies without building
RUN cd /arrow/cpp \
&& cmake --build . --target re2_ep -- -j1 || true
# Apply the patch to re2 after the dependencies are fetched but before the build
RUN cd /arrow/cpp/re2_ep-prefix/src/re2_ep \
&& patch -p1 < /arrow/re2_patch.diff
# Continue with the build and install Apache Arrow
RUN cd /arrow/cpp \
&& cmake --build . --target install \
&& rm -rf /arrow /tmp/apache-arrow.tar.gz
COPY Pipfile .
COPY Pipfile.lock .
RUN PIPENV_VENV_IN_PROJECT=1 pipenv install --deploy
# Final Stage
FROM base AS runtime
COPY --from=base /.venv /.venv
ENV PATH="/.venv/bin:$PATH"
WORKDIR /app
COPY src .
CMD ["python3", "main.py"]
I use pipenv to build dependencies, so you can customize the image for your needs.
or just add RUN pip install pyarrow
or update these lines from original Dockerfile above like:
RUN cd /arrow/cpp \
&& cmake --build . --target install \
&& cd /arrow/python \
&& python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet \
&& python setup.py install \
&& rm -rf /arrow /tmp/apache-arrow.tar.gz
there is also a bug with pcre.h in apache arrow source code, so I applied patch within the Dockerimage. maybe it's not bug though, but I reported it here: #43350
I think necessary
I can't make this work either. If someone has knowledge about what the underlying problem is I will gladly put in some time and effort and try to make this work. Unfortunately, my knowledge about this thus far is quite limited. I would really like to work with the Alpine base image as it is a safe and small starting point. I am using
python:3.11-alpine
as a base.I've managed to build pyarrow with apache arrow finally, but the resulted image is 3,5GB And building lasts about 30 min. Here is the confirmed docker file:
FROM --platform=linux/amd64 python:3.12-alpine AS base # Setup env ENV LANG=C.UTF-8 ENV LC_ALL=C.UTF-8 ENV PYTHONDONTWRITEBYTECODE=1 ENV PYTHONFAULTHANDLER=1 ENV ACCEPT_EULA=Y RUN apk update && apk add --no-cache \ gcc \ g++ \ curl \ unixodbc-dev \ bash \ libffi-dev \ openssl-dev \ cargo \ musl-dev \ postgresql-dev \ cmake \ rust \ linux-headers \ libc-dev \ libgcc \ libstdc++ \ ca-certificates \ zlib-dev \ bzip2-dev \ xz-dev \ lz4-dev \ zstd-dev \ snappy-dev \ brotli-dev \ build-base \ autoconf \ boost-dev \ flex \ libxml2-dev \ libxslt-dev \ libjpeg-turbo-dev \ ninja \ git \ && pip install --upgrade pip && pip install pipenv cython numpy ARG ARROW_VERSION=17.0.0 ARG ARROW_SHA256=8379554d89f19f2c8db63620721cabade62541f47a4e706dfb0a401f05a713ef ARG ARROW_BUILD_TYPE=release ENV ARROW_HOME=/usr/local \ PARQUET_HOME=/usr/local RUN mkdir /arrow \ && wget -q https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz -O /tmp/apache-arrow.tar.gz \ && echo "${ARROW_SHA256} *apache-arrow.tar.gz" | sha256sum /tmp/apache-arrow.tar.gz \ && tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1 # Create the patch file for re2 RUN echo "diff --git a/util/pcre.h b/util/pcre.h" > /arrow/re2_patch.diff \ && echo "index e69de29..b6f3e31 100644" >> /arrow/re2_patch.diff \ && echo "--- a/util/pcre.h" >> /arrow/re2_patch.diff \ && echo "+++ b/util/pcre.h" >> /arrow/re2_patch.diff \ && echo "@@ -21,6 +21,7 @@" >> /arrow/re2_patch.diff \ && echo " #include \"re2/filtered_re2.h\"" >> /arrow/re2_patch.diff \ && echo " #include \"re2/pod_array.h\"" >> /arrow/re2_patch.diff \ && echo " #include \"re2/stringpiece.h\"" >> /arrow/re2_patch.diff \ && echo "+#include <cstdint>" >> /arrow/re2_patch.diff # Configure the build using CMake RUN cd /arrow/cpp \ && cmake --preset ninja-release-python # Pre-fetch dependencies without building RUN cd /arrow/cpp \ && cmake --build . --target re2_ep -- -j1 || true # Apply the patch to re2 after the dependencies are fetched but before the build RUN cd /arrow/cpp/re2_ep-prefix/src/re2_ep \ && patch -p1 < /arrow/re2_patch.diff # Continue with the build and install Apache Arrow RUN cd /arrow/cpp \ && cmake --build . --target install \ && rm -rf /arrow /tmp/apache-arrow.tar.gz COPY Pipfile . COPY Pipfile.lock . RUN PIPENV_VENV_IN_PROJECT=1 pipenv install --deploy # Final Stage FROM base AS runtime COPY --from=base /.venv /.venv ENV PATH="/.venv/bin:$PATH" WORKDIR /app COPY src . CMD ["python3", "main.py"]I use pipenv to build dependencies, so you can customize the image for your needs. or just add
RUN pip install pyarrow
or update these lines from original Dockerfile above like:RUN cd /arrow/cpp \ && cmake --build . --target install \ && cd /arrow/python \ && python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet \ && python setup.py install \ && rm -rf /arrow /tmp/apache-arrow.tar.gz
there is also a bug with pcre.h in apache arrow source code, so I applied patch within the Dockerimage. maybe it's not bug though, but I reported it here: #43350
I think necessary reply here, because that docker help me so much. Thanks
I can't make this work either. If someone has knowledge about what the underlying problem is I will gladly put in some time and effort and try to make this work. Unfortunately, my knowledge about this thus far is quite limited. I would really like to work with the Alpine base image as it is a safe and small starting point. I am using
python:3.11-alpine
as a base.