tidyverse(core ecosystem)dplyr(data manipulation)ggplot2(visualization)readr(data import)tidyr(data tidying)stringr(string manipulation)lubridate(date/time handling)forcats(factor handling)purrr(functional programming)readxl(Excel files)haven(SPSS/SAS/Stata)DBI(database interface)broom(model tidying)rsample(resampling)caret/tidymodels(ML frameworks)shiny(web apps)rmarkdown(reproducible reports)knitr(dynamic reports)devtools/remotes(GitHub installs)glue(string interpolation)
pandas(data manipulation)numpy(numerical computing)matplotlib(plotting)seaborn(statistical viz)scikit-learn(ML)jupyter(interactive notebooks)ipython(enhanced shell)requests(HTTP)beautifulsoup4(web scraping)sqlalchemy(database ORM)statsmodels(statistical modeling)plotly(interactive viz)dash(web apps)scipy(scientific computing)openpyxl(Excel files)pyarrow(fast I/O)fastparquet(Parquet support)black(code formatting)flake8(linting)pytest(testing)
(Required for R/Python packages to install correctly)
libcairo2-dev # Cairo graphics (ggplot2, svglite)
libpango1.0-dev # Text rendering (ggplot2)
libgdk-pixbuf2.0-dev # Image handling (ggplot2)
libcurl4-openssl-dev # HTTP/SSL (curl, httr)
libssl-dev # Encryption (openssl)
libxml2-dev # XML parsing (xml2, rvest)
libfontconfig1-dev # Font handling (systemfonts)
libpq-dev # PostgreSQL (RPostgres)
zlib1g-dev # Compression (many packages)
libharfbuzz-dev # Text shaping (textshaping)
libfribidi-dev # Bidirectional text (textshaping)r-tidyverse, r-caret, r-crayon, r-devtools, r-e1071, r-forecast,
r-hexbin, r-htmltools, r-htmlwidgets, r-irkernel, r-nycflights13,
r-randomforest, r-rcurl, r-rmarkdown, r-rodbc, r-rsqlite, r-shiny,
r-tidymodels, r-base, r-arrow, r-aws.s3, r-catools, r-hdf5r,
r-odbc, r-renv, r-sf, r-sparklyrlubridate(date/time handling)readxl(Excel files)haven(SPSS/SAS/Stata)DBI(database interface)broom(model tidying)rsample(resampling)knitr(dynamic reports)glue(string interpolation)
pandas, numpy, matplotlib, scipy, scikit-learn, seaborn, sqlalchemy,
statsmodels, beautifulsoup4, openpyxl, plotly, dash, pyarrow,
altair, bokeh, bottleneck, cloudpickle, cython, dask, dill, h5py,
ipympl, ipywidgets, jupyterlab-git, numba, numexpr, patsy, protobuf,
pytables, scikit-image, sympy, widgetsnbextension, xlrd, pillow,
pyyaml, joblib, s3fs, mkl, fire, graphviz, kubeflow-training, lxml,
cryptography, jupyterlab, jupyter_contrib_nbextensions, jupyter-server-proxyjupyter(metapackage)ipython(enhanced shell)requests(HTTP)fastparquet(Parquet support)black(code formatting)flake8(linting)pytest(testing)
- Installed:
libfreetype6-dev,libpng-dev,libjpeg-dev,libtiff-dev,libfreetype-dev,libfreetype6 - Missing: All 10 Ubuntu packages listed in Section 3
r-hexbin(hexagonal binning - specialized visualization)r-nycflights13(example dataset - not production-ready)r-randomforest(redundant withcaret/tidymodels)r-rcurl(legacy - modern alternatives intidyverse)r-forecast(time series - specialized use case)r-htmltools/r-htmlwidgets(web components - niche for statisticians)r-hdf5r(HDF5 files - specialized scientific format)r-catools(utilities - covered by base R)
altair(visualization - redundant withplotly/seaborn)bokeh(visualization - redundant withplotly/dash)bottleneck(optimized numpy - marginal benefit)cloudpickle(serialization - niche use case)cython(C extensions - not needed by most statisticians)dask(parallel computing - overkill for most statistical work)dill(pickle alternative - redundant)h5py(HDF5 files - specialized format)numba(JIT compiler - not needed for statistical analysis)numexpr(fast evaluation - marginal benefit)patsy(statistical formulas - covered bystatsmodels)protobuf(serialization - infrastructure dependency)pytables(HDF5 files - redundant withh5py)scikit-image(image processing - specialized for statistics)sympy(symbolic math - rarely used in statistical analysis)widgetsnbextension(Jupyter widgets - redundant withipywidgets)facets(ML visualization - specialized)fire(CLI tools - not data science specific)graphviz(graph visualization - niche)kubeflow-training(Kubernetes - infrastructure dependency)lxml(XML parsing - covered bybeautifulsoup4)cryptography(security - infrastructure dependency)
- VSCode extensions (move to optional layer)
- Language servers (move to optional layer)
- MPI tools (
openmpi-bin,libopenmpi-dev) - Cloud tools (
kubectl,azcli,argo)
RUN apt-get install -y --no-install-recommends \
libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev \
libcurl4-openssl-dev libssl-dev libxml2-dev \
libfontconfig1-dev libpq-dev zlib1g-dev \
libharfbuzz-dev libfribidi-devmamba install -c conda-forge \
r-lubridate r-readxl r-haven r-dbi r-broom \
r-rsample r-knitr r-gluemamba install -c conda-forge \
jupyter ipython requests fastparquet \
black flake8 pytest- R packages:
r-hexbin,r-nycflights13,r-randomforest,r-rcurl,r-forecast,r-htmltools,r-htmlwidgets,r-hdf5r,r-catools - Python packages:
altair,bokeh,bottleneck,cloudpickle,cython,dask,dill,h5py,numba,numexpr,patsy,protobuf,pytables,scikit-image,sympy,widgetsnbextension,facets,fire,graphviz,lxml - Optional layers: VSCode, language servers, cloud tools
-
Prioritize Core Statistical Packages:
- Keep
r-sf(geospatial statistics) - Keep
r-sparklyr(big data integration) - Keep
r-aws.s3(cloud data access) - Keep
r-renv(reproducibility) - Keep ODBC drivers (database connectivity)
- Keep
-
Enterprise Infrastructure:
- Keep Oracle/Microsoft ODBC drivers
- Keep Kubernetes tools (
kubectl,argo) - Keep authentication tools (
krb5-user,git-credential-manager)
-
Statistical Workflows:
- Keep
r-tidymodels(modern statistical modeling) - Keep
statsmodels(statistical testing) - Keep
scikit-learn(machine learning)
- Keep
-
Reproducibility:
- Keep
r-rmarkdownandjupyter(reproducible documents) - Keep
r-renv(R environment management)
- Keep
- Base Image: Keep essential statistical packages
- Optional Layers:
cloud-tools: kubectl, azcli, argodev-tools: VSCode, language serversspecialized: dask, scikit-image, h5py
- Documentation: Clear instructions for adding optional layers
- Critical Gaps: Missing core packages like
lubridate,readxl,haven,DBI(R) andrequests,black,flake8(Python) - System Dependencies: Only 6 of 16 required Ubuntu packages are installed
- Overhead: ~50% of installed packages aren't in top 20 lists
- Efficiency: Removing niche packages could reduce image size by ~400MB
- Statistical Focus: Current environment has too many general-purpose data science packages and not enough specialized statistical tools
- Enterprise Ready: Good coverage of enterprise tools (ODBC, Kubernetes, authentication)
The environment is well-suited for enterprise statistical work but needs refinement to focus on core statistical packages while maintaining essential infrastructure components.
📢 Help Shape Our Data Science Environment!
Dear Colleagues,
We're optimizing our Kubeflow environment to better serve your needs. To create a truly useful base setup, we need your input on which packages matter most for your daily work.
Current State & Changes
Our environment already includes core statistical packages (
tidyverse,pandas,scikit-learn), enterprise tools (ODBC, Kubernetes), and development environments (VSCode, JupyterLab, RStudio).We're considering adding essentials like
lubridate,haven, andrequests, while potentially removing specialized tools likedaskandh5pyto eliminate bloat.Here's Where You Come In
What's the package you always have to install when you use the zone? Tell us and maybe you aren't the only one needing it. Or maybe you are—in that case, you're on your own! (Kidding... mostly.)
Please share:
Our Goal
Create a lean base environment with 90% of what you need out-of-the-box, while allowing easy addition of specialized tools.
Reply with your department and package priorities. Your feedback directly shapes our Docker image configuration!
Thanks for helping build better tools for our statistical work.
The Zone Team