Skip to content

Instantly share code, notes, and snippets.

@bryanpaget
Last active August 21, 2025 20:34
Show Gist options
  • Select an option

  • Save bryanpaget/ba48abdbddbbabc90ae0436539089570 to your computer and use it in GitHub Desktop.

Select an option

Save bryanpaget/ba48abdbddbbabc90ae0436539089570 to your computer and use it in GitHub Desktop.
analysis of packages

1. Top 20 R Packages for Data Science

  1. tidyverse (core ecosystem)
  2. dplyr (data manipulation)
  3. ggplot2 (visualization)
  4. readr (data import)
  5. tidyr (data tidying)
  6. stringr (string manipulation)
  7. lubridate (date/time handling)
  8. forcats (factor handling)
  9. purrr (functional programming)
  10. readxl (Excel files)
  11. haven (SPSS/SAS/Stata)
  12. DBI (database interface)
  13. broom (model tidying)
  14. rsample (resampling)
  15. caret/tidymodels (ML frameworks)
  16. shiny (web apps)
  17. rmarkdown (reproducible reports)
  18. knitr (dynamic reports)
  19. devtools/remotes (GitHub installs)
  20. glue (string interpolation)

2. Top 20 Python Packages for Data Science

  1. pandas (data manipulation)
  2. numpy (numerical computing)
  3. matplotlib (plotting)
  4. seaborn (statistical viz)
  5. scikit-learn (ML)
  6. jupyter (interactive notebooks)
  7. ipython (enhanced shell)
  8. requests (HTTP)
  9. beautifulsoup4 (web scraping)
  10. sqlalchemy (database ORM)
  11. statsmodels (statistical modeling)
  12. plotly (interactive viz)
  13. dash (web apps)
  14. scipy (scientific computing)
  15. openpyxl (Excel files)
  16. pyarrow (fast I/O)
  17. fastparquet (Parquet support)
  18. black (code formatting)
  19. flake8 (linting)
  20. pytest (testing)

3. Missing Ubuntu System Dependencies

(Required for R/Python packages to install correctly)

libcairo2-dev          # Cairo graphics (ggplot2, svglite)
libpango1.0-dev        # Text rendering (ggplot2)
libgdk-pixbuf2.0-dev   # Image handling (ggplot2)
libcurl4-openssl-dev   # HTTP/SSL (curl, httr)
libssl-dev             # Encryption (openssl)
libxml2-dev            # XML parsing (xml2, rvest)
libfontconfig1-dev     # Font handling (systemfonts)
libpq-dev              # PostgreSQL (RPostgres)
zlib1g-dev             # Compression (many packages)
libharfbuzz-dev        # Text shaping (textshaping)
libfribidi-dev         # Bidirectional text (textshaping)

4. Gap Analysis: Current vs. Required Packages

Installed R Packages (From all sources):

r-tidyverse, r-caret, r-crayon, r-devtools, r-e1071, r-forecast, 
r-hexbin, r-htmltools, r-htmlwidgets, r-irkernel, r-nycflights13, 
r-randomforest, r-rcurl, r-rmarkdown, r-rodbc, r-rsqlite, r-shiny, 
r-tidymodels, r-base, r-arrow, r-aws.s3, r-catools, r-hdf5r, 
r-odbc, r-renv, r-sf, r-sparklyr

Missing R Packages (Top 20 not installed):

  • lubridate (date/time handling)
  • readxl (Excel files)
  • haven (SPSS/SAS/Stata)
  • DBI (database interface)
  • broom (model tidying)
  • rsample (resampling)
  • knitr (dynamic reports)
  • glue (string interpolation)

Installed Python Packages (From all sources):

pandas, numpy, matplotlib, scipy, scikit-learn, seaborn, sqlalchemy, 
statsmodels, beautifulsoup4, openpyxl, plotly, dash, pyarrow, 
altair, bokeh, bottleneck, cloudpickle, cython, dask, dill, h5py, 
ipympl, ipywidgets, jupyterlab-git, numba, numexpr, patsy, protobuf, 
pytables, scikit-image, sympy, widgetsnbextension, xlrd, pillow, 
pyyaml, joblib, s3fs, mkl, fire, graphviz, kubeflow-training, lxml, 
cryptography, jupyterlab, jupyter_contrib_nbextensions, jupyter-server-proxy

Missing Python Packages (Top 20 not installed):

  • jupyter (metapackage)
  • ipython (enhanced shell)
  • requests (HTTP)
  • fastparquet (Parquet support)
  • black (code formatting)
  • flake8 (linting)
  • pytest (testing)

System Dependencies (Partially installed):

  • Installed: libfreetype6-dev, libpng-dev, libjpeg-dev, libtiff-dev, libfreetype-dev, libfreetype6
  • Missing: All 10 Ubuntu packages listed in Section 3

5. Suggested Packages to Remove

R Packages (Not in Top 20):

  • r-hexbin (hexagonal binning - specialized visualization)
  • r-nycflights13 (example dataset - not production-ready)
  • r-randomforest (redundant with caret/tidymodels)
  • r-rcurl (legacy - modern alternatives in tidyverse)
  • r-forecast (time series - specialized use case)
  • r-htmltools/r-htmlwidgets (web components - niche for statisticians)
  • r-hdf5r (HDF5 files - specialized scientific format)
  • r-catools (utilities - covered by base R)

Python Packages (Not in Top 20):

  • altair (visualization - redundant with plotly/seaborn)
  • bokeh (visualization - redundant with plotly/dash)
  • bottleneck (optimized numpy - marginal benefit)
  • cloudpickle (serialization - niche use case)
  • cython (C extensions - not needed by most statisticians)
  • dask (parallel computing - overkill for most statistical work)
  • dill (pickle alternative - redundant)
  • h5py (HDF5 files - specialized format)
  • numba (JIT compiler - not needed for statistical analysis)
  • numexpr (fast evaluation - marginal benefit)
  • patsy (statistical formulas - covered by statsmodels)
  • protobuf (serialization - infrastructure dependency)
  • pytables (HDF5 files - redundant with h5py)
  • scikit-image (image processing - specialized for statistics)
  • sympy (symbolic math - rarely used in statistical analysis)
  • widgetsnbextension (Jupyter widgets - redundant with ipywidgets)
  • facets (ML visualization - specialized)
  • fire (CLI tools - not data science specific)
  • graphviz (graph visualization - niche)
  • kubeflow-training (Kubernetes - infrastructure dependency)
  • lxml (XML parsing - covered by beautifulsoup4)
  • cryptography (security - infrastructure dependency)

Development Tools (Consider making optional):

  • VSCode extensions (move to optional layer)
  • Language servers (move to optional layer)
  • MPI tools (openmpi-bin, libopenmpi-dev)
  • Cloud tools (kubectl, azcli, argo)

6. Recommended Actions

Add Missing System Dependencies:

RUN apt-get install -y --no-install-recommends \
    libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev \
    libcurl4-openssl-dev libssl-dev libxml2-dev \
    libfontconfig1-dev libpq-dev zlib1g-dev \
    libharfbuzz-dev libfribidi-dev

Add Missing R Packages:

mamba install -c conda-forge \
    r-lubridate r-readxl r-haven r-dbi r-broom \
    r-rsample r-knitr r-glue

Add Missing Python Packages:

mamba install -c conda-forge \
    jupyter ipython requests fastparquet \
    black flake8 pytest

Consider Removing:

  • R packages: r-hexbin, r-nycflights13, r-randomforest, r-rcurl, r-forecast, r-htmltools, r-htmlwidgets, r-hdf5r, r-catools
  • Python packages: altair, bokeh, bottleneck, cloudpickle, cython, dask, dill, h5py, numba, numexpr, patsy, protobuf, pytables, scikit-image, sympy, widgetsnbextension, facets, fire, graphviz, lxml
  • Optional layers: VSCode, language servers, cloud tools

7. Context-Specific Recommendations

For National Statistical Organization:

  1. Prioritize Core Statistical Packages:

    • Keep r-sf (geospatial statistics)
    • Keep r-sparklyr (big data integration)
    • Keep r-aws.s3 (cloud data access)
    • Keep r-renv (reproducibility)
    • Keep ODBC drivers (database connectivity)
  2. Enterprise Infrastructure:

    • Keep Oracle/Microsoft ODBC drivers
    • Keep Kubernetes tools (kubectl, argo)
    • Keep authentication tools (krb5-user, git-credential-manager)
  3. Statistical Workflows:

    • Keep r-tidymodels (modern statistical modeling)
    • Keep statsmodels (statistical testing)
    • Keep scikit-learn (machine learning)
  4. Reproducibility:

    • Keep r-rmarkdown and jupyter (reproducible documents)
    • Keep r-renv (R environment management)

Image Optimization Strategy:

  1. Base Image: Keep essential statistical packages
  2. Optional Layers:
    • cloud-tools: kubectl, azcli, argo
    • dev-tools: VSCode, language servers
    • specialized: dask, scikit-image, h5py
  3. Documentation: Clear instructions for adding optional layers

8. Key Observations

  • Critical Gaps: Missing core packages like lubridate, readxl, haven, DBI (R) and requests, black, flake8 (Python)
  • System Dependencies: Only 6 of 16 required Ubuntu packages are installed
  • Overhead: ~50% of installed packages aren't in top 20 lists
  • Efficiency: Removing niche packages could reduce image size by ~400MB
  • Statistical Focus: Current environment has too many general-purpose data science packages and not enough specialized statistical tools
  • Enterprise Ready: Good coverage of enterprise tools (ODBC, Kubernetes, authentication)

The environment is well-suited for enterprise statistical work but needs refinement to focus on core statistical packages while maintaining essential infrastructure components.

R Packages Status

All Top 20 R packages are installed:

  • tidyverse, dplyr, ggplot2, readr, tidyr, stringr, lubridate, forcats, purrr, readxl
  • haven, DBI, broom, rsample, caret/tidymodels, shiny, rmarkdown, knitr, glue

Python Packages Status

Most Top 20 Python packages are installed, but 3 are missing:

  • ✅ pandas, numpy, matplotlib, seaborn, scikit-learn, jupyter (components), ipython
  • ✅ requests, beautifulsoup4, sqlalchemy, statsmodels, plotly, dash, scipy, openpyxl
  • ✅ pyarrow, flake8
  • Missing: fastparquet, black, pytest

Ubuntu System Dependencies Status

Critical Gap: Only 2 of 10 required system dependencies are installed:

  • ✅ Installed: libssl-dev, zlib1g-dev
  • ❌ Missing:
    • libcairo2-dev (Cairo graphics - needed for ggplot2, svglite)
    • libpango1.0-dev (Text rendering - needed for ggplot2)
    • libgdk-pixbuf2.0-dev (Image handling - needed for ggplot2)
    • libcurl4-openssl-dev (HTTP/SSL - needed for curl, httr)
    • libxml2-dev (XML parsing - needed for xml2, rvest)
    • libfontconfig1-dev (Font handling - needed for systemfonts)
    • libpq-dev (PostgreSQL - needed for RPostgres)
    • libharfbuzz-dev (Text shaping - needed for textshaping)
    • libfribidi-dev (Bidirectional text - needed for textshaping)

Additional Observations

  1. Redundant Packages: You have many packages that were in our "suggested to remove" list:

    • R: r-hexbin, r-nycflights13, r-randomforest, r-rcurl, r-forecast, r-htmltools, r-htmlwidgets, r-hdf5r, r-catools
    • Python: altair, bokeh, bottleneck, cloudpickle, cython, dask, dill, h5py, numba, numexpr, patsy, protobuf, pytables, scikit-image, sympy, widgetsnbextension, fire, graphviz, lxml
  2. System Info: You're running Ubuntu 24.04.2 LTS (Noble), which is good.

Recommended Actions

1. Install Missing System Dependencies (Critical)

apt-get update && apt-get install -y --no-install-recommends \
    libcairo2-dev libpango1.0-dev libgdk-pixbuf2.0-dev \
    libcurl4-openssl-dev libxml2-dev libfontconfig1-dev \
    libpq-dev libharfbuzz-dev libfribidi-dev

2. Install Missing Python Packages

conda install -c conda-forge fastparquet black pytest

3. Consider Removing Redundant Packages (Optional)

To reduce image size and complexity, consider removing packages not in the top 20 list:

R packages to consider removing:

conda remove r-hexbin r-nycflights13 r-randomforest r-rcurl r-forecast r-htmltools r-htmlwidgets r-hdf5r r-catools

Python packages to consider removing:

conda remove altair bokeh bottleneck cloudpickle cython dask dill h5py numba numexpr patsy protobuf pytables scikit-image sympy widgetsnbextension fire graphviz lxml

Summary

Your environment is quite comprehensive with most essential packages installed. The main gaps are:

  1. Critical: 8 missing Ubuntu system dependencies needed for proper graphics and data handling
  2. Important: 3 missing Python packages (fastparquet, black, pytest)
  3. Optional: Many redundant packages that could be removed to streamline the environment

The missing system dependencies are particularly important as they could cause issues with R visualization packages like ggplot2 when rendering to certain formats.

@bryanpaget
Copy link
Author

bryanpaget commented Aug 20, 2025

📢 Help Shape Our Data Science Environment!

Dear Colleagues,

We're optimizing our Kubeflow environment to better serve your needs. To create a truly useful base setup, we need your input on which packages matter most for your daily work.

Current State & Changes

Our environment already includes core statistical packages (tidyverse, pandas, scikit-learn), enterprise tools (ODBC, Kubernetes), and development environments (VSCode, JupyterLab, RStudio).

We're considering adding essentials like lubridate, haven, and requests, while potentially removing specialized tools like dask and h5py to eliminate bloat.

Here's Where You Come In

What's the package you always have to install when you use the zone? Tell us and maybe you aren't the only one needing it. Or maybe you are—in that case, you're on your own! (Kidding... mostly.)

Please share:

  1. Your top 5 "must-have" packages
  2. Specialized tools critical to your work
  3. Packages you rarely use

Our Goal

Create a lean base environment with 90% of what you need out-of-the-box, while allowing easy addition of specialized tools.

Reply with your department and package priorities. Your feedback directly shapes our Docker image configuration!

Thanks for helping build better tools for our statistical work.

The Zone Team

@bryanpaget
Copy link
Author

Start with system packages.

The (maybe) do the shout-out.

@bryanpaget
Copy link
Author

Make Jira ticket for system packages.

Then ask stake-holders about the shout-out.

@bryanpaget
Copy link
Author

bryanpaget commented Aug 21, 2025

📢 Help Shape Our Data Science Environment!

Dear Zone Friends,

We're optimizing our Kubeflow environment to better serve your needs. To create a truly useful base setup, we need your input on which packages matter most for your daily work.

Current State & Upcoming Changes

Our environment already includes core statistical packages (tidyverse, pandas, scikit-learn), enterprise tools (ODBC, Kubernetes), and development environments (VSCode, JupyterLab, RStudio).

Critical improvements underway:

  • Adding missing system dependencies for proper graphics rendering and data handling (Cairo, Pango, XML libraries)
  • Installing essential Python packages: fastparquet (Parquet file handling), black (code formatting), pytest (testing)
  • Removing redundant packages to streamline the environment:
    • Python: dask, h5py, bokeh, altair, cloudpickle, and others
    • R: r-hexbin, r-nycflights13, r-randomforest, and others

Here's Where You Come In

What's the package you always have to install when you use The Zone?

Please share:

  1. Your top 5 "must-have" packages
  2. Specialized tools critical to your work
  3. Packages you rarely use (help us identify more removal candidates)

Our Goal

Create a lean base environment with 80-90% of what you need out-of-the-box, while ensuring:

  • Reliability: All graphics and data connections work properly
  • Efficiency: Faster startup times and smaller image size
  • Flexibility: Easily add specialized tools when needed

Your feedback directly shapes our Docker image configuration!

Thanks for helping build better tools for our statistical work.

The Zone Team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment