Skip to content

Instantly share code, notes, and snippets.

@zeroasterisk
Created June 4, 2026 23:17
Show Gist options
  • Select an option

  • Save zeroasterisk/9d143b85403e29de0517d028bf889004 to your computer and use it in GitHub Desktop.

Select an option

Save zeroasterisk/9d143b85403e29de0517d028bf889004 to your computer and use it in GitHub Desktop.
PyPI downloads: is that spike real humans or a CI/CD fleet? (BigQuery method + SQL)

Is that PyPI download spike real humans — or a CI/CD fleet?

Raw PyPI download counts conflate humans with mirrors, CI/CD, and transitive dependencies. When a package "takes off" or two packages look like a migration, the naive read is usually wrong. This is a method (+ copy-paste SQL) to tell a human adoption wave from a single-image automation fleet, using the public BigQuery table bigquery-public-data.pypi.file_downloads.

The idea

details.installer.name ("pip"/"uv") is trivially spoofable, so don't trust it alone. Instead segment on the environment fingerprints a CI fleet can't easily diversify and measure how concentrated each becomes across the inflection date:

  • A genuine human influx stays diverse — many Python patch levels, OSes, distros, countries, installer versions.
  • A CI/CD fleet or mirror collapses each dimension onto a single value.

Score "collapse" with the Herfindahl index (HHI, 0→1) and share-of-lift (what fraction of the post-minus-pre delta one value owns). If ~99% of the new volume is one tuple — Linux / one-distro / one-glibc / one-kernel / one-python / one-installer-version / one-country — that's a container image, not a community.

Cost discipline (important)

The table is large and partitioned by timestamp, billed ~$6.25/TB to your project. Always: (1) filter DATE(timestamp) BETWEEN ..., (2) select only needed columns, (3) --dry_run before any billed run, (4) materialize one wide scan and slice the rest offline. A typical full analysis here is well under $1.

Step 1 — reusable base view

-- Replace PROJECT.DATASET and the package list.
CREATE OR REPLACE VIEW `PROJECT.DATASET.downloads_base` AS
SELECT
  DATE(timestamp) AS day, timestamp AS ts,
  file.project AS package, file.version AS version,
  file.type AS file_type, file.filename AS filename, country_code,
  COALESCE(details.installer.name,'unknown') AS installer_name,
  details.installer.version AS installer_version,
  details.python AS python_version,
  details.system.name AS system_name, details.system.release AS system_release,
  details.distro.name AS distro_name, details.distro.version AS distro_version,
  details.distro.libc.lib AS libc_lib, details.distro.libc.version AS libc_version,
  details.ci AS ci_flag,
  CASE WHEN details.installer.name IN ('pip','uv','poetry','pdm')
       THEN 'human' ELSE 'mirror_ci' END AS cls
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project IN ('PKG_A','PKG_B');

Step 2 — concentration-diff table function (the whole test in one call)

CREATE OR REPLACE TABLE FUNCTION `PROJECT.DATASET.inflection_concentration`(
  pkg STRING, dim STRING,
  pre_start DATE, pre_end DATE, post_start DATE, post_end DATE
) AS (
  WITH base AS (
    SELECT
      CASE dim
        WHEN 'installer' THEN CONCAT(installer_name,' ',IFNULL(installer_version,''))
        WHEN 'python'    THEN python_version
        WHEN 'system'    THEN system_name
        WHEN 'distro'    THEN CONCAT(distro_name,' ',IFNULL(distro_version,''))
        WHEN 'libc'      THEN CONCAT(libc_lib,' ',IFNULL(libc_version,''))
        WHEN 'kernel'    THEN system_release
        WHEN 'country'   THEN country_code
        WHEN 'version'   THEN version
        WHEN 'filetype'  THEN file_type
        WHEN 'wheel_platform' THEN
          CASE WHEN CONTAINS_SUBSTR(filename,'manylinux') THEN 'manylinux'
               WHEN CONTAINS_SUBSTR(filename,'musllinux') THEN 'musllinux'
               WHEN CONTAINS_SUBSTR(filename,'macosx')    THEN 'macosx'
               WHEN CONTAINS_SUBSTR(filename,'win_amd64') THEN 'win_amd64'
               WHEN CONTAINS_SUBSTR(filename,'none-any')  THEN 'none-any'
               ELSE 'other' END
        ELSE NULL END AS value,
      CASE WHEN day BETWEEN pre_start  AND pre_end  THEN 1 ELSE 0 END AS in_pre,
      CASE WHEN day BETWEEN post_start AND post_end THEN 1 ELSE 0 END AS in_post
    FROM `PROJECT.DATASET.downloads_base`
    WHERE package = pkg AND cls = 'human' AND day BETWEEN pre_start AND post_end
  ),
  agg AS (SELECT value, SUM(in_pre) pre_n, SUM(in_post) post_n
          FROM base WHERE value IS NOT NULL GROUP BY value),
  tot AS (SELECT SUM(pre_n) pre_t, SUM(post_n) post_t,
                 SUM(GREATEST(post_n-pre_n,0)) lift_pos_t FROM agg)
  SELECT a.value, a.pre_n, a.post_n,
    ROUND(100*SAFE_DIVIDE(a.pre_n,  t.pre_t), 2)  AS pre_pct,
    ROUND(100*SAFE_DIVIDE(a.post_n, t.post_t), 2) AS post_pct,
    ROUND(100*SAFE_DIVIDE(a.post_n-a.pre_n, t.lift_pos_t), 2) AS lift_pct,
    ROUND((SELECT SUM(POW(SAFE_DIVIDE(pre_n,  t.pre_t),2))  FROM agg),4) AS hhi_pre,
    ROUND((SELECT SUM(POW(SAFE_DIVIDE(post_n, t.post_t),2)) FROM agg),4) AS hhi_post
  FROM agg a, tot t
);

Step 3 — run it per dimension

SELECT value, pre_pct, post_pct, lift_pct, hhi_pre, hhi_post
FROM `PROJECT.DATASET.inflection_concentration`(
  'YOUR_PACKAGE','python',
  DATE '2026-05-13', DATE '2026-05-19',   -- pre window
  DATE '2026-05-20', DATE '2026-05-26')   -- post (lift) window
ORDER BY post_pct DESC;

Sweep dim over: installer, python, system, distro, libc, kernel, country, version, filetype, wheel_platform.

Interpretation cheat-sheet

Signal Human adoption CI/CD or mirror
Python version HHI low, spread spikes; one patch owns the lift
Installer version many pip/uv versions one pinned version
OS / distro / libc / kernel mixed; macOS+Windows present collapses to one Linux image tuple
Country broad one country owns the lift
Package-version HHI on a rise concentrates on the new release falls (re-resolves many versions)
hour-of-day / weekday diurnal + weekend dip flat 24/7 or sharp cron spikes
file type some sdist + wheels 100% wheel, single platform tag
details.ci=true low higher — but many fleets omit it (weak)

A real-world example: a package's downloads nearly tripled overnight; the fingerprint of the new volume was ~99% Linux / Debian 13 / glibc 2.41 / kernel 4.4.0 / Python 3.11.13 / uv <one-version> / US, while package-version HHI fell. Verdict: one automated source, not adoption — despite a perfect-looking "pip" label.

details.ci is a floor, not a ceiling — the environment fingerprint is the reliable signal. Don't over-trust the self-reported CI flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment