Raw PyPI download counts conflate humans with mirrors, CI/CD, and transitive
dependencies. When a package "takes off" or two packages look like a migration,
the naive read is usually wrong. This is a method (+ copy-paste SQL) to tell a
human adoption wave from a single-image automation fleet, using the public
BigQuery table bigquery-public-data.pypi.file_downloads.
details.installer.name ("pip"/"uv") is trivially spoofable, so don't trust it
alone. Instead segment on the environment fingerprints a CI fleet can't easily
diversify and measure how concentrated each becomes across the inflection date:
- A genuine human influx stays diverse — many Python patch levels, OSes, distros, countries, installer versions.
- A CI/CD fleet or mirror collapses each dimension onto a single value.
Score "collapse" with the Herfindahl index (HHI, 0→1) and share-of-lift (what
fraction of the post-minus-pre delta one value owns). If ~99% of the new volume
is one tuple — Linux / one-distro / one-glibc / one-kernel / one-python / one-installer-version / one-country — that's a container image, not a community.
The table is large and partitioned by timestamp, billed ~$6.25/TB to your
project. Always: (1) filter DATE(timestamp) BETWEEN ..., (2) select only needed
columns, (3) --dry_run before any billed run, (4) materialize one wide scan and
slice the rest offline. A typical full analysis here is well under $1.
-- Replace PROJECT.DATASET and the package list.
CREATE OR REPLACE VIEW `PROJECT.DATASET.downloads_base` AS
SELECT
DATE(timestamp) AS day, timestamp AS ts,
file.project AS package, file.version AS version,
file.type AS file_type, file.filename AS filename, country_code,
COALESCE(details.installer.name,'unknown') AS installer_name,
details.installer.version AS installer_version,
details.python AS python_version,
details.system.name AS system_name, details.system.release AS system_release,
details.distro.name AS distro_name, details.distro.version AS distro_version,
details.distro.libc.lib AS libc_lib, details.distro.libc.version AS libc_version,
details.ci AS ci_flag,
CASE WHEN details.installer.name IN ('pip','uv','poetry','pdm')
THEN 'human' ELSE 'mirror_ci' END AS cls
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project IN ('PKG_A','PKG_B');CREATE OR REPLACE TABLE FUNCTION `PROJECT.DATASET.inflection_concentration`(
pkg STRING, dim STRING,
pre_start DATE, pre_end DATE, post_start DATE, post_end DATE
) AS (
WITH base AS (
SELECT
CASE dim
WHEN 'installer' THEN CONCAT(installer_name,' ',IFNULL(installer_version,''))
WHEN 'python' THEN python_version
WHEN 'system' THEN system_name
WHEN 'distro' THEN CONCAT(distro_name,' ',IFNULL(distro_version,''))
WHEN 'libc' THEN CONCAT(libc_lib,' ',IFNULL(libc_version,''))
WHEN 'kernel' THEN system_release
WHEN 'country' THEN country_code
WHEN 'version' THEN version
WHEN 'filetype' THEN file_type
WHEN 'wheel_platform' THEN
CASE WHEN CONTAINS_SUBSTR(filename,'manylinux') THEN 'manylinux'
WHEN CONTAINS_SUBSTR(filename,'musllinux') THEN 'musllinux'
WHEN CONTAINS_SUBSTR(filename,'macosx') THEN 'macosx'
WHEN CONTAINS_SUBSTR(filename,'win_amd64') THEN 'win_amd64'
WHEN CONTAINS_SUBSTR(filename,'none-any') THEN 'none-any'
ELSE 'other' END
ELSE NULL END AS value,
CASE WHEN day BETWEEN pre_start AND pre_end THEN 1 ELSE 0 END AS in_pre,
CASE WHEN day BETWEEN post_start AND post_end THEN 1 ELSE 0 END AS in_post
FROM `PROJECT.DATASET.downloads_base`
WHERE package = pkg AND cls = 'human' AND day BETWEEN pre_start AND post_end
),
agg AS (SELECT value, SUM(in_pre) pre_n, SUM(in_post) post_n
FROM base WHERE value IS NOT NULL GROUP BY value),
tot AS (SELECT SUM(pre_n) pre_t, SUM(post_n) post_t,
SUM(GREATEST(post_n-pre_n,0)) lift_pos_t FROM agg)
SELECT a.value, a.pre_n, a.post_n,
ROUND(100*SAFE_DIVIDE(a.pre_n, t.pre_t), 2) AS pre_pct,
ROUND(100*SAFE_DIVIDE(a.post_n, t.post_t), 2) AS post_pct,
ROUND(100*SAFE_DIVIDE(a.post_n-a.pre_n, t.lift_pos_t), 2) AS lift_pct,
ROUND((SELECT SUM(POW(SAFE_DIVIDE(pre_n, t.pre_t),2)) FROM agg),4) AS hhi_pre,
ROUND((SELECT SUM(POW(SAFE_DIVIDE(post_n, t.post_t),2)) FROM agg),4) AS hhi_post
FROM agg a, tot t
);SELECT value, pre_pct, post_pct, lift_pct, hhi_pre, hhi_post
FROM `PROJECT.DATASET.inflection_concentration`(
'YOUR_PACKAGE','python',
DATE '2026-05-13', DATE '2026-05-19', -- pre window
DATE '2026-05-20', DATE '2026-05-26') -- post (lift) window
ORDER BY post_pct DESC;Sweep dim over: installer, python, system, distro, libc, kernel,
country, version, filetype, wheel_platform.
| Signal | Human adoption | CI/CD or mirror |
|---|---|---|
| Python version HHI | low, spread | spikes; one patch owns the lift |
| Installer version | many pip/uv versions | one pinned version |
| OS / distro / libc / kernel | mixed; macOS+Windows present | collapses to one Linux image tuple |
| Country | broad | one country owns the lift |
| Package-version HHI on a rise | concentrates on the new release | falls (re-resolves many versions) |
| hour-of-day / weekday | diurnal + weekend dip | flat 24/7 or sharp cron spikes |
| file type | some sdist + wheels | 100% wheel, single platform tag |
details.ci=true |
low | higher — but many fleets omit it (weak) |
A real-world example: a package's downloads nearly tripled overnight; the
fingerprint of the new volume was ~99% Linux / Debian 13 / glibc 2.41 / kernel 4.4.0 / Python 3.11.13 / uv <one-version> / US, while package-version HHI fell.
Verdict: one automated source, not adoption — despite a perfect-looking "pip" label.
details.ciis a floor, not a ceiling — the environment fingerprint is the reliable signal. Don't over-trust the self-reported CI flag.