Skip to content

Instantly share code, notes, and snippets.

@steren
Last active July 13, 2020 17:49
Show Gist options
  • Save steren/4e8784ba782c624be48f97a4ea808f28 to your computer and use it in GitHub Desktop.
Save steren/4e8784ba782c624be48f97a4ea808f28 to your computer and use it in GitHub Desktop.
Extract constant Go regular expressions from GitHub
# Extracts constant Go regular expressions from GitHub
# using BigQuery GitHub public dataset.
# To run on the entire GitHub corpus,
# remove the `sample_` prefix from the table names.
# Warning: This query processes ~2.2 TB of data, which is above BigQuery free quota.
SELECT
REGEXP_EXTRACT(pattern, r'^[\"\`](.*)[\"\`]$') as pattern,
COUNT(*) AS cnt,
FROM (
SELECT
REGEXP_EXTRACT(content, r'.(?:(?:Must)?Compile|MatchString)\((\"[^\"]+\"|\`[^\`]+\`)') AS pattern
FROM (
SELECT
id,
SPLIT(content, "regexp") AS content
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
REGEXP_MATCH(content, r'.(?:(?:Must)?Compile|MatchString)\((\"[^\"]+\"|\`[^\`]+\`)')) AS C
JOIN (
SELECT
id
FROM
[bigquery-public-data:github_repos.sample_files]
WHERE
path LIKE '%.go'
GROUP BY
id) AS F
ON
C.id = F.id )
WHERE
pattern != "null"
GROUP BY
pattern
ORDER BY
cnt DESC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment