Skip to content

Instantly share code, notes, and snippets.

@missinglink
Last active September 13, 2024 20:37
Show Gist options
  • Save missinglink/f5825c2ef1d94de447c3a71b9d70ee23 to your computer and use it in GitHub Desktop.
Save missinglink/f5825c2ef1d94de447c3a71b9d70ee23 to your computer and use it in GitHub Desktop.
Scrape a list of Europa data portal catalogues & datasets
import got from 'got'
import fs from 'node:fs'
import path from 'node:path'
const API_BASE = 'https://data.europa.eu/api/hub/search'
await fs.promises.mkdir('data', { recursive: true })
const catalogues: string[] = await got.get(`${API_BASE}/catalogues`).json()
for (const catalogue of catalogues) {
console.error(`-- ${catalogue} --`)
const datasets: string[] = await got.get(`${API_BASE}/datasets?catalogue=${catalogue}`).json()
if (!datasets.length) continue
const datasetDir = path.resolve('data', catalogue)
await fs.promises.mkdir(datasetDir, { recursive: true })
for (const dataset of datasets) {
const dataPath = path.resolve(datasetDir, `${dataset}.json`)
if (fs.existsSync(dataPath) && fs.statSync(dataPath).size > 0) continue
const data: object = await got.get(`${API_BASE}/datasets/${dataset}`).json()
if (!data?.result) continue
await fs.promises.writeFile(dataPath, JSON.stringify(data.result, null, 2))
console.error(`[write] ${catalogue}/${dataset}.json`)
}
}
npm install got tsx
node --import=tsx scaper.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment