Skip to content

Instantly share code, notes, and snippets.

@geotheory
Last active November 4, 2024 14:11
Show Gist options
  • Save geotheory/221dbf86e1343155653450494f90270c to your computer and use it in GitHub Desktop.
Save geotheory/221dbf86e1343155653450494f90270c to your computer and use it in GitHub Desktop.
GDELT Headlines

Usage

Any GDELT API URL generated with https://gdelt.github.io/ can be given to this script to obtain the results of their TimelineVolInfo mode (which returns up to 10 article headlines and links per day) as a searchable DT table.

# output to generic file which is overwritten by the next query
Rscript <path-to-file>/gdelt-headlines.R 'https://api.gdeltproject.org/api/v2/doc/doc?query=spoons%20sourcecountry:UK%20sourcelang:eng&mode=ArtList&maxrecords=75&format=html&timespan=3w'

# output to specific file
Rscript <path-to-file>/gdelt-headlines.R 'https://api.gdeltproject.org/api/v2/doc/doc?query=spoons%20sourcecountry:UK%20sourcelang:eng&mode=ArtList&maxrecords=75&format=html&timespan=3w' custom.html

Bash function to facilitate usage

gdelt-headlines() { Rscript <path-to-file>/gdelt-headlines.R "$1" $2; }
suppressPackageStartupMessages({
require(DT)
require(magrittr)
})
Sys.setenv(RSTUDIO_PANDOC = "/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64")
unescape_html = function(str) paste0("<x>",str,"</x>") |> xml2::read_html() |> xml2::xml_text()
args = (commandArgs(TRUE))
fn = ifelse(length(args) > 1, args[2], '<default path>/gdelt-report.html') |>
stringr::str_replace('~', system('echo $HOME', intern = TRUE))
# message(args[1])
u0 = args[1] |> stringr::str_remove_all(stringr::fixed('\\'))
u = args[1] |>
stringr::str_remove_all(stringr::fixed('\\')) |>
stringr::str_replace('&mode=.*?(?=&)', '&mode=TimelineVolInfo') %>%
stringr::str_replace('&format=.*?(?=&)', '&format=json') %>%
stringr::str_replace_all(' ', '%20') |> stringr::str_replace_all('"', '%22') |>
stringr::str_remove(stringr::fixed('&timezoom=yes'))
if(!stringr::str_detect(u, '&format=')) u = paste0(u, '&format=json')
# message(u)
j0 = readLines(u, warn = FALSE) |> jsonlite::fromJSON()
report_title = stringr::str_extract(u0, '(?<=[?]query=).*?(?=%20)')
# message(report_title)
d = tibble::as_tibble(j0$timeline$data[[1]]) |> dplyr::filter(value != 0) |>
tidyr::unnest(toparts) |>
dplyr::select(date, title, url) |>
dplyr::mutate(date = stringr::str_remove(date, 'T.*') |> as.Date(, '%Y%m%d')) |>
dplyr::arrange(desc(date)) |>
dplyr::rename(headline = title, link = url) |>
dplyr::mutate(headline = purrr::map_chr(headline, unescape_html),
link = as.character(glue::glue('<a href="{link}" target="_blank">{stringr::str_extract(link,"(?<=://).*?(?=/)")|>stringr::str_remove("www.")}</a>')))
report = DT::datatable(d, escape = 2, caption = paste0('GDELT headlines: ', report_title),
options = list(iDisplayLength = 100),
filter = list(position = 'top', clear = FALSE))
htmlwidgets::saveWidget(report, fn, selfcontained = TRUE)
message('Saved to: ', fn)
browseURL(fn)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment