-
-
Save seabbs/5a9a22950357d569c482535a12645c2c to your computer and use it in GitHub Desktop.
# Packages ----------------------------------------------------------------- | |
require(data.table, quietly = TRUE) | |
require(future, quietly = TRUE) | |
## Require for data and nowcasting | |
# require(EpiNow, quietly = TRUE) | |
# require(NCoVUtils, quietly = TRUE) | |
## Required for forecasting | |
# require(future.apply, quietly = TRUE) | |
# require(fable, quietly = TRUE) | |
# require(fabletools, quietly = TRUE) | |
# require(feasts, quietly = TRUE) | |
# require(urca, quietly = TRUE) | |
# Get cases --------------------------------------------------------------- | |
NCoVUtils::reset_cache() | |
cases <- NCoVUtils::get_ecdc_cases() | |
cases <- NCoVUtils::format_ecdc_data(cases) | |
cases <- data.table::setDT(cases)[!is.na(region)][, | |
`:=`(local = cases, imported = 0)][, cases := NULL] | |
cases <- data.table::melt(cases, measure.vars = c("local", "imported"), | |
variable.name = "import_status", | |
value.name = "confirm") | |
## Remove regions with data issues | |
cases <- cases[!region %in% c("Faroe Islands", "Sao Tome and Principe", "Tajikistan")] | |
cases <- cases[1:2000] | |
# Get linelist ------------------------------------------------------------ | |
linelist <- | |
data.table::fread("https://raw.githubusercontent.com/epiforecasts/NCoVUtils/master/data-raw/linelist.csv") | |
delays <- linelist[!is.na(date_onset_symptoms)][, | |
.(report_delay = as.numeric(lubridate::dmy(date_confirmation) - | |
as.Date(lubridate::dmy(date_onset_symptoms))))] | |
delays <- delays$report_delay | |
# Set up cores ----------------------------------------------------- | |
if (!interactive()){ | |
options(future.fork.enable = TRUE) | |
} | |
future::plan("multiprocess", | |
gc = TRUE, earlySignal = TRUE) | |
# Fit the reporting delay ------------------------------------------------- | |
delay_defs <- EpiNow::get_dist_def(delays, | |
bootstraps = 10, | |
samples = 50) | |
# Fit the incubation period ----------------------------------------------- | |
## Mean delay | |
exp(EpiNow::covid_incubation_period[1, ]$mean) | |
## Get incubation defs | |
incubation_defs <- EpiNow::lognorm_dist_def(mean = EpiNow::covid_incubation_period[1, ]$mean, | |
mean_sd = EpiNow::covid_incubation_period[1, ]$mean_sd, | |
sd = EpiNow::covid_incubation_period[1, ]$sd, | |
sd_sd = EpiNow::covid_incubation_period[1, ]$sd_sd, | |
max_value = 30, samples = 50) | |
# future::plan("multiprocess", workers = future::availableCores(), | |
# gc = TRUE, earlySignal = TRUE) | |
future::plan("sequential", earlySignal = TRUE, gc = TRUE) | |
# Run pipeline ---------------------------------------------------- | |
EpiNow::regional_rt_pipeline( | |
cases = cases, | |
delay_defs = delay_defs, | |
incubation_defs = incubation_defs, | |
target_folder = "national", | |
case_limit = 60, | |
horizon = 14, | |
nowcast_lag = 8, | |
approx_delay = TRUE, | |
report_forecast = TRUE, | |
forecast_model = function(...) { | |
EpiSoon::fable_model(model = fabletools::combination_model(fable::RW(y ~ drift()), fable::ETS(y), | |
fable::NAIVE(y), | |
cmbn_args = list(weights = "inv_var")), ...) | |
} | |
) | |
future::plan("sequential") | |
# Summarise results ------------------------------------------------------- | |
EpiNow::regional_summary(results_dir = "national", | |
summary_dir = "national-summary", | |
target_date = "latest", | |
region_scale = "Country", | |
csv_region_label = "country", | |
log_cases = TRUE) | |
There are two obvious solutions to this problem both of which make the code less usable:
-
Split each region into a separate Rscript call that is completely isolated. This would be the traditional HPC approach but is not ideal as it reduces the usability of the tooling for others and makes the infrastructure harder to reproduce. Potentially something like
callr
coule be used internally to reproduce this but this also seems clunky. -
Isolate the area of the code causing problems (likely but not definitely the graphing - not definitely as this makes no sense to scale with samples) and remove/run in a separate process. This is also clunky and requires the problem to be fully understood and identified.
There are also a few other options:
-
Nesting parallelization for efficiency. This is again pretty clunky but easy to implement. In theory a good solution (if a bit inefficient) and one that I have previously explored with no luck. As the code base has now changed quite a bit this may now be more possible (testing overnight).
-
Run with few cores so ram >> cores in a batch process of regions. Extremely clunky and hard to predict when failures will occur.
@seabbs Any news on this? I could try to find some R wizards here at our Bioinformatics Core group to have a look. Unfortunately i cant help much with R, never really used it. (Anyway, problem seems to be in C in the end right? Could you provide the valgrind output?)
Hi @adhusch - still a work in progress unfortunately though I think getting there. Any help with the code would be much appreciated at any time. Being new to valgrind I didn't find the output particularly helpful in identifying the code areas in R that were causing problems and I have not been able to find anything on how to do this - anything on this would be great.
New issue detailing progress: epiforecasts/covid#56
This runs the global nowcast/forecast on the first 2000 rows of data sequentially for 50 samples.
If you install
epiforecasts/EpiNow@review
you will get a version of the code that spits out memory used at each step helping to identify where the leak occurs (requirespryr
to be installed).My read on the situation is that it is caused by the saving of plots in
plot_pipeline
but this could be incorrect. To try and contain this I have moved all reporting functions into a clean environment that only depends on empty environment. This does not appear to help.valgrind
suggests there is a memory leak but I have not been able to use it to identify a line of R code as refers to the C code causing the issue.