DatasetAllocator
:
- Helps create the flattened representation of the timeseries tensor from the raw files
DatasetAllocator.record
:
1.
extract_timeseries_dataset.main
:
1.
TimeseriesDataset
:
- Time series tensor representation of the dataset with dimensions (num_patients, num_timesteps, num_codes)
- Each entry of tensor stores sums of patient codecounts binned into 2-week intervals
TimeseriesDataset.fact_counts / DatasetAllocator.fact_counts / fact_counts.npy
:
- Stores the actual count data in a flattened representation (i.e. it's a 1D array).
- It is created by iterating through the patients and saving the counts for each (timestep, code) observed (sorted by time)
- To reconstruct the counts, we would also need the
fact_codes
andfact_steps
arrays.
TimeseriesDataset.fact_codes / DatasetAllocator.fact_codes / fact_codes.npy
:
- Stores the fact codes corresponding to the counts in fact_counts array.
- It has values in [0, total_concepts] = [0, ~107k]. These values are mapped to original code names using
ConceptDefs.concept_idx
method - Shape is the same as fact_counts
TimeseriesDataset.fact_steps / DatasetAllocator.fact_steps / fact_steps.npy
:
- Stores the time steps corresponding to the counts in fact_counts array.
- It has values in [-2, 4] years = [-51,104] fortnights (0 corresponds to fortnight of first MDD diagnosis for the patient)
- Shape is the same as fact_counts
TimeseriesDataset.as_csr
:
- Create a sparse 2D matrix representing all patient facts w/ caching.
- Really should be a 3D matrix of (patient,code,time), but scipy.sparse only supports 2D, so instead it's (patient x time,code)
- Contiguous blocks of {rows_per_patient = 6 years = 156} rows represent neighboring fortnights for individual patients
- Uses the 1D arrays [fact_counts, fact_codes, fact_steps] to know which entry to populate with what
- Internally, it 1. remaps step values from [-51, 104] to [0, 155], 2. Offsets them to match the patient row in 2D matrix, 3. Uses the remapped steps and codes as rows and cols for the sparse matrix
TimeseriesDataset.two_week_antidepressants
:
- Subsets the timeseries tensor to 1. only [2 years after first MDD diagnosis ~= 60 fortnights] and 2. only antidepressant indices
- Returns a 3D tensor of shape [n_patients, 60, n_antidepressant_codes]
TimeseriesDataset.three_month_antidepressants
:
- Same subset as above (60 fortnights + antidepressants) but the time window for counts is now [3 months = 6 fortnights] wide (resulting in 10 three-month time blocks in 2 years).
- Returns a 3D tensor of shape [n_patients, 10, n_antidepressant_codes]
TimeseriesDataset.six_month_antidepressants
:
- Same subset as above (60 fortnights + antidepressants) but the time window for counts is now [6 months = 12 fortnights] wide (resulting in 5 six-month time blocks in 2 years).
- Returns a 3D tensor of shape [n_patients, 5, n_antidepressant_codes]
TimeseriesDataset.count_representation
:
- Subsets the timeseries tensor to [2 years before first MDD diagnosis = 51 fortnights] and returns total counts of codes for this period.
- Returns a 2D tensor of shape [n_patients, n_codes]
- I feel this is confusing notation
TimeseriesDataset.count_representation_6_mos_before
:
- Same as
TimeseriesDataset.count_representation
but now restricted to [6 months before first MDD diagnosis]. - Returns a 2D tensor of shape [n_patients, n_codes]
TimeseriesDataset._counts_and_dems_without_antidepressants
:
- Subsets the
TimeseriesDataset.count_representation
matrix by excluding the antidepressant codes, and appends patient demographic info to it. - Returns a 2D tensor of shape [n_patients, n_codes - n_antidepressant_codes + n_dem_features]
TimeseriesDataset._counts_dems_and_ages_without_antidepressants
:
- Same as
TimeseriesDataset._counts_and_dems_without_antidepressants
but it also appends patient age as a feature - Same shape as
TimeseriesDataset._counts_and_dems_without_antidepressants
except that it has 1 more col
TimeseriesDataset._counts_dems_and_ages_without_antidepressants_6_mos_before
:
- Analogous function to
TimeseriesDataset._counts_dems_and_ages_without_antidepressants
but usesTimeseriesDataset.count_representation_6_mos_before
TimeseriesDatasetExtended.two_week_non_AD_collapsed_codes
:
- Analogous to
TimeseriesDataset.two_week_antidepressants
except:- This excludes antidepressants.
- Sums the counts of all the included codes for each time block
- Returns an (expanded) 3D tensor of shape [n_patients, 60, 1]
TimeseriesDatasetExtended.two_week_psych_codes
:
- Analogous to
TimeseriesDataset.two_week_antidepressants
except that this includes psych codes using theConceptDefs.is_psych_code
filter. - Returns a 3D tensor of shape [n_patients, 60, n_psych_codes]
TimeseriesDatasetExtended.build_outcomes
:
- This method defines several outcomes for each patient (all are 1D arrays of shape [n_patients]).
- The count data used is
TimeseriesDataset.two_week_antidepressants
,TimeseriesDatasetExtended.two_week_non_AD_collapsed_codes
andTimeseriesDatasetExtended.two_week_psych_codes
. - These counts all correspond to 2 years from the index prescription. Also a
prediction_time_window_in_fortnights
variable is defined to be [3 months = 6 fortnights] (call that variable W in the below discussion) - Outcome defined are:
switched_treatment
: The logic is to identify whether an AD treatment changed during the first W (i.e. 3 months). It does so by first finding timesteps where a unique treatment was provided (some code count > 0), and returns True if there are more than 1 sure timesteps.antidepressant_afterwards
: Identifies whether an AD treatment was provided to patient in [W, 2W] / [3 month - 6 month period]. Does so by summing all counts at each step and checking if any timestep with count > 0 exists.same_treatment_afterwards
: Computes unique AD treatment timesteps in [0, W] and [W, 2W] windows and checks if these two match. One thing to note is that this method doesn't check if only one treatment is provided in the window.remains_in_care
: If any non-AD code exists in [W, 2W] window, return True.psych_afterwards
: If any psych code exists in [W, 2W] window, return True.stable_treatment
: Computed as(not switched_treatment) and (same_treatment_afterwards)
, i.e. the patient sticks to the same treatment in the [0, 2W] window.dropped_treatment
: Computed as(not antidepressant_afterwards) and (remains_in_care and (not psych_afterwards))
, i.e. the patient 1. didn't receive an AD or psych code but received a non-AD code in [W,2W] window (which means they dropped treatment but not going to hospital).