A Comprehensive Framework for Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort Extraction and Staging from MIMIC-IV and eICU-CRD Databases
Sepsis-Associated Acute Kidney Injury (SA-AKI) represents a formidable clinical challenge within intensive care, profoundly increasing patient mortality and long-term morbidity.1 The accurate and reproducible identification of this specific patient cohort is a foundational step for developing effective predictive models and advancing clinical research in critical care. This report delineates a rigorous, ontology-driven methodology for defining SA-AKI patient cohorts within two prominent critical care databases: MIMIC-IV v3.1 and eICU-CRD. The framework integrates internationally recognized clinical criteria for sepsis and acute kidney injury with advanced data mapping techniques, ensuring high data fidelity and reproducibility across these heterogeneous electronic health record (EHR) systems. The ultimate output includes the SQL Data Definition Language (DDL) for a comprehensive staging table, designed to support binary mortality prediction, encompassing a rich set of static, dynamic, and intervention-related features.
Effective clinical research leveraging large-scale EHR databases necessitates a profound understanding of their unique structures, data provenance, and inherent complexities. This section provides an overview of the MIMIC-IV and eICU-CRD databases, highlighting their architectural principles, temporal data representations, and critical considerations for researchers.
MIMIC-IV v3.1 is a publicly available, de-identified clinical database sourced from the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA, spanning patient admissions from 2008 to 2022.1 This extensive dataset comprises records for over 364,000 unique patients, corresponding to more than 546,000 hospitalizations and nearly 95,000 unique Intensive Care Unit (ICU) stays.1 The database is organized into a modular structure, reflecting its data provenance. The
hosp module contains hospital-wide EHR data, including administrative details, billing codes, laboratory results, and pharmacy records that cover a patient's entire hospitalization. In contrast, the icu module provides high-resolution data from the in-ICU clinical information system (MetaVision), offering detailed, often minute-by-minute, physiological measurements, ventilator settings, and fluid intake/output records specific to an ICU stay.1
A crucial aspect of MIMIC-IV's design philosophy is its commitment to providing data "as is," meaning "data cleaning steps were not performed, to ensure the data reflects a real-world clinical dataset".1 This deliberate choice aims to preserve the ecological validity of the data, retaining the artifacts, inconsistencies, and potential errors inherent in routine clinical practice. For a task demanding "perfect calculations" and a "robust plan," this implies that SQL queries must be designed with resilience. This involves anticipating and handling data quality issues such as
NULL values, unexpected string formats, and out-of-range numerical data, often requiring explicit type casting and validation within the SQL itself, even if advanced imputation is deferred. The implication is that the data extraction pipeline must be robust enough to manage real-world data imperfections, ensuring that derived features are reliable despite the raw data's variability.
To comply with HIPAA Safe Harbor provisions, MIMIC-IV employs a thorough de-identification process. All dates and times for a given patient are shifted by a random offset into the future, preserving the internal consistency of a single patient's timeline (e.g., the duration between any two events for a patient remains accurate) while making direct temporal comparisons between different patients impossible.1 The
patients table provides anchor_age, anchor_year, and anchor_year_group to anchor a patient's timeline.1 Notably, the ages of all patients over 89 are obscured, with their
anchor_age recorded as 91. This means an anchor_age of 91 represents the entire group of patients aged 90 and above.1 For mortality prediction, especially in SA-AKI where advanced age is a significant risk factor, this age capping reduces granularity for the oldest demographic. The SQL queries and subsequent feature engineering must acknowledge this by treating the value '91' as a categorical representation of "90 years or older" rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling. Free-text fields within the database are de-identified by replacing Protected Health Information (PHI) with three underscores ("___").1
The database is interconnected through a hierarchical system of three primary identifiers: subject_id, hadm_id, and stay_id.1 The
subject_id is a unique integer assigned to each individual patient, linking all data pertaining to a single person across their hospitalizations and care episodes. The hadm_id uniquely identifies each hospital admission, linking all events, diagnoses, and procedures associated with a single inpatient stay. A single subject_id can have multiple hadm_ids. The stay_id uniquely identifies a single, continuous stay within a specific care unit, primarily an ICU, linking high-frequency, granular data from the icu module tables (e.g., chartevents, inputevents) to a specific ICU admission. A single hadm_id can have multiple stay_ids if a patient was transferred between ICUs or readmitted to the ICU during the same hospitalization.1
The eICU Collaborative Research Database (eICU-CRD) v2.0 is a large, de-identified, multi-center critical care database containing granular data from over 200,000 admissions to ICUs across the United States.1 Data were collected between 2014 and 2015 from 208 hospitals and 335 distinct ICU wards participating in the Philips eICU tele-ICU program.1 This origin provides a rich, high-granularity view of real-world clinical practice, encompassing vital sign measurements, laboratory results, diagnoses, treatments, and medication documentation.1
The multi-center nature of eICU-CRD, while enhancing the external validity and generalizability of research findings by capturing a broader range of patient populations and clinical practices, also presents significant analytical challenges.1 The presence, frequency, and even the naming conventions for certain data points—such as specific laboratory tests in the
lab table or charted items in the nursecharting table—can vary substantially across different hospitals (identified by hospitalid) and ICU wards (wardid).1 This inherent heterogeneity means that direct string matching for
labname or drugname is often insufficient for comprehensive data extraction. Therefore, an ontology-driven approach is not merely a suggestion but a necessity for eICU data to overcome this variability. Furthermore, this variability suggests that hospital-level metadata (e.g., numbedscategory, teachingstatus, region from eicu.hospital) could serve as important covariates or predictive features in the final model, accounting for systemic differences in care or patient populations.1
The eICU-CRD database is organized around a hierarchy of unique identifiers: uniquepid, patienthealthsystemstayid, and patientunitstayid.1
patientunitstayid serves as the primary identifier for linking most clinical event data, uniquely identifying a single ICU stay.1 While most tables link to the central
patient table via patientunitstayid, complex clinical relationships (e.g., linking a specific microbiology culture result to a targeted antibiotic administration) are not enforced by explicit foreign key constraints. Researchers must infer these relationships by joining on patientunitstayid and carefully aligning temporal data using offset columns.1 The temporal data in eICU-CRD is represented using offsets in minutes, calculated relative to the moment of a patient's ICU admission (offset 0).1 Events occurring before ICU admission have negative offsets, while events during the ICU stay have positive offsets. The
unitdischargeoffset represents the patient's length of stay in the ICU in minutes.1 This offset system simplifies intra-ICU duration calculations but introduces complexity when analyzing events across multiple ICU stays for the same patient, as each
patientunitstayid has its own "clock." Reconstructing a patient's full hospital journey, if needed for features spanning beyond a single ICU stay, requires careful stitching of individual ICU stay timelines using various admission and discharge offsets from the patient table.1
Critical considerations for researchers using eICU-CRD include the impact of data interfaces and systematic missingness. The availability of any given data type is contingent on whether the corresponding care unit had the necessary "data interface" enabled.1 This means that missing data should not be naively interpreted as the absence of an event. For instance, if an ICU lacked an interface for its infusion pumps, the
infusiondrug table would contain no records for patients from that unit, even if those patients received continuous infusions. This systematic missingness can introduce significant bias if not addressed. Therefore, robust methodology requires identifying hospitals or units that consistently provide the data of interest and restricting analysis to this reliable subset or employing advanced statistical methods to handle non-random missingness.1 The SQL for cohort definition should include a mechanism to filter by
hospitalid if such limitations are identified.
Another crucial distinction in eICU-CRD is between ordered and administered treatments. The medication table primarily reflects medication orders, which do not guarantee administration. In contrast, the infusiondrug and intakeoutput tables contain data charted by nurses at the bedside, reflecting the actual administration of continuous infusions and other fluids.1 For studies involving continuous infusions,
infusiondrug and intakeoutput are more reliable sources for actual administration data.1 For example, when identifying vasopressor use, relying on the
infusiondrug table is medically more accurate for quantifying actual treatment compared to mere orders.
The programmatic extraction of clinical concepts, rather than relying on hardcoded values, is a critical requirement for ensuring medical accuracy and generalizability across heterogeneous datasets. This is achieved by leveraging a suite of ontologies:
- RxNorm: This terminology provides normalized names for generic and branded drugs, linking to various drug vocabularies.2 It is essential for identifying all forms of antibiotic and vasopressor medications.
- LOINC (Logical Observation Identifiers Names and Codes): LOINC standardizes laboratory test names and clinical observations.1 It is crucial for consistently identifying lab results like creatinine, bilirubin, and platelet count across different hospital systems.
- ATC (Anatomical Therapeutic Chemical Classification System): ATC classifies drugs based on their anatomical main group, therapeutic main group, pharmacological subgroup, chemical subgroup, and chemical substance.1 This hierarchical classification provides a broad and comprehensive method for identifying drug classes, such as all systemic antibacterials (J01) or specific vasopressors (e.g., C01CA).6
- HPO (Human Phenotype Ontology): HPO describes phenotypic abnormalities associated with human diseases.1 While less directly used for primary cohort definition in this context, it can be valuable for identifying subtle phenotypic expressions or for future feature enrichment related to disease severity and genetic predispositions.
- UMLS (Unified Medical Language System): UMLS serves as a comprehensive metathesaurus, integrating and linking numerous biomedical vocabularies, including RxNorm and SNOMED_CT, through Concept Unique Identifiers (CUIs) and semantic types.1 This central hub role is vital for achieving comprehensive mapping. By mapping local EHR terms (e.g.,
drugname, labname) to RxNorm or LOINC, and then using UMLS to bridge to ATC and SNOMED_CT via CUIs and relationships, a highly comprehensive and accurate list of relevant clinical concepts can be generated. This programmatic approach is superior to hardcoding specific values and enhances the generalizability and reproducibility of the cohort definition. - SNOMED_CT (Systematized Nomenclature of Medicine—Clinical Terms): SNOMED_CT is a comprehensive, multilingual clinical healthcare terminology that provides concepts, descriptions, and relationships, often mapped to other international standards like ICD.1 It is fundamental for identifying clinical conditions, diagnoses (e.g., AKI, ESRD, infection), and procedures with high specificity and consistency.
The interconnectedness of these ontologies, particularly through UMLS's role as a Metathesaurus, allows for a multi-faceted mapping strategy. For example, a drug name in MIMIC or eICU can be mapped to an RxNorm concept, which then links to an ATC code and a UMLS CUI, which in turn links to SNOMED_CT concepts. This ensures that no relevant drug or condition is missed due to variations in local terminology or data entry practices.
The precise and reproducible definition of the SA-AKI cohort is the cornerstone of any subsequent predictive modeling. This involves the meticulous application of the Sepsis-3 criteria for sepsis and the Kidney Disease: Improving Global Outcomes (KDIGO) criteria for Acute Kidney Injury (AKI), coupled with strict temporal linkage and exclusion criteria.
The Third International Consensus Definitions for Sepsis (Sepsis-3), published in 2016, redefined sepsis as "life-threatening organ dysfunction caused by a dysregulated host response to infection".1 This definition is operationalized by identifying patients with a suspected infection who concurrently exhibit an acute increase of 2 or more points in the Sequential Organ Failure Assessment (SOFA) score. The baseline SOFA score is assumed to be zero for patients without known pre-existing organ dysfunction.1
The anchor point for the Sepsis-3 definition is the presence of an infection. In retrospective database research, a robust and widely accepted proxy is the temporal association of antimicrobial therapy and the collection of microbiological cultures.1 This combination signifies that the clinical team suspected an infection and initiated treatment based on that suspicion. A suspected infection is defined for a given hospital admission (MIMIC
hadm_id) or ICU stay (eICU patientunitstayid) if an antibiotic was administered within a 72-hour window before or a 24-hour window after a microbiological culture was sampled. The infection_onset_time is designated as the earlier of these two events' timestamps or offsets.1
Antibiotic Administration:
For MIMIC-IV, antibiotic orders are identified from the hosp.prescriptions table or, for more granular administration data, the hosp.emar and hosp.emar_detail tables.1 For eICU-CRD, antibiotic orders are identified from the
medication table using the drugstartoffset.1 To avoid hardcoding specific drug names, a comprehensive list of antibiotics is programmatically identified using multiple ontologies:
-
RxNorm: Concepts with tty values such as 'IN' (Ingredient), 'PIN' (Precise Ingredient), 'SCD' (Semantic Clinical Drug), or 'SBD' (Semantic Branded Drug) 1 are queried from
rxnorm.rxnconso where the string representation (str) contains common antibiotic keywords (e.g., 'penicillin', 'cephalexin', 'vancomycin', 'ciprofloxacin', 'levofloxacin').15 Additionally,
rxnorm.rxnsty is used to find rxcuis classified with the semantic type 'Antibiotic'.1 Specific examples include Ampicillin (RxCUI 308212, 313800) 18, Vancomycin (RxCUI 2118449) 19, and Meropenem (RxCUI 29561).20 -
ATC: atc.who_atc_ddd is queried for atc_codes starting with 'J01', which represents "ANTIBACTERIALS FOR SYSTEMIC USE".1 This provides a broad, hierarchical classification of antibiotics.
-
UMLS: umls.mrsty is queried for cuis with the semantic type 'Antibiotic' or 'Clinical Drug'.1 Additionally,
umls.mrconso is searched for strings containing '%antibiotic%'.1 The UMLS Concept Unique Identifier (CUI) C0009450 represents "Infectious disease" 23, which can be used to link to related treatments and conditions.24 -
SNOMED_CT: snomed_ct.snomed_description is queried for terms containing '%antibiotic%'.1
snomed_ct.snomed_relationship is used to find concepts related to antibiotics via the 'Is a' relationship (type_id = 116680003).1For MIMIC-IV, specific icu.inputevents d_items.itemids such as 225798 for 'Vancomycin' are also included.27 For eICU-CRD, the
medication.drugname field is free-text and hospital-specific.1 This necessitates robust pattern matching using
LIKE statements for common antibiotic names (e.g., LIKE '%penicillin%', LIKE '%vancomycin%', LIKE '%cefepime%') 16 in conjunction with the ontology-derived lists.
Culture Collection:
In MIMIC-IV, microbiological culture events (e.g., blood, urine, sputum) are identified from the hosp.microbiologyevents table using chartdate or charttime.1 In eICU-CRD, culture events are identified from the
microlab table using culturetakenoffset.1
The SOFA score is a quantitative measure of organ dysfunction across six critical organ systems.1 An acute increase of 2 or more points from baseline is the cornerstone of the Sepsis-3 definition.1 For each 24-hour period of a patient's ICU stay, the worst value for each component is used to calculate the score.1
- Respiration (PaO2/FiO2 Ratio):
- MIMIC-IV: Arterial partial pressure of oxygen (PaO2) is extracted from hosp.labevents (itemid 50821) 1, and fraction of inspired oxygen (FiO2) from
icu.chartevents (itemid 223835).1 These values must be temporally aligned using their respective
charttime columns. LOINC codes for PaO2 (78354-8) 32 and FiO2 (19996-8, 3150-0) 33 are used for standardization. - eICU-CRD: PaO2 is extracted from the lab table (labname like '%pao2%'), and FiO2 from respiratoryCharting (respchartvaluelabel = 'FiO2') or apacheApsVar (fio2 column) for the first 24 hours. Events are aligned using their offset columns.1
- MIMIC-IV: Arterial partial pressure of oxygen (PaO2) is extracted from hosp.labevents (itemid 50821) 1, and fraction of inspired oxygen (FiO2) from
- Coagulation (Platelet Count):
- MIMIC-IV: Platelet count is extracted from hosp.labevents using itemid 51265.1 The LOINC code for Platelet Count is 777-3.35
- eICU-CRD: Platelet count is extracted from the lab table where labname like '%platelet%'.1
- Liver (Bilirubin):
- MIMIC-IV: Total bilirubin is extracted from hosp.labevents using itemid 50885.1 LOINC codes for Total Bilirubin include 1975-2 and 42719-5.36
- eICU-CRD: Total bilirubin is extracted from the lab table where labname like '%bilirubin%'.1
- Cardiovascular (Mean Arterial Pressure & Vasopressor Use):
- MIMIC-IV: Mean Arterial Pressure (MAP) is extracted from icu.chartevents (using itemids 220052, 220181, or 225312).1 Vasopressor administrations are identified from the
icu.inputevents table by mapping drug names to their corresponding itemids in icu.d_items (e.g., Norepinephrine itemid 221906).1 The LOINC code for MAP is commonly 8480-6. - eICU-CRD: MAP is extracted from vitalPeriodic (systemicmean) or vitalAperiodic (noninvasivemean). Vasopressor use is identified from the infusiondrug table where the drugname column contains keywords for common vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').1 The distinction between ordered medications and actual administrations is critical here; the
infusiondrug table is the more accurate source for continuous infusions.1 - Ontology Mapping (Vasopressors): To ensure comprehensive identification, a multi-ontology approach is employed:
- RxNorm: rxnorm.rxnconso is queried for tty values ('IN', 'PIN', 'SCD', 'SBD') where str contains common vasopressor keywords (e.g., 'phenylephrine', 'norepinephrine', 'epinephrine', 'vasopressin', 'dopamine', 'dobutamine').41
rxnorm.rxnsty is also used to find rxcuis with sty = 'Vasopressor'.1 Examples include Vasopressin (RxCUI 569490, 11149) 45, Norepinephrine (RxCUI 7512) 49, Epinephrine (RxCUI 727316) 51, Dopamine (RxCUI 1743871) 53, Phenylephrine (RxCUI 1666372) 55, and Dobutamine (RxCUI 203121).57 - ATC: atc.who_atc_ddd is queried for atc_names containing '%norepinephrine%', '%epinephrine%', etc..1 Specific ATC codes include Norepinephrine (C01CA03) 8, Epinephrine (C01CA24) 60, Dopamine (C01CA04) 62, Vasopressin (H01BA01) 64, Phenylephrine (R01AA04, R01AB01) 66, and Dobutamine (C01CA07).68
- UMLS: umls.mrsty is queried for cuis with sty = 'Vasopressor'.1
umls.mrconso is searched for strings containing '%vasopressor%'.1 - SNOMED_CT: snomed_ct.snomed_description is queried for terms containing '%vasopressor%'.1
snomed_ct.snomed_relationship is used to find related concepts via the 'Is a' relationship (type_id = 116680003).1
- RxNorm: rxnorm.rxnconso is queried for tty values ('IN', 'PIN', 'SCD', 'SBD') where str contains common vasopressor keywords (e.g., 'phenylephrine', 'norepinephrine', 'epinephrine', 'vasopressin', 'dopamine', 'dobutamine').41
- MIMIC-IV: Mean Arterial Pressure (MAP) is extracted from icu.chartevents (using itemids 220052, 220181, or 225312).1 Vasopressor administrations are identified from the
- Central Nervous System (Glasgow Coma Scale - GCS):
- MIMIC-IV: The total GCS score is calculated by summing its three components: Eyes (itemid 220739), Verbal (itemid 223900), and Motor (itemid 223901), all found in icu.chartevents.1
- eICU-CRD: GCS is often recorded as a total score in nurseCharting (nursingchartcelltypevallabel = 'Glasgow coma score') or as its individual components in apacheApsVar (eyes, motor, verbal) for the first 24 hours.1
- Renal (Creatinine & Urine Output):
- MIMIC-IV: Serum creatinine is extracted from hosp.labevents (itemid 50912).1 Total urine output over a period is calculated by summing
value from icu.outputevents for relevant itemids like 226559, 226561.1 LOINC codes for Creatinine include 2161-8 and 2162-6 for urine 4, and common LOINC 2160-0 for serum/plasma creatinine. The LOINC code for Urine Output is 9192-6.79 - eICU-CRD: Serum creatinine is extracted from the lab table (labname = 'creatinine'). Urine output is extracted from the intakeOutput table where celllabel = 'Urine'.1
- MIMIC-IV: Serum creatinine is extracted from hosp.labevents (itemid 50912).1 Total urine output over a period is calculated by summing
The sepsis_onset_time is defined as the first timestamp (or offset) at which a patient satisfies both the criteria for a suspected infection (as defined in Section III.A.1) and demonstrates an acute increase in the total SOFA score of 2 or more points from a baseline of zero (as defined in Section III.A.2).1
The second pillar of the SA-AKI definition is the diagnosis of AKI according to the Kidney Disease: Improving Global Outcomes (KDIGO) 2012 clinical practice guideline.1 AKI is defined by the presence of any one of the following criteria 1:
- An increase in serum creatinine (SCr) by ≥0.3 mg/dL (≥26.5 μmol/L) within 48 hours.
- An increase in SCr to ≥1.5 times a baseline value, which is known or presumed to have occurred within the prior 7 days.
- A urine volume of less than 0.5 mL/kg/h for 6 hours.
Relevant ICD-9 codes for AKI include 584.5 (acute renal failure with tubular necrosis), 584.9 (acute renal failure, unspecified), and 586 (renal failure, unspecified).80 Corresponding ICD-10 codes are N17.0 to N17.9 (Acute renal failure) and N19 (Unspecified kidney failure).82
Ontology Mapping (AKI):
- UMLS: Concepts such as "acute kidney injury" or "acute renal failure" are identified, with CUIs like C0022660 ("Acute kidney injury").84
- SNOMED_CT: The concept ID 14669001 for "Acute kidney injury" 84 is used, and
snomed_ct.snomed_icd_map is leveraged to link to relevant ICD codes.
A significant practical challenge in applying the KDIGO criteria retrospectively is the frequent absence of a known, pre-illness baseline SCr value.1 The choice of baseline can dramatically affect cohort composition, as a poorly chosen baseline can lead to misclassification of AKI.1 A robust, hierarchical method for determining baseline SCr, reflecting clinical best practice, is implemented as follows 1:
- Recent Outpatient Value (Ideal): The median outpatient SCr value recorded in the 365 to 7 days prior to the current hospital admission. This represents the patient's stable, pre-illness renal function. In MIMIC-IV, this is queried from hosp.labevents where hadm_id is null but charttime falls in the correct window relative to the admission time.1 In eICU-CRD, this corresponds to
lab entries with a negative labresultoffset.1 - Admission Value: If no recent outpatient value is available, the first SCr measurement recorded upon hospital admission is used.
- Nadir In-Hospital Value: If an admission value is also unavailable, the lowest (nadir) SCr value measured within the first 48 hours of ICU admission can be used, as this may represent renal recovery after initial hydration.
- MDRD Estimation (Final Resort): As a final resort, if no measured values are available, a baseline SCr can be estimated by back-calculating from a presumed normal glomerular filtration rate (GFR) of 75 mL/min/1.73 m², using the Modification of Diet in Renal Disease (MDRD) equation and the patient's demographic data (age, sex, race). Any patient whose baseline is derived this way is flagged, as it represents an estimation rather than a direct measurement.1
The reliance on estimated GFR (MDRD) as a last resort introduces a potential source of misclassification and reduces the fidelity of the AKI diagnosis for those patients. This methodological choice is acknowledged and a flag is included in the SQL DDL to identify these cases, allowing for sensitivity analysis in subsequent modeling.
The aki_onset_time is defined as the earliest timestamp (or offset) at which any of the KDIGO criteria for SCr or Urine Output (UO) are met. This requires continuous monitoring of both lab values and hourly urine output calculations.1
To be classified as SA-AKI, the onset of AKI must be temporally linked to the onset of sepsis. Research consensus suggests that AKI developing within a narrow window around the sepsis diagnosis is most likely to be directly caused by the septic insult.1 For this framework, the "Early SA-AKI" phenotype is targeted: the
aki_onset_time must occur within the window of 48 hours before to 48 hours after the sepsis_onset_time.1 For MIMIC-IV, this involves
TIMESTAMP comparisons, while for eICU-CRD, it involves offset comparisons (±2880 minutes).1
The "definitional paradox" highlights that any patient with a documented infection who develops AKI automatically meets the criteria for sepsis.1 This means the SA-AKI cohort will be inherently heterogeneous in severity, ranging from mild, transient AKI (e.g., Stage 1, low urine output) to severe, persistent kidney failure.1 This inherent heterogeneity within the SA-AKI label necessitates additional features to capture clinical nuance, such as
aki_trigger, persistent_aki, and kdigo_stage_max_48hr 1, to improve mortality prediction by allowing the model to differentiate between various levels of SA-AKI severity.
Inclusion Criteria:
- Patient is an adult (Age ≥ 18 years).
- Patient meets Sepsis-3 criteria, establishing a sepsis_onset_time.
- Patient meets KDIGO criteria for AKI (any stage), establishing an aki_onset_time.
- Temporal Linkage: The aki_onset_time must occur within the window of 48 hours before to 48 hours after the sepsis_onset_time.1
Exclusion Criteria:
- Pre-existing End-Stage Renal Disease (ESRD) or Chronic Maintenance Dialysis: Patients with ESRD or those on chronic maintenance dialysis prior to admission are excluded, as they cannot develop a new AKI.1
- ICD-9 codes: '585.6', 'V45.1' (long-term dialysis), 'V56.x' (dialysis codes).1
- ICD-10 codes: 'N18.6' (End stage renal disease), 'Z99.2' (dependence on renal dialysis), 'Z49.x' (care involving dialysis).1
- Ontology Mapping (ESRD):
- UMLS: Concepts such as "end stage renal disease," "ESRD," or "chronic dialysis" are identified, including CUIs like C1561643 ("Chronic kidney disease") 88 and specific CUIs for ESRD.89
- SNOMED_CT: The concept ID 46177005 for "End-stage renal disease" 91 is used, and
snomed_ct.snomed_icd_map is leveraged to find associated ICD codes.
- Short ICU Stay: Patients with an ICU length of stay less than 24 hours are excluded, as this is insufficient time to reliably observe the development of AKI and gather necessary data.1 For MIMIC-IV, this is calculated from
intime and outtime in icu.icustays (los column). For eICU-CRD, it is calculated from unitdischargeoffset in the patient table. - Early Renal Replacement Therapy (RRT): Patients who receive RRT within the first 24 hours of ICU admission are excluded because the intervention fundamentally alters the natural course of AKI and its associated biomarkers, confounding the predictive modeling task.1
- eICU-CRD Specific Data Interface Limitations: Patients admitted to hospitals or units with known incomplete data interfaces for critical variables (e.g., laboratory results, infusion medications) should be excluded to prevent systematic bias.1 This means that missing data should not be assumed to be an absence of an event.1 An initial exploratory analysis is performed to identify and retain only those hospitals that consistently provide the data required for this study. The SQL for cohort definition includes a mechanism to filter by
hospitalid if such systematic limitations are identified.
Table 1 provides a summary of the logic required to define the SA-AKI cohort.
Table 1: SA-AKI Cohort Definition Logic (Expanded)
Criterion | Rationale | MIMIC-IV Implementation (Tables, Columns/itemids, Ontology Mapping) | eICU-CRD Implementation (Tables, Columns/labname pattern, Ontology Mapping) | Temporal Logic |
---|---|---|---|---|
Sepsis-3 | Defines sepsis as infection plus organ dysfunction. | |||
Suspected Infection | Proxy for infection onset based on clinical action. | hosp.prescriptions, hosp.emar, hosp.emar_detail, hosp.microbiologyevents. Antibiotic mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Antibiotic'), ATC (J01%), UMLS (sty='Antibiotic'), SNOMED_CT (term LIKE '%antibiotic%'). | medication, microlab. Antibiotic mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Antibiotic'), ATC (J01%), UMLS (sty='Antibiotic'), SNOMED_CT (term LIKE '%antibiotic%'), and drugname pattern matching. | Antibiotic within 72h before or 24h after culture. |
SOFA ≥2 | Quantifies acute organ dysfunction. | |||
- Respiration | PaO2/FiO2 ratio. | hosp.labevents (itemid: 50821), icu.chartevents (itemid: 223835). LOINC PaO2 (78354-8), FiO2 (19996-8, 3150-0). | lab (labname like '%pao2%'), respiratoryCharting (respchartvaluelabel='FiO2'), apacheApsVar (fio2). | Worst value in 24h period. |
- Coagulation | Platelet count. | hosp.labevents (itemid: 51265). LOINC Platelet Count (777-3). | lab (labname like '%platelet%'). | Worst value in 24h period. |
- Liver | Total bilirubin. | hosp.labevents (itemid: 50885). LOINC Bilirubin (1975-2, 42719-5). | lab (labname like '%bilirubin%'). | Worst value in 24h period. |
- Cardiovascular | MAP and vasopressor use. | icu.chartevents (MAP itemids: 220052, 220181, 225312), icu.inputevents (vasopressor itemids, e.g., 221906 for Norepinephrine). Vasopressor mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Vasopressor'), ATC (C01CA%, H01BA%), UMLS (sty='Vasopressor'), SNOMED_CT (term LIKE '%vasopressor%'). | vitalPeriodic/vitalAperiodic (MAP columns), infusiondrug (drugname keywords for vasopressors). | Worst value in 24h period. |
- CNS | Glasgow Coma Scale. | icu.chartevents (GCS component itemids: 220739, 223900, 223901). | nurseCharting (nursingchartcelltypevallabel='Glasgow coma score'), apacheApsVar (eyes, motor, verbal). | Worst value in 24h period. |
- Renal | Creatinine and Urine Output. | hosp.labevents (itemid: 50912), icu.outputevents (Urine itemids: 226559, 226561), icu.chartevents (Weight itemid: 226512). LOINC Creatinine (2160-0, 2161-8, 2162-6), Urine Output (9192-6). | lab (labname='creatinine'), intakeOutput (celllabel='Urine'), patient (admissionweight). | Worst value in 24h period. |
KDIGO-AKI | Defines acute kidney injury. | |||
SCr Criteria | ≥0.3 mg/dL rise in 48h or ≥1.5x baseline in 7d. | hosp.labevents (itemid: 50912). Baseline SCr: hierarchical (outpatient, admission, nadir in-hospital, MDRD estimation). | lab (labname='creatinine'). Baseline SCr: hierarchical (negative offset, admission, nadir in-ICU, MDRD estimation). | Earliest time KDIGO criteria met. |
UO Criteria | <0.5 mL/kg/h for 6h. | icu.outputevents (Urine itemids), icu.chartevents (Weight itemid: 226512). | intakeOutput (celllabel='Urine'), patient (admissionweight). | Earliest time KDIGO criteria met. |
Temporal Link | Ensures AKI is associated with sepsis. | aki_onset_time within ±48h of sepsis_onset_time. | aki_onset_offset within ±2880 min of sepsis_onset_offset. | |
Exclusions | Create a clean, relevant cohort. | |||
ESRD | Remove patients with pre-existing terminal kidney failure. | hosp.diagnoses_icd (ICD9: '585.6', 'V45.1', 'V56.x'; ICD10: 'N18.6', 'Z99.2', 'Z49.x'). Ontology mapping for ESRD: UMLS, SNOMED_CT (46177005). | diagnosis (ICD9: '585.6', 'V45.1', 'V56.x'; ICD10: 'N18.6', 'Z99.2', 'Z49.x'). Ontology mapping for ESRD: UMLS, SNOMED_CT (46177005). | |
ICU Stay < 24h | Ensure sufficient data capture. | icu.icustays (los column). | patient (unitdischargeoffset column). | |
Early RRT | Remove patients whose AKI course is altered by early intervention. | icu.procedureevents (RRT itemids), mimic-code RRT concept. | treatment (treatmentstring like '%renal % dialysis%'), apacheApsVar (dialysis flag). | RRT within first 24h of ICU admission. |
eICU Data Interface Limitations | Prevent bias from systematic missingness. | N/A | Exclude hospitals with consistently incomplete data for critical variables (e.g., labs, infusions). |
Once the SA-AKI cohort is meticulously defined, the next critical step is to engineer a rich set of features that characterize each patient for mortality prediction. These features capture the patient's baseline state, dynamic physiological changes, and critical interventions, providing comprehensive context for the predictive model.
These fundamental covariates capture intrinsic patient characteristics and administrative details at the time of ICU admission.
- Demographics and Anthropometrics:
- Age: In MIMIC-IV, age is calculated using hosp.patients.anchor_age and hosp.patients.anchor_year relative to hosp.admissions.admittime.1 A critical aspect is the handling of
anchor_age 91, which represents all patients aged 90 and above.1 This value is treated as a censored variable or a separate category in the DDL, rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling for this sensitive demographic. In eICU-CRD, age is extracted from
eicu.patient.age, which is a varchar field. This requires preprocessing to convert it to a numeric type, typically by replacing the string "> 89" with a value like 91.1 - Gender & Ethnicity: For MIMIC-IV, these are extracted directly from hosp.patients.gender and hosp.admissions.race. For eICU-CRD, they are extracted from eicu.patient.gender and eicu.patient.ethnicity.1
- Height & Weight: In MIMIC-IV, admission height (cm) and weight (kg) are typically found in icu.chartevents using itemids 226730 and 226512, respectively, with the first recorded value during the ICU stay being used.1 In eICU-CRD, these are extracted directly from the
admissionheight (cm) and admissionweight (kg) columns in the patient table.1
- Age: In MIMIC-IV, age is calculated using hosp.patients.anchor_age and hosp.patients.anchor_year relative to hosp.admissions.admittime.1 A critical aspect is the handling of
- Admission and Hospital-Level Information:
- Admission Source & Type: For MIMIC-IV, admission_location and admission_type from hosp.admissions provide the source (e.g., 'EMERGENCY ROOM', 'PHYSICIAN REFERRAL') and type (e.g., 'EW EMER.', 'URGENT') of admission.1 For eICU-CRD,
unitadmitsource from the patient table (e.g., 'Emergency Department', 'Operating Room') serves this purpose.1 - First Care Unit (MIMIC-IV): first_careunit from icu.icustays provides the specific type of ICU (e.g., 'Medical Intensive Care Unit (MICU)').1
- Hospital Characteristics (eICU-CRD): The multi-center nature of eICU-CRD makes hospital-level features uniquely valuable. These are extracted from the hospital table, linked via hospitalid, and include numbedscategory (hospital size), teachingstatus (boolean), and region (US geographical region).1 These features are crucial for accounting for site-specific variations in practice and outcomes, or even as predictive features themselves, given the inherent heterogeneity across hospitals in eICU-CRD.1
- Admission Source & Type: For MIMIC-IV, admission_location and admission_type from hosp.admissions provide the source (e.g., 'EMERGENCY ROOM', 'PHYSICIAN REFERRAL') and type (e.g., 'EW EMER.', 'URGENT') of admission.1 For eICU-CRD,
- Comorbidity Assessment via ICD Codes (Charlson Comorbidity Index):
A patient's burden of chronic disease is a powerful predictor of their ability to survive a critical illness. A standardized approach using a validated comorbidity index, such as the Charlson Comorbidity Index, is employed. This method provides a weighted score that summarizes the overall burden of chronic disease, creating a powerful and reproducible feature.1 The implementation involves extracting all International Classification of Diseases (ICD) codes associated with a patient's hospital stay and mapping them to the 17 conditions defined by the index.-
MIMIC-IV: All icd_code and icd_version (9 and 10) for each hadm_id are extracted from the hosp.diagnoses_icd table.1
-
eICU-CRD: All icd9code for each patientunitstayid are extracted from the diagnosis table. This field may contain multiple comma-separated codes that require careful parsing.1
Once extracted, these codes are mapped to the 17 Charlson comorbidities using established code sets (e.g., Deyo for ICD-9 and Quan for ICD-10).1 This process results in 17 binary features (e.g.,
comorb_mi, comorb_chf, comorb_cancer), where a value of 1 indicates the presence of the comorbidity. Table 2 provides the specific code mappings for key conditions relevant to the SA-AKI cohort.
-
Table 2: Comorbidity to ICD Code Mapping (Charlson Index, Deyo/Quan Adaptation)
Charlson Comorbidity | Weight | ICD-9 Codes (Deyo-like) | ICD-10 Codes (Quan-like) |
---|---|---|---|
Myocardial Infarction | 1 | 410.x, 412.x | I21.x, I22.x, I25.2 |
Congestive Heart Failure | 1 | 428.x | I50.x, I09.9, I11.0, I13.0, I13.2 |
Peripheral Vascular Disease | 1 | 443.9, 441.x, 785.4, V43.4 | I70.x, I71.x, I73.8, I73.9 |
Cerebrovascular Disease | 1 | 430.x-438.x | I60.x-I69.x, G45.x, G46.x |
Dementia | 1 | 290.x, 294.1, 331.2 | F00.x-F03.x, F05.1, G30.x, G31.1 |
Chronic Pulmonary Disease | 1 | 490.x-496.x, 500.x-505.x, 506.4 | J40.x-J47.x, J60.x-J67.x |
Connective Tissue Disease | 1 | 710.0, 710.1, 710.4, 714.x, 725.x | M05.x, M06.x, M31.5, M32.x-M34.x, M35.1, M35.3, M36.0 |
Peptic Ulcer Disease | 1 | 531.x-534.x | K25.x-K28.x |
Mild Liver Disease | 1 | 571.2, 571.4, 571.5, 571.6 | B18.x, K70.0, K70.3, K70.9, K71.7, K73.x, K74.x, K76.0 |
Diabetes (uncomplicated) | 1 | 250.0, 250.1, 250.2, 250.3, 250.7 | E10.0, E10.1, E10.9, E11.0, E11.1, E11.9 |
Diabetes (with complications) | 2 | 250.4, 250.5, 250.6 | E10.2-E10.5, E10.7, E11.2-E11.5, E11.7 |
Hemiplegia or Paraplegia | 2 | 344.1, 342.x | G81.x, G82.x, G04.1, G83.0-G83.4, G83.9 |
Renal Disease | 2 | 582.x, 583.x, 585.x, 586.x, V42.0, V45.1 | N18.x, N19.x, N03.2-N03.7, N05.2-N05.7, Z49.x, Z99.2 |
Any Malignancy | 2 | 140.x-172.x, 174.x-195.8, 200.x-208.x | C00.x-C26.x, C30.x-C41.x, C43.x, C45.x-C58.x, C60.x-C76.x, C81.x-C85.x, C88.x, C90.x-C97.x |
Moderate/Severe Liver Disease | 3 | 572.2-572.8, 456.0-456.2 | I85.0, I85.9, I86.4, K70.4, K71.1, K72.1, K72.9, K76.5-K76.7 |
Metastatic Solid Tumor | 6 | 196.x-199.x | C77.x-C80.x |
AIDS/HIV | 6 | 042.x | B20.x, B21.x, B22.x, B24.x |
The initial 24 hours of an ICU stay are a critical period, establishing a patient's physiological trajectory. Extracting and summarizing high-frequency time-series data from this window provides a rich set of features highly predictive of subsequent outcomes.1 A primary challenge is mapping clinically equivalent concepts across the two database schemas, as MIMIC-IV uses a standardized
itemid system, while eICU-CRD employs a combination of fixed columns and non-standardized, free-text labname fields.1 For each dynamic variable, the raw time-series data from the first 24 hours of the ICU stay are extracted and summarized into static features including minimum, maximum, mean, and standard deviation, capturing both central tendency and lability.
- Vital Signs (First 24 Hours): Core physiological measurements are continuously or frequently tracked in the ICU.
- MIMIC-IV: icu.chartevents is the primary source for high-resolution vital signs.1
- eICU-CRD: vitalPeriodic contains 5-minute median values for most vital signs, while vitalAperiodic contains intermittent measurements.1
- Features: Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Mean Arterial Pressure, Respiratory Rate, Temperature, SpO2.
- Laboratory Biomarkers (First 24 Hours): Laboratory results provide insights into various organ systems. The main challenge, especially in eICU-CRD, is the lack of standardization in test names. The lab.labname field is hospital-specific, meaning the same test (e.g., creatinine) may have dozens of different names (e.g., 'creatinine', 'CREAT', 'serum creat', 'Cr') across the database.1 A successful extraction pipeline must use pattern matching (e.g., SQL
LIKE statements or regular expressions) to aggregate all variations of a test into a single canonical feature.- MIMIC-IV: hosp.labevents is the primary source, with tests identified by itemid.1
- eICU-CRD: The lab table is the primary source. The apacheApsVar table also contains single "worst" values for many key labs in the first 24 hours and can supplement the lab table.1
- Features: Creatinine, Blood Urea Nitrogen (BUN), Anion Gap, Sodium, Potassium, Chloride, Bicarbonate (HCO3), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Alkaline Phosphatase (Alk Phos), Total Bilirubin, Albumin, White Blood Cell Count (WBC), Platelet Count, Hemoglobin, Hematocrit, International Normalized Ratio (INR), Prothrombin Time (PT), Partial Thromboplastin Time (PTT), C-Reactive Protein (CRP), Lactate, Glucose, pH, PaO2, PaCO2.
Table 3 provides a comprehensive mapping of canonical clinical features to their specific implementations in both databases.
Table 3: Master Feature Mapping for Dynamic Variables (First 24 Hours)
Feature Category | Canonical Feature Name | MIMIC-IV Source (Table, itemid) | eICU-CRD Source (Table, Column/labname pattern) | LOINC Code |
---|---|---|---|---|
Vital Signs | Heart Rate | icu.chartevents (220045) | vitalPeriodic (heartrate) | 8867-4 |
Systolic Blood Pressure | icu.chartevents (220179, 220050) | vitalPeriodic (systemicsystolic), vitalAperiodic (noninvasivesystolic) | 8480-6 | |
Diastolic Blood Pressure | icu.chartevents (220180, 220051) | vitalPeriodic (systemicdiastolic), vitalAperiodic (noninvasivediastolic) | 8462-4 | |
Mean Arterial Pressure | icu.chartevents (220052, 220181, 225312) | vitalPeriodic (systemicmean), vitalAperiodic (noninvasivemean) | 8478-0 | |
Respiratory Rate | icu.chartevents (220210, 224690) | vitalPeriodic (respiration) | 9279-1 | |
Temperature | icu.chartevents (223762, 223761) | vitalPeriodic (temperature) | 8310-5 | |
SpO2 | icu.chartevents (220277) | vitalPeriodic (sao2) | 2708-6 | |
Renal Labs | Creatinine | hosp.labevents (50912) | lab (labname like '%creatinine%') | 2160-0 (Serum/Plasma) |
Blood Urea Nitrogen (BUN) | hosp.labevents (51006) | lab (labname like '%BUN%') | 3094-0, 6299-2 92 | |
Anion Gap | hosp.labevents (50868) | lab (labname like '%anion gap%') | 33037-3, 10466-1 94 | |
Sodium | hosp.labevents (50983) | lab (labname like '%sodium%') | 2947-0 96 | |
Potassium | hosp.labevents (50971) | lab (labname like '%potassium%') | 2823-3 (Serum/Plasma), 2829-0 (Urine) 97 | |
Chloride | hosp.labevents (50902) | lab (labname like '%chloride%') | 2075-0 98 | |
Bicarbonate (HCO3) | hosp.labevents (50882) | lab (labname like '%bicarb%', '%Total CO2%', '%HCO3%') | 1960-4, 69964-5 100 | |
Hepatic Labs | Alanine Aminotransferase (ALT) | hosp.labevents (50861) | lab (labname like '%ALT (SGPT)%') | 1743-4, 1742-6 102 |
Aspartate Aminotransferase (AST) | hosp.labevents (50878) | lab (labname like '%AST (SGOT)%') | 30239-8, 1920-8 104 | |
Alkaline Phosphatase (Alk Phos) | hosp.labevents (50863) | lab (labname like '%alkaline phos%') | 6768-6, 77141-0 106 | |
Total Bilirubin | hosp.labevents (50885) | lab (labname like '%total bilirubin%') | 1975-2, 42719-5 36 | |
Albumin | hosp.labevents (50862) | lab (labname like '%albumin%') | 1751-7 108 | |
Hematology | White Blood Cell Count (WBC) | hosp.labevents (51301) | lab (labname like '%WBC x 1000%', '%wbc%') | 6690-2 110 |
Platelet Count | hosp.labevents (51265) | lab (labname like '%platelets x 1000%') | 777-3 35 | |
Hemoglobin | hosp.labevents (51222) | lab (labname like '%Hgb%', '%hemoglobin%') | 718-7 111 | |
Hematocrit | hosp.labevents (51221) | lab (labname like '%Hct%', '%hematocrit%') | 4544-3 112 | |
Coagulation | International Normalized Ratio (INR) | hosp.labevents (51237) | lab (labname like '%INR%') | 6301-6 113 |
Prothrombin Time (PT) | hosp.labevents (51275) | lab (labname like '%PT%') | 5902-2 114 | |
Partial Thromboplastin Time (PTT) | hosp.labevents (51274) | lab (labname like '%PTT%', '%APTT%') | 50391-2 115 | |
Inflammatory | C-Reactive Protein (CRP) | hosp.labevents (50889) | lab (labname like '%CRP%') | 30522-7, 71426-1 5 |
Lactate | hosp.labevents (50813) | lab (labname like '%lactate%') | 29246-6 117 | |
Metabolic | Glucose | hosp.labevents (50931, 50809) | lab (labname like '%glucose%'), nurseCharting (nursingchartcelltypevallabel='glucose') | 2339-0 118 |
Blood Gas | pH | hosp.labevents (50820) | lab (labname like '%pH%', '%arterial pH%') | 2744-1 119 |
PaO2 | hosp.labevents (50821) | lab (labname like '%paO2%', '%arterial PaO2%') | 78354-8 32 | |
PaCO2 | hosp.labevents (50818) | lab (labname like '%paCO2%', '%arterial PaCO2%') | 19996-8, 52484-3 33 |
Identifying key clinical interventions is crucial for building a robust predictive model, as these interventions signal the severity of the patient's condition and the intensity of care received. Binary flags (presence/absence within the first 24 hours) and, where applicable, summary measures are extracted for critical interventions in the SA-AKI context.
- Mechanical Ventilation: Mechanical ventilation is a life-sustaining intervention for respiratory failure, a common organ dysfunction in sepsis.1 Its use is a key indicator of illness severity.
- MIMIC-IV: A patient is considered to be on mechanical ventilation if they have a relevant record in the icu.procedureevents table. Itemids such as 225792 ('Invasive Ventilation'), 224385 ('Intubation'), or 224386 ('Tracheostomy') are strong indicators.1 The duration of ventilation can be calculated from
starttime and endtime. - eICU-CRD: The respiratorycare table is the primary source. The presence of a non-null value in the ventstartoffset column is a direct indicator.1 Additional evidence can be gathered from the
airwaytype column (e.g., 'Oral ETT', 'Tracheostomy'). The respiratoryCharting table provides detailed time-stamped ventilator settings.
- MIMIC-IV: A patient is considered to be on mechanical ventilation if they have a relevant record in the icu.procedureevents table. Itemids such as 225792 ('Invasive Ventilation'), 224385 ('Intubation'), or 224386 ('Tracheostomy') are strong indicators.1 The duration of ventilation can be calculated from
- Vasopressor Support: The use of vasopressor medications to maintain blood pressure is a defining characteristic of septic shock and a critical component of the cardiovascular SOFA score.1
- MIMIC-IV: Vasopressor use is identified from the icu.inputevents table. This requires mapping common vasopressor medications (e.g., Norepinephrine, Epinephrine, Dopamine, Vasopressin, Phenylephrine) to their corresponding itemids in the icu.d_items dictionary (e.g., Norepinephrine itemid 221906).1 A binary feature indicates if any of these medications were administered during the first 24 hours.
- eICU-CRD: Vasopressor use is reliably identified from the infusiondrug table, which documents continuous drug infusions. This involves querying for rows where the drugname column contains keywords for vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').1
- Renal Replacement Therapy (RRT): RRT, including hemodialysis and continuous renal replacement therapy (CRRT), is the primary intervention for severe AKI.1 Its initiation is a component of the KDIGO Stage 3 definition and a critical feature.
- MIMIC-IV: Identifying RRT requires querying multiple tables. icu.procedureevents may contain itemids for dialysis procedures (e.g., 225802 for 'Dialysis - CRRT', 225803 for 'Dialysis - CVVHD').1
icu.chartevents contains numerous itemids related to dialysis machine settings, and icu.inputevents will contain records for dialysate and replacement fluids. - eICU-CRD: RRT is most effectively identified from the treatment table, searching the treatmentstring column for hierarchical paths containing keywords like 'renal | dialysis', 'renal | CRRT', or 'renal | CVVH'.1 The presence of a dialysis flag in the
apacheApsVar table can also serve as a confirmatory data point for the first 24 hours.
- MIMIC-IV: Identifying RRT requires querying multiple tables. icu.procedureevents may contain itemids for dialysis procedures (e.g., 225802 for 'Dialysis - CRRT', 225803 for 'Dialysis - CVVHD').1
- Fluid Balance: Fluid balance—the net difference between fluid intake and output—is a critical therapeutic target and a powerful prognostic indicator in sepsis and AKI.1 The methodology for calculating fluid balance differs significantly between the two databases. A naive approach can lead to errors, particularly in eICU-CRD, where
intakeOutput columns (intaketotal, outputtotal, nettotal) represent cumulative totals that are often duplicated.1 In contrast, MIMIC-IV
inputevents and outputevents contain discrete, non-cumulative event volumes.1- MIMIC-IV: To calculate the 24-hour fluid balance, the amount column from icu.inputevents within the 24-hour window is summed, and the sum of the value column from icu.outputevents in the same window is subtracted.1
- eICU-CRD: To correctly calculate the 24-hour fluid balance, distinct values of the nettotal column at each unique intakeoutputoffset are selected. The 24-hour balance is then the difference between the maximum and minimum of these unique cumulative values within the desired 24-hour window.1
The preceding sections have detailed the logic for defining the SA-AKI cohort and engineering a rich set of common features from both MIMIC-IV and eICU-CRD. This final section provides the blueprint for assembling these components into the final staging tables.
The ultimate goal is to create two flat, analysis-ready tables—one for MIMIC-IV and one for eICU-CRD. In each table, every row will represent a unique ICU stay (stay_id for MIMIC-IV, patientunitstayid for eICU-CRD) from the SA-AKI cohort. Each column will represent one of the engineered features, along with the final outcome variable. The conceptual process for constructing this matrix involves a series of joins:
- Start with the Cohort: Begin with the final list of unique ICU stay identifiers that meet all inclusion and exclusion criteria as defined in Section III. This forms the backbone of the staging table.
- Join Static Features: Left-join the static and baseline features from Section IV.A (Demographics, Admission Information, and the 17 binary Comorbidity flags) to the cohort table using the appropriate stay identifier.
- Join Summarized Dynamic Features: Left-join the summary statistics (min, max, mean, std dev) for each vital sign and laboratory biomarker from the first 24 hours, as defined in Section IV.B.
- Join Intervention Features: Left-join the binary flags for the use of Mechanical Ventilation, Vasopressors, and RRT within the first 24 hours, as well as the calculated 24-hour Fluid Balance, as defined in Section IV.C.
- Join the Outcome Variable: Finally, join the target variable for the prediction task. For in-hospital mortality, this is the hospital_expire_flag from the mimic.admissions table in MIMIC-IV, or the hospitalDischargeStatus ('Expired' vs. 'Alive') from the eicu.patient table in eICU-CRD.1
The resulting tables will be wide-format data frames, perfectly suited for input into standard machine learning algorithms.
While the core request was to create the staging tables, a truly robust dataset anticipates the needs of the modeling phase. The heterogeneity of the SA-AKI syndrome, a concept known as the "definitional paradox," presents an opportunity for more nuanced feature engineering.1 The SA-AKI label encompasses a wide spectrum of severity, from a patient with a transient dip in urine output to one with complete anuria requiring dialysis.1 A model that can distinguish between these sub-phenotypes will likely be more accurate and interpretable. Therefore, it is highly recommended to create the following additional features to capture this clinical heterogeneity 1:
- aki_trigger: A categorical feature indicating what triggered the initial KDIGO diagnosis: 'Creatinine', 'Urine Output', or 'Both'. This captures the nature of the initial renal insult.
- persistent_aki_48hr_flag: A binary flag indicating if the patient's AKI is persistent. This is set to 1 if the patient still meets any KDIGO criteria 48 hours after the initial aki_onset_time, and 0 otherwise. Persistent AKI is a known marker for more severe injury and higher mortality risk.
- kdigo_stage_max_48hr: A numerical feature (1, 2, or 3) representing the maximum KDIGO stage the patient reached within the first 48 hours of AKI onset. This provides a graded measure of initial AKI severity.
Final Guidance for Subsequent Analysis:
- Handling Missing Values: As requested, this protocol does not impute missing values. The resulting feature matrix will exhibit missingness, an inherent property of real-world EHR data. For the subsequent modeling phase, robust methods such as multiple imputation by chained equations (MICE) or the use of tree-based models (e.g., LightGBM, XGBoost) that can intrinsically handle missing data are recommended.1
- Dimensionality Reduction: The feature set created by this protocol is intentionally large and comprehensive to provide maximum flexibility. It contains many variables that will be correlated (e.g., systolic, diastolic, and mean arterial pressure). This is by design. The user can now apply their preferred dimensionality reduction or feature selection techniques—such as Principal Component Analysis (PCA), recursive feature elimination, or LASSO regularization—to select the most informative and independent subset of features for their final model.1
By following this comprehensive protocol, a researcher can construct two parallel, high-quality, and clinically nuanced datasets from MIMIC-IV and eICU-CRD, providing a solid foundation for developing and validating a powerful mortality prediction model for the challenging Sepsis-Associated Acute Kidney Injury cohort.
The following SQL DDL statements define the schema for the SA-AKI patient cohort staging tables in both MIMIC-IV and eICU-CRD. These tables are designed to store the extracted and harmonized features as described in the report, suitable for mortality prediction.
SQL
CREATE TABLE IF NOT EXISTS mimic.sa_aki_cohort_staging_mimic (
-- Patient Identifiers
subject_id BIGINT NOT NULL,
hadm_id BIGINT NOT NULL,
stay_id BIGINT NOT NULL PRIMARY KEY,
\-- Cohort Definition Timestamps
sepsis\_onset\_time TIMESTAMP,
aki\_onset\_time TIMESTAMP,
\-- Outcome Variable
hospital\_expire\_flag BIGINT, \-- 0 \= Alive, 1 \= Expired
\-- Static and Baseline Features
age INTEGER,
gender VARCHAR(1),
ethnicity VARCHAR(80),
admission\_height NUMERIC, \-- cm
admission\_weight NUMERIC, \-- kg
admission\_type VARCHAR(50),
admission\_location VARCHAR(50),
first\_careunit VARCHAR(255),
baseline\_scr\_estimated\_flag BOOLEAN, \-- TRUE if baseline SCr was estimated via MDRD
\-- Comorbidity Flags (Charlson Comorbidity Index)
comorb\_mi BOOLEAN,
comorb\_chf BOOLEAN,
comorb\_pvd BOOLEAN,
comorb\_cvd BOOLEAN,
comorb\_dementia BOOLEAN,
comorb\_cpd BOOLEAN,
comorb\_ctd BOOLEAN,
comorb\_pud BOOLEAN,
comorb\_mild\_liver BOOLEAN,
comorb\_dm\_uncomp BOOLEAN,
comorb\_dm\_comp BOOLEAN,
comorb\_hemiplegia BOOLEAN,
comorb\_renal BOOLEAN,
comorb\_malignancy BOOLEAN,
comorb\_mod\_sev\_liver BOOLEAN,
comorb\_metastatic BOOLEAN,
comorb\_aids BOOLEAN,
\-- Dynamic Physiological and Laboratory Features (First 24 Hours)
\-- Vitals
heartrate\_min NUMERIC,
heartrate\_max NUMERIC,
heartrate\_mean NUMERIC,
heartrate\_std NUMERIC,
sysbp\_min NUMERIC,
sysbp\_max NUMERIC,
sysbp\_mean NUMERIC,
sysbp\_std NUMERIC,
diabp\_min NUMERIC,
diabp\_max NUMERIC,
diabp\_mean NUMERIC,
diabp\_std NUMERIC,
meanbp\_min NUMERIC,
meanbp\_max NUMERIC,
meanbp\_mean NUMERIC,
meanbp\_std NUMERIC,
resprate\_min NUMERIC,
resprate\_max NUMERIC,
resprate\_mean NUMERIC,
resprate\_std NUMERIC,
temperature\_min NUMERIC,
temperature\_max NUMERIC,
temperature\_mean NUMERIC,
temperature\_std NUMERIC,
spo2\_min NUMERIC,
spo2\_max NUMERIC,
spo2\_mean NUMERIC,
spo2\_std NUMERIC,
\-- Renal Labs
creatinine\_min NUMERIC,
creatinine\_max NUMERIC,
creatinine\_mean NUMERIC,
creatinine\_std NUMERIC,
bun\_min NUMERIC,
bun\_max NUMERIC,
bun\_mean NUMERIC,
bun\_std NUMERIC,
aniongap\_min NUMERIC,
aniongap\_max NUMERIC,
aniongap\_mean NUMERIC,
aniongap\_std NUMERIC,
sodium\_min NUMERIC,
sodium\_max NUMERIC,
sodium\_mean NUMERIC,
sodium\_std NUMERIC,
potassium\_min NUMERIC,
potassium\_max NUMERIC,
potassium\_mean NUMERIC,
potassium\_std NUMERIC,
chloride\_min NUMERIC,
chloride\_max NUMERIC,
chloride\_mean NUMERIC,
chloride\_std NUMERIC,
bicarbonate\_min NUMERIC,
bicarbonate\_max NUMERIC,
bicarbonate\_mean NUMERIC,
bicarbonate\_std NUMERIC,
\-- Hepatic Labs
alt\_min NUMERIC,
alt\_max NUMERIC,
alt\_mean NUMERIC,
alt\_std NUMERIC,
ast\_min NUMERIC,
ast\_max NUMERIC,
ast\_mean NUMERIC,
ast\_std NUMERIC,
alkphos\_min NUMERIC,
alkphos\_max NUMERIC,
alkphos\_mean NUMERIC,
alkphos\_std NUMERIC,
totalbilirubin\_min NUMERIC,
totalbilirubin\_max NUMERIC,
totalbilirubin\_mean NUMERIC,
totalbilirubin\_std NUMERIC,
albumin\_min NUMERIC,
albumin\_max NUMERIC,
albumin\_mean NUMERIC,
albumin\_std NUMERIC,
\-- Hematology
wbc\_min NUMERIC,
wbc\_max NUMERIC,
wbc\_mean NUMERIC,
wbc\_std NUMERIC,
plateletcount\_min NUMERIC,
plateletcount\_max NUMERIC,
plateletcount\_mean NUMERIC,
plateletcount\_std NUMERIC,
hemoglobin\_min NUMERIC,
hemoglobin\_max NUMERIC,
hemoglobin\_mean NUMERIC,
hemoglobin\_std NUMERIC,
hematocrit\_min NUMERIC,
hematocrit\_max NUMERIC,
hematocrit\_mean NUMERIC,
hematocrit\_std NUMERIC,
\-- Coagulation
inr\_min NUMERIC,
inr\_max NUMERIC,
inr\_mean NUMERIC,
inr\_std NUMERIC,
pt\_min NUMERIC,
pt\_max NUMERIC,
pt\_mean NUMERIC,
pt\_std NUMERIC,
ptt\_min NUMERIC,
ptt\_max NUMERIC,
ptt\_mean NUMERIC,
ptt\_std NUMERIC,
\-- Inflammatory / Metabolic
crp\_min NUMERIC,
crp\_max NUMERIC,
crp\_mean NUMERIC,
crp\_std NUMERIC,
lactate\_min NUMERIC,
lactate\_max NUMERIC,
lactate\_mean NUMERIC,
lactate\_std NUMERIC,
glucose\_min NUMERIC,
glucose\_max NUMERIC,
glucose\_mean NUMERIC,
glucose\_std NUMERIC,
ph\_min NUMERIC,
ph\_max NUMERIC,
ph\_mean NUMERIC,
ph\_std NUMERIC,
pao2\_min NUMERIC,
pao2\_max NUMERIC,
pao2\_mean NUMERIC,
pao2\_std NUMERIC,
paco2\_min NUMERIC,
paco2\_max NUMERIC,
paco2\_mean NUMERIC,
paco2\_std NUMERIC,
\-- Intervention Features (First 24 Hours)
mechanical\_ventilation\_24hr\_flag BOOLEAN,
vasopressor\_24hr\_flag BOOLEAN,
rrt\_24hr\_flag BOOLEAN,
fluid\_balance\_24hr\_ml NUMERIC,
\-- Additional SA-AKI Specific Features
aki\_trigger VARCHAR(20), \-- 'Creatinine', 'Urine Output', 'Both'
persistent\_aki\_48hr\_flag BOOLEAN,
kdigo\_stage\_max\_48hr INTEGER
);
SQL
CREATE TABLE IF NOT EXISTS eicu.sa_aki_cohort_staging_eicu (
-- Patient Identifiers
uniquepid VARCHAR(10) NOT NULL,
patienthealthsystemstayid INTEGER,
patientunitstayid INTEGER NOT NULL PRIMARY KEY,
\-- Cohort Definition Timestamps (Offsets in minutes from ICU admit)
sepsis\_onset\_offset INTEGER,
aki\_onset\_offset INTEGER,
\-- Outcome Variable
hospitaldischargestatus VARCHAR(10), \-- 'Alive' or 'Expired'
\-- Static and Baseline Features
age VARCHAR(10), \-- Retain as VARCHAR to handle '\> 89' then convert/handle in analysis
gender VARCHAR(25),
ethnicity VARCHAR(50),
admission\_height NUMERIC(10,2), \-- cm
admission\_weight NUMERIC(10,2), \-- kg
unitadmitsource VARCHAR(100),
numbedscategory VARCHAR(32),
teachingstatus BOOLEAN,
region VARCHAR(64),
baseline\_scr\_estimated\_flag BOOLEAN, \-- TRUE if baseline SCr was estimated via MDRD
\-- Comorbidity Flags (Charlson Comorbidity Index)
comorb\_mi BOOLEAN,
comorb\_chf BOOLEAN,
comorb\_pvd BOOLEAN,
comorb\_cvd BOOLEAN,
comorb\_dementia BOOLEAN,
comorb\_cpd BOOLEAN,
comorb\_ctd BOOLEAN,
comorb\_pud BOOLEAN,
comorb\_mild\_liver BOOLEAN,
comorb\_dm\_uncomp BOOLEAN,
comorb\_dm\_comp BOOLEAN,
comorb\_hemiplegia BOOLEAN,
comorb\_renal BOOLEAN,
comorb\_malignancy BOOLEAN,
comorb\_mod\_sev\_liver BOOLEAN,
comorb\_metastatic BOOLEAN,
comorb\_aids BOOLEAN,
\-- Dynamic Physiological and Laboratory Features (First 24 Hours)
\-- Vitals
heartrate\_min NUMERIC,
heartrate\_max NUMERIC,
heartrate\_mean NUMERIC,
heartrate\_std NUMERIC,
sysbp\_min NUMERIC,
sysbp\_max NUMERIC,
sysbp\_mean NUMERIC,
sysbp\_std NUMERIC,
diabp\_min NUMERIC,
diabp\_max NUMERIC,
diabp\_mean NUMERIC,
diabp\_std NUMERIC,
meanbp\_min NUMERIC,
meanbp\_max NUMERIC,
meanbp\_mean NUMERIC,
meanbp\_std NUMERIC,
resprate\_min NUMERIC,
resprate\_max NUMERIC,
resprate\_mean NUMERIC,
resprate\_std NUMERIC,
temperature\_min NUMERIC,
temperature\_max NUMERIC,
temperature\_mean NUMERIC,
temperature\_std NUMERIC,
spo2\_min NUMERIC,
spo2\_max NUMERIC,
spo2\_mean NUMERIC,
spo2\_std NUMERIC,
\-- Renal Labs
creatinine\_min NUMERIC,
creatinine\_max NUMERIC,
creatinine\_mean NUMERIC,
creatinine\_std NUMERIC,
bun\_min NUMERIC,
bun\_max NUMERIC,
bun\_mean NUMERIC,
bun\_std NUMERIC,
aniongap\_min NUMERIC,
aniongap\_max NUMERIC,
aniongap\_mean NUMERIC,
aniongap\_std NUMERIC,
sodium\_min NUMERIC,
sodium\_max NUMERIC,
sodium\_mean NUMERIC,
sodium\_std NUMERIC,
potassium\_min NUMERIC,
potassium\_max NUMERIC,
potassium\_mean NUMERIC,
potassium\_std NUMERIC,
chloride\_min NUMERIC,
chloride\_max NUMERIC,
chloride\_mean NUMERIC,
chloride\_std NUMERIC,
bicarbonate\_min NUMERIC,
bicarbonate\_max NUMERIC,
bicarbonate\_mean NUMERIC,
bicarbonate\_std NUMERIC,
\-- Hepatic Labs
alt\_min NUMERIC,
alt\_max NUMERIC,
alt\_mean NUMERIC,
alt\_std NUMERIC,
ast\_min NUMERIC,
ast\_max NUMERIC,
ast\_mean NUMERIC,
ast\_std NUMERIC,
alkphos\_min NUMERIC,
alkphos\_max NUMERIC,
alkphos\_mean NUMERIC,
alkphos\_std NUMERIC,
totalbilirubin\_min NUMERIC,
totalbilirubin\_max NUMERIC,
totalbilirubin\_mean NUMERIC,
totalbilirubin\_std NUMERIC,
albumin\_min NUMERIC,
albumin\_max NUMERIC,
albumin\_mean NUMERIC,
albumin\_std NUMERIC,
\-- Hematology
wbc\_min NUMERIC,
wbc\_max NUMERIC,
wbc\_mean NUMERIC,
wbc\_std NUMERIC,
plateletcount\_min NUMERIC,
plateletcount\_max NUMERIC,
plateletcount\_mean NUMERIC,
plateletcount\_std NUMERIC,
hemoglobin\_min NUMERIC,
hemoglobin\_max NUMERIC,
hemoglobin\_mean NUMERIC,
hemoglobin\_std NUMERIC,
hematocrit\_min NUMERIC,
hematocrit\_max NUMERIC,
hematocrit\_mean NUMERIC,
hematocrit\_std NUMERIC,
\-- Coagulation
inr\_min NUMERIC,
inr\_max NUMERIC,
inr\_mean NUMERIC,
inr\_std NUMERIC,
pt\_min NUMERIC,
pt\_max NUMERIC,
pt\_mean NUMERIC,
pt\_std NUMERIC,
ptt\_min NUMERIC,
ptt\_max NUMERIC,
ptt\_mean NUMERIC,
ptt\_std NUMERIC,
\-- Inflammatory / Metabolic
crp\_min NUMERIC,
crp\_max NUMERIC,
crp\_mean NUMERIC,
crp\_std NUMERIC,
lactate\_min NUMERIC,
lactate\_max NUMERIC,
lactate\_mean NUMERIC,
lactate\_std NUMERIC,
glucose\_min NUMERIC,
glucose\_max NUMERIC,
glucose\_mean NUMERIC,
glucose\_std NUMERIC,
ph\_min NUMERIC,
ph\_max NUMERIC,
ph\_mean NUMERIC,
ph\_std NUMERIC,
pao2\_min NUMERIC,
pao2\_max NUMERIC,
pao2\_mean NUMERIC,
pao2\_std NUMERIC,
paco2\_min NUMERIC,
paco2\_max NUMERIC,
paco2\_mean NUMERIC,
paco2\_std NUMERIC,
\-- Intervention Features (First 24 Hours)
mechanical\_ventilation\_24hr\_flag BOOLEAN,
vasopressor\_24hr\_flag BOOLEAN,
rrt\_24hr\_flag BOOLEAN,
fluid\_balance\_24hr\_ml NUMERIC,
\-- Additional SA-AKI Specific Features
aki\_trigger VARCHAR(20), \-- 'Creatinine', 'Urine Output', 'Both'
persistent\_aki\_48hr\_flag BOOLEAN,
kdigo\_stage\_max\_48hr INTEGER
);
```## A Comprehensive Framework for Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort Extraction and Staging from MIMIC-IV and eICU-CRD Databases
### I. Executive Summary
Sepsis-Associated Acute Kidney Injury (SA-AKI) represents a formidable clinical challenge within intensive care, profoundly increasing patient mortality and long-term morbidity.[1] The accurate and reproducible identification of this specific patient cohort is a foundational step for developing effective predictive models and advancing clinical research in critical care. This report delineates a rigorous, ontology-driven methodology for defining SA-AKI patient cohorts within two prominent critical care databases: MIMIC-IV v3.1 and eICU-CRD. The framework integrates internationally recognized clinical criteria for sepsis and acute kidney injury with advanced data mapping techniques, ensuring high data fidelity and reproducibility across these heterogeneous electronic health record (EHR) systems. The ultimate output includes the SQL Data Definition Language (DDL) for a comprehensive staging table, designed to support binary mortality prediction, encompassing a rich set of static, dynamic, and intervention-related features.
### II. Introduction to the Data Ecosystems
Effective clinical research leveraging large-scale EHR databases necessitates a profound understanding of their unique structures, data provenance, and inherent complexities. This section provides an overview of the MIMIC-IV and eICU-CRD databases, highlighting their architectural principles, temporal data representations, and critical considerations for researchers.
#### MIMIC-IV v3.1: A Single-Center, High-Resolution Database
MIMIC-IV v3.1 is a publicly available, de-identified clinical database sourced from the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA, spanning patient admissions from 2008 to 2022.[1] This extensive dataset comprises records for over 364,000 unique patients, corresponding to more than 546,000 hospitalizations and nearly 95,000 unique Intensive Care Unit (ICU) stays.[1] The database is organized into a modular structure, reflecting its data provenance. The `hosp` module contains hospital-wide EHR data, including administrative details, billing codes, laboratory results, and pharmacy records that cover a patient's entire hospitalization. In contrast, the `icu` module provides high-resolution data from the in-ICU clinical information system (MetaVision), offering detailed, often minute-by-minute, physiological measurements, ventilator settings, and fluid intake/output records specific to an ICU stay.[1]
A crucial aspect of MIMIC-IV's design philosophy is its commitment to providing data "as is," meaning "data cleaning steps were not performed, to ensure the data reflects a real-world clinical dataset".[1] This deliberate choice aims to preserve the ecological validity of the data, retaining the artifacts, inconsistencies, and potential errors inherent in routine clinical practice. For a task demanding "perfect calculations" and a "robust plan," this implies that SQL queries must be designed with resilience. The expectation of "perfect calculations" in the user query, when juxtaposed with MIMIC-IV's "data as is" philosophy, necessitates that any robust SQL implementation must explicitly handle potential data quality issues. This includes anticipating non-numeric values in numeric columns, managing missing data points, and accounting for unexpected string formats. This approach ensures the integrity of derived features and the accuracy of the cohort definition despite the raw data's inherent variability, requiring careful use of functions like `TRY_CAST` and `COALESCE` and robust filtering conditions within the SQL queries.
To comply with HIPAA Safe Harbor provisions, MIMIC-IV employs a thorough de-identification process. All dates and times for a given patient are shifted by a random offset into the future, preserving the internal consistency of a single patient's timeline (e.g., the duration between any two events for a patient remains accurate) while making direct temporal comparisons between different patients impossible.[1] The `patients` table provides `anchor_age`, `anchor_year`, and `anchor_year_group` to anchor a patient's timeline.[1] Notably, the ages of all patients over 89 are obscured, with their `anchor_age` recorded as 91. This means an `anchor_age` of 91 represents the entire group of patients aged 90 and above.[1] For mortality prediction, especially in SA-AKI where advanced age is a significant risk factor, this age capping reduces granularity for the oldest demographic. Age is a well-established strong predictor of mortality, and collapsing all elderly patients into a single '91' category results in a loss of potentially valuable predictive power within that specific age group. The SQL queries and subsequent feature engineering must acknowledge this limitation by treating the value '91' as a distinct categorical representation (e.g., "90 years or older") rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling. Free-text fields within the database are de-identified by replacing Protected Health Information (PHI) with three underscores ("___").[1]
The database is interconnected through a hierarchical system of three primary identifiers: `subject_id`, `hadm_id`, and `stay_id`.[1] The `subject_id` is a unique integer assigned to each individual patient, linking all data pertaining to a single person across their hospitalizations and care episodes. The `hadm_id` uniquely identifies each hospital admission, linking all events, diagnoses, and procedures associated with a single inpatient stay. A single `subject_id` can have multiple `hadm_id`s. The `stay_id` uniquely identifies a single, continuous stay within a specific care unit, primarily an ICU, linking high-frequency, granular data from the `icu` module tables (e.g., `chartevents`, `inputevents`) to a specific ICU admission. A single `hadm_id` can have multiple `stay_id`s if a patient was transferred between ICUs or readmitted to the ICU during the same hospitalization.[1]
#### eICU-CRD: A Multi-Center, Granular Database
The eICU Collaborative Research Database (eICU-CRD) v2.0 is a large, de-identified, multi-center critical care database containing granular data from over 200,000 admissions to ICUs across the United States.[1] Data were collected between 2014 and 2015 from 208 hospitals and 335 distinct ICU wards participating in the Philips eICU tele-ICU program.[1] This origin provides a rich, high-granularity view of real-world clinical practice, encompassing vital sign measurements, laboratory results, diagnoses, treatments, and medication documentation.[1]
The multi-center nature of eICU-CRD, while enhancing the external validity and generalizability of research findings by capturing a broader range of patient populations and clinical practices, also presents significant analytical challenges.[1] The presence, frequency, and even the naming conventions for certain data points—such as specific laboratory tests in the `lab` table or charted items in the `nursecharting` table—can vary substantially across different hospitals (identified by `hospitalid`) and ICU wards (`wardid`).[1] This inherent heterogeneity means that direct string matching for `labname` or `drugname` is often insufficient for comprehensive data extraction. The eICU documentation explicitly states that naming conventions for data points can vary substantially across hospitals. This necessitates an ontology-driven approach, as direct string matching would lead to incomplete data capture. Therefore, an ontology-driven approach is not merely a suggestion but a necessity for eICU data to overcome this variability. Furthermore, this variability suggests that hospital-level metadata (e.g., `numbedscategory`, `teachingstatus`, `region` from `eicu.hospital`) could serve as important covariates or predictive features in the final model, accounting for systemic differences in care or patient populations.[1]
The eICU-CRD database is organized around a hierarchy of unique identifiers: `uniquepid`, `patienthealthsystemstayid`, and `patientunitstayid`.[1] `patientunitstayid` serves as the primary identifier for linking most clinical event data, uniquely identifying a single ICU stay.[1] While most tables link to the central `patient` table via `patientunitstayid`, complex clinical relationships (e.g., linking a specific microbiology culture result to a targeted antibiotic administration) are not enforced by explicit foreign key constraints. Researchers must infer these relationships by joining on `patientunitstayid` and carefully aligning temporal data using offset columns.[1] The temporal data in eICU-CRD is represented using offsets in minutes, calculated relative to the moment of a patient's ICU admission (offset 0).[1] Events occurring before ICU admission have negative offsets, while events during the ICU stay have positive offsets. The `unitdischargeoffset` represents the patient's length of stay in the ICU in minutes.[1] This offset system simplifies intra-ICU duration calculations but introduces complexity when analyzing events across multiple ICU stays for the same patient, as each `patientunitstayid` has its own "clock." The eICU documentation explicitly states that there is no single, master timeline for a patient's complete hospital journey, requiring researchers to piece together chronologies using various admission and discharge offsets from the `patient` table. This is a critical consideration for any feature that needs to span beyond a single ICU stay or requires a full hospital-level timeline, necessitating complex temporal joins and calculations.
Critical considerations for researchers using eICU-CRD include the impact of data interfaces and systematic missingness. The availability of any given data type is contingent on whether the corresponding care unit had the necessary "data interface" enabled.[1] This means that missing data should not be naively interpreted as the absence of an event. For instance, if an ICU lacked an interface for its infusion pumps, the `infusiondrug` table would contain no records for patients from that unit, even if those patients received continuous infusions. This principle, explicitly stated in the eICU documentation, means that a simple `LEFT JOIN` and assuming `NULL` implies absence is flawed. This systematic missingness can introduce significant bias if not addressed. Therefore, robust methodology requires identifying hospitals or units that consistently provide the data of interest and restricting analysis to this reliable subset or employing advanced statistical methods to handle non-random missingness.[1] The SQL for cohort definition should include a mechanism to filter by `hospitalid` if such systematic limitations are identified.
Another crucial distinction in eICU-CRD is between ordered and administered treatments. The `medication` table primarily reflects medication orders, which do not guarantee administration. In contrast, the `infusiondrug` and `intakeoutput` tables contain data charted by nurses at the bedside, reflecting the actual administration of continuous infusions and other fluids.[1] The eICU documentation explicitly highlights this distinction. For a critical intervention like vasopressors, relying on the `medication` table (orders) would introduce noise and misrepresent actual treatment. The `infusiondrug` table is the gold standard for continuous infusions. For studies involving continuous infusions, `infusiondrug` and `intakeoutput` are more reliable sources for actual administration data.[1]
#### Ontologies: The Foundation for Standardized Extraction
The programmatic extraction of clinical concepts, rather than relying on hardcoded values, is a critical requirement for ensuring medical accuracy and generalizability across heterogeneous datasets. This is achieved by leveraging a suite of ontologies:
* **RxNorm:** This terminology provides normalized names for generic and branded drugs, linking to various drug vocabularies.[2, 3] It is essential for identifying all forms of antibiotic and vasopressor medications.
* **LOINC (Logical Observation Identifiers Names and Codes):** LOINC standardizes laboratory test names and clinical observations.[1, 4, 5] It is crucial for consistently identifying lab results like creatinine, bilirubin, and platelet count across different hospital systems.
* **ATC (Anatomical Therapeutic Chemical Classification System):** ATC classifies drugs based on their anatomical main group, therapeutic main group, pharmacological subgroup, chemical subgroup, and chemical substance.[1, 6, 7] This hierarchical classification provides a broad and comprehensive method for identifying drug classes, such as all systemic antibacterials (J01) or specific vasopressors (e.g., C01CA).[6, 8]
* **HPO (Human Phenotype Ontology):** HPO describes phenotypic abnormalities associated with human diseases.[1] While less directly used for primary cohort definition in this context, it can be valuable for identifying subtle phenotypic expressions or for future feature enrichment related to disease severity and genetic predispositions.
* **UMLS (Unified Medical Language System):** UMLS serves as a comprehensive metathesaurus, integrating and linking numerous biomedical vocabularies, including RxNorm and SNOMED_CT, through Concept Unique Identifiers (CUIs) and semantic types.[1, 9, 10, 11] This central hub role is vital for achieving comprehensive mapping. The user's explicit instruction to use "all ontologies" underscores the importance of UMLS's role as a Metathesaurus. By mapping local EHR terms (e.g., `drugname`, `labname`) to RxNorm or LOINC, and then using UMLS to bridge to ATC and SNOMED_CT via CUIs and relationships, a highly comprehensive and accurate list of relevant clinical concepts (e.g., all antibiotics, all vasopressors, all creatinine tests) can be generated. This programmatic approach is superior to hardcoding specific values and enhances the generalizability and reproducibility of the cohort definition.
* **SNOMED_CT (Systematized Nomenclature of Medicine—Clinical Terms):** SNOMED_CT is a comprehensive, multilingual clinical healthcare terminology that provides concepts, descriptions, and relationships, often mapped to other international standards like ICD.[1, 12, 13] It is fundamental for identifying clinical conditions, diagnoses (e.g., AKI, ESRD, infection), and procedures with high specificity and consistency.
The interconnectedness of these ontologies, particularly through UMLS's role as a Metathesaurus, allows for a multi-faceted mapping strategy. For example, a drug name in MIMIC or eICU can be mapped to an RxNorm concept, which then links to an ATC code and a UMLS CUI, which in turn links to SNOMED_CT concepts. This ensures that no relevant drug or condition is missed due to variations in local terminology or data entry practices.
### III. Defining the Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort
The precise and reproducible definition of the SA-AKI cohort is the cornerstone of any subsequent predictive modeling. This involves the meticulous application of the Sepsis-3 criteria for sepsis and the Kidney Disease: Improving Global Outcomes (KDIGO) criteria for Acute Kidney Injury (AKI), coupled with strict temporal linkage and exclusion criteria.
#### A. Implementing the Sepsis-3 Definition
The Third International Consensus Definitions for Sepsis (Sepsis-3), published in 2016, redefined sepsis as "life-threatening organ dysfunction caused by a dysregulated host response to infection".[1, 1] This definition is operationalized by identifying patients with a suspected infection who concurrently exhibit an acute increase of 2 or more points in the Sequential Organ Failure Assessment (SOFA) score. The baseline SOFA score is assumed to be zero for patients without known pre-existing organ dysfunction.[1, 1]
#### 1. Identifying a Suspected Infection: The Temporal Link of Antibiotics and Cultures
The anchor point for the Sepsis-3 definition is the presence of an infection. In retrospective database research, a robust and widely accepted proxy is the temporal association of antimicrobial therapy and the collection of microbiological cultures.[1] This combination signifies that the clinical team suspected an infection and initiated treatment based on that suspicion. A suspected infection is defined for a given hospital admission (MIMIC `hadm_id`) or ICU stay (eICU `patientunitstayid`) if an antibiotic was administered within a 72-hour window *before* or a 24-hour window *after* a microbiological culture was sampled. The `infection_onset_time` is designated as the earlier of these two events' timestamps or offsets.[1]
**Antibiotic Administration:**
For MIMIC-IV, antibiotic orders are identified from the `hosp.prescriptions` table or, for more granular administration data, the `hosp.emar` and `hosp.emar_detail` tables.[1] For eICU-CRD, antibiotic orders are identified from the `medication` table using the `drugstartoffset`.[1] To avoid hardcoding specific drug names, a comprehensive list of antibiotics is programmatically identified using multiple ontologies:
* **RxNorm:** Concepts with `tty` values such as 'IN' (Ingredient), 'PIN' (Precise Ingredient), 'SCD' (Semantic Clinical Drug), or 'SBD' (Semantic Branded Drug) [1, 14] are queried from `rxnorm.rxnconso` where the string representation (`str`) contains common antibiotic keywords (e.g., 'penicillin', 'cephalexin', 'vancomycin', 'ciprofloxacin', 'levofloxacin').[15, 16] Additionally, `rxnorm.rxnsty` is used to find `rxcui`s classified with the semantic type 'Antibiotic'.[1, 17] Specific examples include Ampicillin (RxCUI 308212, 313800) [18], Vancomycin (RxCUI 2118449) [19], and Meropenem (RxCUI 29561).[20]
* **ATC:** `atc.who_atc_ddd` is queried for `atc_code`s starting with 'J01', which represents "ANTIBACTERIALS FOR SYSTEMIC USE".[1, 6, 7, 21] This provides a broad, hierarchical classification of antibiotics.
* **UMLS:** `umls.mrsty` is queried for `cui`s with the semantic type 'Antibiotic' or 'Clinical Drug'.[1, 9, 11, 22] Additionally, `umls.mrconso` is searched for strings containing '%antibiotic%'.[1] The UMLS Concept Unique Identifier (CUI) C0009450 represents "Infectious disease" [23], which can be used to link to related treatments and conditions.[24, 25]
* **SNOMED_CT:** `snomed_ct.snomed_description` is queried for terms containing '%antibiotic%'.[1] `snomed_ct.snomed_relationship` is used to find concepts related to antibiotics via the 'Is a' relationship (`type_id = 116680003`).[1, 12, 13, 26]
For MIMIC-IV, specific `icu.inputevents` `d_items.itemid`s such as 225798 for 'Vancomycin' are also included.[27] For eICU-CRD, the `medication.drugname` field is free-text and hospital-specific.[1, 28, 29, 30] This necessitates robust pattern matching using `LIKE` statements for common antibiotic names (e.g., `LIKE '%penicillin%'`, `LIKE '%vancomycin%'`, `LIKE '%cefepime%'`) [16] in conjunction with the ontology-derived lists.
**Culture Collection:**
In MIMIC-IV, microbiological culture events (e.g., blood, urine, sputum) are identified from the `hosp.microbiologyevents` table using `chartdate` or `charttime`.[1] In eICU-CRD, culture events are identified from the `microlab` table using `culturetakenoffset`.[1]
#### 2. Calculating the Sequential Organ Failure Assessment (SOFA) Score
The SOFA score is a quantitative measure of organ dysfunction across six critical organ systems.[1, 1] An acute increase of 2 or more points from baseline is the cornerstone of the Sepsis-3 definition.[1, 1] For each 24-hour period of a patient's ICU stay, the worst value for each component is used to calculate the score.[1]
* **Respiration (PaO2/FiO2 Ratio):**
* **MIMIC-IV:** Arterial partial pressure of oxygen (PaO2) is extracted from `hosp.labevents` (itemid 50821) [1, 31], and fraction of inspired oxygen (FiO2) from `icu.chartevents` (itemid 223835).[1, 31] These values must be temporally aligned using their respective `charttime` columns. LOINC codes for PaO2 (78354-8) [32] and FiO2 (19996-8, 3150-0) [33, 34] are used for standardization.
* **eICU-CRD:** PaO2 is extracted from the `lab` table (`labname` like '%pao2%'), and FiO2 from `respiratoryCharting` (`respchartvaluelabel = 'FiO2'`) or `apacheApsVar` (`fio2` column) for the first 24 hours. Events are aligned using their offset columns.[1]
* **Coagulation (Platelet Count):**
* **MIMIC-IV:** Platelet count is extracted from `hosp.labevents` using itemid 51265.[1] The LOINC code for Platelet Count is 777-3.[35]
* **eICU-CRD:** Platelet count is extracted from the `lab` table where `labname` like '%platelet%'.[1]
* **Liver (Bilirubin):**
* **MIMIC-IV:** Total bilirubin is extracted from `hosp.labevents` using itemid 50885.[1] LOINC codes for Total Bilirubin include 1975-2 and 42719-5.[36, 37]
* **eICU-CRD:** Total bilirubin is extracted from the `lab` table where `labname` like '%bilirubin%'.[1]
* **Cardiovascular (Mean Arterial Pressure & Vasopressor Use):**
* **MIMIC-IV:** Mean Arterial Pressure (MAP) is extracted from `icu.chartevents` (using itemids 220052, 220181, or 225312).[1] Vasopressor administrations are identified from the `icu.inputevents` table by mapping drug names to their corresponding `itemid`s in `icu.d_items` (e.g., Norepinephrine itemid 221906).[1, 31] The LOINC code for MAP is commonly 8480-6.
* **eICU-CRD:** MAP is extracted from `vitalPeriodic` (`systemicmean`) or `vitalAperiodic` (`noninvasivemean`). Vasopressor use is identified from the `infusiondrug` table where the `drugname` column contains keywords for common vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').[1, 38, 39, 40] The distinction between ordered medications and actual administrations is critical here; the `infusiondrug` table is the more accurate source for continuous infusions.[1]
* **Ontology Mapping (Vasopressors):** To ensure comprehensive identification, a multi-ontology approach is employed:
* **RxNorm:** `rxnorm.rxnconso` is queried for `tty` values ('IN', 'PIN', 'SCD', 'SBD') where `str` contains common vasopressor keywords (e.g., 'phenylephrine', 'norepinephrine', 'epinephrine', 'vasopressin', 'dopamine', 'dobutamine').[41, 42, 43, 44] `rxnorm.rxnsty` is also used to find `rxcui`s with `sty = 'Vasopressor'`.[1] Examples include Vasopressin (RxCUI 569490, 11149) [45, 46, 47, 48], Norepinephrine (RxCUI 7512) [49, 50], Epinephrine (RxCUI 727316) [51, 52], Dopamine (RxCUI 1743871) [53, 54], Phenylephrine (RxCUI 1666372) [55, 56], and Dobutamine (RxCUI 203121).[57, 58]
* **ATC:** `atc.who_atc_ddd` is queried for `atc_name`s containing '%norepinephrine%', '%epinephrine%', etc..[1] Specific ATC codes include Norepinephrine (C01CA03) [8, 59], Epinephrine (C01CA24) [60, 61], Dopamine (C01CA04) [62, 63], Vasopressin (H01BA01) [64, 65], Phenylephrine (R01AA04, R01AB01) [66, 67], and Dobutamine (C01CA07).[68, 69]
* **UMLS:** `umls.mrsty` is queried for `cui`s with `sty = 'Vasopressor'`.[1] `umls.mrconso` is searched for strings containing '%vasopressor%'.[1]
* **SNOMED_CT:** `snomed_ct.snomed_description` is queried for terms containing '%vasopressor%'.[1] `snomed_ct.snomed_relationship` is used to find related concepts via the 'Is a' relationship (`type_id = 116680003`).[1, 70, 71, 72, 73, 74, 75, 76, 77]
* **Central Nervous System (Glasgow Coma Scale - GCS):**
* **MIMIC-IV:** The total GCS score is calculated by summing its three components: Eyes (itemid 220739), Verbal (itemid 223900), and Motor (itemid 223901), all found in `icu.chartevents`.[1]
* **eICU-CRD:** GCS is often recorded as a total score in `nurseCharting` (`nursingchartcelltypevallabel = 'Glasgow coma score'`) or as its individual components in `apacheApsVar` (`eyes`, `motor`, `verbal`) for the first 24 hours.[1]
* **Renal (Creatinine & Urine Output):**
* **MIMIC-IV:** Serum creatinine is extracted from `hosp.labevents` (itemid 50912).[1] Total urine output over a period is calculated by summing `value` from `icu.outputevents` for relevant itemids like 226559, 226561.[1] LOINC codes for Creatinine include 2161-8 and 2162-6 for urine [4, 78], and common LOINC 2160-0 for serum/plasma creatinine. The LOINC code for Urine Output is 9192-6.[79]
* **eICU-CRD:** Serum creatinine is extracted from the `lab` table (`labname = 'creatinine'`). Urine output is extracted from the `intakeOutput` table where `celllabel = 'Urine'`.[1]
#### 3. Defining Sepsis Onset
The `sepsis_onset_time` is defined as the first timestamp (or offset) at which a patient satisfies both the criteria for a suspected infection (as defined in Section III.A.1) and demonstrates an acute increase in the total SOFA score of 2 or more points from a baseline of zero (as defined in Section III.A.2).[1]
### B. Implementing the KDIGO-AKI Definition
The second pillar of the SA-AKI definition is the diagnosis of AKI according to the Kidney Disease: Improving Global Outcomes (KDIGO) 2012 clinical practice guideline.[1, 1] AKI is defined by the presence of any one of the following criteria [1, 1]:
1. An increase in serum creatinine (SCr) by
2. An increase in SCr to
3. A urine volume of less than
Relevant ICD-9 codes for AKI include 584.5 (acute renal failure with tubular necrosis), 584.9 (acute renal failure, unspecified), and 586 (renal failure, unspecified).[80, 81] Corresponding ICD-10 codes are N17.0 to N17.9 (Acute renal failure) and N19 (Unspecified kidney failure).[82, 83]
**Ontology Mapping (AKI):**
* **UMLS:** Concepts such as "acute kidney injury" or "acute renal failure" are identified, with CUIs like C0022660 ("Acute kidney injury").[84, 85, 86]
* **SNOMED_CT:** The concept ID 14669001 for "Acute kidney injury" [84, 87] is used, and `snomed_ct.snomed_icd_map` is leveraged to link to relevant ICD codes.
#### 1. The Baseline Creatinine Challenge
A significant practical challenge in applying the KDIGO criteria retrospectively is the frequent absence of a known, pre-illness baseline SCr value.[1] The choice of baseline can dramatically affect cohort composition, as a poorly chosen baseline can lead to misclassification of AKI.[1] The document explicitly states that "The choice of baseline can dramatically affect cohort composition" and that using MDRD estimation is a "final resort" that "represents an estimation rather than a direct measurement".[1] This indicates that this specific baseline method introduces uncertainty and should be noted as a potential source of variability or bias, and the SQL should include a flag to identify these cases. A robust, hierarchical method for determining baseline SCr, reflecting clinical best practice, is implemented as follows [1]:
1. **Recent Outpatient Value (Ideal):** The median outpatient SCr value recorded in the 365 to 7 days prior to the current hospital admission. This represents the patient's stable, pre-illness renal function. In MIMIC-IV, this is queried from `hosp.labevents` where `hadm_id` is null but `charttime` falls in the correct window relative to the admission time.[1] In eICU-CRD, this corresponds to `lab` entries with a negative `labresultoffset`.[1]
2. **Admission Value:** If no recent outpatient value is available, the first SCr measurement recorded upon hospital admission is used.
3. **Nadir In-Hospital Value:** If an admission value is also unavailable, the lowest (nadir) SCr value measured within the first 48 hours of ICU admission can be used, as this may represent renal recovery after initial hydration.
4. **MDRD Estimation (Final Resort):** As a final resort, if no measured values are available, a baseline SCr can be estimated by back-calculating from a presumed normal glomerular filtration rate (GFR) of 75 mL/min/1.73 m², using the Modification of Diet in Renal Disease (MDRD) equation and the patient's demographic data (age, sex, race). Any patient whose baseline is derived this way is flagged, as it represents an estimation rather than a direct measurement.[1]
#### 2. Defining AKI Onset
The `aki_onset_time` is defined as the earliest timestamp (or offset) at which any of the KDIGO criteria for SCr or Urine Output (UO) are met. This requires continuous monitoring of both lab values and hourly urine output calculations.[1]
### C. Temporal Linkage and Final Cohort Selection
To be classified as SA-AKI, the onset of AKI must be temporally linked to the onset of sepsis. Research consensus suggests that AKI developing within a narrow window around the sepsis diagnosis is most likely to be directly caused by the septic insult.[1, 1] For this framework, the "Early SA-AKI" phenotype is targeted: the `aki_onset_time` must occur within the window of 48 hours before to 48 hours after the `sepsis_onset_time`.[1, 1] For MIMIC-IV, this involves `TIMESTAMP` comparisons, while for eICU-CRD, it involves `offset` comparisons (±2880 minutes).[1]
The "definitional paradox" highlights that any patient with a documented infection who develops AKI automatically meets the criteria for sepsis.[1] The document explicitly discusses this paradox and how Stage 1 AKI can be triggered by minor changes, leading to a "vast spectrum of clinical severity" within the SA-AKI label.[1] This inherent heterogeneity within the SA-AKI label necessitates additional features to capture clinical nuance, such as `aki_trigger`, `persistent_aki`, and `kdigo_stage_max_48hr` [1], to improve mortality prediction by allowing the model to differentiate between various levels of SA-AKI severity.
**Inclusion Criteria:**
1. Patient is an adult (Age
2. Patient meets Sepsis-3 criteria, establishing a `sepsis_onset_time`.
3. Patient meets KDIGO criteria for AKI (any stage), establishing an `aki_onset_time`.
4. Temporal Linkage: The `aki_onset_time` must occur within the window of 48 hours before to 48 hours after the `sepsis_onset_time`.[1]
**Exclusion Criteria:**
1. **Pre-existing End-Stage Renal Disease (ESRD) or Chronic Maintenance Dialysis:** Patients with ESRD or those on chronic maintenance dialysis prior to admission are excluded, as they cannot develop a new AKI.[1]
* **ICD-9 codes:** '585.6', 'V45.1' (long-term dialysis), 'V56.x' (dialysis codes).[1, 81]
* **ICD-10 codes:** 'N18.6' (End stage renal disease), 'Z99.2' (dependence on renal dialysis), 'Z49.x' (care involving dialysis).[1, 83]
* **Ontology Mapping (ESRD):**
* **UMLS:** Concepts such as "end stage renal disease," "ESRD," or "chronic dialysis" are identified, including CUIs like C1561643 ("Chronic kidney disease") [88] and specific CUIs for ESRD.[89, 90]
* **SNOMED_CT:** The concept ID 46177005 for "End-stage renal disease" [91] is used, and `snomed_ct.snomed_icd_map` is leveraged to find associated ICD codes.
2. **Short ICU Stay:** Patients with an ICU length of stay less than 24 hours are excluded, as this is insufficient time to reliably observe the development of AKI and gather necessary data.[1] For MIMIC-IV, this is calculated from `intime` and `outtime` in `icu.icustays` (`los` column). For eICU-CRD, it is calculated from `unitdischargeoffset` in the `patient` table.
3. **Early Renal Replacement Therapy (RRT):** Patients who receive RRT within the first 24 hours of ICU admission are excluded because the intervention fundamentally alters the natural course of AKI and its associated biomarkers, confounding the predictive modeling task.[1]
4. **eICU-CRD Specific Data Interface Limitations:** Patients admitted to hospitals or units with known incomplete data interfaces for critical variables (e.g., laboratory results, infusion medications) should be excluded to prevent systematic bias.[1] The explicit exclusion for eICU's data interface limitations reinforces the need for pre-analysis data quality checks, especially for multi-center datasets. This is not merely about handling missing values, but about addressing *systematic* missingness, which can introduce selection bias if not properly accounted for. The SQL for cohort definition includes a mechanism to filter by `hospitalid` if such limitations are identified.
Table 1 provides a summary of the logic required to define the SA-AKI cohort.
**Table 1: SA-AKI Cohort Definition Logic (Expanded)**
Criterion | Rationale | MIMIC-IV Implementation (Tables, Columns/itemids, Ontology Mapping) | eICU-CRD Implementation (Tables, Columns/labname pattern, Ontology Mapping) | Temporal Logic |
---|---|---|---|---|
**Sepsis-3** | Defines sepsis as infection plus organ dysfunction. | |||
Suspected Infection | Proxy for infection onset based on clinical action. | `hosp.prescriptions`, `hosp.emar`, `hosp.emar_detail`, |
- Sepsis AKI Mortality Prediction Dataset Prep Base.pdf
- RxNORM - amoxicillin - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/RXNORM/723
- RxNorm Overview - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/rxnorm/overview.html
- LOINC 2161-8 Creatinine [Mass/volume] in Urine, accessed on June 18, 2025, https://loinc.org/2161-8
- LOINC 71426-1 C reactive protein [Mass/volume] in Blood by High sensitivity method, accessed on June 18, 2025, https://loinc.org/71426-1
- j01 antibacterials for systemic use - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=j01
- Antiinfectives For Systemic Use - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=J&showdescription=yes
- C01CA03 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA03
- The Semantic Network - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_003.html
- Unique Identifiers in the Metathesaurus - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
- Antibiotic Concepts and Types - John Snow Labs, accessed on June 18, 2025, https://www.johnsnowlabs.com/marketplace/antibiotic-concepts-and-types/
- SAMPLE ANTIBIOTIC SENSITIVITY RESULT (SNOMED CT DM+D) - NHS Data Dictionary, accessed on June 18, 2025, https://archive.datadictionary.nhs.uk/DD%20Release%20June%202021/data_elements/sample_antibiotic_sensitivity_result__snomed_ct_dm_d_.html
- What is SNOMED CT, accessed on June 18, 2025, https://www.snomed.org/what-is-snomed-ct
- Appendix 5 - RxNorm Term Types (TTY) - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/rxnorm/docs/appendix5.html
- Sepsis Treatment: Antibiotics, Medication, and Surgery - Everyday Health, accessed on June 18, 2025, https://www.everydayhealth.com/sepsis/treatment/
- Antibiotics - EMCrit Project, accessed on June 18, 2025, https://emcrit.org/ibcc/antibiotics/
- UMLS - ATC - Statistics - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/ATC/stats.html
- Label: AMPICILLIN capsule - DailyMed, accessed on June 18, 2025, https://dailymed.nlm.nih.gov/dailymed/lookup.cfm?setid=7f8b1ab1-73c6-4cdf-a838-7a4554c62586
- 400 ML vancomycin 5 MG/ML Injection - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=2118449
- RxClass, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxClass/search?query=Fever
- atcddd.fhi.no, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=j01ca#:~:text=J01%20ANTIBACTERIALS%20FOR%20SYSTEMIC%20USE,mode%20of%20action%20and%20chemistry.
- UMLS Semantic Network, accessed on June 18, 2025, https://uts.nlm.nih.gov/uts/umls/semantic-network/T200
- Infectious disease (Concept Id: C0009450) - NCBI, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/medgen/1057
- suitability of UMLS and SNOMED-CT for encoding outcome concepts - Oxford Academic, accessed on June 18, 2025, https://academic.oup.com/jamia/article/30/12/1895/7249289
- TUTORIAL # Customizing the UMLS Metathesaurus for Your Applications - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/presentations/customizing-AMIA-Nov2000.pdf
- Amoxicillin 500mg capsules (RX Farma) (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=35591411000001103
- mimic-code/mimic-iii/notebooks/vancomycin-dosing.ipynb at main - GitHub, accessed on June 18, 2025, https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iii/notebooks/vancomycin-dosing.ipynb
- FedWeight: Mitigating Covariate Shift of Federated Learning on Electronic Health Records Data through Patients Re-weighting - medRxiv, accessed on June 18, 2025, https://www.medrxiv.org/content/10.1101/2025.02.10.25322018v1.full.pdf
- (PDF) FedWeight: Mitigating Covariate Shift of Federated Learning on Electronic Health Records Data through Patients Re-weighting - ResearchGate, accessed on June 18, 2025, https://www.researchgate.net/publication/390786796_FedWeight_Mitigating_Covariate_Shift_of_Federated_Learning_on_Electronic_Health_Records_Data_through_Patients_Re-weighting
- medication - eICU Collaborative Research Database, accessed on June 18, 2025, https://eicu-crd.mit.edu/eicutables/medication/
- Itemids for a list of variables · Issue #540 · MIT-LCP/mimic-code - GitHub, accessed on June 18, 2025, MIT-LCP/mimic-code#540
- LOINC 78354-8 Arterial-alveolar oxygen Partial pressure ratio, accessed on June 18, 2025, https://loinc.org/78354-8
- LOINC 19996-8 Oxygen/Inspired gas Respiratory system --on ventilator, accessed on June 18, 2025, https://loinc.org/19996-8
- LOINC 3150-0 Inhaled oxygen concentration, accessed on June 18, 2025, https://loinc.org/3150-0
- www.labcorp.com, accessed on June 18, 2025, https://www.labcorp.com/tests/005249/platelet-count
- www.labcorp.com, accessed on June 18, 2025, https://www.labcorp.com/tests/001214/bilirubin-total-and-direct
- LOINC 42719-5 Bilirubin.total [Mass/volume] in Blood, accessed on June 18, 2025, https://loinc.org/42719-5
- infusionDrug - eICU Collaborative Research Database, accessed on June 18, 2025, https://eicu-crd.mit.edu/eicutables/infusiondrug/
- Early prediction of hemodynamic interventions in the intensive care unit using machine learning - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8590869/
- Practice Patterns in the Initiation of Secondary Vasopressors and Adjunctive Corticosteroids during Septic Shock in the United States - PubMed Central, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8641823/
- Inotropes and Vasopressors - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK482411/
- Vasopressors: Types, Purpose and Side Effects - Cleveland Clinic, accessed on June 18, 2025, https://my.clevelandclinic.org/health/treatments/23208-vasopressors
- Inotropes and Vasopressors - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/sites/books/NBK482411/
- List of Vasopressors - Drugs.com, accessed on June 18, 2025, https://www.drugs.com/drug-class/vasopressors.html
- RxNORM - vasopressin (USP) 20 UNT/ML Injectable Solution [Vasostrict] - Classes, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=1593738
- vasopressin (USP) 20 UNT/ML [Pitressin] - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=569490
- RxNORM - vasopressin (USP) - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/RXNORM/11149
- arginine vasopressin - Wikidata, accessed on June 18, 2025, https://www.wikidata.org/wiki/Q183011
- RxNORM - norepinephrine - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=7512
- Norepinephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK537259/
- NDA019430 0.3 ML EPINEPHrine 0.5 MG/ML Auto-Injector - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=727316
- Epinephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK482160/
- 500 ML DOPamine hydrochloride 0.8 MG/ML Injection - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=1743871
- Urinary Dopamine Levels Can Predict the Avidity of Post-Therapy [131I]MIBG Scintigraphy in Unresectable or Metastatic Pheochromocytomas and Paragangliomas: A Preliminary Clinical Study - PubMed Central, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11858449/
- RxNORM - 1 ML phenylephrine hydrochloride 10 MG/ML Injection - Classes, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=1666372
- Phenylephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK534801/
- RxCUI 203121 - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=203121
- Dobutamine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/sites/books/NBK470431/
- Noradrenaline (Norepinephrine) 0.08 mg/ml, solution for infusion - Summary of Product Characteristics (SmPC) - (emc) | 8776, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/8776/smpc
- EpiPen Adrenaline (Epinephrine) 0.3 mg Auto-Injector - Summary of Product Characteristics (SmPC) - (emc) | 4289, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/4289/smpc
- C01CA24 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA24
- dopamine - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA04
- Dopamine Hydrochloride 40mg/ml Concentrate for Solution for Infusion - Summary of Product Characteristics (SmPC) - (emc) | 6619, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/6619/smpc
- Vasopressin and analogues - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=H01BA&showdescription=yes
- Vasopressin - Wikipedia, accessed on June 18, 2025, https://en.wikipedia.org/wiki/Vasopressin
- R01AA04 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=R01AA04
- r01 nasal preparations - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?showdescription=yes&code=R01AB
- C01CA07 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA07
- Dobutamine 5 mg/ml solution for infusion - Patient Information Leaflet (PIL) - (emc) | 6462, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/6462/pil
- SNOMED CT - Vasopressin deficiency - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/45369008
- Vasopressin resistance - Classes | NCBO BioPortal - SNOMED CT, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/111395007
- SNOMED CT - Allergy to norepinephrine - Classes | NCBO BioPortal, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=294029008
- Adrenaline (base) 25micrograms/ml (1 in 40000) topical solution (Special Order) 1 ml (product) - SNOMED CT Browser, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=33514611000001107
- Dopamine 400mg/10ml solution for infusion pre-filled syringes (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=36071011000001104
- snomed international snomed ct license agreement - Athena, accessed on June 18, 2025, https://qaathena.odysseusinc.com/search-terms/terms/606832
- Phenylephrine 12mg capsules (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=8528911000001104
- Dobutamine 100mg/100ml infusion bags (Special Order) 1 bag (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=35060111000001103
- LOINC 2162-6 Creatinine [Mass/time] in 24 hour Urine, accessed on June 18, 2025, https://loinc.org/2162-6
- LOINC 9192-6 Urine output 24 hour, accessed on June 18, 2025, https://loinc.org/9192-6
- www.aapc.com, accessed on June 18, 2025, https://www.aapc.com/codes/icd9-codes/584.9#:~:text=ICD%2D9%20code%20584.9%20for,NEPHROSIS%20(580%2D589).
- Urine Culture Bacterial (Re-evaluation of Inclusion of Renal Failure in the List of ICD-9-CM Codes Covered) (CAG-00195N) - Decision Memo - CMS, accessed on June 18, 2025, https://www.cms.gov/medicare-coverage-database/view/ncacal-decision-memo.aspx?proposed=N&calid=100&keyword=Transplant&keywordType=starts&areaId=s24&docType=NCA,CAL,NCD,MEDCAC,TA,MCD,6,3,5,1,F,P&contractOption=all&sortBy=relevance&bc=1
- clinicalcodes.rss.mhs.man.ac.uk, accessed on June 18, 2025, https://clinicalcodes.rss.mhs.man.ac.uk/medcodes/article/30/codelist/res30-acute-kidney-injury/
- Renal failure (N17-N19) - ICD-10 Version:2019, accessed on June 18, 2025, https://icd.who.int/browse10/2019/en#/N17-N19
- SNOMED CT - Acute kidney injury - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/14669001
- Mouse model of ischemic acute kidney injury: technical notes and tricks - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC3532486/
- Acute Kidney Injury - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK441896/
- Acute kidney injury - Athena, accessed on June 18, 2025, https://qaathena.odysseusinc.com/search-terms/terms/197320
- Chronic kidney disease (Concept Id: C1561643) - NCBI, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/medgen/473458
- Development and external validation of multimodal postoperative acute kidney injury risk machine learning models - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC10746378/
- End-Stage Renal Disease - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK499861/
- SNOMED CT - End-stage renal disease - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/46177005
- LOINC 3094-0 Urea nitrogen [Mass/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/3094-0
- LOINC 6299-2 Urea nitrogen [Mass/volume] in Blood, accessed on June 18, 2025, https://loinc.org/6299-2
- loinc.org, accessed on June 18, 2025, https://loinc.org/33037-3#:~:text=LOINC%2033037%2D3%20Anion%20gap%20in%20Serum%20or%20Plasma%20by%20calculation
- Anion gap in Serum or Plasma by Calculated.3Ions - 10466-1 - LOINC - Find-A-Code, accessed on June 18, 2025, https://www.findacode.com/loinc/10466-1--anion-gap-3-in-serum-or-plasma.html
- LOINC 2947-0 Sodium [Moles/volume] in Blood, accessed on June 18, 2025, https://loinc.org/2947-0
- loinc.org, accessed on June 18, 2025, https://loinc.org/2829-0#:~:text=LOINC%202829%2D0%20Potassium%20%5BMoles,time%5D%20in%2024%20hour%20Urine
- loinc.org, accessed on June 18, 2025, https://loinc.org/2075-0#:~:text=LOINC%202075%2D0%20Chloride%20%5BMoles,volume%5D%20in%20Serum%20or%20Plasma
- LOINC 2075-0 Chloride [Moles/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/2075-0
- loinc.org, accessed on June 18, 2025, https://loinc.org/1960-4#:~:text=LOINC%201960%2D4%20Bicarbonate%20%5BMoles%2Fvolume%5D%20in%20Arterial%20blood
- LOINC 69964-5 Bicarbonate [Moles/volume] standard in Plasma, accessed on June 18, 2025, https://loinc.org/69964-5
- loinc.org, accessed on June 18, 2025, https://loinc.org/1743-4#:~:text=LOINC%201743%2D4%20Alanine%20aminotransferase,With%20P%2D5'%2DP
- LOINC 1743-4 Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma by With P-5, accessed on June 18, 2025, https://loinc.org/1743-4
- loinc.org, accessed on June 18, 2025, https://loinc.org/30239-8#:~:text=LOINC%2030239%2D8%20Aspartate%20aminotransferase,With%20P%2D5'%2DP
- LOINC 1920-8 Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/1920-8
- loinc.org, accessed on June 18, 2025, https://loinc.org/6768-6#:~:text=LOINC%206768%2D6%20Alkaline%20phosphatase,volume%5D%20in%20Serum%20or%20Plasma
- LOINC 77141-0 Alkaline phosphatase [Enzymatic activity/volume] in Serum, Plasma or Blood, accessed on June 18, 2025, https://loinc.org/77141-0
- loinc.org, accessed on June 18, 2025, https://loinc.org/1751-7#:~:text=LOINC%201751%2D7%20Albumin%20%5BMass,volume%5D%20in%20Serum%20or%20Plasma
- LOINC 1751-7 Albumin [Mass/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/1751-7
- Leukocytes [#/volume] in Blood by Automated count - 6690-2 - LOINC, accessed on June 18, 2025, https://www.findacode.com/loinc/6690-2--leukocytes-volume-in-blood-by-automated-count.html
- Hemoglobin [Mass/volume] in Blood - 718-7 - LOINC - Find-A-Code, accessed on June 18, 2025, https://www.findacode.com/loinc/718-7--hemoglobin-massvolume-in-blood.html
- loinc.org, accessed on June 18, 2025, https://loinc.org/4544-3#:~:text=LOINC%204544%2D3%20Hematocrit%20%5BVolume,of%20Blood%20by%20Automated%20count
- www.findacode.com, accessed on June 18, 2025, https://www.findacode.com/loinc/6301-6--inr-in-platelet-poor-plasma-by-coagulation-assay.html#:~:text=INR%20in%20Platelet%20poor%20plasma,assay%20%2D%206301%2D6%20%2D%20LOINC
- loinc.org, accessed on June 18, 2025, https://loinc.org/5902-2
- LOINC 50391-2 PTT circulating inhibitor [Presence] in Platelet poor plasma, accessed on June 18, 2025, https://loinc.org/50391-2
- loinc.org, accessed on June 18, 2025, https://loinc.org/30522-7#:~:text=LOINC%2030522%2D7%20C%20reactive,Plasma%20by%20High%20sensitivity%20method
- loinc.org, accessed on June 18, 2025, https://loinc.org/29246-6#:~:text=LOINC%2029246%2D6%20Lactate%20%5BMass%2Fvolume%5D%20in%20Body%20fluid
- loinc.org, accessed on June 18, 2025, https://loinc.org/2339-0#:~:text=LOINC%202339%2D0%20Glucose%20%5BMass%2Fvolume%5D%20in%20Blood
- LOINC 2744-1 pH of Arterial blood, accessed on June 18, 2025, https://loinc.org/2744-1
- LOINC Panel Details 52484-3 Arterial Blood Gases (ABGs), accessed on June 18, 2025, https://loinc.org/52484-3/panel