Skip to content

Instantly share code, notes, and snippets.

@stabgan
Created June 18, 2025 17:57
Show Gist options
  • Save stabgan/1132d658c50a6e26cae1a93271f7cc32 to your computer and use it in GitHub Desktop.
Save stabgan/1132d658c50a6e26cae1a93271f7cc32 to your computer and use it in GitHub Desktop.
Extract Sepsis Associated AKI cohort from MIMIC and eICU

A Comprehensive Framework for Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort Extraction and Staging from MIMIC-IV and eICU-CRD Databases

I. Executive Summary

Sepsis-Associated Acute Kidney Injury (SA-AKI) represents a formidable clinical challenge within intensive care, profoundly increasing patient mortality and long-term morbidity.1 The accurate and reproducible identification of this specific patient cohort is a foundational step for developing effective predictive models and advancing clinical research in critical care. This report delineates a rigorous, ontology-driven methodology for defining SA-AKI patient cohorts within two prominent critical care databases: MIMIC-IV v3.1 and eICU-CRD. The framework integrates internationally recognized clinical criteria for sepsis and acute kidney injury with advanced data mapping techniques, ensuring high data fidelity and reproducibility across these heterogeneous electronic health record (EHR) systems. The ultimate output includes the SQL Data Definition Language (DDL) for a comprehensive staging table, designed to support binary mortality prediction, encompassing a rich set of static, dynamic, and intervention-related features.

II. Introduction to the Data Ecosystems

Effective clinical research leveraging large-scale EHR databases necessitates a profound understanding of their unique structures, data provenance, and inherent complexities. This section provides an overview of the MIMIC-IV and eICU-CRD databases, highlighting their architectural principles, temporal data representations, and critical considerations for researchers.

MIMIC-IV v3.1: A Single-Center, High-Resolution Database

MIMIC-IV v3.1 is a publicly available, de-identified clinical database sourced from the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA, spanning patient admissions from 2008 to 2022.1 This extensive dataset comprises records for over 364,000 unique patients, corresponding to more than 546,000 hospitalizations and nearly 95,000 unique Intensive Care Unit (ICU) stays.1 The database is organized into a modular structure, reflecting its data provenance. The

hosp module contains hospital-wide EHR data, including administrative details, billing codes, laboratory results, and pharmacy records that cover a patient's entire hospitalization. In contrast, the icu module provides high-resolution data from the in-ICU clinical information system (MetaVision), offering detailed, often minute-by-minute, physiological measurements, ventilator settings, and fluid intake/output records specific to an ICU stay.1

A crucial aspect of MIMIC-IV's design philosophy is its commitment to providing data "as is," meaning "data cleaning steps were not performed, to ensure the data reflects a real-world clinical dataset".1 This deliberate choice aims to preserve the ecological validity of the data, retaining the artifacts, inconsistencies, and potential errors inherent in routine clinical practice. For a task demanding "perfect calculations" and a "robust plan," this implies that SQL queries must be designed with resilience. This involves anticipating and handling data quality issues such as

NULL values, unexpected string formats, and out-of-range numerical data, often requiring explicit type casting and validation within the SQL itself, even if advanced imputation is deferred. The implication is that the data extraction pipeline must be robust enough to manage real-world data imperfections, ensuring that derived features are reliable despite the raw data's variability.

To comply with HIPAA Safe Harbor provisions, MIMIC-IV employs a thorough de-identification process. All dates and times for a given patient are shifted by a random offset into the future, preserving the internal consistency of a single patient's timeline (e.g., the duration between any two events for a patient remains accurate) while making direct temporal comparisons between different patients impossible.1 The

patients table provides anchor_age, anchor_year, and anchor_year_group to anchor a patient's timeline.1 Notably, the ages of all patients over 89 are obscured, with their

anchor_age recorded as 91. This means an anchor_age of 91 represents the entire group of patients aged 90 and above.1 For mortality prediction, especially in SA-AKI where advanced age is a significant risk factor, this age capping reduces granularity for the oldest demographic. The SQL queries and subsequent feature engineering must acknowledge this by treating the value '91' as a categorical representation of "90 years or older" rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling. Free-text fields within the database are de-identified by replacing Protected Health Information (PHI) with three underscores ("___").1

The database is interconnected through a hierarchical system of three primary identifiers: subject_id, hadm_id, and stay_id.1 The

subject_id is a unique integer assigned to each individual patient, linking all data pertaining to a single person across their hospitalizations and care episodes. The hadm_id uniquely identifies each hospital admission, linking all events, diagnoses, and procedures associated with a single inpatient stay. A single subject_id can have multiple hadm_ids. The stay_id uniquely identifies a single, continuous stay within a specific care unit, primarily an ICU, linking high-frequency, granular data from the icu module tables (e.g., chartevents, inputevents) to a specific ICU admission. A single hadm_id can have multiple stay_ids if a patient was transferred between ICUs or readmitted to the ICU during the same hospitalization.1

eICU-CRD: A Multi-Center, Granular Database

The eICU Collaborative Research Database (eICU-CRD) v2.0 is a large, de-identified, multi-center critical care database containing granular data from over 200,000 admissions to ICUs across the United States.1 Data were collected between 2014 and 2015 from 208 hospitals and 335 distinct ICU wards participating in the Philips eICU tele-ICU program.1 This origin provides a rich, high-granularity view of real-world clinical practice, encompassing vital sign measurements, laboratory results, diagnoses, treatments, and medication documentation.1

The multi-center nature of eICU-CRD, while enhancing the external validity and generalizability of research findings by capturing a broader range of patient populations and clinical practices, also presents significant analytical challenges.1 The presence, frequency, and even the naming conventions for certain data points—such as specific laboratory tests in the

lab table or charted items in the nursecharting table—can vary substantially across different hospitals (identified by hospitalid) and ICU wards (wardid).1 This inherent heterogeneity means that direct string matching for

labname or drugname is often insufficient for comprehensive data extraction. Therefore, an ontology-driven approach is not merely a suggestion but a necessity for eICU data to overcome this variability. Furthermore, this variability suggests that hospital-level metadata (e.g., numbedscategory, teachingstatus, region from eicu.hospital) could serve as important covariates or predictive features in the final model, accounting for systemic differences in care or patient populations.1

The eICU-CRD database is organized around a hierarchy of unique identifiers: uniquepid, patienthealthsystemstayid, and patientunitstayid.1

patientunitstayid serves as the primary identifier for linking most clinical event data, uniquely identifying a single ICU stay.1 While most tables link to the central

patient table via patientunitstayid, complex clinical relationships (e.g., linking a specific microbiology culture result to a targeted antibiotic administration) are not enforced by explicit foreign key constraints. Researchers must infer these relationships by joining on patientunitstayid and carefully aligning temporal data using offset columns.1 The temporal data in eICU-CRD is represented using offsets in minutes, calculated relative to the moment of a patient's ICU admission (offset 0).1 Events occurring before ICU admission have negative offsets, while events during the ICU stay have positive offsets. The

unitdischargeoffset represents the patient's length of stay in the ICU in minutes.1 This offset system simplifies intra-ICU duration calculations but introduces complexity when analyzing events across multiple ICU stays for the same patient, as each

patientunitstayid has its own "clock." Reconstructing a patient's full hospital journey, if needed for features spanning beyond a single ICU stay, requires careful stitching of individual ICU stay timelines using various admission and discharge offsets from the patient table.1

Critical considerations for researchers using eICU-CRD include the impact of data interfaces and systematic missingness. The availability of any given data type is contingent on whether the corresponding care unit had the necessary "data interface" enabled.1 This means that missing data should not be naively interpreted as the absence of an event. For instance, if an ICU lacked an interface for its infusion pumps, the

infusiondrug table would contain no records for patients from that unit, even if those patients received continuous infusions. This systematic missingness can introduce significant bias if not addressed. Therefore, robust methodology requires identifying hospitals or units that consistently provide the data of interest and restricting analysis to this reliable subset or employing advanced statistical methods to handle non-random missingness.1 The SQL for cohort definition should include a mechanism to filter by

hospitalid if such limitations are identified.

Another crucial distinction in eICU-CRD is between ordered and administered treatments. The medication table primarily reflects medication orders, which do not guarantee administration. In contrast, the infusiondrug and intakeoutput tables contain data charted by nurses at the bedside, reflecting the actual administration of continuous infusions and other fluids.1 For studies involving continuous infusions,

infusiondrug and intakeoutput are more reliable sources for actual administration data.1 For example, when identifying vasopressor use, relying on the

infusiondrug table is medically more accurate for quantifying actual treatment compared to mere orders.

Ontologies: The Foundation for Standardized Extraction

The programmatic extraction of clinical concepts, rather than relying on hardcoded values, is a critical requirement for ensuring medical accuracy and generalizability across heterogeneous datasets. This is achieved by leveraging a suite of ontologies:

  • RxNorm: This terminology provides normalized names for generic and branded drugs, linking to various drug vocabularies.2 It is essential for identifying all forms of antibiotic and vasopressor medications.
  • LOINC (Logical Observation Identifiers Names and Codes): LOINC standardizes laboratory test names and clinical observations.1 It is crucial for consistently identifying lab results like creatinine, bilirubin, and platelet count across different hospital systems.
  • ATC (Anatomical Therapeutic Chemical Classification System): ATC classifies drugs based on their anatomical main group, therapeutic main group, pharmacological subgroup, chemical subgroup, and chemical substance.1 This hierarchical classification provides a broad and comprehensive method for identifying drug classes, such as all systemic antibacterials (J01) or specific vasopressors (e.g., C01CA).6
  • HPO (Human Phenotype Ontology): HPO describes phenotypic abnormalities associated with human diseases.1 While less directly used for primary cohort definition in this context, it can be valuable for identifying subtle phenotypic expressions or for future feature enrichment related to disease severity and genetic predispositions.
  • UMLS (Unified Medical Language System): UMLS serves as a comprehensive metathesaurus, integrating and linking numerous biomedical vocabularies, including RxNorm and SNOMED_CT, through Concept Unique Identifiers (CUIs) and semantic types.1 This central hub role is vital for achieving comprehensive mapping. By mapping local EHR terms (e.g.,
    drugname, labname) to RxNorm or LOINC, and then using UMLS to bridge to ATC and SNOMED_CT via CUIs and relationships, a highly comprehensive and accurate list of relevant clinical concepts can be generated. This programmatic approach is superior to hardcoding specific values and enhances the generalizability and reproducibility of the cohort definition.
  • SNOMED_CT (Systematized Nomenclature of Medicine—Clinical Terms): SNOMED_CT is a comprehensive, multilingual clinical healthcare terminology that provides concepts, descriptions, and relationships, often mapped to other international standards like ICD.1 It is fundamental for identifying clinical conditions, diagnoses (e.g., AKI, ESRD, infection), and procedures with high specificity and consistency.

The interconnectedness of these ontologies, particularly through UMLS's role as a Metathesaurus, allows for a multi-faceted mapping strategy. For example, a drug name in MIMIC or eICU can be mapped to an RxNorm concept, which then links to an ATC code and a UMLS CUI, which in turn links to SNOMED_CT concepts. This ensures that no relevant drug or condition is missed due to variations in local terminology or data entry practices.

III. Defining the Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort

The precise and reproducible definition of the SA-AKI cohort is the cornerstone of any subsequent predictive modeling. This involves the meticulous application of the Sepsis-3 criteria for sepsis and the Kidney Disease: Improving Global Outcomes (KDIGO) criteria for Acute Kidney Injury (AKI), coupled with strict temporal linkage and exclusion criteria.

A. Implementing the Sepsis-3 Definition

The Third International Consensus Definitions for Sepsis (Sepsis-3), published in 2016, redefined sepsis as "life-threatening organ dysfunction caused by a dysregulated host response to infection".1 This definition is operationalized by identifying patients with a suspected infection who concurrently exhibit an acute increase of 2 or more points in the Sequential Organ Failure Assessment (SOFA) score. The baseline SOFA score is assumed to be zero for patients without known pre-existing organ dysfunction.1

1. Identifying a Suspected Infection: The Temporal Link of Antibiotics and Cultures

The anchor point for the Sepsis-3 definition is the presence of an infection. In retrospective database research, a robust and widely accepted proxy is the temporal association of antimicrobial therapy and the collection of microbiological cultures.1 This combination signifies that the clinical team suspected an infection and initiated treatment based on that suspicion. A suspected infection is defined for a given hospital admission (MIMIC

hadm_id) or ICU stay (eICU patientunitstayid) if an antibiotic was administered within a 72-hour window before or a 24-hour window after a microbiological culture was sampled. The infection_onset_time is designated as the earlier of these two events' timestamps or offsets.1

Antibiotic Administration:
For MIMIC-IV, antibiotic orders are identified from the hosp.prescriptions table or, for more granular administration data, the hosp.emar and hosp.emar_detail tables.1 For eICU-CRD, antibiotic orders are identified from the
medication table using the drugstartoffset.1 To avoid hardcoding specific drug names, a comprehensive list of antibiotics is programmatically identified using multiple ontologies:

  • RxNorm: Concepts with tty values such as 'IN' (Ingredient), 'PIN' (Precise Ingredient), 'SCD' (Semantic Clinical Drug), or 'SBD' (Semantic Branded Drug) 1 are queried from
    rxnorm.rxnconso where the string representation (str) contains common antibiotic keywords (e.g., 'penicillin', 'cephalexin', 'vancomycin', 'ciprofloxacin', 'levofloxacin').15 Additionally,
    rxnorm.rxnsty is used to find rxcuis classified with the semantic type 'Antibiotic'.1 Specific examples include Ampicillin (RxCUI 308212, 313800) 18, Vancomycin (RxCUI 2118449) 19, and Meropenem (RxCUI 29561).20

  • ATC: atc.who_atc_ddd is queried for atc_codes starting with 'J01', which represents "ANTIBACTERIALS FOR SYSTEMIC USE".1 This provides a broad, hierarchical classification of antibiotics.

  • UMLS: umls.mrsty is queried for cuis with the semantic type 'Antibiotic' or 'Clinical Drug'.1 Additionally,
    umls.mrconso is searched for strings containing '%antibiotic%'.1 The UMLS Concept Unique Identifier (CUI) C0009450 represents "Infectious disease" 23, which can be used to link to related treatments and conditions.24

  • SNOMED_CT: snomed_ct.snomed_description is queried for terms containing '%antibiotic%'.1
    snomed_ct.snomed_relationship is used to find concepts related to antibiotics via the 'Is a' relationship (type_id = 116680003).1

    For MIMIC-IV, specific icu.inputevents d_items.itemids such as 225798 for 'Vancomycin' are also included.27 For eICU-CRD, the
    medication.drugname field is free-text and hospital-specific.1 This necessitates robust pattern matching using
    LIKE statements for common antibiotic names (e.g., LIKE '%penicillin%', LIKE '%vancomycin%', LIKE '%cefepime%') 16 in conjunction with the ontology-derived lists.

Culture Collection:
In MIMIC-IV, microbiological culture events (e.g., blood, urine, sputum) are identified from the hosp.microbiologyevents table using chartdate or charttime.1 In eICU-CRD, culture events are identified from the
microlab table using culturetakenoffset.1

2. Calculating the Sequential Organ Failure Assessment (SOFA) Score

The SOFA score is a quantitative measure of organ dysfunction across six critical organ systems.1 An acute increase of 2 or more points from baseline is the cornerstone of the Sepsis-3 definition.1 For each 24-hour period of a patient's ICU stay, the worst value for each component is used to calculate the score.1

  • Respiration (PaO2/FiO2 Ratio):
    • MIMIC-IV: Arterial partial pressure of oxygen (PaO2) is extracted from hosp.labevents (itemid 50821) 1, and fraction of inspired oxygen (FiO2) from
      icu.chartevents (itemid 223835).1 These values must be temporally aligned using their respective
      charttime columns. LOINC codes for PaO2 (78354-8) 32 and FiO2 (19996-8, 3150-0) 33 are used for standardization.
    • eICU-CRD: PaO2 is extracted from the lab table (labname like '%pao2%'), and FiO2 from respiratoryCharting (respchartvaluelabel = 'FiO2') or apacheApsVar (fio2 column) for the first 24 hours. Events are aligned using their offset columns.1
  • Coagulation (Platelet Count):
    • MIMIC-IV: Platelet count is extracted from hosp.labevents using itemid 51265.1 The LOINC code for Platelet Count is 777-3.35
    • eICU-CRD: Platelet count is extracted from the lab table where labname like '%platelet%'.1
  • Liver (Bilirubin):
    • MIMIC-IV: Total bilirubin is extracted from hosp.labevents using itemid 50885.1 LOINC codes for Total Bilirubin include 1975-2 and 42719-5.36
    • eICU-CRD: Total bilirubin is extracted from the lab table where labname like '%bilirubin%'.1
  • Cardiovascular (Mean Arterial Pressure & Vasopressor Use):
    • MIMIC-IV: Mean Arterial Pressure (MAP) is extracted from icu.chartevents (using itemids 220052, 220181, or 225312).1 Vasopressor administrations are identified from the
      icu.inputevents table by mapping drug names to their corresponding itemids in icu.d_items (e.g., Norepinephrine itemid 221906).1 The LOINC code for MAP is commonly 8480-6.
    • eICU-CRD: MAP is extracted from vitalPeriodic (systemicmean) or vitalAperiodic (noninvasivemean). Vasopressor use is identified from the infusiondrug table where the drugname column contains keywords for common vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').1 The distinction between ordered medications and actual administrations is critical here; the
      infusiondrug table is the more accurate source for continuous infusions.1
    • Ontology Mapping (Vasopressors): To ensure comprehensive identification, a multi-ontology approach is employed:
      • RxNorm: rxnorm.rxnconso is queried for tty values ('IN', 'PIN', 'SCD', 'SBD') where str contains common vasopressor keywords (e.g., 'phenylephrine', 'norepinephrine', 'epinephrine', 'vasopressin', 'dopamine', 'dobutamine').41
        rxnorm.rxnsty is also used to find rxcuis with sty = 'Vasopressor'.1 Examples include Vasopressin (RxCUI 569490, 11149) 45, Norepinephrine (RxCUI 7512) 49, Epinephrine (RxCUI 727316) 51, Dopamine (RxCUI 1743871) 53, Phenylephrine (RxCUI 1666372) 55, and Dobutamine (RxCUI 203121).57
      • ATC: atc.who_atc_ddd is queried for atc_names containing '%norepinephrine%', '%epinephrine%', etc..1 Specific ATC codes include Norepinephrine (C01CA03) 8, Epinephrine (C01CA24) 60, Dopamine (C01CA04) 62, Vasopressin (H01BA01) 64, Phenylephrine (R01AA04, R01AB01) 66, and Dobutamine (C01CA07).68
      • UMLS: umls.mrsty is queried for cuis with sty = 'Vasopressor'.1
        umls.mrconso is searched for strings containing '%vasopressor%'.1
      • SNOMED_CT: snomed_ct.snomed_description is queried for terms containing '%vasopressor%'.1
        snomed_ct.snomed_relationship is used to find related concepts via the 'Is a' relationship (type_id = 116680003).1
  • Central Nervous System (Glasgow Coma Scale - GCS):
    • MIMIC-IV: The total GCS score is calculated by summing its three components: Eyes (itemid 220739), Verbal (itemid 223900), and Motor (itemid 223901), all found in icu.chartevents.1
    • eICU-CRD: GCS is often recorded as a total score in nurseCharting (nursingchartcelltypevallabel = 'Glasgow coma score') or as its individual components in apacheApsVar (eyes, motor, verbal) for the first 24 hours.1
  • Renal (Creatinine & Urine Output):
    • MIMIC-IV: Serum creatinine is extracted from hosp.labevents (itemid 50912).1 Total urine output over a period is calculated by summing
      value from icu.outputevents for relevant itemids like 226559, 226561.1 LOINC codes for Creatinine include 2161-8 and 2162-6 for urine 4, and common LOINC 2160-0 for serum/plasma creatinine. The LOINC code for Urine Output is 9192-6.79
    • eICU-CRD: Serum creatinine is extracted from the lab table (labname = 'creatinine'). Urine output is extracted from the intakeOutput table where celllabel = 'Urine'.1

3. Defining Sepsis Onset

The sepsis_onset_time is defined as the first timestamp (or offset) at which a patient satisfies both the criteria for a suspected infection (as defined in Section III.A.1) and demonstrates an acute increase in the total SOFA score of 2 or more points from a baseline of zero (as defined in Section III.A.2).1

B. Implementing the KDIGO-AKI Definition

The second pillar of the SA-AKI definition is the diagnosis of AKI according to the Kidney Disease: Improving Global Outcomes (KDIGO) 2012 clinical practice guideline.1 AKI is defined by the presence of any one of the following criteria 1:

  1. An increase in serum creatinine (SCr) by ≥0.3 mg/dL (≥26.5 μmol/L) within 48 hours.
  2. An increase in SCr to ≥1.5 times a baseline value, which is known or presumed to have occurred within the prior 7 days.
  3. A urine volume of less than 0.5 mL/kg/h for 6 hours.

Relevant ICD-9 codes for AKI include 584.5 (acute renal failure with tubular necrosis), 584.9 (acute renal failure, unspecified), and 586 (renal failure, unspecified).80 Corresponding ICD-10 codes are N17.0 to N17.9 (Acute renal failure) and N19 (Unspecified kidney failure).82

Ontology Mapping (AKI):

  • UMLS: Concepts such as "acute kidney injury" or "acute renal failure" are identified, with CUIs like C0022660 ("Acute kidney injury").84
  • SNOMED_CT: The concept ID 14669001 for "Acute kidney injury" 84 is used, and
    snomed_ct.snomed_icd_map is leveraged to link to relevant ICD codes.

1. The Baseline Creatinine Challenge

A significant practical challenge in applying the KDIGO criteria retrospectively is the frequent absence of a known, pre-illness baseline SCr value.1 The choice of baseline can dramatically affect cohort composition, as a poorly chosen baseline can lead to misclassification of AKI.1 A robust, hierarchical method for determining baseline SCr, reflecting clinical best practice, is implemented as follows 1:

  1. Recent Outpatient Value (Ideal): The median outpatient SCr value recorded in the 365 to 7 days prior to the current hospital admission. This represents the patient's stable, pre-illness renal function. In MIMIC-IV, this is queried from hosp.labevents where hadm_id is null but charttime falls in the correct window relative to the admission time.1 In eICU-CRD, this corresponds to
    lab entries with a negative labresultoffset.1
  2. Admission Value: If no recent outpatient value is available, the first SCr measurement recorded upon hospital admission is used.
  3. Nadir In-Hospital Value: If an admission value is also unavailable, the lowest (nadir) SCr value measured within the first 48 hours of ICU admission can be used, as this may represent renal recovery after initial hydration.
  4. MDRD Estimation (Final Resort): As a final resort, if no measured values are available, a baseline SCr can be estimated by back-calculating from a presumed normal glomerular filtration rate (GFR) of 75 mL/min/1.73 m², using the Modification of Diet in Renal Disease (MDRD) equation and the patient's demographic data (age, sex, race). Any patient whose baseline is derived this way is flagged, as it represents an estimation rather than a direct measurement.1

The reliance on estimated GFR (MDRD) as a last resort introduces a potential source of misclassification and reduces the fidelity of the AKI diagnosis for those patients. This methodological choice is acknowledged and a flag is included in the SQL DDL to identify these cases, allowing for sensitivity analysis in subsequent modeling.

2. Defining AKI Onset

The aki_onset_time is defined as the earliest timestamp (or offset) at which any of the KDIGO criteria for SCr or Urine Output (UO) are met. This requires continuous monitoring of both lab values and hourly urine output calculations.1

C. Temporal Linkage and Final Cohort Selection

To be classified as SA-AKI, the onset of AKI must be temporally linked to the onset of sepsis. Research consensus suggests that AKI developing within a narrow window around the sepsis diagnosis is most likely to be directly caused by the septic insult.1 For this framework, the "Early SA-AKI" phenotype is targeted: the

aki_onset_time must occur within the window of 48 hours before to 48 hours after the sepsis_onset_time.1 For MIMIC-IV, this involves

TIMESTAMP comparisons, while for eICU-CRD, it involves offset comparisons (±2880 minutes).1

The "definitional paradox" highlights that any patient with a documented infection who develops AKI automatically meets the criteria for sepsis.1 This means the SA-AKI cohort will be inherently heterogeneous in severity, ranging from mild, transient AKI (e.g., Stage 1, low urine output) to severe, persistent kidney failure.1 This inherent heterogeneity within the SA-AKI label necessitates additional features to capture clinical nuance, such as

aki_trigger, persistent_aki, and kdigo_stage_max_48hr 1, to improve mortality prediction by allowing the model to differentiate between various levels of SA-AKI severity.

Inclusion Criteria:

  1. Patient is an adult (Age ≥ 18 years).
  2. Patient meets Sepsis-3 criteria, establishing a sepsis_onset_time.
  3. Patient meets KDIGO criteria for AKI (any stage), establishing an aki_onset_time.
  4. Temporal Linkage: The aki_onset_time must occur within the window of 48 hours before to 48 hours after the sepsis_onset_time.1

Exclusion Criteria:

  1. Pre-existing End-Stage Renal Disease (ESRD) or Chronic Maintenance Dialysis: Patients with ESRD or those on chronic maintenance dialysis prior to admission are excluded, as they cannot develop a new AKI.1
    • ICD-9 codes: '585.6', 'V45.1' (long-term dialysis), 'V56.x' (dialysis codes).1
    • ICD-10 codes: 'N18.6' (End stage renal disease), 'Z99.2' (dependence on renal dialysis), 'Z49.x' (care involving dialysis).1
    • Ontology Mapping (ESRD):
      • UMLS: Concepts such as "end stage renal disease," "ESRD," or "chronic dialysis" are identified, including CUIs like C1561643 ("Chronic kidney disease") 88 and specific CUIs for ESRD.89
      • SNOMED_CT: The concept ID 46177005 for "End-stage renal disease" 91 is used, and
        snomed_ct.snomed_icd_map is leveraged to find associated ICD codes.
  2. Short ICU Stay: Patients with an ICU length of stay less than 24 hours are excluded, as this is insufficient time to reliably observe the development of AKI and gather necessary data.1 For MIMIC-IV, this is calculated from
    intime and outtime in icu.icustays (los column). For eICU-CRD, it is calculated from unitdischargeoffset in the patient table.
  3. Early Renal Replacement Therapy (RRT): Patients who receive RRT within the first 24 hours of ICU admission are excluded because the intervention fundamentally alters the natural course of AKI and its associated biomarkers, confounding the predictive modeling task.1
  4. eICU-CRD Specific Data Interface Limitations: Patients admitted to hospitals or units with known incomplete data interfaces for critical variables (e.g., laboratory results, infusion medications) should be excluded to prevent systematic bias.1 This means that missing data should not be assumed to be an absence of an event.1 An initial exploratory analysis is performed to identify and retain only those hospitals that consistently provide the data required for this study. The SQL for cohort definition includes a mechanism to filter by
    hospitalid if such systematic limitations are identified.

Table 1 provides a summary of the logic required to define the SA-AKI cohort.

Table 1: SA-AKI Cohort Definition Logic (Expanded)

Criterion Rationale MIMIC-IV Implementation (Tables, Columns/itemids, Ontology Mapping) eICU-CRD Implementation (Tables, Columns/labname pattern, Ontology Mapping) Temporal Logic
Sepsis-3 Defines sepsis as infection plus organ dysfunction.
Suspected Infection Proxy for infection onset based on clinical action. hosp.prescriptions, hosp.emar, hosp.emar_detail, hosp.microbiologyevents. Antibiotic mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Antibiotic'), ATC (J01%), UMLS (sty='Antibiotic'), SNOMED_CT (term LIKE '%antibiotic%'). medication, microlab. Antibiotic mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Antibiotic'), ATC (J01%), UMLS (sty='Antibiotic'), SNOMED_CT (term LIKE '%antibiotic%'), and drugname pattern matching. Antibiotic within 72h before or 24h after culture.
SOFA ≥2 Quantifies acute organ dysfunction.
- Respiration PaO2/FiO2 ratio. hosp.labevents (itemid: 50821), icu.chartevents (itemid: 223835). LOINC PaO2 (78354-8), FiO2 (19996-8, 3150-0). lab (labname like '%pao2%'), respiratoryCharting (respchartvaluelabel='FiO2'), apacheApsVar (fio2). Worst value in 24h period.
- Coagulation Platelet count. hosp.labevents (itemid: 51265). LOINC Platelet Count (777-3). lab (labname like '%platelet%'). Worst value in 24h period.
- Liver Total bilirubin. hosp.labevents (itemid: 50885). LOINC Bilirubin (1975-2, 42719-5). lab (labname like '%bilirubin%'). Worst value in 24h period.
- Cardiovascular MAP and vasopressor use. icu.chartevents (MAP itemids: 220052, 220181, 225312), icu.inputevents (vasopressor itemids, e.g., 221906 for Norepinephrine). Vasopressor mapping via RxNorm (TTY IN, PIN, SCD, SBD; sty='Vasopressor'), ATC (C01CA%, H01BA%), UMLS (sty='Vasopressor'), SNOMED_CT (term LIKE '%vasopressor%'). vitalPeriodic/vitalAperiodic (MAP columns), infusiondrug (drugname keywords for vasopressors). Worst value in 24h period.
- CNS Glasgow Coma Scale. icu.chartevents (GCS component itemids: 220739, 223900, 223901). nurseCharting (nursingchartcelltypevallabel='Glasgow coma score'), apacheApsVar (eyes, motor, verbal). Worst value in 24h period.
- Renal Creatinine and Urine Output. hosp.labevents (itemid: 50912), icu.outputevents (Urine itemids: 226559, 226561), icu.chartevents (Weight itemid: 226512). LOINC Creatinine (2160-0, 2161-8, 2162-6), Urine Output (9192-6). lab (labname='creatinine'), intakeOutput (celllabel='Urine'), patient (admissionweight). Worst value in 24h period.
KDIGO-AKI Defines acute kidney injury.
SCr Criteria ≥0.3 mg/dL rise in 48h or ≥1.5x baseline in 7d. hosp.labevents (itemid: 50912). Baseline SCr: hierarchical (outpatient, admission, nadir in-hospital, MDRD estimation). lab (labname='creatinine'). Baseline SCr: hierarchical (negative offset, admission, nadir in-ICU, MDRD estimation). Earliest time KDIGO criteria met.
UO Criteria <0.5 mL/kg/h for 6h. icu.outputevents (Urine itemids), icu.chartevents (Weight itemid: 226512). intakeOutput (celllabel='Urine'), patient (admissionweight). Earliest time KDIGO criteria met.
Temporal Link Ensures AKI is associated with sepsis. aki_onset_time within ±48h of sepsis_onset_time. aki_onset_offset within ±2880 min of sepsis_onset_offset.
Exclusions Create a clean, relevant cohort.
ESRD Remove patients with pre-existing terminal kidney failure. hosp.diagnoses_icd (ICD9: '585.6', 'V45.1', 'V56.x'; ICD10: 'N18.6', 'Z99.2', 'Z49.x'). Ontology mapping for ESRD: UMLS, SNOMED_CT (46177005). diagnosis (ICD9: '585.6', 'V45.1', 'V56.x'; ICD10: 'N18.6', 'Z99.2', 'Z49.x'). Ontology mapping for ESRD: UMLS, SNOMED_CT (46177005).
ICU Stay < 24h Ensure sufficient data capture. icu.icustays (los column). patient (unitdischargeoffset column).
Early RRT Remove patients whose AKI course is altered by early intervention. icu.procedureevents (RRT itemids), mimic-code RRT concept. treatment (treatmentstring like '%renal % dialysis%'), apacheApsVar (dialysis flag). RRT within first 24h of ICU admission.
eICU Data Interface Limitations Prevent bias from systematic missingness. N/A Exclude hospitals with consistently incomplete data for critical variables (e.g., labs, infusions).

IV. Feature Engineering for Mortality Prediction

Once the SA-AKI cohort is meticulously defined, the next critical step is to engineer a rich set of features that characterize each patient for mortality prediction. These features capture the patient's baseline state, dynamic physiological changes, and critical interventions, providing comprehensive context for the predictive model.

A. Static and Baseline Features (at ICU Admission)

These fundamental covariates capture intrinsic patient characteristics and administrative details at the time of ICU admission.

  • Demographics and Anthropometrics:
    • Age: In MIMIC-IV, age is calculated using hosp.patients.anchor_age and hosp.patients.anchor_year relative to hosp.admissions.admittime.1 A critical aspect is the handling of
      anchor_age 91, which represents all patients aged 90 and above.1 This value is treated as a censored variable or a separate category in the DDL, rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling for this sensitive demographic. In eICU-CRD, age is extracted from
      eicu.patient.age, which is a varchar field. This requires preprocessing to convert it to a numeric type, typically by replacing the string "> 89" with a value like 91.1
    • Gender & Ethnicity: For MIMIC-IV, these are extracted directly from hosp.patients.gender and hosp.admissions.race. For eICU-CRD, they are extracted from eicu.patient.gender and eicu.patient.ethnicity.1
    • Height & Weight: In MIMIC-IV, admission height (cm) and weight (kg) are typically found in icu.chartevents using itemids 226730 and 226512, respectively, with the first recorded value during the ICU stay being used.1 In eICU-CRD, these are extracted directly from the
      admissionheight (cm) and admissionweight (kg) columns in the patient table.1
  • Admission and Hospital-Level Information:
    • Admission Source & Type: For MIMIC-IV, admission_location and admission_type from hosp.admissions provide the source (e.g., 'EMERGENCY ROOM', 'PHYSICIAN REFERRAL') and type (e.g., 'EW EMER.', 'URGENT') of admission.1 For eICU-CRD,
      unitadmitsource from the patient table (e.g., 'Emergency Department', 'Operating Room') serves this purpose.1
    • First Care Unit (MIMIC-IV): first_careunit from icu.icustays provides the specific type of ICU (e.g., 'Medical Intensive Care Unit (MICU)').1
    • Hospital Characteristics (eICU-CRD): The multi-center nature of eICU-CRD makes hospital-level features uniquely valuable. These are extracted from the hospital table, linked via hospitalid, and include numbedscategory (hospital size), teachingstatus (boolean), and region (US geographical region).1 These features are crucial for accounting for site-specific variations in practice and outcomes, or even as predictive features themselves, given the inherent heterogeneity across hospitals in eICU-CRD.1
  • Comorbidity Assessment via ICD Codes (Charlson Comorbidity Index):
    A patient's burden of chronic disease is a powerful predictor of their ability to survive a critical illness. A standardized approach using a validated comorbidity index, such as the Charlson Comorbidity Index, is employed. This method provides a weighted score that summarizes the overall burden of chronic disease, creating a powerful and reproducible feature.1 The implementation involves extracting all International Classification of Diseases (ICD) codes associated with a patient's hospital stay and mapping them to the 17 conditions defined by the index.
    • MIMIC-IV: All icd_code and icd_version (9 and 10) for each hadm_id are extracted from the hosp.diagnoses_icd table.1

    • eICU-CRD: All icd9code for each patientunitstayid are extracted from the diagnosis table. This field may contain multiple comma-separated codes that require careful parsing.1

      Once extracted, these codes are mapped to the 17 Charlson comorbidities using established code sets (e.g., Deyo for ICD-9 and Quan for ICD-10).1 This process results in 17 binary features (e.g.,
      comorb_mi, comorb_chf, comorb_cancer), where a value of 1 indicates the presence of the comorbidity. Table 2 provides the specific code mappings for key conditions relevant to the SA-AKI cohort.

Table 2: Comorbidity to ICD Code Mapping (Charlson Index, Deyo/Quan Adaptation)

Charlson Comorbidity Weight ICD-9 Codes (Deyo-like) ICD-10 Codes (Quan-like)
Myocardial Infarction 1 410.x, 412.x I21.x, I22.x, I25.2
Congestive Heart Failure 1 428.x I50.x, I09.9, I11.0, I13.0, I13.2
Peripheral Vascular Disease 1 443.9, 441.x, 785.4, V43.4 I70.x, I71.x, I73.8, I73.9
Cerebrovascular Disease 1 430.x-438.x I60.x-I69.x, G45.x, G46.x
Dementia 1 290.x, 294.1, 331.2 F00.x-F03.x, F05.1, G30.x, G31.1
Chronic Pulmonary Disease 1 490.x-496.x, 500.x-505.x, 506.4 J40.x-J47.x, J60.x-J67.x
Connective Tissue Disease 1 710.0, 710.1, 710.4, 714.x, 725.x M05.x, M06.x, M31.5, M32.x-M34.x, M35.1, M35.3, M36.0
Peptic Ulcer Disease 1 531.x-534.x K25.x-K28.x
Mild Liver Disease 1 571.2, 571.4, 571.5, 571.6 B18.x, K70.0, K70.3, K70.9, K71.7, K73.x, K74.x, K76.0
Diabetes (uncomplicated) 1 250.0, 250.1, 250.2, 250.3, 250.7 E10.0, E10.1, E10.9, E11.0, E11.1, E11.9
Diabetes (with complications) 2 250.4, 250.5, 250.6 E10.2-E10.5, E10.7, E11.2-E11.5, E11.7
Hemiplegia or Paraplegia 2 344.1, 342.x G81.x, G82.x, G04.1, G83.0-G83.4, G83.9
Renal Disease 2 582.x, 583.x, 585.x, 586.x, V42.0, V45.1 N18.x, N19.x, N03.2-N03.7, N05.2-N05.7, Z49.x, Z99.2
Any Malignancy 2 140.x-172.x, 174.x-195.8, 200.x-208.x C00.x-C26.x, C30.x-C41.x, C43.x, C45.x-C58.x, C60.x-C76.x, C81.x-C85.x, C88.x, C90.x-C97.x
Moderate/Severe Liver Disease 3 572.2-572.8, 456.0-456.2 I85.0, I85.9, I86.4, K70.4, K71.1, K72.1, K72.9, K76.5-K76.7
Metastatic Solid Tumor 6 196.x-199.x C77.x-C80.x
AIDS/HIV 6 042.x B20.x, B21.x, B22.x, B24.x

B. Dynamic Physiological and Laboratory Features (First 24 Hours)

The initial 24 hours of an ICU stay are a critical period, establishing a patient's physiological trajectory. Extracting and summarizing high-frequency time-series data from this window provides a rich set of features highly predictive of subsequent outcomes.1 A primary challenge is mapping clinically equivalent concepts across the two database schemas, as MIMIC-IV uses a standardized

itemid system, while eICU-CRD employs a combination of fixed columns and non-standardized, free-text labname fields.1 For each dynamic variable, the raw time-series data from the first 24 hours of the ICU stay are extracted and summarized into static features including minimum, maximum, mean, and standard deviation, capturing both central tendency and lability.

  • Vital Signs (First 24 Hours): Core physiological measurements are continuously or frequently tracked in the ICU.
    • MIMIC-IV: icu.chartevents is the primary source for high-resolution vital signs.1
    • eICU-CRD: vitalPeriodic contains 5-minute median values for most vital signs, while vitalAperiodic contains intermittent measurements.1
    • Features: Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Mean Arterial Pressure, Respiratory Rate, Temperature, SpO2.
  • Laboratory Biomarkers (First 24 Hours): Laboratory results provide insights into various organ systems. The main challenge, especially in eICU-CRD, is the lack of standardization in test names. The lab.labname field is hospital-specific, meaning the same test (e.g., creatinine) may have dozens of different names (e.g., 'creatinine', 'CREAT', 'serum creat', 'Cr') across the database.1 A successful extraction pipeline must use pattern matching (e.g., SQL
    LIKE statements or regular expressions) to aggregate all variations of a test into a single canonical feature.
    • MIMIC-IV: hosp.labevents is the primary source, with tests identified by itemid.1
    • eICU-CRD: The lab table is the primary source. The apacheApsVar table also contains single "worst" values for many key labs in the first 24 hours and can supplement the lab table.1
    • Features: Creatinine, Blood Urea Nitrogen (BUN), Anion Gap, Sodium, Potassium, Chloride, Bicarbonate (HCO3), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Alkaline Phosphatase (Alk Phos), Total Bilirubin, Albumin, White Blood Cell Count (WBC), Platelet Count, Hemoglobin, Hematocrit, International Normalized Ratio (INR), Prothrombin Time (PT), Partial Thromboplastin Time (PTT), C-Reactive Protein (CRP), Lactate, Glucose, pH, PaO2, PaCO2.

Table 3 provides a comprehensive mapping of canonical clinical features to their specific implementations in both databases.

Table 3: Master Feature Mapping for Dynamic Variables (First 24 Hours)

Feature Category Canonical Feature Name MIMIC-IV Source (Table, itemid) eICU-CRD Source (Table, Column/labname pattern) LOINC Code
Vital Signs Heart Rate icu.chartevents (220045) vitalPeriodic (heartrate) 8867-4
Systolic Blood Pressure icu.chartevents (220179, 220050) vitalPeriodic (systemicsystolic), vitalAperiodic (noninvasivesystolic) 8480-6
Diastolic Blood Pressure icu.chartevents (220180, 220051) vitalPeriodic (systemicdiastolic), vitalAperiodic (noninvasivediastolic) 8462-4
Mean Arterial Pressure icu.chartevents (220052, 220181, 225312) vitalPeriodic (systemicmean), vitalAperiodic (noninvasivemean) 8478-0
Respiratory Rate icu.chartevents (220210, 224690) vitalPeriodic (respiration) 9279-1
Temperature icu.chartevents (223762, 223761) vitalPeriodic (temperature) 8310-5
SpO2 icu.chartevents (220277) vitalPeriodic (sao2) 2708-6
Renal Labs Creatinine hosp.labevents (50912) lab (labname like '%creatinine%') 2160-0 (Serum/Plasma)
Blood Urea Nitrogen (BUN) hosp.labevents (51006) lab (labname like '%BUN%') 3094-0, 6299-2 92
Anion Gap hosp.labevents (50868) lab (labname like '%anion gap%') 33037-3, 10466-1 94
Sodium hosp.labevents (50983) lab (labname like '%sodium%') 2947-0 96
Potassium hosp.labevents (50971) lab (labname like '%potassium%') 2823-3 (Serum/Plasma), 2829-0 (Urine) 97
Chloride hosp.labevents (50902) lab (labname like '%chloride%') 2075-0 98
Bicarbonate (HCO3) hosp.labevents (50882) lab (labname like '%bicarb%', '%Total CO2%', '%HCO3%') 1960-4, 69964-5 100
Hepatic Labs Alanine Aminotransferase (ALT) hosp.labevents (50861) lab (labname like '%ALT (SGPT)%') 1743-4, 1742-6 102
Aspartate Aminotransferase (AST) hosp.labevents (50878) lab (labname like '%AST (SGOT)%') 30239-8, 1920-8 104
Alkaline Phosphatase (Alk Phos) hosp.labevents (50863) lab (labname like '%alkaline phos%') 6768-6, 77141-0 106
Total Bilirubin hosp.labevents (50885) lab (labname like '%total bilirubin%') 1975-2, 42719-5 36
Albumin hosp.labevents (50862) lab (labname like '%albumin%') 1751-7 108
Hematology White Blood Cell Count (WBC) hosp.labevents (51301) lab (labname like '%WBC x 1000%', '%wbc%') 6690-2 110
Platelet Count hosp.labevents (51265) lab (labname like '%platelets x 1000%') 777-3 35
Hemoglobin hosp.labevents (51222) lab (labname like '%Hgb%', '%hemoglobin%') 718-7 111
Hematocrit hosp.labevents (51221) lab (labname like '%Hct%', '%hematocrit%') 4544-3 112
Coagulation International Normalized Ratio (INR) hosp.labevents (51237) lab (labname like '%INR%') 6301-6 113
Prothrombin Time (PT) hosp.labevents (51275) lab (labname like '%PT%') 5902-2 114
Partial Thromboplastin Time (PTT) hosp.labevents (51274) lab (labname like '%PTT%', '%APTT%') 50391-2 115
Inflammatory C-Reactive Protein (CRP) hosp.labevents (50889) lab (labname like '%CRP%') 30522-7, 71426-1 5
Lactate hosp.labevents (50813) lab (labname like '%lactate%') 29246-6 117
Metabolic Glucose hosp.labevents (50931, 50809) lab (labname like '%glucose%'), nurseCharting (nursingchartcelltypevallabel='glucose') 2339-0 118
Blood Gas pH hosp.labevents (50820) lab (labname like '%pH%', '%arterial pH%') 2744-1 119
PaO2 hosp.labevents (50821) lab (labname like '%paO2%', '%arterial PaO2%') 78354-8 32
PaCO2 hosp.labevents (50818) lab (labname like '%paCO2%', '%arterial PaCO2%') 19996-8, 52484-3 33

C. Intervention and Treatment Feature Engineering

Identifying key clinical interventions is crucial for building a robust predictive model, as these interventions signal the severity of the patient's condition and the intensity of care received. Binary flags (presence/absence within the first 24 hours) and, where applicable, summary measures are extracted for critical interventions in the SA-AKI context.

  • Mechanical Ventilation: Mechanical ventilation is a life-sustaining intervention for respiratory failure, a common organ dysfunction in sepsis.1 Its use is a key indicator of illness severity.
    • MIMIC-IV: A patient is considered to be on mechanical ventilation if they have a relevant record in the icu.procedureevents table. Itemids such as 225792 ('Invasive Ventilation'), 224385 ('Intubation'), or 224386 ('Tracheostomy') are strong indicators.1 The duration of ventilation can be calculated from
      starttime and endtime.
    • eICU-CRD: The respiratorycare table is the primary source. The presence of a non-null value in the ventstartoffset column is a direct indicator.1 Additional evidence can be gathered from the
      airwaytype column (e.g., 'Oral ETT', 'Tracheostomy'). The respiratoryCharting table provides detailed time-stamped ventilator settings.
  • Vasopressor Support: The use of vasopressor medications to maintain blood pressure is a defining characteristic of septic shock and a critical component of the cardiovascular SOFA score.1
    • MIMIC-IV: Vasopressor use is identified from the icu.inputevents table. This requires mapping common vasopressor medications (e.g., Norepinephrine, Epinephrine, Dopamine, Vasopressin, Phenylephrine) to their corresponding itemids in the icu.d_items dictionary (e.g., Norepinephrine itemid 221906).1 A binary feature indicates if any of these medications were administered during the first 24 hours.
    • eICU-CRD: Vasopressor use is reliably identified from the infusiondrug table, which documents continuous drug infusions. This involves querying for rows where the drugname column contains keywords for vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').1
  • Renal Replacement Therapy (RRT): RRT, including hemodialysis and continuous renal replacement therapy (CRRT), is the primary intervention for severe AKI.1 Its initiation is a component of the KDIGO Stage 3 definition and a critical feature.
    • MIMIC-IV: Identifying RRT requires querying multiple tables. icu.procedureevents may contain itemids for dialysis procedures (e.g., 225802 for 'Dialysis - CRRT', 225803 for 'Dialysis - CVVHD').1
      icu.chartevents contains numerous itemids related to dialysis machine settings, and icu.inputevents will contain records for dialysate and replacement fluids.
    • eICU-CRD: RRT is most effectively identified from the treatment table, searching the treatmentstring column for hierarchical paths containing keywords like 'renal | dialysis', 'renal | CRRT', or 'renal | CVVH'.1 The presence of a dialysis flag in the
      apacheApsVar table can also serve as a confirmatory data point for the first 24 hours.
  • Fluid Balance: Fluid balance—the net difference between fluid intake and output—is a critical therapeutic target and a powerful prognostic indicator in sepsis and AKI.1 The methodology for calculating fluid balance differs significantly between the two databases. A naive approach can lead to errors, particularly in eICU-CRD, where
    intakeOutput columns (intaketotal, outputtotal, nettotal) represent cumulative totals that are often duplicated.1 In contrast, MIMIC-IV
    inputevents and outputevents contain discrete, non-cumulative event volumes.1
    • MIMIC-IV: To calculate the 24-hour fluid balance, the amount column from icu.inputevents within the 24-hour window is summed, and the sum of the value column from icu.outputevents in the same window is subtracted.1
    • eICU-CRD: To correctly calculate the 24-hour fluid balance, distinct values of the nettotal column at each unique intakeoutputoffset are selected. The 24-hour balance is then the difference between the maximum and minimum of these unique cumulative values within the desired 24-hour window.1

V. Final Staging Table Construction and Recommendations

The preceding sections have detailed the logic for defining the SA-AKI cohort and engineering a rich set of common features from both MIMIC-IV and eICU-CRD. This final section provides the blueprint for assembling these components into the final staging tables.

A. Constructing the Final Feature Matrix

The ultimate goal is to create two flat, analysis-ready tables—one for MIMIC-IV and one for eICU-CRD. In each table, every row will represent a unique ICU stay (stay_id for MIMIC-IV, patientunitstayid for eICU-CRD) from the SA-AKI cohort. Each column will represent one of the engineered features, along with the final outcome variable. The conceptual process for constructing this matrix involves a series of joins:

  1. Start with the Cohort: Begin with the final list of unique ICU stay identifiers that meet all inclusion and exclusion criteria as defined in Section III. This forms the backbone of the staging table.
  2. Join Static Features: Left-join the static and baseline features from Section IV.A (Demographics, Admission Information, and the 17 binary Comorbidity flags) to the cohort table using the appropriate stay identifier.
  3. Join Summarized Dynamic Features: Left-join the summary statistics (min, max, mean, std dev) for each vital sign and laboratory biomarker from the first 24 hours, as defined in Section IV.B.
  4. Join Intervention Features: Left-join the binary flags for the use of Mechanical Ventilation, Vasopressors, and RRT within the first 24 hours, as well as the calculated 24-hour Fluid Balance, as defined in Section IV.C.
  5. Join the Outcome Variable: Finally, join the target variable for the prediction task. For in-hospital mortality, this is the hospital_expire_flag from the mimic.admissions table in MIMIC-IV, or the hospitalDischargeStatus ('Expired' vs. 'Alive') from the eicu.patient table in eICU-CRD.1

The resulting tables will be wide-format data frames, perfectly suited for input into standard machine learning algorithms.

B. Recommendations for Feature Representation and Next Steps

While the core request was to create the staging tables, a truly robust dataset anticipates the needs of the modeling phase. The heterogeneity of the SA-AKI syndrome, a concept known as the "definitional paradox," presents an opportunity for more nuanced feature engineering.1 The SA-AKI label encompasses a wide spectrum of severity, from a patient with a transient dip in urine output to one with complete anuria requiring dialysis.1 A model that can distinguish between these sub-phenotypes will likely be more accurate and interpretable. Therefore, it is highly recommended to create the following additional features to capture this clinical heterogeneity 1:

  • aki_trigger: A categorical feature indicating what triggered the initial KDIGO diagnosis: 'Creatinine', 'Urine Output', or 'Both'. This captures the nature of the initial renal insult.
  • persistent_aki_48hr_flag: A binary flag indicating if the patient's AKI is persistent. This is set to 1 if the patient still meets any KDIGO criteria 48 hours after the initial aki_onset_time, and 0 otherwise. Persistent AKI is a known marker for more severe injury and higher mortality risk.
  • kdigo_stage_max_48hr: A numerical feature (1, 2, or 3) representing the maximum KDIGO stage the patient reached within the first 48 hours of AKI onset. This provides a graded measure of initial AKI severity.

Final Guidance for Subsequent Analysis:

  • Handling Missing Values: As requested, this protocol does not impute missing values. The resulting feature matrix will exhibit missingness, an inherent property of real-world EHR data. For the subsequent modeling phase, robust methods such as multiple imputation by chained equations (MICE) or the use of tree-based models (e.g., LightGBM, XGBoost) that can intrinsically handle missing data are recommended.1
  • Dimensionality Reduction: The feature set created by this protocol is intentionally large and comprehensive to provide maximum flexibility. It contains many variables that will be correlated (e.g., systolic, diastolic, and mean arterial pressure). This is by design. The user can now apply their preferred dimensionality reduction or feature selection techniques—such as Principal Component Analysis (PCA), recursive feature elimination, or LASSO regularization—to select the most informative and independent subset of features for their final model.1

By following this comprehensive protocol, a researcher can construct two parallel, high-quality, and clinically nuanced datasets from MIMIC-IV and eICU-CRD, providing a solid foundation for developing and validating a powerful mortality prediction model for the challenging Sepsis-Associated Acute Kidney Injury cohort.

VI. SQL DDL for SA-AKI Patient Cohort Staging Tables

The following SQL DDL statements define the schema for the SA-AKI patient cohort staging tables in both MIMIC-IV and eICU-CRD. These tables are designed to store the extracted and harmonized features as described in the report, suitable for mortality prediction.

MIMIC-IV SA-AKI Staging Table DDL

SQL

CREATE TABLE IF NOT EXISTS mimic.sa_aki_cohort_staging_mimic (
-- Patient Identifiers
subject_id BIGINT NOT NULL,
hadm_id BIGINT NOT NULL,
stay_id BIGINT NOT NULL PRIMARY KEY,

\-- Cohort Definition Timestamps  
sepsis\_onset\_time TIMESTAMP,  
aki\_onset\_time TIMESTAMP,

\-- Outcome Variable  
hospital\_expire\_flag BIGINT, \-- 0 \= Alive, 1 \= Expired

\-- Static and Baseline Features  
age INTEGER,  
gender VARCHAR(1),  
ethnicity VARCHAR(80),  
admission\_height NUMERIC, \-- cm  
admission\_weight NUMERIC, \-- kg  
admission\_type VARCHAR(50),  
admission\_location VARCHAR(50),  
first\_careunit VARCHAR(255),  
baseline\_scr\_estimated\_flag BOOLEAN, \-- TRUE if baseline SCr was estimated via MDRD

\-- Comorbidity Flags (Charlson Comorbidity Index)  
comorb\_mi BOOLEAN,  
comorb\_chf BOOLEAN,  
comorb\_pvd BOOLEAN,  
comorb\_cvd BOOLEAN,  
comorb\_dementia BOOLEAN,  
comorb\_cpd BOOLEAN,  
comorb\_ctd BOOLEAN,  
comorb\_pud BOOLEAN,  
comorb\_mild\_liver BOOLEAN,  
comorb\_dm\_uncomp BOOLEAN,  
comorb\_dm\_comp BOOLEAN,  
comorb\_hemiplegia BOOLEAN,  
comorb\_renal BOOLEAN,  
comorb\_malignancy BOOLEAN,  
comorb\_mod\_sev\_liver BOOLEAN,  
comorb\_metastatic BOOLEAN,  
comorb\_aids BOOLEAN,

\-- Dynamic Physiological and Laboratory Features (First 24 Hours)  
\-- Vitals  
heartrate\_min NUMERIC,  
heartrate\_max NUMERIC,  
heartrate\_mean NUMERIC,  
heartrate\_std NUMERIC,  
sysbp\_min NUMERIC,  
sysbp\_max NUMERIC,  
sysbp\_mean NUMERIC,  
sysbp\_std NUMERIC,  
diabp\_min NUMERIC,  
diabp\_max NUMERIC,  
diabp\_mean NUMERIC,  
diabp\_std NUMERIC,  
meanbp\_min NUMERIC,  
meanbp\_max NUMERIC,  
meanbp\_mean NUMERIC,  
meanbp\_std NUMERIC,  
resprate\_min NUMERIC,  
resprate\_max NUMERIC,  
resprate\_mean NUMERIC,  
resprate\_std NUMERIC,  
temperature\_min NUMERIC,  
temperature\_max NUMERIC,  
temperature\_mean NUMERIC,  
temperature\_std NUMERIC,  
spo2\_min NUMERIC,  
spo2\_max NUMERIC,  
spo2\_mean NUMERIC,  
spo2\_std NUMERIC,

\-- Renal Labs  
creatinine\_min NUMERIC,  
creatinine\_max NUMERIC,  
creatinine\_mean NUMERIC,  
creatinine\_std NUMERIC,  
bun\_min NUMERIC,  
bun\_max NUMERIC,  
bun\_mean NUMERIC,  
bun\_std NUMERIC,  
aniongap\_min NUMERIC,  
aniongap\_max NUMERIC,  
aniongap\_mean NUMERIC,  
aniongap\_std NUMERIC,  
sodium\_min NUMERIC,  
sodium\_max NUMERIC,  
sodium\_mean NUMERIC,  
sodium\_std NUMERIC,  
potassium\_min NUMERIC,  
potassium\_max NUMERIC,  
potassium\_mean NUMERIC,  
potassium\_std NUMERIC,  
chloride\_min NUMERIC,  
chloride\_max NUMERIC,  
chloride\_mean NUMERIC,  
chloride\_std NUMERIC,  
bicarbonate\_min NUMERIC,  
bicarbonate\_max NUMERIC,  
bicarbonate\_mean NUMERIC,  
bicarbonate\_std NUMERIC,

\-- Hepatic Labs  
alt\_min NUMERIC,  
alt\_max NUMERIC,  
alt\_mean NUMERIC,  
alt\_std NUMERIC,  
ast\_min NUMERIC,  
ast\_max NUMERIC,  
ast\_mean NUMERIC,  
ast\_std NUMERIC,  
alkphos\_min NUMERIC,  
alkphos\_max NUMERIC,  
alkphos\_mean NUMERIC,  
alkphos\_std NUMERIC,  
totalbilirubin\_min NUMERIC,  
totalbilirubin\_max NUMERIC,  
totalbilirubin\_mean NUMERIC,  
totalbilirubin\_std NUMERIC,  
albumin\_min NUMERIC,  
albumin\_max NUMERIC,  
albumin\_mean NUMERIC,  
albumin\_std NUMERIC,

\-- Hematology  
wbc\_min NUMERIC,  
wbc\_max NUMERIC,  
wbc\_mean NUMERIC,  
wbc\_std NUMERIC,  
plateletcount\_min NUMERIC,  
plateletcount\_max NUMERIC,  
plateletcount\_mean NUMERIC,  
plateletcount\_std NUMERIC,  
hemoglobin\_min NUMERIC,  
hemoglobin\_max NUMERIC,  
hemoglobin\_mean NUMERIC,  
hemoglobin\_std NUMERIC,  
hematocrit\_min NUMERIC,  
hematocrit\_max NUMERIC,  
hematocrit\_mean NUMERIC,  
hematocrit\_std NUMERIC,

\-- Coagulation  
inr\_min NUMERIC,  
inr\_max NUMERIC,  
inr\_mean NUMERIC,  
inr\_std NUMERIC,  
pt\_min NUMERIC,  
pt\_max NUMERIC,  
pt\_mean NUMERIC,  
pt\_std NUMERIC,  
ptt\_min NUMERIC,  
ptt\_max NUMERIC,  
ptt\_mean NUMERIC,  
ptt\_std NUMERIC,

\-- Inflammatory / Metabolic  
crp\_min NUMERIC,  
crp\_max NUMERIC,  
crp\_mean NUMERIC,  
crp\_std NUMERIC,  
lactate\_min NUMERIC,  
lactate\_max NUMERIC,  
lactate\_mean NUMERIC,  
lactate\_std NUMERIC,  
glucose\_min NUMERIC,  
glucose\_max NUMERIC,  
glucose\_mean NUMERIC,  
glucose\_std NUMERIC,  
ph\_min NUMERIC,  
ph\_max NUMERIC,  
ph\_mean NUMERIC,  
ph\_std NUMERIC,  
pao2\_min NUMERIC,  
pao2\_max NUMERIC,  
pao2\_mean NUMERIC,  
pao2\_std NUMERIC,  
paco2\_min NUMERIC,  
paco2\_max NUMERIC,  
paco2\_mean NUMERIC,  
paco2\_std NUMERIC,

\-- Intervention Features (First 24 Hours)  
mechanical\_ventilation\_24hr\_flag BOOLEAN,  
vasopressor\_24hr\_flag BOOLEAN,  
rrt\_24hr\_flag BOOLEAN,  
fluid\_balance\_24hr\_ml NUMERIC,

\-- Additional SA-AKI Specific Features  
aki\_trigger VARCHAR(20), \-- 'Creatinine', 'Urine Output', 'Both'  
persistent\_aki\_48hr\_flag BOOLEAN,  
kdigo\_stage\_max\_48hr INTEGER  

);

eICU-CRD SA-AKI Staging Table DDL

SQL

CREATE TABLE IF NOT EXISTS eicu.sa_aki_cohort_staging_eicu (
-- Patient Identifiers
uniquepid VARCHAR(10) NOT NULL,
patienthealthsystemstayid INTEGER,
patientunitstayid INTEGER NOT NULL PRIMARY KEY,

\-- Cohort Definition Timestamps (Offsets in minutes from ICU admit)  
sepsis\_onset\_offset INTEGER,  
aki\_onset\_offset INTEGER,

\-- Outcome Variable  
hospitaldischargestatus VARCHAR(10), \-- 'Alive' or 'Expired'

\-- Static and Baseline Features  
age VARCHAR(10), \-- Retain as VARCHAR to handle '\> 89' then convert/handle in analysis  
gender VARCHAR(25),  
ethnicity VARCHAR(50),  
admission\_height NUMERIC(10,2), \-- cm  
admission\_weight NUMERIC(10,2), \-- kg  
unitadmitsource VARCHAR(100),  
numbedscategory VARCHAR(32),  
teachingstatus BOOLEAN,  
region VARCHAR(64),  
baseline\_scr\_estimated\_flag BOOLEAN, \-- TRUE if baseline SCr was estimated via MDRD

\-- Comorbidity Flags (Charlson Comorbidity Index)  
comorb\_mi BOOLEAN,  
comorb\_chf BOOLEAN,  
comorb\_pvd BOOLEAN,  
comorb\_cvd BOOLEAN,  
comorb\_dementia BOOLEAN,  
comorb\_cpd BOOLEAN,  
comorb\_ctd BOOLEAN,  
comorb\_pud BOOLEAN,  
comorb\_mild\_liver BOOLEAN,  
comorb\_dm\_uncomp BOOLEAN,  
comorb\_dm\_comp BOOLEAN,  
comorb\_hemiplegia BOOLEAN,  
comorb\_renal BOOLEAN,  
comorb\_malignancy BOOLEAN,  
comorb\_mod\_sev\_liver BOOLEAN,  
comorb\_metastatic BOOLEAN,  
comorb\_aids BOOLEAN,

\-- Dynamic Physiological and Laboratory Features (First 24 Hours)  
\-- Vitals  
heartrate\_min NUMERIC,  
heartrate\_max NUMERIC,  
heartrate\_mean NUMERIC,  
heartrate\_std NUMERIC,  
sysbp\_min NUMERIC,  
sysbp\_max NUMERIC,  
sysbp\_mean NUMERIC,  
sysbp\_std NUMERIC,  
diabp\_min NUMERIC,  
diabp\_max NUMERIC,  
diabp\_mean NUMERIC,  
diabp\_std NUMERIC,  
meanbp\_min NUMERIC,  
meanbp\_max NUMERIC,  
meanbp\_mean NUMERIC,  
meanbp\_std NUMERIC,  
resprate\_min NUMERIC,  
resprate\_max NUMERIC,  
resprate\_mean NUMERIC,  
resprate\_std NUMERIC,  
temperature\_min NUMERIC,  
temperature\_max NUMERIC,  
temperature\_mean NUMERIC,  
temperature\_std NUMERIC,  
spo2\_min NUMERIC,  
spo2\_max NUMERIC,  
spo2\_mean NUMERIC,  
spo2\_std NUMERIC,

\-- Renal Labs  
creatinine\_min NUMERIC,  
creatinine\_max NUMERIC,  
creatinine\_mean NUMERIC,  
creatinine\_std NUMERIC,  
bun\_min NUMERIC,  
bun\_max NUMERIC,  
bun\_mean NUMERIC,  
bun\_std NUMERIC,  
aniongap\_min NUMERIC,  
aniongap\_max NUMERIC,  
aniongap\_mean NUMERIC,  
aniongap\_std NUMERIC,  
sodium\_min NUMERIC,  
sodium\_max NUMERIC,  
sodium\_mean NUMERIC,  
sodium\_std NUMERIC,  
potassium\_min NUMERIC,  
potassium\_max NUMERIC,  
potassium\_mean NUMERIC,  
potassium\_std NUMERIC,  
chloride\_min NUMERIC,  
chloride\_max NUMERIC,  
chloride\_mean NUMERIC,  
chloride\_std NUMERIC,  
bicarbonate\_min NUMERIC,  
bicarbonate\_max NUMERIC,  
bicarbonate\_mean NUMERIC,  
bicarbonate\_std NUMERIC,

\-- Hepatic Labs  
alt\_min NUMERIC,  
alt\_max NUMERIC,  
alt\_mean NUMERIC,  
alt\_std NUMERIC,  
ast\_min NUMERIC,  
ast\_max NUMERIC,  
ast\_mean NUMERIC,  
ast\_std NUMERIC,  
alkphos\_min NUMERIC,  
alkphos\_max NUMERIC,  
alkphos\_mean NUMERIC,  
alkphos\_std NUMERIC,  
totalbilirubin\_min NUMERIC,  
totalbilirubin\_max NUMERIC,  
totalbilirubin\_mean NUMERIC,  
totalbilirubin\_std NUMERIC,  
albumin\_min NUMERIC,  
albumin\_max NUMERIC,  
albumin\_mean NUMERIC,  
albumin\_std NUMERIC,

\-- Hematology  
wbc\_min NUMERIC,  
wbc\_max NUMERIC,  
wbc\_mean NUMERIC,  
wbc\_std NUMERIC,  
plateletcount\_min NUMERIC,  
plateletcount\_max NUMERIC,  
plateletcount\_mean NUMERIC,  
plateletcount\_std NUMERIC,  
hemoglobin\_min NUMERIC,  
hemoglobin\_max NUMERIC,  
hemoglobin\_mean NUMERIC,  
hemoglobin\_std NUMERIC,  
hematocrit\_min NUMERIC,  
hematocrit\_max NUMERIC,  
hematocrit\_mean NUMERIC,  
hematocrit\_std NUMERIC,

\-- Coagulation  
inr\_min NUMERIC,  
inr\_max NUMERIC,  
inr\_mean NUMERIC,  
inr\_std NUMERIC,  
pt\_min NUMERIC,  
pt\_max NUMERIC,  
pt\_mean NUMERIC,  
pt\_std NUMERIC,  
ptt\_min NUMERIC,  
ptt\_max NUMERIC,  
ptt\_mean NUMERIC,  
ptt\_std NUMERIC,

\-- Inflammatory / Metabolic  
crp\_min NUMERIC,  
crp\_max NUMERIC,  
crp\_mean NUMERIC,  
crp\_std NUMERIC,  
lactate\_min NUMERIC,  
lactate\_max NUMERIC,  
lactate\_mean NUMERIC,  
lactate\_std NUMERIC,  
glucose\_min NUMERIC,  
glucose\_max NUMERIC,  
glucose\_mean NUMERIC,  
glucose\_std NUMERIC,  
ph\_min NUMERIC,  
ph\_max NUMERIC,  
ph\_mean NUMERIC,  
ph\_std NUMERIC,  
pao2\_min NUMERIC,  
pao2\_max NUMERIC,  
pao2\_mean NUMERIC,  
pao2\_std NUMERIC,  
paco2\_min NUMERIC,  
paco2\_max NUMERIC,  
paco2\_mean NUMERIC,  
paco2\_std NUMERIC,

\-- Intervention Features (First 24 Hours)  
mechanical\_ventilation\_24hr\_flag BOOLEAN,  
vasopressor\_24hr\_flag BOOLEAN,  
rrt\_24hr\_flag BOOLEAN,  
fluid\_balance\_24hr\_ml NUMERIC,

\-- Additional SA-AKI Specific Features  
aki\_trigger VARCHAR(20), \-- 'Creatinine', 'Urine Output', 'Both'  
persistent\_aki\_48hr\_flag BOOLEAN,  
kdigo\_stage\_max\_48hr INTEGER  

);
```## A Comprehensive Framework for Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort Extraction and Staging from MIMIC-IV and eICU-CRD Databases

### I. Executive Summary

Sepsis-Associated Acute Kidney Injury (SA-AKI) represents a formidable clinical challenge within intensive care, profoundly increasing patient mortality and long-term morbidity.[1] The accurate and reproducible identification of this specific patient cohort is a foundational step for developing effective predictive models and advancing clinical research in critical care. This report delineates a rigorous, ontology-driven methodology for defining SA-AKI patient cohorts within two prominent critical care databases: MIMIC-IV v3.1 and eICU-CRD. The framework integrates internationally recognized clinical criteria for sepsis and acute kidney injury with advanced data mapping techniques, ensuring high data fidelity and reproducibility across these heterogeneous electronic health record (EHR) systems. The ultimate output includes the SQL Data Definition Language (DDL) for a comprehensive staging table, designed to support binary mortality prediction, encompassing a rich set of static, dynamic, and intervention-related features.

### II. Introduction to the Data Ecosystems

Effective clinical research leveraging large-scale EHR databases necessitates a profound understanding of their unique structures, data provenance, and inherent complexities. This section provides an overview of the MIMIC-IV and eICU-CRD databases, highlighting their architectural principles, temporal data representations, and critical considerations for researchers.

#### MIMIC-IV v3.1: A Single-Center, High-Resolution Database

MIMIC-IV v3.1 is a publicly available, de-identified clinical database sourced from the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA, spanning patient admissions from 2008 to 2022.[1] This extensive dataset comprises records for over 364,000 unique patients, corresponding to more than 546,000 hospitalizations and nearly 95,000 unique Intensive Care Unit (ICU) stays.[1] The database is organized into a modular structure, reflecting its data provenance. The `hosp` module contains hospital-wide EHR data, including administrative details, billing codes, laboratory results, and pharmacy records that cover a patient's entire hospitalization. In contrast, the `icu` module provides high-resolution data from the in-ICU clinical information system (MetaVision), offering detailed, often minute-by-minute, physiological measurements, ventilator settings, and fluid intake/output records specific to an ICU stay.[1]

A crucial aspect of MIMIC-IV's design philosophy is its commitment to providing data "as is," meaning "data cleaning steps were not performed, to ensure the data reflects a real-world clinical dataset".[1] This deliberate choice aims to preserve the ecological validity of the data, retaining the artifacts, inconsistencies, and potential errors inherent in routine clinical practice. For a task demanding "perfect calculations" and a "robust plan," this implies that SQL queries must be designed with resilience. The expectation of "perfect calculations" in the user query, when juxtaposed with MIMIC-IV's "data as is" philosophy, necessitates that any robust SQL implementation must explicitly handle potential data quality issues. This includes anticipating non-numeric values in numeric columns, managing missing data points, and accounting for unexpected string formats. This approach ensures the integrity of derived features and the accuracy of the cohort definition despite the raw data's inherent variability, requiring careful use of functions like `TRY_CAST` and `COALESCE` and robust filtering conditions within the SQL queries.

To comply with HIPAA Safe Harbor provisions, MIMIC-IV employs a thorough de-identification process. All dates and times for a given patient are shifted by a random offset into the future, preserving the internal consistency of a single patient's timeline (e.g., the duration between any two events for a patient remains accurate) while making direct temporal comparisons between different patients impossible.[1] The `patients` table provides `anchor_age`, `anchor_year`, and `anchor_year_group` to anchor a patient's timeline.[1] Notably, the ages of all patients over 89 are obscured, with their `anchor_age` recorded as 91. This means an `anchor_age` of 91 represents the entire group of patients aged 90 and above.[1] For mortality prediction, especially in SA-AKI where advanced age is a significant risk factor, this age capping reduces granularity for the oldest demographic. Age is a well-established strong predictor of mortality, and collapsing all elderly patients into a single '91' category results in a loss of potentially valuable predictive power within that specific age group. The SQL queries and subsequent feature engineering must acknowledge this limitation by treating the value '91' as a distinct categorical representation (e.g., "90 years or older") rather than a precise continuous age, to avoid misinterpretation and ensure appropriate model handling. Free-text fields within the database are de-identified by replacing Protected Health Information (PHI) with three underscores ("___").[1]

The database is interconnected through a hierarchical system of three primary identifiers: `subject_id`, `hadm_id`, and `stay_id`.[1] The `subject_id` is a unique integer assigned to each individual patient, linking all data pertaining to a single person across their hospitalizations and care episodes. The `hadm_id` uniquely identifies each hospital admission, linking all events, diagnoses, and procedures associated with a single inpatient stay. A single `subject_id` can have multiple `hadm_id`s. The `stay_id` uniquely identifies a single, continuous stay within a specific care unit, primarily an ICU, linking high-frequency, granular data from the `icu` module tables (e.g., `chartevents`, `inputevents`) to a specific ICU admission. A single `hadm_id` can have multiple `stay_id`s if a patient was transferred between ICUs or readmitted to the ICU during the same hospitalization.[1]

#### eICU-CRD: A Multi-Center, Granular Database

The eICU Collaborative Research Database (eICU-CRD) v2.0 is a large, de-identified, multi-center critical care database containing granular data from over 200,000 admissions to ICUs across the United States.[1] Data were collected between 2014 and 2015 from 208 hospitals and 335 distinct ICU wards participating in the Philips eICU tele-ICU program.[1] This origin provides a rich, high-granularity view of real-world clinical practice, encompassing vital sign measurements, laboratory results, diagnoses, treatments, and medication documentation.[1]

The multi-center nature of eICU-CRD, while enhancing the external validity and generalizability of research findings by capturing a broader range of patient populations and clinical practices, also presents significant analytical challenges.[1] The presence, frequency, and even the naming conventions for certain data points—such as specific laboratory tests in the `lab` table or charted items in the `nursecharting` table—can vary substantially across different hospitals (identified by `hospitalid`) and ICU wards (`wardid`).[1] This inherent heterogeneity means that direct string matching for `labname` or `drugname` is often insufficient for comprehensive data extraction. The eICU documentation explicitly states that naming conventions for data points can vary substantially across hospitals. This necessitates an ontology-driven approach, as direct string matching would lead to incomplete data capture. Therefore, an ontology-driven approach is not merely a suggestion but a necessity for eICU data to overcome this variability. Furthermore, this variability suggests that hospital-level metadata (e.g., `numbedscategory`, `teachingstatus`, `region` from `eicu.hospital`) could serve as important covariates or predictive features in the final model, accounting for systemic differences in care or patient populations.[1]

The eICU-CRD database is organized around a hierarchy of unique identifiers: `uniquepid`, `patienthealthsystemstayid`, and `patientunitstayid`.[1] `patientunitstayid` serves as the primary identifier for linking most clinical event data, uniquely identifying a single ICU stay.[1] While most tables link to the central `patient` table via `patientunitstayid`, complex clinical relationships (e.g., linking a specific microbiology culture result to a targeted antibiotic administration) are not enforced by explicit foreign key constraints. Researchers must infer these relationships by joining on `patientunitstayid` and carefully aligning temporal data using offset columns.[1] The temporal data in eICU-CRD is represented using offsets in minutes, calculated relative to the moment of a patient's ICU admission (offset 0).[1] Events occurring before ICU admission have negative offsets, while events during the ICU stay have positive offsets. The `unitdischargeoffset` represents the patient's length of stay in the ICU in minutes.[1] This offset system simplifies intra-ICU duration calculations but introduces complexity when analyzing events across multiple ICU stays for the same patient, as each `patientunitstayid` has its own "clock." The eICU documentation explicitly states that there is no single, master timeline for a patient's complete hospital journey, requiring researchers to piece together chronologies using various admission and discharge offsets from the `patient` table. This is a critical consideration for any feature that needs to span beyond a single ICU stay or requires a full hospital-level timeline, necessitating complex temporal joins and calculations.

Critical considerations for researchers using eICU-CRD include the impact of data interfaces and systematic missingness. The availability of any given data type is contingent on whether the corresponding care unit had the necessary "data interface" enabled.[1] This means that missing data should not be naively interpreted as the absence of an event. For instance, if an ICU lacked an interface for its infusion pumps, the `infusiondrug` table would contain no records for patients from that unit, even if those patients received continuous infusions. This principle, explicitly stated in the eICU documentation, means that a simple `LEFT JOIN` and assuming `NULL` implies absence is flawed. This systematic missingness can introduce significant bias if not addressed. Therefore, robust methodology requires identifying hospitals or units that consistently provide the data of interest and restricting analysis to this reliable subset or employing advanced statistical methods to handle non-random missingness.[1] The SQL for cohort definition should include a mechanism to filter by `hospitalid` if such systematic limitations are identified.

Another crucial distinction in eICU-CRD is between ordered and administered treatments. The `medication` table primarily reflects medication orders, which do not guarantee administration. In contrast, the `infusiondrug` and `intakeoutput` tables contain data charted by nurses at the bedside, reflecting the actual administration of continuous infusions and other fluids.[1] The eICU documentation explicitly highlights this distinction. For a critical intervention like vasopressors, relying on the `medication` table (orders) would introduce noise and misrepresent actual treatment. The `infusiondrug` table is the gold standard for continuous infusions. For studies involving continuous infusions, `infusiondrug` and `intakeoutput` are more reliable sources for actual administration data.[1]

#### Ontologies: The Foundation for Standardized Extraction

The programmatic extraction of clinical concepts, rather than relying on hardcoded values, is a critical requirement for ensuring medical accuracy and generalizability across heterogeneous datasets. This is achieved by leveraging a suite of ontologies:

* **RxNorm:** This terminology provides normalized names for generic and branded drugs, linking to various drug vocabularies.[2, 3] It is essential for identifying all forms of antibiotic and vasopressor medications.
* **LOINC (Logical Observation Identifiers Names and Codes):** LOINC standardizes laboratory test names and clinical observations.[1, 4, 5] It is crucial for consistently identifying lab results like creatinine, bilirubin, and platelet count across different hospital systems.
* **ATC (Anatomical Therapeutic Chemical Classification System):** ATC classifies drugs based on their anatomical main group, therapeutic main group, pharmacological subgroup, chemical subgroup, and chemical substance.[1, 6, 7] This hierarchical classification provides a broad and comprehensive method for identifying drug classes, such as all systemic antibacterials (J01) or specific vasopressors (e.g., C01CA).[6, 8]
* **HPO (Human Phenotype Ontology):** HPO describes phenotypic abnormalities associated with human diseases.[1] While less directly used for primary cohort definition in this context, it can be valuable for identifying subtle phenotypic expressions or for future feature enrichment related to disease severity and genetic predispositions.
* **UMLS (Unified Medical Language System):** UMLS serves as a comprehensive metathesaurus, integrating and linking numerous biomedical vocabularies, including RxNorm and SNOMED_CT, through Concept Unique Identifiers (CUIs) and semantic types.[1, 9, 10, 11] This central hub role is vital for achieving comprehensive mapping. The user's explicit instruction to use "all ontologies" underscores the importance of UMLS's role as a Metathesaurus. By mapping local EHR terms (e.g., `drugname`, `labname`) to RxNorm or LOINC, and then using UMLS to bridge to ATC and SNOMED_CT via CUIs and relationships, a highly comprehensive and accurate list of relevant clinical concepts (e.g., all antibiotics, all vasopressors, all creatinine tests) can be generated. This programmatic approach is superior to hardcoding specific values and enhances the generalizability and reproducibility of the cohort definition.
* **SNOMED_CT (Systematized Nomenclature of Medicine—Clinical Terms):** SNOMED_CT is a comprehensive, multilingual clinical healthcare terminology that provides concepts, descriptions, and relationships, often mapped to other international standards like ICD.[1, 12, 13] It is fundamental for identifying clinical conditions, diagnoses (e.g., AKI, ESRD, infection), and procedures with high specificity and consistency.

The interconnectedness of these ontologies, particularly through UMLS's role as a Metathesaurus, allows for a multi-faceted mapping strategy. For example, a drug name in MIMIC or eICU can be mapped to an RxNorm concept, which then links to an ATC code and a UMLS CUI, which in turn links to SNOMED_CT concepts. This ensures that no relevant drug or condition is missed due to variations in local terminology or data entry practices.

### III. Defining the Sepsis-Associated Acute Kidney Injury (SA-AKI) Cohort

The precise and reproducible definition of the SA-AKI cohort is the cornerstone of any subsequent predictive modeling. This involves the meticulous application of the Sepsis-3 criteria for sepsis and the Kidney Disease: Improving Global Outcomes (KDIGO) criteria for Acute Kidney Injury (AKI), coupled with strict temporal linkage and exclusion criteria.

#### A. Implementing the Sepsis-3 Definition

The Third International Consensus Definitions for Sepsis (Sepsis-3), published in 2016, redefined sepsis as "life-threatening organ dysfunction caused by a dysregulated host response to infection".[1, 1] This definition is operationalized by identifying patients with a suspected infection who concurrently exhibit an acute increase of 2 or more points in the Sequential Organ Failure Assessment (SOFA) score. The baseline SOFA score is assumed to be zero for patients without known pre-existing organ dysfunction.[1, 1]

#### 1. Identifying a Suspected Infection: The Temporal Link of Antibiotics and Cultures

The anchor point for the Sepsis-3 definition is the presence of an infection. In retrospective database research, a robust and widely accepted proxy is the temporal association of antimicrobial therapy and the collection of microbiological cultures.[1] This combination signifies that the clinical team suspected an infection and initiated treatment based on that suspicion. A suspected infection is defined for a given hospital admission (MIMIC `hadm_id`) or ICU stay (eICU `patientunitstayid`) if an antibiotic was administered within a 72-hour window *before* or a 24-hour window *after* a microbiological culture was sampled. The `infection_onset_time` is designated as the earlier of these two events' timestamps or offsets.[1]

**Antibiotic Administration:**
For MIMIC-IV, antibiotic orders are identified from the `hosp.prescriptions` table or, for more granular administration data, the `hosp.emar` and `hosp.emar_detail` tables.[1] For eICU-CRD, antibiotic orders are identified from the `medication` table using the `drugstartoffset`.[1] To avoid hardcoding specific drug names, a comprehensive list of antibiotics is programmatically identified using multiple ontologies:
* **RxNorm:** Concepts with `tty` values such as 'IN' (Ingredient), 'PIN' (Precise Ingredient), 'SCD' (Semantic Clinical Drug), or 'SBD' (Semantic Branded Drug) [1, 14] are queried from `rxnorm.rxnconso` where the string representation (`str`) contains common antibiotic keywords (e.g., 'penicillin', 'cephalexin', 'vancomycin', 'ciprofloxacin', 'levofloxacin').[15, 16] Additionally, `rxnorm.rxnsty` is used to find `rxcui`s classified with the semantic type 'Antibiotic'.[1, 17] Specific examples include Ampicillin (RxCUI 308212, 313800) [18], Vancomycin (RxCUI 2118449) [19], and Meropenem (RxCUI 29561).[20]
* **ATC:** `atc.who_atc_ddd` is queried for `atc_code`s starting with 'J01', which represents "ANTIBACTERIALS FOR SYSTEMIC USE".[1, 6, 7, 21] This provides a broad, hierarchical classification of antibiotics.
* **UMLS:** `umls.mrsty` is queried for `cui`s with the semantic type 'Antibiotic' or 'Clinical Drug'.[1, 9, 11, 22] Additionally, `umls.mrconso` is searched for strings containing '%antibiotic%'.[1] The UMLS Concept Unique Identifier (CUI) C0009450 represents "Infectious disease" [23], which can be used to link to related treatments and conditions.[24, 25]
* **SNOMED_CT:** `snomed_ct.snomed_description` is queried for terms containing '%antibiotic%'.[1] `snomed_ct.snomed_relationship` is used to find concepts related to antibiotics via the 'Is a' relationship (`type_id = 116680003`).[1, 12, 13, 26]
For MIMIC-IV, specific `icu.inputevents` `d_items.itemid`s such as 225798 for 'Vancomycin' are also included.[27] For eICU-CRD, the `medication.drugname` field is free-text and hospital-specific.[1, 28, 29, 30] This necessitates robust pattern matching using `LIKE` statements for common antibiotic names (e.g., `LIKE '%penicillin%'`, `LIKE '%vancomycin%'`, `LIKE '%cefepime%'`) [16] in conjunction with the ontology-derived lists.

**Culture Collection:**
In MIMIC-IV, microbiological culture events (e.g., blood, urine, sputum) are identified from the `hosp.microbiologyevents` table using `chartdate` or `charttime`.[1] In eICU-CRD, culture events are identified from the `microlab` table using `culturetakenoffset`.[1]

#### 2. Calculating the Sequential Organ Failure Assessment (SOFA) Score

The SOFA score is a quantitative measure of organ dysfunction across six critical organ systems.[1, 1] An acute increase of 2 or more points from baseline is the cornerstone of the Sepsis-3 definition.[1, 1] For each 24-hour period of a patient's ICU stay, the worst value for each component is used to calculate the score.[1]

* **Respiration (PaO2/FiO2 Ratio):**
* **MIMIC-IV:** Arterial partial pressure of oxygen (PaO2) is extracted from `hosp.labevents` (itemid 50821) [1, 31], and fraction of inspired oxygen (FiO2) from `icu.chartevents` (itemid 223835).[1, 31] These values must be temporally aligned using their respective `charttime` columns. LOINC codes for PaO2 (78354-8) [32] and FiO2 (19996-8, 3150-0) [33, 34] are used for standardization.
* **eICU-CRD:** PaO2 is extracted from the `lab` table (`labname` like '%pao2%'), and FiO2 from `respiratoryCharting` (`respchartvaluelabel = 'FiO2'`) or `apacheApsVar` (`fio2` column) for the first 24 hours. Events are aligned using their offset columns.[1]
* **Coagulation (Platelet Count):**
* **MIMIC-IV:** Platelet count is extracted from `hosp.labevents` using itemid 51265.[1] The LOINC code for Platelet Count is 777-3.[35]
* **eICU-CRD:** Platelet count is extracted from the `lab` table where `labname` like '%platelet%'.[1]
* **Liver (Bilirubin):**
* **MIMIC-IV:** Total bilirubin is extracted from `hosp.labevents` using itemid 50885.[1] LOINC codes for Total Bilirubin include 1975-2 and 42719-5.[36, 37]
* **eICU-CRD:** Total bilirubin is extracted from the `lab` table where `labname` like '%bilirubin%'.[1]
* **Cardiovascular (Mean Arterial Pressure & Vasopressor Use):**
* **MIMIC-IV:** Mean Arterial Pressure (MAP) is extracted from `icu.chartevents` (using itemids 220052, 220181, or 225312).[1] Vasopressor administrations are identified from the `icu.inputevents` table by mapping drug names to their corresponding `itemid`s in `icu.d_items` (e.g., Norepinephrine itemid 221906).[1, 31] The LOINC code for MAP is commonly 8480-6.
* **eICU-CRD:** MAP is extracted from `vitalPeriodic` (`systemicmean`) or `vitalAperiodic` (`noninvasivemean`). Vasopressor use is identified from the `infusiondrug` table where the `drugname` column contains keywords for common vasopressors (e.g., 'Norepinephrine', 'Epinephrine', 'Dopamine', 'Vasopressin').[1, 38, 39, 40] The distinction between ordered medications and actual administrations is critical here; the `infusiondrug` table is the more accurate source for continuous infusions.[1]
* **Ontology Mapping (Vasopressors):** To ensure comprehensive identification, a multi-ontology approach is employed:
* **RxNorm:** `rxnorm.rxnconso` is queried for `tty` values ('IN', 'PIN', 'SCD', 'SBD') where `str` contains common vasopressor keywords (e.g., 'phenylephrine', 'norepinephrine', 'epinephrine', 'vasopressin', 'dopamine', 'dobutamine').[41, 42, 43, 44] `rxnorm.rxnsty` is also used to find `rxcui`s with `sty = 'Vasopressor'`.[1] Examples include Vasopressin (RxCUI 569490, 11149) [45, 46, 47, 48], Norepinephrine (RxCUI 7512) [49, 50], Epinephrine (RxCUI 727316) [51, 52], Dopamine (RxCUI 1743871) [53, 54], Phenylephrine (RxCUI 1666372) [55, 56], and Dobutamine (RxCUI 203121).[57, 58]
* **ATC:** `atc.who_atc_ddd` is queried for `atc_name`s containing '%norepinephrine%', '%epinephrine%', etc..[1] Specific ATC codes include Norepinephrine (C01CA03) [8, 59], Epinephrine (C01CA24) [60, 61], Dopamine (C01CA04) [62, 63], Vasopressin (H01BA01) [64, 65], Phenylephrine (R01AA04, R01AB01) [66, 67], and Dobutamine (C01CA07).[68, 69]
* **UMLS:** `umls.mrsty` is queried for `cui`s with `sty = 'Vasopressor'`.[1] `umls.mrconso` is searched for strings containing '%vasopressor%'.[1]
* **SNOMED_CT:** `snomed_ct.snomed_description` is queried for terms containing '%vasopressor%'.[1] `snomed_ct.snomed_relationship` is used to find related concepts via the 'Is a' relationship (`type_id = 116680003`).[1, 70, 71, 72, 73, 74, 75, 76, 77]
* **Central Nervous System (Glasgow Coma Scale - GCS):**
* **MIMIC-IV:** The total GCS score is calculated by summing its three components: Eyes (itemid 220739), Verbal (itemid 223900), and Motor (itemid 223901), all found in `icu.chartevents`.[1]
* **eICU-CRD:** GCS is often recorded as a total score in `nurseCharting` (`nursingchartcelltypevallabel = 'Glasgow coma score'`) or as its individual components in `apacheApsVar` (`eyes`, `motor`, `verbal`) for the first 24 hours.[1]
* **Renal (Creatinine & Urine Output):**
* **MIMIC-IV:** Serum creatinine is extracted from `hosp.labevents` (itemid 50912).[1] Total urine output over a period is calculated by summing `value` from `icu.outputevents` for relevant itemids like 226559, 226561.[1] LOINC codes for Creatinine include 2161-8 and 2162-6 for urine [4, 78], and common LOINC 2160-0 for serum/plasma creatinine. The LOINC code for Urine Output is 9192-6.[79]
* **eICU-CRD:** Serum creatinine is extracted from the `lab` table (`labname = 'creatinine'`). Urine output is extracted from the `intakeOutput` table where `celllabel = 'Urine'`.[1]

#### 3. Defining Sepsis Onset

The `sepsis_onset_time` is defined as the first timestamp (or offset) at which a patient satisfies both the criteria for a suspected infection (as defined in Section III.A.1) and demonstrates an acute increase in the total SOFA score of 2 or more points from a baseline of zero (as defined in Section III.A.2).[1]

### B. Implementing the KDIGO-AKI Definition

The second pillar of the SA-AKI definition is the diagnosis of AKI according to the Kidney Disease: Improving Global Outcomes (KDIGO) 2012 clinical practice guideline.[1, 1] AKI is defined by the presence of any one of the following criteria [1, 1]:
1. An increase in serum creatinine (SCr) by $\ge0.3~mg/dL$ $(\ge26.5~\mu mol/L)$ within 48 hours.
2. An increase in SCr to $\ge1.5$ times a baseline value, which is known or presumed to have occurred within the prior 7 days.
3. A urine volume of less than $0.5~mL/kg/h$ for 6 hours.

Relevant ICD-9 codes for AKI include 584.5 (acute renal failure with tubular necrosis), 584.9 (acute renal failure, unspecified), and 586 (renal failure, unspecified).[80, 81] Corresponding ICD-10 codes are N17.0 to N17.9 (Acute renal failure) and N19 (Unspecified kidney failure).[82, 83]

**Ontology Mapping (AKI):**
* **UMLS:** Concepts such as "acute kidney injury" or "acute renal failure" are identified, with CUIs like C0022660 ("Acute kidney injury").[84, 85, 86]
* **SNOMED_CT:** The concept ID 14669001 for "Acute kidney injury" [84, 87] is used, and `snomed_ct.snomed_icd_map` is leveraged to link to relevant ICD codes.

#### 1. The Baseline Creatinine Challenge

A significant practical challenge in applying the KDIGO criteria retrospectively is the frequent absence of a known, pre-illness baseline SCr value.[1] The choice of baseline can dramatically affect cohort composition, as a poorly chosen baseline can lead to misclassification of AKI.[1] The document explicitly states that "The choice of baseline can dramatically affect cohort composition" and that using MDRD estimation is a "final resort" that "represents an estimation rather than a direct measurement".[1] This indicates that this specific baseline method introduces uncertainty and should be noted as a potential source of variability or bias, and the SQL should include a flag to identify these cases. A robust, hierarchical method for determining baseline SCr, reflecting clinical best practice, is implemented as follows [1]:
1. **Recent Outpatient Value (Ideal):** The median outpatient SCr value recorded in the 365 to 7 days prior to the current hospital admission. This represents the patient's stable, pre-illness renal function. In MIMIC-IV, this is queried from `hosp.labevents` where `hadm_id` is null but `charttime` falls in the correct window relative to the admission time.[1] In eICU-CRD, this corresponds to `lab` entries with a negative `labresultoffset`.[1]
2. **Admission Value:** If no recent outpatient value is available, the first SCr measurement recorded upon hospital admission is used.
3. **Nadir In-Hospital Value:** If an admission value is also unavailable, the lowest (nadir) SCr value measured within the first 48 hours of ICU admission can be used, as this may represent renal recovery after initial hydration.
4. **MDRD Estimation (Final Resort):** As a final resort, if no measured values are available, a baseline SCr can be estimated by back-calculating from a presumed normal glomerular filtration rate (GFR) of 75 mL/min/1.73 m², using the Modification of Diet in Renal Disease (MDRD) equation and the patient's demographic data (age, sex, race). Any patient whose baseline is derived this way is flagged, as it represents an estimation rather than a direct measurement.[1]

#### 2. Defining AKI Onset

The `aki_onset_time` is defined as the earliest timestamp (or offset) at which any of the KDIGO criteria for SCr or Urine Output (UO) are met. This requires continuous monitoring of both lab values and hourly urine output calculations.[1]

### C. Temporal Linkage and Final Cohort Selection

To be classified as SA-AKI, the onset of AKI must be temporally linked to the onset of sepsis. Research consensus suggests that AKI developing within a narrow window around the sepsis diagnosis is most likely to be directly caused by the septic insult.[1, 1] For this framework, the "Early SA-AKI" phenotype is targeted: the `aki_onset_time` must occur within the window of 48 hours before to 48 hours after the `sepsis_onset_time`.[1, 1] For MIMIC-IV, this involves `TIMESTAMP` comparisons, while for eICU-CRD, it involves `offset` comparisons (±2880 minutes).[1]

The "definitional paradox" highlights that any patient with a documented infection who develops AKI automatically meets the criteria for sepsis.[1] The document explicitly discusses this paradox and how Stage 1 AKI can be triggered by minor changes, leading to a "vast spectrum of clinical severity" within the SA-AKI label.[1] This inherent heterogeneity within the SA-AKI label necessitates additional features to capture clinical nuance, such as `aki_trigger`, `persistent_aki`, and `kdigo_stage_max_48hr` [1], to improve mortality prediction by allowing the model to differentiate between various levels of SA-AKI severity.

**Inclusion Criteria:**
1. Patient is an adult (Age $\ge$ 18 years).
2. Patient meets Sepsis-3 criteria, establishing a `sepsis_onset_time`.
3. Patient meets KDIGO criteria for AKI (any stage), establishing an `aki_onset_time`.
4. Temporal Linkage: The `aki_onset_time` must occur within the window of 48 hours before to 48 hours after the `sepsis_onset_time`.[1]

**Exclusion Criteria:**
1. **Pre-existing End-Stage Renal Disease (ESRD) or Chronic Maintenance Dialysis:** Patients with ESRD or those on chronic maintenance dialysis prior to admission are excluded, as they cannot develop a new AKI.[1]
* **ICD-9 codes:** '585.6', 'V45.1' (long-term dialysis), 'V56.x' (dialysis codes).[1, 81]
* **ICD-10 codes:** 'N18.6' (End stage renal disease), 'Z99.2' (dependence on renal dialysis), 'Z49.x' (care involving dialysis).[1, 83]
* **Ontology Mapping (ESRD):**
* **UMLS:** Concepts such as "end stage renal disease," "ESRD," or "chronic dialysis" are identified, including CUIs like C1561643 ("Chronic kidney disease") [88] and specific CUIs for ESRD.[89, 90]
* **SNOMED_CT:** The concept ID 46177005 for "End-stage renal disease" [91] is used, and `snomed_ct.snomed_icd_map` is leveraged to find associated ICD codes.
2. **Short ICU Stay:** Patients with an ICU length of stay less than 24 hours are excluded, as this is insufficient time to reliably observe the development of AKI and gather necessary data.[1] For MIMIC-IV, this is calculated from `intime` and `outtime` in `icu.icustays` (`los` column). For eICU-CRD, it is calculated from `unitdischargeoffset` in the `patient` table.
3. **Early Renal Replacement Therapy (RRT):** Patients who receive RRT within the first 24 hours of ICU admission are excluded because the intervention fundamentally alters the natural course of AKI and its associated biomarkers, confounding the predictive modeling task.[1]
4. **eICU-CRD Specific Data Interface Limitations:** Patients admitted to hospitals or units with known incomplete data interfaces for critical variables (e.g., laboratory results, infusion medications) should be excluded to prevent systematic bias.[1] The explicit exclusion for eICU's data interface limitations reinforces the need for pre-analysis data quality checks, especially for multi-center datasets. This is not merely about handling missing values, but about addressing *systematic* missingness, which can introduce selection bias if not properly accounted for. The SQL for cohort definition includes a mechanism to filter by `hospitalid` if such limitations are identified.

Table 1 provides a summary of the logic required to define the SA-AKI cohort.

**Table 1: SA-AKI Cohort Definition Logic (Expanded)**

Criterion Rationale MIMIC-IV Implementation (Tables, Columns/itemids, Ontology Mapping) eICU-CRD Implementation (Tables, Columns/labname pattern, Ontology Mapping) Temporal Logic
**Sepsis-3** Defines sepsis as infection plus organ dysfunction.
Suspected Infection Proxy for infection onset based on clinical action. `hosp.prescriptions`, `hosp.emar`, `hosp.emar_detail`,

Works cited

  1. Sepsis AKI Mortality Prediction Dataset Prep Base.pdf
  2. RxNORM - amoxicillin - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/RXNORM/723
  3. RxNorm Overview - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/rxnorm/overview.html
  4. LOINC 2161-8 Creatinine [Mass/volume] in Urine, accessed on June 18, 2025, https://loinc.org/2161-8
  5. LOINC 71426-1 C reactive protein [Mass/volume] in Blood by High sensitivity method, accessed on June 18, 2025, https://loinc.org/71426-1
  6. j01 antibacterials for systemic use - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=j01
  7. Antiinfectives For Systemic Use - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=J&showdescription=yes
  8. C01CA03 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA03
  9. The Semantic Network - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_003.html
  10. Unique Identifiers in the Metathesaurus - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
  11. Antibiotic Concepts and Types - John Snow Labs, accessed on June 18, 2025, https://www.johnsnowlabs.com/marketplace/antibiotic-concepts-and-types/
  12. SAMPLE ANTIBIOTIC SENSITIVITY RESULT (SNOMED CT DM+D) - NHS Data Dictionary, accessed on June 18, 2025, https://archive.datadictionary.nhs.uk/DD%20Release%20June%202021/data_elements/sample_antibiotic_sensitivity_result__snomed_ct_dm_d_.html
  13. What is SNOMED CT, accessed on June 18, 2025, https://www.snomed.org/what-is-snomed-ct
  14. Appendix 5 - RxNorm Term Types (TTY) - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/rxnorm/docs/appendix5.html
  15. Sepsis Treatment: Antibiotics, Medication, and Surgery - Everyday Health, accessed on June 18, 2025, https://www.everydayhealth.com/sepsis/treatment/
  16. Antibiotics - EMCrit Project, accessed on June 18, 2025, https://emcrit.org/ibcc/antibiotics/
  17. UMLS - ATC - Statistics - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/ATC/stats.html
  18. Label: AMPICILLIN capsule - DailyMed, accessed on June 18, 2025, https://dailymed.nlm.nih.gov/dailymed/lookup.cfm?setid=7f8b1ab1-73c6-4cdf-a838-7a4554c62586
  19. 400 ML vancomycin 5 MG/ML Injection - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=2118449
  20. RxClass, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxClass/search?query=Fever
  21. atcddd.fhi.no, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=j01ca#:~:text=J01%20ANTIBACTERIALS%20FOR%20SYSTEMIC%20USE,mode%20of%20action%20and%20chemistry.
  22. UMLS Semantic Network, accessed on June 18, 2025, https://uts.nlm.nih.gov/uts/umls/semantic-network/T200
  23. Infectious disease (Concept Id: C0009450) - NCBI, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/medgen/1057
  24. suitability of UMLS and SNOMED-CT for encoding outcome concepts - Oxford Academic, accessed on June 18, 2025, https://academic.oup.com/jamia/article/30/12/1895/7249289
  25. TUTORIAL # Customizing the UMLS Metathesaurus for Your Applications - National Library of Medicine, accessed on June 18, 2025, https://www.nlm.nih.gov/research/umls/presentations/customizing-AMIA-Nov2000.pdf
  26. Amoxicillin 500mg capsules (RX Farma) (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=35591411000001103
  27. mimic-code/mimic-iii/notebooks/vancomycin-dosing.ipynb at main - GitHub, accessed on June 18, 2025, https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iii/notebooks/vancomycin-dosing.ipynb
  28. FedWeight: Mitigating Covariate Shift of Federated Learning on Electronic Health Records Data through Patients Re-weighting - medRxiv, accessed on June 18, 2025, https://www.medrxiv.org/content/10.1101/2025.02.10.25322018v1.full.pdf
  29. (PDF) FedWeight: Mitigating Covariate Shift of Federated Learning on Electronic Health Records Data through Patients Re-weighting - ResearchGate, accessed on June 18, 2025, https://www.researchgate.net/publication/390786796_FedWeight_Mitigating_Covariate_Shift_of_Federated_Learning_on_Electronic_Health_Records_Data_through_Patients_Re-weighting
  30. medication - eICU Collaborative Research Database, accessed on June 18, 2025, https://eicu-crd.mit.edu/eicutables/medication/
  31. Itemids for a list of variables · Issue #540 · MIT-LCP/mimic-code - GitHub, accessed on June 18, 2025, MIT-LCP/mimic-code#540
  32. LOINC 78354-8 Arterial-alveolar oxygen Partial pressure ratio, accessed on June 18, 2025, https://loinc.org/78354-8
  33. LOINC 19996-8 Oxygen/Inspired gas Respiratory system --on ventilator, accessed on June 18, 2025, https://loinc.org/19996-8
  34. LOINC 3150-0 Inhaled oxygen concentration, accessed on June 18, 2025, https://loinc.org/3150-0
  35. www.labcorp.com, accessed on June 18, 2025, https://www.labcorp.com/tests/005249/platelet-count
  36. www.labcorp.com, accessed on June 18, 2025, https://www.labcorp.com/tests/001214/bilirubin-total-and-direct
  37. LOINC 42719-5 Bilirubin.total [Mass/volume] in Blood, accessed on June 18, 2025, https://loinc.org/42719-5
  38. infusionDrug - eICU Collaborative Research Database, accessed on June 18, 2025, https://eicu-crd.mit.edu/eicutables/infusiondrug/
  39. Early prediction of hemodynamic interventions in the intensive care unit using machine learning - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8590869/
  40. Practice Patterns in the Initiation of Secondary Vasopressors and Adjunctive Corticosteroids during Septic Shock in the United States - PubMed Central, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8641823/
  41. Inotropes and Vasopressors - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK482411/
  42. Vasopressors: Types, Purpose and Side Effects - Cleveland Clinic, accessed on June 18, 2025, https://my.clevelandclinic.org/health/treatments/23208-vasopressors
  43. Inotropes and Vasopressors - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/sites/books/NBK482411/
  44. List of Vasopressors - Drugs.com, accessed on June 18, 2025, https://www.drugs.com/drug-class/vasopressors.html
  45. RxNORM - vasopressin (USP) 20 UNT/ML Injectable Solution [Vasostrict] - Classes, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=1593738
  46. vasopressin (USP) 20 UNT/ML [Pitressin] - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=569490
  47. RxNORM - vasopressin (USP) - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/RXNORM/11149
  48. arginine vasopressin - Wikidata, accessed on June 18, 2025, https://www.wikidata.org/wiki/Q183011
  49. RxNORM - norepinephrine - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=7512
  50. Norepinephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK537259/
  51. NDA019430 0.3 ML EPINEPHrine 0.5 MG/ML Auto-Injector - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=727316
  52. Epinephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK482160/
  53. 500 ML DOPamine hydrochloride 0.8 MG/ML Injection - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=1743871
  54. Urinary Dopamine Levels Can Predict the Avidity of Post-Therapy [131I]MIBG Scintigraphy in Unresectable or Metastatic Pheochromocytomas and Paragangliomas: A Preliminary Clinical Study - PubMed Central, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11858449/
  55. RxNORM - 1 ML phenylephrine hydrochloride 10 MG/ML Injection - Classes, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/RXNORM?p=classes&conceptid=1666372
  56. Phenylephrine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK534801/
  57. RxCUI 203121 - RxNav, accessed on June 18, 2025, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=203121
  58. Dobutamine - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/sites/books/NBK470431/
  59. Noradrenaline (Norepinephrine) 0.08 mg/ml, solution for infusion - Summary of Product Characteristics (SmPC) - (emc) | 8776, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/8776/smpc
  60. EpiPen Adrenaline (Epinephrine) 0.3 mg Auto-Injector - Summary of Product Characteristics (SmPC) - (emc) | 4289, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/4289/smpc
  61. C01CA24 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA24
  62. dopamine - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA04
  63. Dopamine Hydrochloride 40mg/ml Concentrate for Solution for Infusion - Summary of Product Characteristics (SmPC) - (emc) | 6619, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/6619/smpc
  64. Vasopressin and analogues - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=H01BA&showdescription=yes
  65. Vasopressin - Wikipedia, accessed on June 18, 2025, https://en.wikipedia.org/wiki/Vasopressin
  66. R01AA04 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=R01AA04
  67. r01 nasal preparations - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?showdescription=yes&code=R01AB
  68. C01CA07 - ATCDDD - ATC/DDD Index - FHI, accessed on June 18, 2025, https://atcddd.fhi.no/atc_ddd_index/?code=C01CA07
  69. Dobutamine 5 mg/ml solution for infusion - Patient Information Leaflet (PIL) - (emc) | 6462, accessed on June 18, 2025, https://www.medicines.org.uk/emc/product/6462/pil
  70. SNOMED CT - Vasopressin deficiency - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/45369008
  71. Vasopressin resistance - Classes | NCBO BioPortal - SNOMED CT, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/111395007
  72. SNOMED CT - Allergy to norepinephrine - Classes | NCBO BioPortal, accessed on June 18, 2025, https://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=294029008
  73. Adrenaline (base) 25micrograms/ml (1 in 40000) topical solution (Special Order) 1 ml (product) - SNOMED CT Browser, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=33514611000001107
  74. Dopamine 400mg/10ml solution for infusion pre-filled syringes (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=36071011000001104
  75. snomed international snomed ct license agreement - Athena, accessed on June 18, 2025, https://qaathena.odysseusinc.com/search-terms/terms/606832
  76. Phenylephrine 12mg capsules (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=8528911000001104
  77. Dobutamine 100mg/100ml infusion bags (Special Order) 1 bag (product) - SNOMED CT, accessed on June 18, 2025, https://termbrowser.nhs.uk/?perspective=full&conceptId1=35060111000001103
  78. LOINC 2162-6 Creatinine [Mass/time] in 24 hour Urine, accessed on June 18, 2025, https://loinc.org/2162-6
  79. LOINC 9192-6 Urine output 24 hour, accessed on June 18, 2025, https://loinc.org/9192-6
  80. www.aapc.com, accessed on June 18, 2025, https://www.aapc.com/codes/icd9-codes/584.9#:~:text=ICD%2D9%20code%20584.9%20for,NEPHROSIS%20(580%2D589).
  81. Urine Culture Bacterial (Re-evaluation of Inclusion of Renal Failure in the List of ICD-9-CM Codes Covered) (CAG-00195N) - Decision Memo - CMS, accessed on June 18, 2025, https://www.cms.gov/medicare-coverage-database/view/ncacal-decision-memo.aspx?proposed=N&calid=100&keyword=Transplant&keywordType=starts&areaId=s24&docType=NCA,CAL,NCD,MEDCAC,TA,MCD,6,3,5,1,F,P&contractOption=all&sortBy=relevance&bc=1
  82. clinicalcodes.rss.mhs.man.ac.uk, accessed on June 18, 2025, https://clinicalcodes.rss.mhs.man.ac.uk/medcodes/article/30/codelist/res30-acute-kidney-injury/
  83. Renal failure (N17-N19) - ICD-10 Version:2019, accessed on June 18, 2025, https://icd.who.int/browse10/2019/en#/N17-N19
  84. SNOMED CT - Acute kidney injury - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/14669001
  85. Mouse model of ischemic acute kidney injury: technical notes and tricks - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC3532486/
  86. Acute Kidney Injury - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK441896/
  87. Acute kidney injury - Athena, accessed on June 18, 2025, https://qaathena.odysseusinc.com/search-terms/terms/197320
  88. Chronic kidney disease (Concept Id: C1561643) - NCBI, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/medgen/473458
  89. Development and external validation of multimodal postoperative acute kidney injury risk machine learning models - PMC, accessed on June 18, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC10746378/
  90. End-Stage Renal Disease - StatPearls - NCBI Bookshelf, accessed on June 18, 2025, https://www.ncbi.nlm.nih.gov/books/NBK499861/
  91. SNOMED CT - End-stage renal disease - Classes | NCBO BioPortal - Biomedical Ontology, accessed on June 18, 2025, https://purl.bioontology.org/ontology/SNOMEDCT/46177005
  92. LOINC 3094-0 Urea nitrogen [Mass/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/3094-0
  93. LOINC 6299-2 Urea nitrogen [Mass/volume] in Blood, accessed on June 18, 2025, https://loinc.org/6299-2
  94. loinc.org, accessed on June 18, 2025, https://loinc.org/33037-3#:~:text=LOINC%2033037%2D3%20Anion%20gap%20in%20Serum%20or%20Plasma%20by%20calculation
  95. Anion gap in Serum or Plasma by Calculated.3Ions - 10466-1 - LOINC - Find-A-Code, accessed on June 18, 2025, https://www.findacode.com/loinc/10466-1--anion-gap-3-in-serum-or-plasma.html
  96. LOINC 2947-0 Sodium [Moles/volume] in Blood, accessed on June 18, 2025, https://loinc.org/2947-0
  97. loinc.org, accessed on June 18, 2025, https://loinc.org/2829-0#:~:text=LOINC%202829%2D0%20Potassium%20%5BMoles,time%5D%20in%2024%20hour%20Urine
  98. loinc.org, accessed on June 18, 2025, https://loinc.org/2075-0#:~:text=LOINC%202075%2D0%20Chloride%20%5BMoles,volume%5D%20in%20Serum%20or%20Plasma
  99. LOINC 2075-0 Chloride [Moles/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/2075-0
  100. loinc.org, accessed on June 18, 2025, https://loinc.org/1960-4#:~:text=LOINC%201960%2D4%20Bicarbonate%20%5BMoles%2Fvolume%5D%20in%20Arterial%20blood
  101. LOINC 69964-5 Bicarbonate [Moles/volume] standard in Plasma, accessed on June 18, 2025, https://loinc.org/69964-5
  102. loinc.org, accessed on June 18, 2025, https://loinc.org/1743-4#:~:text=LOINC%201743%2D4%20Alanine%20aminotransferase,With%20P%2D5'%2DP
  103. LOINC 1743-4 Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma by With P-5, accessed on June 18, 2025, https://loinc.org/1743-4
  104. loinc.org, accessed on June 18, 2025, https://loinc.org/30239-8#:~:text=LOINC%2030239%2D8%20Aspartate%20aminotransferase,With%20P%2D5'%2DP
  105. LOINC 1920-8 Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/1920-8
  106. loinc.org, accessed on June 18, 2025, https://loinc.org/6768-6#:~:text=LOINC%206768%2D6%20Alkaline%20phosphatase,volume%5D%20in%20Serum%20or%20Plasma
  107. LOINC 77141-0 Alkaline phosphatase [Enzymatic activity/volume] in Serum, Plasma or Blood, accessed on June 18, 2025, https://loinc.org/77141-0
  108. loinc.org, accessed on June 18, 2025, https://loinc.org/1751-7#:~:text=LOINC%201751%2D7%20Albumin%20%5BMass,volume%5D%20in%20Serum%20or%20Plasma
  109. LOINC 1751-7 Albumin [Mass/volume] in Serum or Plasma, accessed on June 18, 2025, https://loinc.org/1751-7
  110. Leukocytes [#/volume] in Blood by Automated count - 6690-2 - LOINC, accessed on June 18, 2025, https://www.findacode.com/loinc/6690-2--leukocytes-volume-in-blood-by-automated-count.html
  111. Hemoglobin [Mass/volume] in Blood - 718-7 - LOINC - Find-A-Code, accessed on June 18, 2025, https://www.findacode.com/loinc/718-7--hemoglobin-massvolume-in-blood.html
  112. loinc.org, accessed on June 18, 2025, https://loinc.org/4544-3#:~:text=LOINC%204544%2D3%20Hematocrit%20%5BVolume,of%20Blood%20by%20Automated%20count
  113. www.findacode.com, accessed on June 18, 2025, https://www.findacode.com/loinc/6301-6--inr-in-platelet-poor-plasma-by-coagulation-assay.html#:~:text=INR%20in%20Platelet%20poor%20plasma,assay%20%2D%206301%2D6%20%2D%20LOINC
  114. loinc.org, accessed on June 18, 2025, https://loinc.org/5902-2
  115. LOINC 50391-2 PTT circulating inhibitor [Presence] in Platelet poor plasma, accessed on June 18, 2025, https://loinc.org/50391-2
  116. loinc.org, accessed on June 18, 2025, https://loinc.org/30522-7#:~:text=LOINC%2030522%2D7%20C%20reactive,Plasma%20by%20High%20sensitivity%20method
  117. loinc.org, accessed on June 18, 2025, https://loinc.org/29246-6#:~:text=LOINC%2029246%2D6%20Lactate%20%5BMass%2Fvolume%5D%20in%20Body%20fluid
  118. loinc.org, accessed on June 18, 2025, https://loinc.org/2339-0#:~:text=LOINC%202339%2D0%20Glucose%20%5BMass%2Fvolume%5D%20in%20Blood
  119. LOINC 2744-1 pH of Arterial blood, accessed on June 18, 2025, https://loinc.org/2744-1
  120. LOINC Panel Details 52484-3 Arterial Blood Gases (ABGs), accessed on June 18, 2025, https://loinc.org/52484-3/panel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment