yamini · April 1, 2025 15:38
diff --git a/clinical-trials-data-v2-sdk.ipynb b/clinical-trials-data-v2-sdk.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synthetic Clinical Trial Dataset Generator\n",
    "\n",
    "This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n",
    "\n",
    "The dataset includes:\n",
    "- Trial information and study design\n",
    "- Participant demographics and health data (PII)\n",
    "- Investigator and coordinator information (PII)\n",
    "- Medical observations and notes with embedded PII\n",
    "- Adverse event reports with varying severity\n",
    "\n",
    "We'll use Gretel's Data Designer to create this fully synthetic dataset from scratch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Install required packages\n",
    "%pip install -U git+https://github.com/gretelai/gretel-python-client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up Data Designer\n",
    "\n",
    "First, we'll initialize the Gretel client and create a new Data Designer object. We'll use the `apache-2.0` model suite for this project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gretel_client.navigator_client import Gretel\n",
    "\n",
    "# Initialize Gretel client - this will prompt for your API key\n",
    "gretel = Gretel(api_key=\"prompt\", endpoint=\"https://api.dev.gretel.ai\")\n",
    "\n",
    "aidd = gretel.data_designer.new(model_suite=\"apache-2.0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up Person Samplers\n",
    "\n",
    "We'll create person samplers to generate consistent personal information for different roles in the clinical trial process:\n",
    "- Participants (patients enrolled in the trial)\n",
    "- Investigators (doctors conducting the trial)\n",
    "- Study coordinators (staff managing the trial)\n",
    "- Sponsors (pharmaceutical company representatives)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 👩‍🚀 Person Attributes\n",
    "\n",
    "I'll create a markdown table without the default column:\n",
    "\n",
    "| Field Name | Type | Alias | Description |\n",
    "|------------|------|-------|-------------|\n",
    "| first_name | str | | Person's first name |\n",
    "| middle_name | str \\| None | | Person's middle name (optional) |\n",
    "| last_name | str | | Person's last name |\n",
    "| sex | SexT | | Person's sex (enum type) |\n",
    "| age | int | | Person's age |\n",
    "| postcode | str | zipcode | Postal/ZIP code |\n",
    "| street_number | int \\| str | | Street number (can be numeric or alphanumeric) |\n",
    "| street_name | str | | Name of the street |\n",
    "| unit | str | | Unit/apartment number |\n",
    "| city | str | | City name |\n",
    "| region | str \\| None | state | Region/state (optional) |\n",
    "| district | str \\| None | county | District/county (optional) |\n",
    "| country | str | | Country name |\n",
    "| ethnic_background | str \\| None | | Ethnic background (optional) |\n",
    "| marital_status | str \\| None | | Marital status (optional) |\n",
    "| education_level | str \\| None | | Education level (optional) |\n",
    "| bachelors_field | str \\| None | | Field of bachelor's degree (optional) |\n",
    "| occupation | str \\| None | | Occupation (optional) |\n",
    "| uuid | UUID | | Unique identifier |\n",
    "| locale | str | | Locale setting |\n",
    "| phone_number | PhoneNumber \\| None | | Generated phone number based on location (None for age < 18) |\n",
    "| email_address | EmailStr \\| None | | Generated email address (None for age < 18) |\n",
    "| birth_date | date | | Calculated birth date based on age |\n",
    "| national_id | str \\| None | | National ID (SSN for US locale) |\n",
    "| ssn | str \\| None | | Alias for national_id |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create person samplers for different roles, using en_GB locale\n",
    "aidd.with_person_samplers({\n",
    "    \"participant\": {\"locale\": \"en_GB\"},\n",
    "    \"investigator\": {\"locale\": \"en_GB\"},\n",
    "    \"coordinator\": {\"locale\": \"en_GB\"},\n",
    "    \"sponsor\": {\"locale\": \"en_GB\"}\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Trial Information\n",
    "\n",
    "Next, we'll create the basic trial information:\n",
    "- Study ID (unique identifier)\n",
    "- Trial phase and therapeutic area\n",
    "- Study design details\n",
    "- Start and end dates for the trial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Study identifiers\n",
    "aidd.add_column(\n",
    "    name=\"study_id\",\n",
    "    type=\"uuid\",\n",
    "    params={\"prefix\": \"CT-\", \"short_form\": True, \"uppercase\": True}\n",
    ")\n",
    "\n",
    "# Trial phase\n",
    "aidd.add_column(\n",
    "    name=\"trial_phase\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n",
    "        \"weights\": [0.2, 0.3, 0.4, 0.1]\n",
    "    }\n",
    ")\n",
    "\n",
    "# Therapeutic area\n",
    "aidd.add_column(\n",
    "    name=\"therapeutic_area\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n",
    "        \"weights\": [0.3, 0.2, 0.2, 0.15, 0.15]\n",
    "    }\n",
    ")\n",
    "\n",
    "# Study design\n",
    "aidd.add_column(\n",
    "    name=\"study_design\",\n",
    "    type=\"subcategory\",\n",
    "    params={\n",
    "        \"category\": \"trial_phase\",\n",
    "        \"values\": {\n",
    "            \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n",
    "            \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n",
    "            \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n",
    "            \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "# Trial dates\n",
    "aidd.add_column(\n",
    "    name=\"trial_start_date\",\n",
    "    type=\"datetime\",\n",
    "    params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n",
    "    convert_to=\"%Y-%m-%d\"\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"trial_end_date\",\n",
    "    type=\"datetime\",\n",
    "    params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n",
    "    convert_to=\"%Y-%m-%d\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Participant Information\n",
    "\n",
    "Now we'll create fields for participant demographics and enrollment details:\n",
    "- Participant ID and basic information\n",
    "- Demographics (age, gender, etc.)\n",
    "- Enrollment status and dates\n",
    "- Randomization assignment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Participant identifiers and information\n",
    "aidd.add_column(\n",
    "    name=\"participant_id\",\n",
    "    type=\"uuid\",\n",
    "    params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_first_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"participant.first_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_last_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"participant.last_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_birth_date\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"participant.birth_date\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_email\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"participant.email_address\"}\n",
    ")\n",
    "\n",
    "# Enrollment information\n",
    "aidd.add_column(\n",
    "    name=\"enrollment_date\",\n",
    "    type=\"timedelta\",\n",
    "    params={\n",
    "        \"dt_min\": 0,\n",
    "        \"dt_max\": 60,\n",
    "        \"reference_column_name\": \"trial_start_date\",\n",
    "        \"unit\": \"D\"\n",
    "    },\n",
    "    convert_to=\"%Y-%m-%d\"\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_status\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n",
    "        \"weights\": [0.6, 0.2, 0.15, 0.05]\n",
    "    }\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"treatment_arm\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n",
    "        \"weights\": [0.5, 0.3, 0.2]\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Investigator and Staff Information\n",
    "\n",
    "Here we'll add information about the trial staff:\n",
    "- Investigator information (principal investigator)\n",
    "- Study coordinator details\n",
    "- Site information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Investigator information\n",
    "aidd.add_column(\n",
    "    name=\"investigator_first_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"investigator.first_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"investigator_last_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"investigator.last_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"investigator_id\",\n",
    "    type=\"uuid\",\n",
    "    params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n",
    ")\n",
    "\n",
    "# Study coordinator information\n",
    "aidd.add_column(\n",
    "    name=\"coordinator_first_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"coordinator.first_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"coordinator_last_name\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"coordinator.last_name\"}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"coordinator_email\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"coordinator.email_address\"}\n",
    ")\n",
    "\n",
    "# Site information\n",
    "aidd.add_column(\n",
    "    name=\"site_id\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n",
    "    }\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"site_location\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n",
    "    }\n",
    ")\n",
    "\n",
    "# Study costs\n",
    "aidd.add_column(\n",
    "    name=\"per_patient_cost\",\n",
    "    type=\"gaussian\",\n",
    "    params={\"mean\": 15000, \"stddev\": 5000, \"min\": 5000}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"participant_compensation\",\n",
    "    type=\"gaussian\", \n",
    "    params={\"mean\": 500, \"stddev\": 200, \"min\": 100}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clinical Measurements and Outcomes\n",
    "\n",
    "These columns will track the key clinical data collected during the trial:\n",
    "- Vital signs and lab values\n",
    "- Efficacy measurements \n",
    "- Dosing information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic clinical measurements\n",
    "aidd.add_column(\n",
    "    name=\"baseline_measurement\",\n",
    "    type=\"gaussian\",\n",
    "    params={\"mean\": 100, \"stddev\": 15},\n",
    "    convert_to=\"float\"\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"final_measurement\",\n",
    "    type=\"gaussian\",\n",
    "    params={\"mean\": 85, \"stddev\": 20},\n",
    "    convert_to=\"float\"\n",
    ")\n",
    "\n",
    "# Calculate percent change\n",
    "aidd.add_column(\n",
    "    name=\"percent_change\",\n",
    "    type=\"expression\",\n",
    "    params={\"expr\": \"(final_measurement - baseline_measurement) / baseline_measurement * 100\"}\n",
    ")\n",
    "\n",
    "# Dosing information\n",
    "aidd.add_column(\n",
    "    name=\"dose_level\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Low\", \"Medium\", \"High\", \"Placebo\"],\n",
    "        \"weights\": [0.3, 0.3, 0.2, 0.2]\n",
    "    }\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"dose_frequency\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n",
    "        \"weights\": [0.4, 0.3, 0.2, 0.1]\n",
    "    }\n",
    ")\n",
    "\n",
    "# Protocol compliance\n",
    "aidd.add_column(\n",
    "    name=\"compliance_rate\",\n",
    "    type=\"uniform\",\n",
    "    params={\"low\": 0.7, \"high\": 1.0}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adverse Events Tracking\n",
    "\n",
    "Here we'll capture adverse events that occur during the clinical trial:\n",
    "- Adverse event presence and type\n",
    "- Severity and relatedness to treatment\n",
    "- Dates and resolution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adverse event flags and details\n",
    "aidd.add_column(\n",
    "    name=\"has_adverse_event\",\n",
    "    type=\"bernoulli\",\n",
    "    params={\"p\": 0.3}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"adverse_event_type\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n",
    "        \"weights\": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n",
    "    },\n",
    "    conditional_params={\"has_adverse_event == 0\": {\"values\": [\"None\"]}}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"adverse_event_severity\",\n",
    "    type=\"category\",\n",
    "    params={\"values\": [\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]},\n",
    "    conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"adverse_event_relatedness\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n",
    "        \"weights\": [0.2, 0.4, 0.3, 0.1]\n",
    "    },\n",
    "    conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n",
    ")\n",
    "\n",
    "aidd.add_column(\n",
    "    name=\"adverse_event_resolved\",\n",
    "    type=\"category\",\n",
    "    params={\"values\": [\"NA\"]},\n",
    "    conditional_params={\"has_adverse_event == 1\": {\"values\": [\"Yes\", \"No\"], \"weights\": [0.8, 0.2]}}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Narrative text fields with style variations\n",
    "\n",
    "These fields will contain natural language text that incorporates PII elements.\n",
    "We'll use style seed categories to ensure diversity in the writing styles:\n",
    "\n",
    "1. Medical observations and notes\n",
    "2. Adverse event descriptions  \n",
    "3. Protocol deviation explanations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Documentation style category\n",
    "aidd.add_column(\n",
    "    name=\"documentation_style\",\n",
    "    type=\"category\",\n",
    "    params={\n",
    "        \"values\": [\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n",
    "        \"weights\": [0.4, 0.3, 0.3]\n",
    "    }\n",
    ")\n",
    "\n",
    "# Medical observations - varies based on documentation style\n",
    "aidd.add_column(\n",
    "    name=\"medical_observations\",\n",
    "    prompt=\"\"\"\n",
    "    {% if documentation_style == \"Formal and Technical\" %}\n",
    "    Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }} \n",
    "    (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\n",
    "    \n",
    "    Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\n",
    "    Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a \n",
    "    change of {{ percent_change }}%.\n",
    "    \n",
    "    Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate \n",
    "    sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\n",
    "    {% elif documentation_style == \"Concise and Direct\" %}\n",
    "    Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }} \n",
    "    ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\n",
    "    \n",
    "    Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}. \n",
    "    Change: {{ percent_change }}%.\n",
    "    \n",
    "    Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference \n",
    "    Dr. {{ investigator_last_name }} briefly.\n",
    "    {% else %}\n",
    "    Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\n",
    "    enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\n",
    "    \n",
    "    Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\n",
    "    Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\n",
    "    representing a {{ percent_change }}% change.\n",
    "    \n",
    "    Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective\n",
    "    patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\n",
    "    {% endif %}\n",
    "    \"\"\"\n",
    ")\n",
    "\n",
    "# Adverse event descriptions - conditional on having an adverse event\n",
    "aidd.add_column(\n",
    "    name=\"adverse_event_description\",\n",
    "    prompt=\"\"\"\n",
    "    {% if has_adverse_event == 1 %}\n",
    "    [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\\\n",
    "    {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment. \n",
    "    {% if adverse_event_resolved == \"Yes\" %}Resolved.{% else %}Ongoing.{% endif %}\n",
    "    {% else %}\n",
    "    [INSTRUCTIONS: Output only the exact text \"No adverse events reported\" without any additional commentary.]\\\n",
    "    No adverse events reported.\\\n",
    "    {% endif %}\n",
    "    \"\"\"\n",
    ")\n",
    "\n",
    "# Protocol deviation description (if compliance is low)\n",
    "aidd.add_column(\n",
    "    name=\"protocol_deviation\",\n",
    "    prompt=\"\"\"\n",
    "    {% if compliance_rate < 0.85 %}\n",
    "    {% if documentation_style == \"Formal and Technical\" %}\n",
    "    [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like \"it looks like\" or \"you've provided\". Begin with the protocol deviation details. Use formal terminology.]\n",
    "    \n",
    "    PROTOCOL DEVIATION REPORT\n",
    "    Study ID: {{ study_id }}\n",
    "    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n",
    "    Compliance Rate: {{ compliance_rate }}\n",
    "    \n",
    "    [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\n",
    "    {% elif documentation_style == \"Concise and Direct\" %}\n",
    "    [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\n",
    "    \n",
    "    PROTOCOL DEVIATION - {{ participant_id }}\n",
    "    • Compliance: {{ compliance_rate }}\n",
    "    • Impact: [severity level]\n",
    "    • Actions: [list actions]\n",
    "    • Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\n",
    "    • PI: Dr. {{ investigator_last_name }}\n",
    "    {% else %}\n",
    "    [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\n",
    "    \n",
    "    During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\n",
    "    \n",
    "    [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\n",
    "    {% endif %}\n",
    "    {% else %}\n",
    "    [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\n",
    "    \n",
    "    PROTOCOL COMPLIANCE ASSESSMENT\n",
    "    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n",
    "    Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\n",
    "    {% endif %}\n",
    "    \"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding Constraints\n",
    "\n",
    "Finally, we'll add constraints to ensure our data is logically consistent:\n",
    "- Trial dates must be in proper sequence\n",
    "- Adverse event dates must occur after enrollment\n",
    "- Measurement changes must be realistic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure appropriate date sequence\n",
    "aidd.add_constraint(\n",
    "    target_column=\"trial_end_date\",\n",
    "    type=\"column_inequality\",\n",
    "    params={\"operator\": \">\", \"rhs\": \"trial_start_date\"}\n",
    ")\n",
    "\n",
    "aidd.add_constraint(\n",
    "    target_column=\"enrollment_date\",\n",
    "    type=\"column_inequality\",\n",
    "    params={\"operator\": \">=\", \"rhs\": \"trial_start_date\"}\n",
    ")\n",
    "\n",
    "aidd.add_constraint(\n",
    "    target_column=\"enrollment_date\",\n",
    "    type=\"column_inequality\",\n",
    "    params={\"operator\": \"<\", \"rhs\": \"trial_end_date\"}\n",
    ")\n",
    "\n",
    "# If there's an adverse event, ensure date is after enrollment\n",
    "# aidd.add_constraint(\n",
    "#     target_column=\"adverse_event_date\",\n",
    "#     type=\"column_inequality\",\n",
    "#     params={\"operator\": \">\", \"rhs\": \"enrollment_date\"}\n",
    "# )\n",
    "\n",
    "# Ensure reasonable clinical measurements\n",
    "aidd.add_constraint(\n",
    "    target_column=\"baseline_measurement\",\n",
    "    type=\"scalar_inequality\",\n",
    "    params={\"operator\": \">\", \"rhs\": 0}\n",
    ")\n",
    "\n",
    "aidd.add_constraint(\n",
    "    target_column=\"final_measurement\",\n",
    "    type=\"scalar_inequality\",\n",
    "    params={\"operator\": \">\", \"rhs\": 0}\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preview and Generate Dataset\n",
    "\n",
    "First, we'll preview a small sample to verify our configuration is working correctly.\n",
    "Then we'll generate the full dataset with the desired number of records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preview a few records\n",
    "preview = aidd.preview()\n",
    "preview.display_sample_record()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# More previews\n",
    "preview.display_sample_record()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a common name for the workflow and file\n",
    "workflow_name = \"clinical-trial-data\"\n",
    "\n",
    "# Submit batch job\n",
    "workflow_run = aidd.create(\n",
    "    num_records=100,\n",
    "    workflow_run_name=workflow_name,\n",
    "    wait_for_completion=True\n",
    ")\n",
    "\n",
    "print(\"\\nGenerated dataset shape:\", workflow_run.dataset.df.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the first few rows of the generated dataset\n",
    "workflow_run.dataset.df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the dataset\n",
    "csv_filename = f\"{workflow_name}.csv\" \n",
    "workflow_run.dataset.df.to_csv(csv_filename, index=False)\n",
    "print(f\"Dataset with {len(workflow_run.dataset.df)} records saved to {csv_filename}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }