Created
April 1, 2025 15:38
-
-
Save yamini/954a2e09368654a565d3d9f2f8ee7388 to your computer and use it in GitHub Desktop.
dd-sdk-clinical-trials-data-yk
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Synthetic Clinical Trial Dataset Generator\n", | |
"\n", | |
"This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n", | |
"\n", | |
"The dataset includes:\n", | |
"- Trial information and study design\n", | |
"- Participant demographics and health data (PII)\n", | |
"- Investigator and coordinator information (PII)\n", | |
"- Medical observations and notes with embedded PII\n", | |
"- Adverse event reports with varying severity\n", | |
"\n", | |
"We'll use Gretel's Data Designer to create this fully synthetic dataset from scratch." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%capture\n", | |
"# Install required packages\n", | |
"%pip install -U git+https://github.com/gretelai/gretel-python-client" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Setting up Data Designer\n", | |
"\n", | |
"First, we'll initialize the Gretel client and create a new Data Designer object. We'll use the `apache-2.0` model suite for this project." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from gretel_client.navigator_client import Gretel\n", | |
"\n", | |
"# Initialize Gretel client - this will prompt for your API key\n", | |
"gretel = Gretel(api_key=\"prompt\", endpoint=\"https://api.dev.gretel.ai\")\n", | |
"\n", | |
"aidd = gretel.data_designer.new(model_suite=\"apache-2.0\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Setting up Person Samplers\n", | |
"\n", | |
"We'll create person samplers to generate consistent personal information for different roles in the clinical trial process:\n", | |
"- Participants (patients enrolled in the trial)\n", | |
"- Investigators (doctors conducting the trial)\n", | |
"- Study coordinators (staff managing the trial)\n", | |
"- Sponsors (pharmaceutical company representatives)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 👩🚀 Person Attributes\n", | |
"\n", | |
"I'll create a markdown table without the default column:\n", | |
"\n", | |
"| Field Name | Type | Alias | Description |\n", | |
"|------------|------|-------|-------------|\n", | |
"| first_name | str | | Person's first name |\n", | |
"| middle_name | str \\| None | | Person's middle name (optional) |\n", | |
"| last_name | str | | Person's last name |\n", | |
"| sex | SexT | | Person's sex (enum type) |\n", | |
"| age | int | | Person's age |\n", | |
"| postcode | str | zipcode | Postal/ZIP code |\n", | |
"| street_number | int \\| str | | Street number (can be numeric or alphanumeric) |\n", | |
"| street_name | str | | Name of the street |\n", | |
"| unit | str | | Unit/apartment number |\n", | |
"| city | str | | City name |\n", | |
"| region | str \\| None | state | Region/state (optional) |\n", | |
"| district | str \\| None | county | District/county (optional) |\n", | |
"| country | str | | Country name |\n", | |
"| ethnic_background | str \\| None | | Ethnic background (optional) |\n", | |
"| marital_status | str \\| None | | Marital status (optional) |\n", | |
"| education_level | str \\| None | | Education level (optional) |\n", | |
"| bachelors_field | str \\| None | | Field of bachelor's degree (optional) |\n", | |
"| occupation | str \\| None | | Occupation (optional) |\n", | |
"| uuid | UUID | | Unique identifier |\n", | |
"| locale | str | | Locale setting |\n", | |
"| phone_number | PhoneNumber \\| None | | Generated phone number based on location (None for age < 18) |\n", | |
"| email_address | EmailStr \\| None | | Generated email address (None for age < 18) |\n", | |
"| birth_date | date | | Calculated birth date based on age |\n", | |
"| national_id | str \\| None | | National ID (SSN for US locale) |\n", | |
"| ssn | str \\| None | | Alias for national_id |" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Create person samplers for different roles, using en_GB locale\n", | |
"aidd.with_person_samplers({\n", | |
" \"participant\": {\"locale\": \"en_GB\"},\n", | |
" \"investigator\": {\"locale\": \"en_GB\"},\n", | |
" \"coordinator\": {\"locale\": \"en_GB\"},\n", | |
" \"sponsor\": {\"locale\": \"en_GB\"}\n", | |
"})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Creating Trial Information\n", | |
"\n", | |
"Next, we'll create the basic trial information:\n", | |
"- Study ID (unique identifier)\n", | |
"- Trial phase and therapeutic area\n", | |
"- Study design details\n", | |
"- Start and end dates for the trial" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Study identifiers\n", | |
"aidd.add_column(\n", | |
" name=\"study_id\",\n", | |
" type=\"uuid\",\n", | |
" params={\"prefix\": \"CT-\", \"short_form\": True, \"uppercase\": True}\n", | |
")\n", | |
"\n", | |
"# Trial phase\n", | |
"aidd.add_column(\n", | |
" name=\"trial_phase\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n", | |
" \"weights\": [0.2, 0.3, 0.4, 0.1]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Therapeutic area\n", | |
"aidd.add_column(\n", | |
" name=\"therapeutic_area\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n", | |
" \"weights\": [0.3, 0.2, 0.2, 0.15, 0.15]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Study design\n", | |
"aidd.add_column(\n", | |
" name=\"study_design\",\n", | |
" type=\"subcategory\",\n", | |
" params={\n", | |
" \"category\": \"trial_phase\",\n", | |
" \"values\": {\n", | |
" \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n", | |
" \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n", | |
" \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n", | |
" \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n", | |
" }\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Trial dates\n", | |
"aidd.add_column(\n", | |
" name=\"trial_start_date\",\n", | |
" type=\"datetime\",\n", | |
" params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n", | |
" convert_to=\"%Y-%m-%d\"\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"trial_end_date\",\n", | |
" type=\"datetime\",\n", | |
" params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n", | |
" convert_to=\"%Y-%m-%d\"\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Participant Information\n", | |
"\n", | |
"Now we'll create fields for participant demographics and enrollment details:\n", | |
"- Participant ID and basic information\n", | |
"- Demographics (age, gender, etc.)\n", | |
"- Enrollment status and dates\n", | |
"- Randomization assignment" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Participant identifiers and information\n", | |
"aidd.add_column(\n", | |
" name=\"participant_id\",\n", | |
" type=\"uuid\",\n", | |
" params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_first_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"participant.first_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_last_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"participant.last_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_birth_date\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"participant.birth_date\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_email\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"participant.email_address\"}\n", | |
")\n", | |
"\n", | |
"# Enrollment information\n", | |
"aidd.add_column(\n", | |
" name=\"enrollment_date\",\n", | |
" type=\"timedelta\",\n", | |
" params={\n", | |
" \"dt_min\": 0,\n", | |
" \"dt_max\": 60,\n", | |
" \"reference_column_name\": \"trial_start_date\",\n", | |
" \"unit\": \"D\"\n", | |
" },\n", | |
" convert_to=\"%Y-%m-%d\"\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_status\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n", | |
" \"weights\": [0.6, 0.2, 0.15, 0.05]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"treatment_arm\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n", | |
" \"weights\": [0.5, 0.3, 0.2]\n", | |
" }\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Investigator and Staff Information\n", | |
"\n", | |
"Here we'll add information about the trial staff:\n", | |
"- Investigator information (principal investigator)\n", | |
"- Study coordinator details\n", | |
"- Site information" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Investigator information\n", | |
"aidd.add_column(\n", | |
" name=\"investigator_first_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"investigator.first_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"investigator_last_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"investigator.last_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"investigator_id\",\n", | |
" type=\"uuid\",\n", | |
" params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n", | |
")\n", | |
"\n", | |
"# Study coordinator information\n", | |
"aidd.add_column(\n", | |
" name=\"coordinator_first_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"coordinator.first_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"coordinator_last_name\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"coordinator.last_name\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"coordinator_email\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"coordinator.email_address\"}\n", | |
")\n", | |
"\n", | |
"# Site information\n", | |
"aidd.add_column(\n", | |
" name=\"site_id\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"site_location\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Study costs\n", | |
"aidd.add_column(\n", | |
" name=\"per_patient_cost\",\n", | |
" type=\"gaussian\",\n", | |
" params={\"mean\": 15000, \"stddev\": 5000, \"min\": 5000}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"participant_compensation\",\n", | |
" type=\"gaussian\", \n", | |
" params={\"mean\": 500, \"stddev\": 200, \"min\": 100}\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Clinical Measurements and Outcomes\n", | |
"\n", | |
"These columns will track the key clinical data collected during the trial:\n", | |
"- Vital signs and lab values\n", | |
"- Efficacy measurements \n", | |
"- Dosing information" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Basic clinical measurements\n", | |
"aidd.add_column(\n", | |
" name=\"baseline_measurement\",\n", | |
" type=\"gaussian\",\n", | |
" params={\"mean\": 100, \"stddev\": 15},\n", | |
" convert_to=\"float\"\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"final_measurement\",\n", | |
" type=\"gaussian\",\n", | |
" params={\"mean\": 85, \"stddev\": 20},\n", | |
" convert_to=\"float\"\n", | |
")\n", | |
"\n", | |
"# Calculate percent change\n", | |
"aidd.add_column(\n", | |
" name=\"percent_change\",\n", | |
" type=\"expression\",\n", | |
" params={\"expr\": \"(final_measurement - baseline_measurement) / baseline_measurement * 100\"}\n", | |
")\n", | |
"\n", | |
"# Dosing information\n", | |
"aidd.add_column(\n", | |
" name=\"dose_level\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Low\", \"Medium\", \"High\", \"Placebo\"],\n", | |
" \"weights\": [0.3, 0.3, 0.2, 0.2]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"dose_frequency\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n", | |
" \"weights\": [0.4, 0.3, 0.2, 0.1]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Protocol compliance\n", | |
"aidd.add_column(\n", | |
" name=\"compliance_rate\",\n", | |
" type=\"uniform\",\n", | |
" params={\"low\": 0.7, \"high\": 1.0}\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Adverse Events Tracking\n", | |
"\n", | |
"Here we'll capture adverse events that occur during the clinical trial:\n", | |
"- Adverse event presence and type\n", | |
"- Severity and relatedness to treatment\n", | |
"- Dates and resolution" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Adverse event flags and details\n", | |
"aidd.add_column(\n", | |
" name=\"has_adverse_event\",\n", | |
" type=\"bernoulli\",\n", | |
" params={\"p\": 0.3}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"adverse_event_type\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n", | |
" \"weights\": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n", | |
" },\n", | |
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"None\"]}}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"adverse_event_severity\",\n", | |
" type=\"category\",\n", | |
" params={\"values\": [\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]},\n", | |
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"adverse_event_relatedness\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n", | |
" \"weights\": [0.2, 0.4, 0.3, 0.1]\n", | |
" },\n", | |
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n", | |
")\n", | |
"\n", | |
"aidd.add_column(\n", | |
" name=\"adverse_event_resolved\",\n", | |
" type=\"category\",\n", | |
" params={\"values\": [\"NA\"]},\n", | |
" conditional_params={\"has_adverse_event == 1\": {\"values\": [\"Yes\", \"No\"], \"weights\": [0.8, 0.2]}}\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Narrative text fields with style variations\n", | |
"\n", | |
"These fields will contain natural language text that incorporates PII elements.\n", | |
"We'll use style seed categories to ensure diversity in the writing styles:\n", | |
"\n", | |
"1. Medical observations and notes\n", | |
"2. Adverse event descriptions \n", | |
"3. Protocol deviation explanations" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Documentation style category\n", | |
"aidd.add_column(\n", | |
" name=\"documentation_style\",\n", | |
" type=\"category\",\n", | |
" params={\n", | |
" \"values\": [\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n", | |
" \"weights\": [0.4, 0.3, 0.3]\n", | |
" }\n", | |
")\n", | |
"\n", | |
"# Medical observations - varies based on documentation style\n", | |
"aidd.add_column(\n", | |
" name=\"medical_observations\",\n", | |
" prompt=\"\"\"\n", | |
" {% if documentation_style == \"Formal and Technical\" %}\n", | |
" Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }} \n", | |
" (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\n", | |
" \n", | |
" Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\n", | |
" Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a \n", | |
" change of {{ percent_change }}%.\n", | |
" \n", | |
" Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate \n", | |
" sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\n", | |
" {% elif documentation_style == \"Concise and Direct\" %}\n", | |
" Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }} \n", | |
" ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\n", | |
" \n", | |
" Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}. \n", | |
" Change: {{ percent_change }}%.\n", | |
" \n", | |
" Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference \n", | |
" Dr. {{ investigator_last_name }} briefly.\n", | |
" {% else %}\n", | |
" Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\n", | |
" enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\n", | |
" \n", | |
" Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\n", | |
" Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\n", | |
" representing a {{ percent_change }}% change.\n", | |
" \n", | |
" Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective\n", | |
" patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\n", | |
" {% endif %}\n", | |
" \"\"\"\n", | |
")\n", | |
"\n", | |
"# Adverse event descriptions - conditional on having an adverse event\n", | |
"aidd.add_column(\n", | |
" name=\"adverse_event_description\",\n", | |
" prompt=\"\"\"\n", | |
" {% if has_adverse_event == 1 %}\n", | |
" [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\\\n", | |
" {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment. \n", | |
" {% if adverse_event_resolved == \"Yes\" %}Resolved.{% else %}Ongoing.{% endif %}\n", | |
" {% else %}\n", | |
" [INSTRUCTIONS: Output only the exact text \"No adverse events reported\" without any additional commentary.]\\\n", | |
" No adverse events reported.\\\n", | |
" {% endif %}\n", | |
" \"\"\"\n", | |
")\n", | |
"\n", | |
"# Protocol deviation description (if compliance is low)\n", | |
"aidd.add_column(\n", | |
" name=\"protocol_deviation\",\n", | |
" prompt=\"\"\"\n", | |
" {% if compliance_rate < 0.85 %}\n", | |
" {% if documentation_style == \"Formal and Technical\" %}\n", | |
" [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like \"it looks like\" or \"you've provided\". Begin with the protocol deviation details. Use formal terminology.]\n", | |
" \n", | |
" PROTOCOL DEVIATION REPORT\n", | |
" Study ID: {{ study_id }}\n", | |
" Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n", | |
" Compliance Rate: {{ compliance_rate }}\n", | |
" \n", | |
" [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\n", | |
" {% elif documentation_style == \"Concise and Direct\" %}\n", | |
" [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\n", | |
" \n", | |
" PROTOCOL DEVIATION - {{ participant_id }}\n", | |
" • Compliance: {{ compliance_rate }}\n", | |
" • Impact: [severity level]\n", | |
" • Actions: [list actions]\n", | |
" • Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\n", | |
" • PI: Dr. {{ investigator_last_name }}\n", | |
" {% else %}\n", | |
" [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\n", | |
" \n", | |
" During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\n", | |
" \n", | |
" [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\n", | |
" {% endif %}\n", | |
" {% else %}\n", | |
" [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\n", | |
" \n", | |
" PROTOCOL COMPLIANCE ASSESSMENT\n", | |
" Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n", | |
" Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\n", | |
" {% endif %}\n", | |
" \"\"\"\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Adding Constraints\n", | |
"\n", | |
"Finally, we'll add constraints to ensure our data is logically consistent:\n", | |
"- Trial dates must be in proper sequence\n", | |
"- Adverse event dates must occur after enrollment\n", | |
"- Measurement changes must be realistic" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Ensure appropriate date sequence\n", | |
"aidd.add_constraint(\n", | |
" target_column=\"trial_end_date\",\n", | |
" type=\"column_inequality\",\n", | |
" params={\"operator\": \">\", \"rhs\": \"trial_start_date\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_constraint(\n", | |
" target_column=\"enrollment_date\",\n", | |
" type=\"column_inequality\",\n", | |
" params={\"operator\": \">=\", \"rhs\": \"trial_start_date\"}\n", | |
")\n", | |
"\n", | |
"aidd.add_constraint(\n", | |
" target_column=\"enrollment_date\",\n", | |
" type=\"column_inequality\",\n", | |
" params={\"operator\": \"<\", \"rhs\": \"trial_end_date\"}\n", | |
")\n", | |
"\n", | |
"# If there's an adverse event, ensure date is after enrollment\n", | |
"# aidd.add_constraint(\n", | |
"# target_column=\"adverse_event_date\",\n", | |
"# type=\"column_inequality\",\n", | |
"# params={\"operator\": \">\", \"rhs\": \"enrollment_date\"}\n", | |
"# )\n", | |
"\n", | |
"# Ensure reasonable clinical measurements\n", | |
"aidd.add_constraint(\n", | |
" target_column=\"baseline_measurement\",\n", | |
" type=\"scalar_inequality\",\n", | |
" params={\"operator\": \">\", \"rhs\": 0}\n", | |
")\n", | |
"\n", | |
"aidd.add_constraint(\n", | |
" target_column=\"final_measurement\",\n", | |
" type=\"scalar_inequality\",\n", | |
" params={\"operator\": \">\", \"rhs\": 0}\n", | |
")\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Preview and Generate Dataset\n", | |
"\n", | |
"First, we'll preview a small sample to verify our configuration is working correctly.\n", | |
"Then we'll generate the full dataset with the desired number of records." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Preview a few records\n", | |
"preview = aidd.preview()\n", | |
"preview.display_sample_record()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# More previews\n", | |
"preview.display_sample_record()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Define a common name for the workflow and file\n", | |
"workflow_name = \"clinical-trial-data\"\n", | |
"\n", | |
"# Submit batch job\n", | |
"workflow_run = aidd.create(\n", | |
" num_records=100,\n", | |
" workflow_run_name=workflow_name,\n", | |
" wait_for_completion=True\n", | |
")\n", | |
"\n", | |
"print(\"\\nGenerated dataset shape:\", workflow_run.dataset.df.shape)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Display the first few rows of the generated dataset\n", | |
"workflow_run.dataset.df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Save the dataset\n", | |
"csv_filename = f\"{workflow_name}.csv\" \n", | |
"workflow_run.dataset.df.to_csv(csv_filename, index=False)\n", | |
"print(f\"Dataset with {len(workflow_run.dataset.df)} records saved to {csv_filename}\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "base", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment