Skip to content

Instantly share code, notes, and snippets.

@yamini
Created April 1, 2025 15:38
Show Gist options
  • Save yamini/954a2e09368654a565d3d9f2f8ee7388 to your computer and use it in GitHub Desktop.
Save yamini/954a2e09368654a565d3d9f2f8ee7388 to your computer and use it in GitHub Desktop.
dd-sdk-clinical-trials-data-yk
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synthetic Clinical Trial Dataset Generator\n",
"\n",
"This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n",
"\n",
"The dataset includes:\n",
"- Trial information and study design\n",
"- Participant demographics and health data (PII)\n",
"- Investigator and coordinator information (PII)\n",
"- Medical observations and notes with embedded PII\n",
"- Adverse event reports with varying severity\n",
"\n",
"We'll use Gretel's Data Designer to create this fully synthetic dataset from scratch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"# Install required packages\n",
"%pip install -U git+https://github.com/gretelai/gretel-python-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up Data Designer\n",
"\n",
"First, we'll initialize the Gretel client and create a new Data Designer object. We'll use the `apache-2.0` model suite for this project."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gretel_client.navigator_client import Gretel\n",
"\n",
"# Initialize Gretel client - this will prompt for your API key\n",
"gretel = Gretel(api_key=\"prompt\", endpoint=\"https://api.dev.gretel.ai\")\n",
"\n",
"aidd = gretel.data_designer.new(model_suite=\"apache-2.0\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up Person Samplers\n",
"\n",
"We'll create person samplers to generate consistent personal information for different roles in the clinical trial process:\n",
"- Participants (patients enrolled in the trial)\n",
"- Investigators (doctors conducting the trial)\n",
"- Study coordinators (staff managing the trial)\n",
"- Sponsors (pharmaceutical company representatives)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 👩‍🚀 Person Attributes\n",
"\n",
"I'll create a markdown table without the default column:\n",
"\n",
"| Field Name | Type | Alias | Description |\n",
"|------------|------|-------|-------------|\n",
"| first_name | str | | Person's first name |\n",
"| middle_name | str \\| None | | Person's middle name (optional) |\n",
"| last_name | str | | Person's last name |\n",
"| sex | SexT | | Person's sex (enum type) |\n",
"| age | int | | Person's age |\n",
"| postcode | str | zipcode | Postal/ZIP code |\n",
"| street_number | int \\| str | | Street number (can be numeric or alphanumeric) |\n",
"| street_name | str | | Name of the street |\n",
"| unit | str | | Unit/apartment number |\n",
"| city | str | | City name |\n",
"| region | str \\| None | state | Region/state (optional) |\n",
"| district | str \\| None | county | District/county (optional) |\n",
"| country | str | | Country name |\n",
"| ethnic_background | str \\| None | | Ethnic background (optional) |\n",
"| marital_status | str \\| None | | Marital status (optional) |\n",
"| education_level | str \\| None | | Education level (optional) |\n",
"| bachelors_field | str \\| None | | Field of bachelor's degree (optional) |\n",
"| occupation | str \\| None | | Occupation (optional) |\n",
"| uuid | UUID | | Unique identifier |\n",
"| locale | str | | Locale setting |\n",
"| phone_number | PhoneNumber \\| None | | Generated phone number based on location (None for age < 18) |\n",
"| email_address | EmailStr \\| None | | Generated email address (None for age < 18) |\n",
"| birth_date | date | | Calculated birth date based on age |\n",
"| national_id | str \\| None | | National ID (SSN for US locale) |\n",
"| ssn | str \\| None | | Alias for national_id |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create person samplers for different roles, using en_GB locale\n",
"aidd.with_person_samplers({\n",
" \"participant\": {\"locale\": \"en_GB\"},\n",
" \"investigator\": {\"locale\": \"en_GB\"},\n",
" \"coordinator\": {\"locale\": \"en_GB\"},\n",
" \"sponsor\": {\"locale\": \"en_GB\"}\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Trial Information\n",
"\n",
"Next, we'll create the basic trial information:\n",
"- Study ID (unique identifier)\n",
"- Trial phase and therapeutic area\n",
"- Study design details\n",
"- Start and end dates for the trial"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Study identifiers\n",
"aidd.add_column(\n",
" name=\"study_id\",\n",
" type=\"uuid\",\n",
" params={\"prefix\": \"CT-\", \"short_form\": True, \"uppercase\": True}\n",
")\n",
"\n",
"# Trial phase\n",
"aidd.add_column(\n",
" name=\"trial_phase\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Phase I\", \"Phase II\", \"Phase III\", \"Phase IV\"],\n",
" \"weights\": [0.2, 0.3, 0.4, 0.1]\n",
" }\n",
")\n",
"\n",
"# Therapeutic area\n",
"aidd.add_column(\n",
" name=\"therapeutic_area\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Oncology\", \"Cardiology\", \"Neurology\", \"Immunology\", \"Infectious Disease\"],\n",
" \"weights\": [0.3, 0.2, 0.2, 0.15, 0.15]\n",
" }\n",
")\n",
"\n",
"# Study design\n",
"aidd.add_column(\n",
" name=\"study_design\",\n",
" type=\"subcategory\",\n",
" params={\n",
" \"category\": \"trial_phase\",\n",
" \"values\": {\n",
" \"Phase I\": [\"Single Arm\", \"Dose Escalation\", \"First-in-Human\", \"Safety Assessment\"],\n",
" \"Phase II\": [\"Randomized\", \"Double-Blind\", \"Proof of Concept\", \"Open-Label Extension\"],\n",
" \"Phase III\": [\"Randomized Controlled\", \"Double-Blind Placebo-Controlled\", \"Multi-Center\", \"Pivotal\"],\n",
" \"Phase IV\": [\"Post-Marketing Surveillance\", \"Real-World Evidence\", \"Long-Term Safety\", \"Expanded Access\"]\n",
" }\n",
" }\n",
")\n",
"\n",
"# Trial dates\n",
"aidd.add_column(\n",
" name=\"trial_start_date\",\n",
" type=\"datetime\",\n",
" params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n",
" convert_to=\"%Y-%m-%d\"\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"trial_end_date\",\n",
" type=\"datetime\",\n",
" params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n",
" convert_to=\"%Y-%m-%d\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Participant Information\n",
"\n",
"Now we'll create fields for participant demographics and enrollment details:\n",
"- Participant ID and basic information\n",
"- Demographics (age, gender, etc.)\n",
"- Enrollment status and dates\n",
"- Randomization assignment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Participant identifiers and information\n",
"aidd.add_column(\n",
" name=\"participant_id\",\n",
" type=\"uuid\",\n",
" params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"participant.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"participant.last_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_birth_date\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"participant.birth_date\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_email\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"participant.email_address\"}\n",
")\n",
"\n",
"# Enrollment information\n",
"aidd.add_column(\n",
" name=\"enrollment_date\",\n",
" type=\"timedelta\",\n",
" params={\n",
" \"dt_min\": 0,\n",
" \"dt_max\": 60,\n",
" \"reference_column_name\": \"trial_start_date\",\n",
" \"unit\": \"D\"\n",
" },\n",
" convert_to=\"%Y-%m-%d\"\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_status\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Active\", \"Completed\", \"Withdrawn\", \"Lost to Follow-up\"],\n",
" \"weights\": [0.6, 0.2, 0.15, 0.05]\n",
" }\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"treatment_arm\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Treatment\", \"Placebo\", \"Standard of Care\"],\n",
" \"weights\": [0.5, 0.3, 0.2]\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Investigator and Staff Information\n",
"\n",
"Here we'll add information about the trial staff:\n",
"- Investigator information (principal investigator)\n",
"- Study coordinator details\n",
"- Site information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Investigator information\n",
"aidd.add_column(\n",
" name=\"investigator_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"investigator.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"investigator_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"investigator.last_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"investigator_id\",\n",
" type=\"uuid\",\n",
" params={\"prefix\": \"INV-\", \"short_form\": True, \"uppercase\": True}\n",
")\n",
"\n",
"# Study coordinator information\n",
"aidd.add_column(\n",
" name=\"coordinator_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"coordinator.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"coordinator_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"coordinator.last_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"coordinator_email\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"coordinator.email_address\"}\n",
")\n",
"\n",
"# Site information\n",
"aidd.add_column(\n",
" name=\"site_id\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"SITE-001\", \"SITE-002\", \"SITE-003\", \"SITE-004\", \"SITE-005\"]\n",
" }\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"site_location\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"London\", \"Manchester\", \"Birmingham\", \"Edinburgh\", \"Cambridge\"]\n",
" }\n",
")\n",
"\n",
"# Study costs\n",
"aidd.add_column(\n",
" name=\"per_patient_cost\",\n",
" type=\"gaussian\",\n",
" params={\"mean\": 15000, \"stddev\": 5000, \"min\": 5000}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"participant_compensation\",\n",
" type=\"gaussian\", \n",
" params={\"mean\": 500, \"stddev\": 200, \"min\": 100}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clinical Measurements and Outcomes\n",
"\n",
"These columns will track the key clinical data collected during the trial:\n",
"- Vital signs and lab values\n",
"- Efficacy measurements \n",
"- Dosing information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Basic clinical measurements\n",
"aidd.add_column(\n",
" name=\"baseline_measurement\",\n",
" type=\"gaussian\",\n",
" params={\"mean\": 100, \"stddev\": 15},\n",
" convert_to=\"float\"\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"final_measurement\",\n",
" type=\"gaussian\",\n",
" params={\"mean\": 85, \"stddev\": 20},\n",
" convert_to=\"float\"\n",
")\n",
"\n",
"# Calculate percent change\n",
"aidd.add_column(\n",
" name=\"percent_change\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"(final_measurement - baseline_measurement) / baseline_measurement * 100\"}\n",
")\n",
"\n",
"# Dosing information\n",
"aidd.add_column(\n",
" name=\"dose_level\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Low\", \"Medium\", \"High\", \"Placebo\"],\n",
" \"weights\": [0.3, 0.3, 0.2, 0.2]\n",
" }\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"dose_frequency\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Once daily\", \"Twice daily\", \"Weekly\", \"Biweekly\"],\n",
" \"weights\": [0.4, 0.3, 0.2, 0.1]\n",
" }\n",
")\n",
"\n",
"# Protocol compliance\n",
"aidd.add_column(\n",
" name=\"compliance_rate\",\n",
" type=\"uniform\",\n",
" params={\"low\": 0.7, \"high\": 1.0}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adverse Events Tracking\n",
"\n",
"Here we'll capture adverse events that occur during the clinical trial:\n",
"- Adverse event presence and type\n",
"- Severity and relatedness to treatment\n",
"- Dates and resolution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Adverse event flags and details\n",
"aidd.add_column(\n",
" name=\"has_adverse_event\",\n",
" type=\"bernoulli\",\n",
" params={\"p\": 0.3}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"adverse_event_type\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Headache\", \"Nausea\", \"Fatigue\", \"Rash\", \"Dizziness\", \"Pain at injection site\", \"Other\"],\n",
" \"weights\": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]\n",
" },\n",
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"None\"]}}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"adverse_event_severity\",\n",
" type=\"category\",\n",
" params={\"values\": [\"Mild\", \"Moderate\", \"Severe\", \"Life-threatening\"]},\n",
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"adverse_event_relatedness\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Unrelated\", \"Possibly related\", \"Probably related\", \"Definitely related\"],\n",
" \"weights\": [0.2, 0.4, 0.3, 0.1]\n",
" },\n",
" conditional_params={\"has_adverse_event == 0\": {\"values\": [\"NA\"]}}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"adverse_event_resolved\",\n",
" type=\"category\",\n",
" params={\"values\": [\"NA\"]},\n",
" conditional_params={\"has_adverse_event == 1\": {\"values\": [\"Yes\", \"No\"], \"weights\": [0.8, 0.2]}}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Narrative text fields with style variations\n",
"\n",
"These fields will contain natural language text that incorporates PII elements.\n",
"We'll use style seed categories to ensure diversity in the writing styles:\n",
"\n",
"1. Medical observations and notes\n",
"2. Adverse event descriptions \n",
"3. Protocol deviation explanations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Documentation style category\n",
"aidd.add_column(\n",
" name=\"documentation_style\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Formal and Technical\", \"Concise and Direct\", \"Detailed and Descriptive\"],\n",
" \"weights\": [0.4, 0.3, 0.3]\n",
" }\n",
")\n",
"\n",
"# Medical observations - varies based on documentation style\n",
"aidd.add_column(\n",
" name=\"medical_observations\",\n",
" prompt=\"\"\"\n",
" {% if documentation_style == \"Formal and Technical\" %}\n",
" Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }} \n",
" (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).\n",
" \n",
" Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.\n",
" Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a \n",
" change of {{ percent_change }}%.\n",
" \n",
" Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate \n",
" sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.\n",
" {% elif documentation_style == \"Concise and Direct\" %}\n",
" Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }} \n",
" ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.\n",
" \n",
" Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}. \n",
" Change: {{ percent_change }}%.\n",
" \n",
" Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference \n",
" Dr. {{ investigator_last_name }} briefly.\n",
" {% else %}\n",
" Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}\n",
" enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).\n",
" \n",
" Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.\n",
" Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),\n",
" representing a {{ percent_change }}% change.\n",
" \n",
" Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective\n",
" patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.\n",
" {% endif %}\n",
" \"\"\"\n",
")\n",
"\n",
"# Adverse event descriptions - conditional on having an adverse event\n",
"aidd.add_column(\n",
" name=\"adverse_event_description\",\n",
" prompt=\"\"\"\n",
" {% if has_adverse_event == 1 %}\n",
" [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\\\n",
" {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment. \n",
" {% if adverse_event_resolved == \"Yes\" %}Resolved.{% else %}Ongoing.{% endif %}\n",
" {% else %}\n",
" [INSTRUCTIONS: Output only the exact text \"No adverse events reported\" without any additional commentary.]\\\n",
" No adverse events reported.\\\n",
" {% endif %}\n",
" \"\"\"\n",
")\n",
"\n",
"# Protocol deviation description (if compliance is low)\n",
"aidd.add_column(\n",
" name=\"protocol_deviation\",\n",
" prompt=\"\"\"\n",
" {% if compliance_rate < 0.85 %}\n",
" {% if documentation_style == \"Formal and Technical\" %}\n",
" [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like \"it looks like\" or \"you've provided\". Begin with the protocol deviation details. Use formal terminology.]\n",
" \n",
" PROTOCOL DEVIATION REPORT\n",
" Study ID: {{ study_id }}\n",
" Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n",
" Compliance Rate: {{ compliance_rate }}\n",
" \n",
" [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]\n",
" {% elif documentation_style == \"Concise and Direct\" %}\n",
" [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]\n",
" \n",
" PROTOCOL DEVIATION - {{ participant_id }}\n",
" • Compliance: {{ compliance_rate }}\n",
" • Impact: [severity level]\n",
" • Actions: [list actions]\n",
" • Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}\n",
" • PI: Dr. {{ investigator_last_name }}\n",
" {% else %}\n",
" [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]\n",
" \n",
" During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.\n",
" \n",
" [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]\n",
" {% endif %}\n",
" {% else %}\n",
" [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]\n",
" \n",
" PROTOCOL COMPLIANCE ASSESSMENT\n",
" Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})\n",
" Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.\n",
" {% endif %}\n",
" \"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding Constraints\n",
"\n",
"Finally, we'll add constraints to ensure our data is logically consistent:\n",
"- Trial dates must be in proper sequence\n",
"- Adverse event dates must occur after enrollment\n",
"- Measurement changes must be realistic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Ensure appropriate date sequence\n",
"aidd.add_constraint(\n",
" target_column=\"trial_end_date\",\n",
" type=\"column_inequality\",\n",
" params={\"operator\": \">\", \"rhs\": \"trial_start_date\"}\n",
")\n",
"\n",
"aidd.add_constraint(\n",
" target_column=\"enrollment_date\",\n",
" type=\"column_inequality\",\n",
" params={\"operator\": \">=\", \"rhs\": \"trial_start_date\"}\n",
")\n",
"\n",
"aidd.add_constraint(\n",
" target_column=\"enrollment_date\",\n",
" type=\"column_inequality\",\n",
" params={\"operator\": \"<\", \"rhs\": \"trial_end_date\"}\n",
")\n",
"\n",
"# If there's an adverse event, ensure date is after enrollment\n",
"# aidd.add_constraint(\n",
"# target_column=\"adverse_event_date\",\n",
"# type=\"column_inequality\",\n",
"# params={\"operator\": \">\", \"rhs\": \"enrollment_date\"}\n",
"# )\n",
"\n",
"# Ensure reasonable clinical measurements\n",
"aidd.add_constraint(\n",
" target_column=\"baseline_measurement\",\n",
" type=\"scalar_inequality\",\n",
" params={\"operator\": \">\", \"rhs\": 0}\n",
")\n",
"\n",
"aidd.add_constraint(\n",
" target_column=\"final_measurement\",\n",
" type=\"scalar_inequality\",\n",
" params={\"operator\": \">\", \"rhs\": 0}\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preview and Generate Dataset\n",
"\n",
"First, we'll preview a small sample to verify our configuration is working correctly.\n",
"Then we'll generate the full dataset with the desired number of records."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview a few records\n",
"preview = aidd.preview()\n",
"preview.display_sample_record()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# More previews\n",
"preview.display_sample_record()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define a common name for the workflow and file\n",
"workflow_name = \"clinical-trial-data\"\n",
"\n",
"# Submit batch job\n",
"workflow_run = aidd.create(\n",
" num_records=100,\n",
" workflow_run_name=workflow_name,\n",
" wait_for_completion=True\n",
")\n",
"\n",
"print(\"\\nGenerated dataset shape:\", workflow_run.dataset.df.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the first few rows of the generated dataset\n",
"workflow_run.dataset.df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save the dataset\n",
"csv_filename = f\"{workflow_name}.csv\" \n",
"workflow_run.dataset.df.to_csv(csv_filename, index=False)\n",
"print(f\"Dataset with {len(workflow_run.dataset.df)} records saved to {csv_filename}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment