Skip to content

Instantly share code, notes, and snippets.

@yamini
Created April 1, 2025 15:30
Show Gist options
  • Save yamini/ca43bde8d1f9a4c6d5b78dacae4d4807 to your computer and use it in GitHub Desktop.
Save yamini/ca43bde8d1f9a4c6d5b78dacae4d4807 to your computer and use it in GitHub Desktop.
dd-sdk-insurance-claims-data-yk
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synthetic Insurance Claims Dataset Generator\n",
"\n",
"This notebook creates a synthetic dataset of insurance claims with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.\n",
"\n",
"The dataset includes:\n",
"- Policy and claim details\n",
"- Policyholder and claimant information (PII)\n",
"- Claim descriptions and adjuster notes with embedded PII\n",
"- Medical information for relevant claims\n",
"\n",
"We'll use Gretel's Data Designer to create this fully synthetic dataset from scratch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"# Install required packages\n",
"%pip install -U git+https://github.com/gretelai/gretel-python-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up Data Designer\n",
"\n",
"First, we'll initialize the Gretel client and create a new Data Designer object. We'll use the `apache-2.0` model suite for this project."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gretel_client.navigator_client import Gretel\n",
"\n",
"# Initialize Gretel client - this will prompt for your API key\n",
"gretel = Gretel(api_key=\"prompt\", endpoint=\"https://api.dev.gretel.ai\")\n",
"\n",
"aidd = gretel.data_designer.new(model_suite=\"apache-2.0\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Person Samplers\n",
"\n",
"We'll create person samplers to generate consistent personal information for different roles in the insurance claims process:\n",
"- Policyholders (primary insurance customers)\n",
"- Claimants (who may be different from policyholders)\n",
"- Adjusters (insurance company employees who evaluate claims)\n",
"- Physicians (for medical-related claims)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create person samplers for different roles, using en_GB locale since en_US with PGM is not supported in streaming mode\n",
"aidd.with_person_samplers({\n",
" \"policyholder\": {\"locale\": \"en_GB\"},\n",
" \"claimant\": {\"locale\": \"en_GB\"},\n",
" \"adjuster\": {\"locale\": \"en_GB\"},\n",
" \"physician\": {\"locale\": \"en_GB\"}\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Policy Information\n",
"\n",
"Next, we'll create the basic policy information:\n",
"- Policy number (unique identifier)\n",
"- Policy type (Auto, Home, Health, etc.)\n",
"- Coverage details (based on policy type)\n",
"- Policy start and end dates"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Policy identifiers\n",
"aidd.add_column(\n",
" name=\"policy_number\",\n",
" type=\"uuid\",\n",
" params={\"prefix\": \"POL-\", \"short_form\": True, \"uppercase\": True}\n",
")\n",
"\n",
"# Policy type\n",
"aidd.add_column(\n",
" name=\"policy_type\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Auto\", \"Home\", \"Health\", \"Life\", \"Travel\"],\n",
" \"weights\": [0.4, 0.3, 0.15, 0.1, 0.05]\n",
" }\n",
")\n",
"\n",
"# Coverage types based on policy type\n",
"aidd.add_column(\n",
" name=\"coverage_type\",\n",
" type=\"subcategory\",\n",
" params={\n",
" \"category\": \"policy_type\",\n",
" \"values\": {\n",
" \"Auto\": [\"Liability\", \"Comprehensive\", \"Collision\", \"Uninsured Motorist\"],\n",
" \"Home\": [\"Dwelling\", \"Personal Property\", \"Liability\", \"Natural Disaster\"],\n",
" \"Health\": [\"Emergency Care\", \"Primary Care\", \"Specialist\", \"Prescription\"],\n",
" \"Life\": [\"Term\", \"Whole Life\", \"Universal Life\", \"Variable Life\"],\n",
" \"Travel\": [\"Trip Cancellation\", \"Medical Emergency\", \"Lost Baggage\", \"Flight Accident\"]\n",
" }\n",
" }\n",
")\n",
"\n",
"# Policy dates\n",
"aidd.add_column(\n",
" name=\"policy_start_date\",\n",
" type=\"datetime\",\n",
" params={\"start\": \"2022-01-01\", \"end\": \"2023-06-30\"},\n",
" convert_to=\"%Y-%m-%d\"\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"policy_end_date\",\n",
" type=\"datetime\",\n",
" params={\"start\": \"2023-07-01\", \"end\": \"2024-12-31\"},\n",
" convert_to=\"%Y-%m-%d\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Policyholder Information (PII)\n",
"\n",
"Now we'll add fields for the policyholder's personal information. This includes PII elements that would typically be subject to privacy regulations:\n",
"- First and last name\n",
"- Birth date\n",
"- Contact information (email)\n",
"\n",
"These fields use expressions to reference the person sampler we defined earlier."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Policyholder personal information\n",
"aidd.add_column(\n",
" name=\"policyholder_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"policyholder.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"policyholder_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"policyholder.last_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"policyholder_birth_date\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"policyholder.birth_date\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"policyholder_email\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"policyholder.email_address\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Claim Information\n",
"\n",
"Next, we'll create the core claim details:\n",
"- Claim ID (unique identifier)\n",
"- Dates (filing date, incident date)\n",
"- Claim status (in process, approved, denied, etc.)\n",
"- Financial information (amount claimed, amount approved)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Claim identifier\n",
"aidd.add_column(\n",
" name=\"claim_id\",\n",
" type=\"uuid\",\n",
" params={\"prefix\": \"CLM-\", \"short_form\": True, \"uppercase\": True}\n",
")\n",
"\n",
"# Claim dates\n",
"aidd.add_column(\n",
" name=\"incident_date\",\n",
" type=\"datetime\",\n",
" params={\"start\": \"2023-01-01\", \"end\": \"2023-12-31\"},\n",
" convert_to=\"%Y-%m-%d\"\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"filing_date\",\n",
" type=\"timedelta\",\n",
" params={\n",
" \"dt_min\": 1,\n",
" \"dt_max\": 30,\n",
" \"reference_column_name\": \"incident_date\",\n",
" \"unit\": \"D\"\n",
" },\n",
" convert_to=\"%Y-%m-%d\"\n",
")\n",
"\n",
"# Claim status\n",
"aidd.add_column(\n",
" name=\"claim_status\",\n",
" type=\"category\",\n",
" params={\n",
" \"values\": [\"Filed\", \"Under Review\", \"Additional Info Requested\", \"Approved\", \"Denied\", \"Appealed\"],\n",
" \"weights\": [0.15, 0.25, 0.15, 0.25, 0.15, 0.05]\n",
" }\n",
")\n",
"\n",
"# Financial information\n",
"aidd.add_column(\n",
" name=\"claim_amount\",\n",
" type=\"gaussian\",\n",
" params={\"mean\": 5000, \"stddev\": 2000, \"min\": 500}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"approved_percentage\",\n",
" type=\"uniform\",\n",
" params={\"low\": 0.0, \"high\": 1.0}\n",
")\n",
"\n",
"# Calculate approved amount based on percentage\n",
"aidd.add_column(\n",
" name=\"approved_amount\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"claim_amount * approved_percentage\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Claimant Information\n",
"\n",
"In some cases, the claimant (person filing the claim) may be different from the policyholder. \n",
"We'll create fields to capture claimant information and their relationship to the policyholder:\n",
"- Flag indicating if claimant is the policyholder\n",
"- Claimant personal details (when different from policyholder)\n",
"- Relationship to policyholder"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Determine if claimant is the policyholder\n",
"aidd.add_column(\n",
" name=\"is_claimant_policyholder\",\n",
" type=\"bernoulli\",\n",
" params={\"p\": 0.7}\n",
")\n",
"\n",
"# Claimant personal information (when different from policyholder)\n",
"aidd.add_column(\n",
" name=\"claimant_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"claimant.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"claimant_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"claimant.last_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"claimant_birth_date\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"claimant.birth_date\"}\n",
")\n",
"\n",
"# Relationship to policyholder\n",
"aidd.add_column(\n",
" name=\"relationship_to_policyholder\",\n",
" type=\"category\",\n",
" params={\"values\": [\"Self\",\"Spouse\", \"Child\", \"Parent\", \"Sibling\", \"Other\"]},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Claim Adjuster Information\n",
"\n",
"Insurance claims are typically handled by claim adjusters. We'll add information about \n",
"the adjuster assigned to each claim:\n",
"- Adjuster name\n",
"- Assignment date\n",
"- Contact information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Adjuster information\n",
"aidd.add_column(\n",
" name=\"adjuster_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"adjuster.first_name\"}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"adjuster_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"adjuster.last_name\"}\n",
")\n",
"\n",
"# Adjuster assignment date\n",
"aidd.add_column(\n",
" name=\"adjuster_assignment_date\",\n",
" type=\"timedelta\",\n",
" params={\n",
" \"dt_min\": 0,\n",
" \"dt_max\": 5,\n",
" \"reference_column_name\": \"filing_date\",\n",
" \"unit\": \"D\"\n",
" },\n",
" convert_to=\"%Y-%m-%d\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Medical Information\n",
"\n",
"For health insurance claims and injury-related claims in other policy types, \n",
"we'll include medical information:\n",
"- Flag indicating if there's a medical component to the claim\n",
"- Medical claim details (when applicable)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Is there a medical component to this claim?\n",
"aidd.add_column(\n",
" name=\"has_medical_component\",\n",
" type=\"bernoulli\",\n",
" params={\"p\": 0.4}\n",
")\n",
"\n",
"# Physician information using conditional logic\n",
"aidd.add_column(\n",
" name=\"physician_first_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"physician.first_name\"},\n",
" conditional_params={\"has_medical_component == 0\": {\"expr\": \"'NA'\"}}\n",
")\n",
"\n",
"aidd.add_column(\n",
" name=\"physician_last_name\",\n",
" type=\"expression\",\n",
" params={\"expr\": \"physician.last_name\"},\n",
" conditional_params={\"has_medical_component == 0\": {\"expr\": \"'NA'\"}}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Free Text Fields with PII References\n",
"\n",
"These fields will contain natural language text that incorporates PII elements from other fields.\n",
"This is particularly useful for testing PII detection and redaction within unstructured text:\n",
"\n",
"1. Incident Description - The policyholder/claimant's account of what happened\n",
"2. Adjuster Notes - The insurance adjuster's professional documentation\n",
"3. Medical Notes - For claims with a medical component\n",
"\n",
"The LLM will be prompted to include PII elements like names, dates, and contact information\n",
"within the narrative text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Incident description from policyholder/claimant\n",
"aidd.add_column(\n",
" name=\"incident_description\",\n",
" prompt=\"\"\"\n",
" Write a detailed description of an insurance incident for a {policy_type} insurance policy with {coverage_type} coverage.\n",
" \n",
" The policyholder is {policyholder_first_name} {policyholder_last_name} (email: {policyholder_email}).\n",
" \n",
" The incident occurred on {incident_date} and resulted in approximately ${claim_amount} in damages/expenses.\n",
" \n",
" Write this from the perspective of the person filing the claim. Include specific details that would be relevant \n",
" to processing this type of claim. Make it detailed but realistic, as if written by someone describing an actual incident.\n",
" \n",
" Reference the policyholder's contact information at least once in the narrative.\n",
" \"\"\"\n",
")\n",
"\n",
"# Adjuster notes\n",
"aidd.add_column(\n",
" name=\"adjuster_notes\",\n",
" prompt=\"\"\"\n",
" Write detailed insurance adjuster notes for claim {claim_id}.\n",
" \n",
" POLICY INFORMATION:\n",
" - Policy #: {policy_number}\n",
" - Type: {policy_type}, {coverage_type} coverage\n",
" - Policyholder: {policyholder_first_name} {policyholder_last_name}\n",
" \n",
" CLAIM DETAILS:\n",
" - Incident Date: {incident_date}\n",
" - Filing Date: {filing_date}\n",
" - Claimed Amount: ${claim_amount}\n",
" \n",
" As adjuster {adjuster_first_name} {adjuster_last_name}, write professional notes documenting:\n",
" 1. Initial contact with the policyholder\n",
" 2. Assessment of the claim based on the incident description\n",
" 3. Coverage determination under the policy\n",
" 4. Recommended next steps\n",
" \n",
" Include at least one mention of contacting the policyholder using their full name and email ({policyholder_email}).\n",
" Use a formal, professional tone typical of insurance documentation.\n",
" \"\"\"\n",
")\n",
"\n",
"# Medical notes (for claims with medical component)\n",
"aidd.add_column(\n",
" name=\"medical_notes\",\n",
" prompt=\"\"\"\n",
" {% if has_medical_component %}\\\n",
" Write medical notes related to insurance claim {{ claim_id }}.\n",
"\n",
" Patient: {{policyholder_first_name}} {{policyholder_last_name}}, DOB: {{policyholder_birth_date}}\n",
"\n",
" As Dr. {{physician_first_name}} {{physician_last_name}}, document:\n",
"\n",
" 1. Chief complaint\n",
" 2. Medical assessment\n",
" 3. Treatment recommendations\n",
" 4. Follow-up instructions\n",
"\n",
" Include appropriate medical terminology relevant to a {{policy_type}} insurance claim.\n",
" If this is for a Health policy, focus on the {{coverage_type}} aspects.\n",
" For other policy types, focus on injury assessment relevant to the incident.\n",
"\n",
" Use a professional medical documentation style that includes specific references \n",
" to the patient by name and birth date.\\\n",
" \n",
" The language should be natural and different from one physician to the next.\\\n",
" \n",
" Vary the length of the response. Keep some notes brief and others more detailed.\\\n",
" {% else -%}\\\n",
" Repeat the following: \"No medical claim\"\\\n",
" {% endif -%}\\\n",
" \"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding Constraints\n",
"\n",
"To ensure our data is logically consistent, we'll add some constraints:\n",
"- Incident date must be during the policy term\n",
"- Filing date must be after incident date"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Ensure incident date falls within policy period\n",
"aidd.add_constraint(\n",
" target_column=\"incident_date\",\n",
" type=\"column_inequality\",\n",
" params={\"operator\": \">=\", \"rhs\": \"policy_start_date\"}\n",
")\n",
"\n",
"aidd.add_constraint(\n",
" target_column=\"incident_date\", \n",
" type=\"column_inequality\",\n",
" params={\"operator\": \"<=\", \"rhs\": \"policy_end_date\"}\n",
")\n",
"\n",
"# Ensure filing date is after incident date\n",
"aidd.add_constraint(\n",
" target_column=\"filing_date\",\n",
" type=\"column_inequality\",\n",
" params={\"operator\": \">\", \"rhs\": \"incident_date\"}\n",
")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preview and Generate Dataset\n",
"\n",
"First, we'll preview a small sample to verify our configuration is working correctly.\n",
"Then we'll generate the full dataset with the desired number of records."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview a few records\n",
"preview = aidd.preview()\n",
"preview.display_sample_record()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# More previews\n",
"preview.display_sample_record()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate the full dataset\n",
"workflow_run = aidd.create(\n",
" num_records=100, # Adjust number as needed. Max 100 for now\n",
" workflow_run_name=\"insurance_claims_with_pii_321\",\n",
" wait_for_completion=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the first few rows of the generated dataset\n",
"workflow_run.dataset.df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save the dataset to CSV\n",
"workflow_run.dataset.df.to_csv(\"insurance_claims_with_pii.csv\", index=False)\n",
"print(f\"Dataset with {len(workflow_run.dataset.df)} records saved to insurance_claims_with_pii.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment