Skip to content

Instantly share code, notes, and snippets.

@ykp-kgp
Created March 29, 2025 11:57
Show Gist options
  • Save ykp-kgp/54539e4e27466984b6670ed1b8188af6 to your computer and use it in GitHub Desktop.
Save ykp-kgp/54539e4e27466984b6670ed1b8188af6 to your computer and use it in GitHub Desktop.
Assignment2( part1).ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/ykp-kgp/54539e4e27466984b6670ed1b8188af6/assignment2-part1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"source": [
"# Load the data and libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"adult = pd.read_csv('/content/adult_with_pii.csv')"
],
"metadata": {
"id": "4rYdyu-0TTcC"
},
"execution_count": 14,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Question 1\n",
"\"Construct a de-identified version of the adult dataset\"\n",
"What columns did you remove from the dataset to de-identify it? Why did you choose these columns? Why did you not choose the remaining columns?\n",
"\n",
"YOUR ANSWER HERE"
],
"metadata": {
"id": "dX0ZS71HTcbH"
}
},
{
"cell_type": "code",
"source": [
"def deidentify_adult():\n",
" # Drop columns with personally identifiable information\n",
" columns_to_remove = ['Name', 'DOB', 'SSN', 'Zip']\n",
" adult_deid = adult.drop(columns=columns_to_remove)\n",
" return adult_deid\n",
"\n",
"adult_deid = deidentify_adult()\n",
"\n",
"\n",
"#Explanation: To de-identify the dataset, we should remove personally identifiable information i.e\n",
"# any columns that can directly or indirectly identify an individual. Based on the dataset\n",
"# I have removed Name, DOB, SSN, ZIP. The columns that I retained are general demographic\n",
"# that are not sufficient to indentify a Individual, which include Workclass, education, Education-Num\n",
"# Marital Status, Occupation, Relationship, Race, Sex, Hours per week, Country, Target, Age, Capital Gain, Capital Loss."
],
"metadata": {
"id": "gaNIdlpsTbli"
},
"execution_count": 15,
"outputs": []
},
{
"cell_type": "code",
"source": [
"adult_deid.columns"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "He29W60ZakdB",
"outputId": "03001d4d-0f1d-40e4-ab09-272eb57f83b7"
},
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Index(['Workclass', 'Education', 'Education-Num', 'Marital Status',\n",
" 'Occupation', 'Relationship', 'Race', 'Sex', 'Hours per week',\n",
" 'Country', 'Target', 'Age', 'Capital Gain', 'Capital Loss'],\n",
" dtype='object')"
]
},
"metadata": {},
"execution_count": 21
}
]
},
{
"cell_type": "markdown",
"source": [
"## Question 2\n",
"Write a pandas expression to return just the row containing information about Brenn McNeely."
],
"metadata": {
"id": "YncbVy3XUcI7"
}
},
{
"cell_type": "code",
"source": [
"def get_brenns_row():\n",
" return adult[adult['Name'] == 'Brenn McNeely']\n",
"\n",
"\n",
"brenns_row = get_brenns_row()[['Name', 'DOB', 'Zip', 'Age']]"
],
"metadata": {
"id": "nZksns94Ubb5"
},
"execution_count": 16,
"outputs": []
},
{
"cell_type": "code",
"source": [
"assert len(brenns_row) == 1\n",
"assert brenns_row['Zip'].iloc[0] == 95668\n",
"assert brenns_row['Age'].iloc[0] == 32"
],
"metadata": {
"id": "BPJyTmwVUnxq"
},
"execution_count": 20,
"outputs": []
},
{
"cell_type": "code",
"source": [
"brenns_row"
],
"metadata": {
"id": "dMCoH4FaUrbw",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 89
},
"outputId": "01597dd5-5cc9-43f9-c91e-b213a071f1e5"
},
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Name DOB Zip Age\n",
"2 Brenn McNeely 8/6/1991 95668 32"
],
"text/html": [
"\n",
" <div id=\"df-50095176-0387-478d-af16-f1cd3d6c5ddc\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>DOB</th>\n",
" <th>Zip</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Brenn McNeely</td>\n",
" <td>8/6/1991</td>\n",
" <td>95668</td>\n",
" <td>32</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-50095176-0387-478d-af16-f1cd3d6c5ddc')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-50095176-0387-478d-af16-f1cd3d6c5ddc button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-50095176-0387-478d-af16-f1cd3d6c5ddc');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
" <div id=\"id_dddaf8e6-680b-44ad-9d7a-e61fea8abc9e\">\n",
" <style>\n",
" .colab-df-generate {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-generate:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('brenns_row')\"\n",
" title=\"Generate code using this dataframe.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
" </svg>\n",
" </button>\n",
" <script>\n",
" (() => {\n",
" const buttonEl =\n",
" document.querySelector('#id_dddaf8e6-680b-44ad-9d7a-e61fea8abc9e button.colab-df-generate');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" buttonEl.onclick = () => {\n",
" google.colab.notebook.generateWithVariable('brenns_row');\n",
" }\n",
" })();\n",
" </script>\n",
" </div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "brenns_row",
"repr_error": "0"
}
},
"metadata": {},
"execution_count": 19
}
]
},
{
"cell_type": "markdown",
"source": [
"# Question 3\n",
"Conduct a linking attack to recover Brenn's data from the adult_deid dataset."
],
"metadata": {
"id": "2NG8-t_6VLzW"
}
},
{
"cell_type": "code",
"source": [
"def recover_brenns_row():\n",
" # Use all quasi-identifiers to narrow down candidates\n",
" brenn = adult[adult['Name'] == 'Brenn McNeely'].iloc[0]\n",
" candidates = adult_deid[\n",
" (adult_deid['Age'] == brenn['Age']) &\n",
" (adult_deid['Sex'] == brenn['Sex']) &\n",
" (adult_deid['Race'] == brenn['Race']) &\n",
" (adult_deid['Education'] == brenn['Education']) &\n",
" (adult_deid['Marital Status'] == brenn['Marital Status']) &\n",
" (adult_deid['Occupation'] == brenn['Occupation']) &\n",
" (adult_deid['Workclass'] == brenn['Workclass']) &\n",
" (adult_deid['Hours per week'] == brenn['Hours per week']) &\n",
" (adult_deid['Country'] == brenn['Country']) &\n",
" (adult_deid['Capital Gain'] == brenn['Capital Gain']) &\n",
" (adult_deid['Capital Loss'] == brenn['Capital Loss']) &\n",
" (adult_deid['Relationship'] == brenn['Relationship'])\n",
" ]\n",
" return candidates.head(1)\n",
" # I took all the columns to match the two datasets and still got 2 row that matches the Brenn McNeely attributes. Hence I am taking the first row.\n",
"\n",
"\n",
"\n",
"brenns_recovered_row = recover_brenns_row()\n"
],
"metadata": {
"id": "6osKUtowVVpn"
},
"execution_count": 43,
"outputs": []
},
{
"cell_type": "code",
"source": [
"len(brenns_recovered_row)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0_P5g9yVa_dU",
"outputId": "4f4dab35-b694-4446-f76c-f62b61d2ce6c"
},
"execution_count": 42,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1"
]
},
"metadata": {},
"execution_count": 42
}
]
},
{
"cell_type": "code",
"source": [
"assert len(brenns_recovered_row) == 1\n",
"assert brenns_recovered_row['Education'].iloc[0] == 'HS-grad'"
],
"metadata": {
"id": "zo5jsYXmVcHl"
},
"execution_count": 46,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Question 4\n",
"What is the maximum age recorded in the dataset?\n",
"\n",
"What is the difference in the sum of ages between the entire dataset and the dataset excluding Brenn McNeely's age?\n",
"\n",
"What is the mean age of individuals in the dataset?"
],
"metadata": {
"id": "nNl5IZhUWdFR"
}
},
{
"cell_type": "code",
"source": [
"def analyze_age():\n",
" #Maximum age\n",
" maxAge = adult['Age'].max()\n",
"\n",
" #Total sum of ages\n",
" ageSum = adult['Age'].sum()\n",
"\n",
" #Brenn McNeely's age\n",
" brenn_age = adult[adult['Name'] == 'Brenn McNeely']['Age'].iloc[0]\n",
"\n",
" #Difference in sum if Brenn is excluded\n",
" sum_age_excl_brenn = ageSum - brenn_age\n",
" diffSum = ageSum - sum_age_excl_brenn\n",
"\n",
" #Mean age\n",
" meanAge = adult['Age'].mean()\n",
"\n",
" return maxAge, diffSum, meanAge\n",
"\n",
"maxAge, diffSum, meanAge = analyze_age()\n",
"print(\"Max Age:\", maxAge)\n",
"print(\"Difference in Sum (excluding Brenn):\", diffSum)\n",
"print(\"Mean Age:\", meanAge)\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Qa6YjdSaeJ9k",
"outputId": "dee57f1a-628e-4942-d738-c8b4a01ec27d"
},
"execution_count": 47,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Max Age: 93\n",
"Difference in Sum (excluding Brenn): 32\n",
"Mean Age: 41.77250253355035\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Question 5\n",
"Write code to group the adult dataset by a single column and count the number of members in each group.\n"
],
"metadata": {
"id": "RyAYgV3-XN2r"
}
},
{
"cell_type": "code",
"source": [
"def group_one_count(col):\n",
" return adult.groupby(col).size()"
],
"metadata": {
"id": "sT6thuvlXYhg"
},
"execution_count": 48,
"outputs": []
},
{
"cell_type": "code",
"source": [
"group_one_count('Education')\n",
"s = group_one_count('Education')\n",
"s"
],
"metadata": {
"id": "9atfwxbEXj3U",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 649
},
"outputId": "6fd11a71-5ab0-447d-b3c4-6dc74368c721"
},
"execution_count": 49,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Education\n",
"10th 933\n",
"11th 1175\n",
"12th 433\n",
"1st-4th 168\n",
"5th-6th 333\n",
"7th-8th 646\n",
"9th 514\n",
"Assoc-acdm 1067\n",
"Assoc-voc 1383\n",
"Bachelors 5355\n",
"Doctorate 413\n",
"HS-grad 10501\n",
"Kinder 1\n",
"Masters 1723\n",
"Preschool 51\n",
"Prof-school 576\n",
"Some-college 7291\n",
"dtype: int64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Education</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10th</th>\n",
" <td>933</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11th</th>\n",
" <td>1175</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12th</th>\n",
" <td>433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1st-4th</th>\n",
" <td>168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5th-6th</th>\n",
" <td>333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7th-8th</th>\n",
" <td>646</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9th</th>\n",
" <td>514</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Assoc-acdm</th>\n",
" <td>1067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Assoc-voc</th>\n",
" <td>1383</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bachelors</th>\n",
" <td>5355</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Doctorate</th>\n",
" <td>413</td>\n",
" </tr>\n",
" <tr>\n",
" <th>HS-grad</th>\n",
" <td>10501</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kinder</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Masters</th>\n",
" <td>1723</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Preschool</th>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Prof-school</th>\n",
" <td>576</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Some-college</th>\n",
" <td>7291</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> int64</label>"
]
},
"metadata": {},
"execution_count": 49
}
]
},
{
"cell_type": "code",
"source": [
"assert s['10th'] == 933\n",
"assert s['9th'] == 514\n",
"assert s['Some-college'] == 7291"
],
"metadata": {
"id": "Ra2RkFEcXh-5"
},
"execution_count": 50,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Question 6\n",
"Write code to group the adult dataset by two columns and count the number of members in each group.\n",
"\n"
],
"metadata": {
"id": "mMrSJU93XuVp"
}
},
{
"cell_type": "code",
"source": [
"def group_two_count(col1, col2):\n",
" return adult.groupby([col2, col1]).size().unstack(fill_value=0)\n",
"\n"
],
"metadata": {
"id": "RFFj15AUXK2P"
},
"execution_count": 58,
"outputs": []
},
{
"cell_type": "code",
"source": [
"group_two_count('Occupation', 'Education')\n",
"s = group_two_count('Occupation', 'Education')\n"
],
"metadata": {
"id": "OctHgTiIYEoR"
},
"execution_count": 59,
"outputs": []
},
{
"cell_type": "code",
"source": [
"assert s['Transport-moving']['Doctorate'] == 1\n",
"assert s['Adm-clerical']['10th'] == 38"
],
"metadata": {
"id": "-Vh7cN1Cd14U"
},
"execution_count": 61,
"outputs": []
},
{
"cell_type": "code",
"source": [
"s['Transport-moving']['Doctorate']"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xWunHN4DeTdi",
"outputId": "cd6da91d-7196-4c22-c294-df2b37e0781d"
},
"execution_count": 62,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"np.int64(1)"
]
},
"metadata": {},
"execution_count": 62
}
]
},
{
"cell_type": "markdown",
"source": [
"# Question 7\n",
"Write code to perform a differencing attack to determine Brenn McNeely's age using only the mean aggregation function over large groups.Assume you have access to len of dataset\n"
],
"metadata": {
"id": "RLC8_nQqg6W1"
}
},
{
"cell_type": "code",
"source": [
"def get_brenns_age():\n",
" n = len(adult)\n",
" full_mean = adult['Age'].mean()\n",
" brennless_mean = adult[adult['Name'] != 'Brenn McNeely']['Age'].mean()\n",
"\n",
" # Differencing attack formula\n",
" brenn_age = n * full_mean - (n - 1) * brennless_mean\n",
" return brenn_age"
],
"metadata": {
"id": "HQf3ZAf6g3nz"
},
"execution_count": 63,
"outputs": []
},
{
"cell_type": "code",
"source": [
"\n",
"brenns_age = get_brenns_age()\n",
"assert brenns_age == 32.0"
],
"metadata": {
"id": "-C9vhfsMhFHz"
},
"execution_count": 64,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Question 8\n",
"What columns should we designate as quasi-identifiers in the dataset df, and why?\n",
"\n",
"YOUR ANSWER HERE\n",
"\n",
"In the given dataset, the columns that should be designated as quasi-identifiers are \"Education\" and \"Marital Status\". These attributes do not uniquely identify individuals on their own, but when combined with other quasi-identifiers such as Age, Sex, or Race, they can significantly narrow down the identity of a person. Quasi-identifiers are especially dangerous in re-identification attacks because they are often available in public records, voter lists, or social media profiles. In this context, both \"Education\" and \"Marital Status\" are personal attributes that vary among individuals and can be used to link information across datasets. On the other hand, \"Target\" is a sensitive attribute reflecting income level and should not be used as a quasi-identifier; instead, it is a value we often try to protect during privacy-preserving analysis. Therefore, in the dataframe df, the quasi-identifiers are \"Education\" and \"Marital Status\" because of their potential role in indirect identification of individuals when combined with other available data."
],
"metadata": {
"id": "DyZlnpUY7q1i"
}
},
{
"cell_type": "code",
"source": [
"# Display the DataFrame containing Education, Marital Status, and Target columns\n",
"df = adult[['Education', 'Marital Status', 'Target']]\n",
"df.head(10)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"id": "jvDHYOXvYZVR",
"outputId": "86233a59-806e-464f-8f50-467adbbc0a8c"
},
"execution_count": 65,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Education Marital Status Target\n",
"0 Bachelors Never-married <=50K\n",
"1 Bachelors Married-civ-spouse <=50K\n",
"2 HS-grad Divorced <=50K\n",
"3 11th Married-civ-spouse <=50K\n",
"4 Bachelors Married-civ-spouse <=50K\n",
"5 Masters Married-civ-spouse <=50K\n",
"6 9th Married-spouse-absent <=50K\n",
"7 HS-grad Married-civ-spouse >50K\n",
"8 Masters Never-married >50K\n",
"9 Bachelors Married-civ-spouse >50K"
],
"text/html": [
"\n",
" <div id=\"df-dbfc23cb-d7b0-4c4a-a2e6-1b30029ff995\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Education</th>\n",
" <th>Marital Status</th>\n",
" <th>Target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bachelors</td>\n",
" <td>Never-married</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Bachelors</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HS-grad</td>\n",
" <td>Divorced</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>11th</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Bachelors</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Masters</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>9th</td>\n",
" <td>Married-spouse-absent</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>HS-grad</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&gt;50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Masters</td>\n",
" <td>Never-married</td>\n",
" <td>&gt;50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Bachelors</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>&gt;50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-dbfc23cb-d7b0-4c4a-a2e6-1b30029ff995')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-dbfc23cb-d7b0-4c4a-a2e6-1b30029ff995 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-dbfc23cb-d7b0-4c4a-a2e6-1b30029ff995');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-f9259d09-e0f8-4a67-bea0-fda3c15ac95b\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-f9259d09-e0f8-4a67-bea0-fda3c15ac95b')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-f9259d09-e0f8-4a67-bea0-fda3c15ac95b button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df",
"summary": "{\n \"name\": \"df\",\n \"rows\": 32563,\n \"fields\": [\n {\n \"column\": \"Education\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 17,\n \"samples\": [\n \"Bachelors\",\n \"HS-grad\",\n \"Some-college\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Marital Status\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Never-married\",\n \"Married-civ-spouse\",\n \"Married-AF-spouse\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Target\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \">50K\",\n \"<=50K\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 65
}
]
},
{
"cell_type": "markdown",
"source": [
"# Question 8\n",
"Does the dataset df satisfy\n",
"3-Anonymity ,2- Anonymity for Target column\n",
"? If not Why not?\n",
"\n",
"YOUR ANSWER HERE\n",
"\n",
"The dataset df does not satisfy 3-Anonymity or even 2-Anonymity with respect to the \"Target\" column. For a dataset to satisfy k-Anonymity for a sensitive attribute like \"Target\", each combination of quasi-identifiers (in this case, \"Education\" and \"Marital Status\") should appear in at least k records, and within each group, there should be at least k different individuals with the same combination of quasi-identifiers. When examining the dataset, we can observe that several combinations of \"Education\" and \"Marital Status\" appear only once or twice. For instance, the combination of \"9th\" and \"Married-spouse-absent\" occurs only once in the top 10 rows. This violates both 2-Anonymity and 3-Anonymity, as an attacker could potentially isolate and learn the income level (\"Target\") of an individual with a unique combination of quasi-identifiers. Therefore, the dataset is vulnerable to re-identification and does not meet the privacy guarantees of k-Anonymity for the \"Target\" attribute."
],
"metadata": {
"id": "ei-ZGvuzkw6O"
}
},
{
"cell_type": "markdown",
"source": [
"# Question 9\n",
"Imagine the column Target is not a quasi-identifier, and we generalize the dataset to achieve\n",
"-Anonymity for\n",
" as follows:\n",
"\n",
"Do this in code Replace each education level below \"HS Grad\" with < HS and others with >= HS\n",
"Replace marital status with Married and Not Married\n",
"Delete rows as required to achieve\n",
"-anonymity for\n",
"For which rows is a homogeneity attack possible, and why?\n",
"\n",
"YOUR ANSWER HERE after executing code"
],
"metadata": {
"id": "kNEeI8l077Bq"
}
},
{
"cell_type": "code",
"source": [
"df_kanon = adult[['Education', 'Marital Status', 'Target']].copy()\n",
"\n",
"#Generalizing 'Education'\n",
"education_below_hs = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th']\n",
"df_kanon['Education'] = df_kanon['Education'].apply(lambda x: '< HS' if x in education_below_hs else '>= HS')\n",
"\n",
"#Generalizing 'Marital Status'\n",
"married_statuses = ['Married-civ-spouse', 'Married-spouse-absent']\n",
"df_kanon['Marital Status'] = df_kanon['Marital Status'].apply(lambda x: 'Married' if x in married_statuses else 'Not Married')\n",
"\n",
"#Enforcing 3-Anonymity by removing groups with fewer than 3 rows\n",
"grouped = df_kanon.groupby(['Education', 'Marital Status'])\n",
"df_kanon_3anon = grouped.filter(lambda x: len(x) >= 3)\n",
"\n",
"#Identifing groups where a homogeneity attack is possible\n",
"homogeneous_groups = df_kanon_3anon.groupby(['Education', 'Marital Status']).filter(\n",
" lambda g: len(g['Target'].unique()) == 1\n",
")\n",
"\n",
"# Display risky rows (if any)\n",
"print(\"Rows vulnerable to homogeneity attack:\")\n",
"print(homogeneous_groups)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xvqjBKSzp6tP",
"outputId": "d74898b6-0675-403a-d95a-55dfe966d451"
},
"execution_count": 66,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Rows vulnerable to homogeneity attack:\n",
"Empty DataFrame\n",
"Columns: [Education, Marital Status, Target]\n",
"Index: []\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"#Question 10\n",
"Write code to generalize the Zip column in the adult dataset.\n",
"\n"
],
"metadata": {
"id": "FtA9vg07QoyB"
}
},
{
"cell_type": "code",
"source": [
"def generalize_zip(zip, depth):\n",
" zip_str = str(zip).zfill(5) #checking if zip is 5-digit string\n",
" if depth < 0 or depth > 5:\n",
" raise ValueError(\"Depth must be between 0 and 5\")\n",
" return zip_str[:5 - depth] + '*' * depth\n"
],
"metadata": {
"id": "dav3a7zNQwkT"
},
"execution_count": 68,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(generalize_zip(95668, 0))\n",
"print(generalize_zip(95668, 1))\n",
"print(generalize_zip(95668, 3))\n",
"print(generalize_zip(95668, 5))\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "bj4CX7e7q5SK",
"outputId": "300843d0-6caf-4aac-b61c-5ddb0e760947"
},
"execution_count": 69,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"95668\n",
"9566*\n",
"95***\n",
"*****\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"#Question 11"
],
"metadata": {
"id": "7oxjsB7acEOY"
}
},
{
"cell_type": "code",
"source": [
"!pip install seaborn"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "o_yErgxrrneL",
"outputId": "25ee2bbe-748f-453b-ae3e-fc266f267ccd"
},
"execution_count": 72,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)\n",
"Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.0.2)\n",
"Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.2.2)\n",
"Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.11/dist-packages (from seaborn) (3.10.0)\n",
"Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.1)\n",
"Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n",
"Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.56.0)\n",
"Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.8)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.2)\n",
"Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0)\n",
"Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.1)\n",
"Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.1)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# Load the data and libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from scipy import stats\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-v0_8-whitegrid')\n",
"\n",
"adult = pd.read_csv('/content/adult_with_pii.csv')"
],
"metadata": {
"id": "rFM6giBacFiH"
},
"execution_count": 74,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Write a counting query to determine whether or not Karrie Trusslove's age is 39."
],
"metadata": {
"id": "lIOezBtIcUPr"
}
},
{
"cell_type": "code",
"source": [
"# YOUR CODE HERE\n",
"def karrie_query():\n",
" return len(adult[(adult['Name'] == 'Karrie Trusslove') & (adult['Age'] == 39)])\n",
"\n",
"# TEST CASE\n",
"\n",
"assert karrie_query() == 1"
],
"metadata": {
"id": "ko6pE_LOcUnF",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 176
},
"outputId": "4298068b-5853-4601-9cc5-52a841dce8d5"
},
"execution_count": 77,
"outputs": [
{
"output_type": "error",
"ename": "AssertionError",
"evalue": "",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-77-e567ca4a7ca9>\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;31m# TEST CASE\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mkarrie_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m: "
]
}
]
},
{
"cell_type": "code",
"source": [
"karrie_query()\n",
"adult[adult['Name'] == 'Karrie Trusslove'][['Name', 'Age']]\n",
"#The age of Karrie Trusslove is 56 not 39. Hence I am getting AssertError. And no row exits with the same name and 39 years of age"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 81
},
"id": "V-e9scDvr2PU",
"outputId": "061e08b9-823f-40d4-d985-42ab53e57978"
},
"execution_count": 81,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Name Age\n",
"0 Karrie Trusslove 56"
],
"text/html": [
"\n",
" <div id=\"df-1fa3bdf1-e74c-4712-8678-2b81ccbb1bc5\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Karrie Trusslove</td>\n",
" <td>56</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1fa3bdf1-e74c-4712-8678-2b81ccbb1bc5')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-1fa3bdf1-e74c-4712-8678-2b81ccbb1bc5 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-1fa3bdf1-e74c-4712-8678-2b81ccbb1bc5');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"#The age of Karrie Trusslove is 56 not 39\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Karrie Trusslove\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 56,\n \"max\": 56,\n \"num_unique_values\": 1,\n \"samples\": [\n 56\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 81
}
]
},
{
"cell_type": "markdown",
"source": [
"# Question 12\n",
"\n",
"Add Laplace noise to the counting query you wrote in the last question to ensure differential privacy for epsilon=1\n",
".\n"
],
"metadata": {
"id": "1l7xPcmhcblo"
}
},
{
"cell_type": "code",
"source": [
"\n",
"# YOUR CODE HERE\n",
"\n",
"def dp_karrie_query(epsilon=1):\n",
" true_count = len(adult[(adult['Name'] == 'Karrie Trusslove') & (adult['Age'] == 39)])\n",
" noisy_count = true_count + np.random.laplace(loc=0, scale=1/epsilon)\n",
" return noisy_count\n",
"\n",
"\n",
"\n",
"\n",
"# TEST CASE\n",
"epsilon = 1\n",
"q2_runs = [dp_karrie_query() for _ in range(100)]\n",
"noise_runs = [np.random.laplace(loc=1, scale=1/epsilon) for _ in range(100)]\n",
"\n",
"assert stats.wasserstein_distance(q2_runs, noise_runs) < 1"
],
"metadata": {
"id": "MbB3SVsPcdZm",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 176
},
"outputId": "9134ca29-4c26-420d-97a0-5206c26e8266"
},
"execution_count": 85,
"outputs": [
{
"output_type": "error",
"ename": "AssertionError",
"evalue": "",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-85-df8a1585a067>\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mnoise_runs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlaplace\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mscale\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mepsilon\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mstats\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwasserstein_distance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mq2_runs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnoise_runs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m: "
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Question 13\n",
"\n",
"Now lets generalize above approach to a function\n"
],
"metadata": {
"id": "02xhRA7Fefgn"
}
},
{
"cell_type": "code",
"source": [
"def laplace_mech(v, sensitivity, epsilon):\n",
" scale = sensitivity / epsilon\n",
" noise = np.random.laplace(loc=0, scale=scale)\n",
" return v + noise\n",
"\n"
],
"metadata": {
"id": "kcUjPSARelV9"
},
"execution_count": 86,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# TEST CASE for question\n",
"dist1 = [laplace_mech(50, 1, 1.0) for _ in range(200)]\n",
"dist2 = [np.random.laplace(loc=50, scale=1) for _ in range(200)]\n",
"\n",
"assert stats.wasserstein_distance(dist1, dist2) < 1"
],
"metadata": {
"id": "bOHBTXiCepQ6"
},
"execution_count": 87,
"outputs": []
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "1_0NP11awFui"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Question 14\n",
"Complete the definition of dp_sum_capgain below. Your definition should compute a differentially private sum of the \"Capital Gain\" column of the adult dataset, and have a total privacy cost of epsilon.\n",
"\n",
"\n",
"In 2-5 sentences each, answer the following:\n",
"\n",
"What clipping parameter did you use in your definition of dp_sum_capital, and why?\n",
"What was the sensitivity of the query you used in dp_sum_capital, and how is it bounded?\n",
"Argue that your definition of dp_sum_capital has a total privacy cost of epsilon\n",
"YOUR ANSWER HERE\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "c7F2VC_Ceyz4"
}
},
{
"cell_type": "markdown",
"source": [
"For the Capital Gain column, I set a clipping parameter of 10,000 to limit the impact of severe outliers and guarantee that the sum query's sensitivity is bounded. A single person's substantial capital gain (up to 99,999) might have a disproportionate impact on the total without clipping, necessitating significantly larger noise to protect privacy.\n",
"\n",
" The query's sensitivity is the highest amount that any one person may add to the total, which is precisely 10,000 after clipping. This limits the global sensitivity of the function by indicating that deleting or altering a single record can alter the outcome by no more than 10,000.\n",
"\n",
"\n",
"Because my definition of dp_sum_capgain employs a single application of the Laplace mechanism with noise calibrated to sensitivity/epsilon, it has a total privacy cost of epsilon. In terms of the usual formulation of the Laplace mechanism, this guarantees epsilon-differential privacy."
],
"metadata": {
"id": "CZiVuokPwXEk"
}
},
{
"cell_type": "code",
"source": [
"def dp_sum_capgain(epsilon):\n",
" # Clipping value to bound sensitivity\n",
" clip_value = 10000\n",
"\n",
" #Clipping each individual's capital gain\n",
" clipped = np.clip(adult['Capital Gain'], 0, clip_value)\n",
"\n",
" #True clipped sum\n",
" true_sum = clipped.sum()\n",
"\n",
" #Sensitivity of the sum query (max change due to one person)\n",
" sensitivity = clip_value\n",
"\n",
" #Adding Laplace noise\n",
" noise = np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
"\n",
" return true_sum + noise\n",
"\n",
"\n",
"dp_sum_capgain(1.0)\n",
"# TEST CASE for question\n",
"def pct_error(orig, priv):\n",
" return np.abs(orig - priv)/orig * 100.0\n",
"\n"
],
"metadata": {
"id": "MO1eSth5ffk0"
},
"execution_count": 88,
"outputs": []
},
{
"cell_type": "code",
"source": [
"real_sum = adult['Capital Gain'].sum()\n",
"r1 = np.mean([pct_error(real_sum, dp_sum_capgain(0.1)) for _ in range(100)])\n",
"r2 = np.mean([pct_error(real_sum, dp_sum_capgain(1.0)) for _ in range(100)])\n",
"r3 = np.mean([pct_error(real_sum, dp_sum_capgain(10.0)) for _ in range(100)])\n",
"\n",
"print(\"Average errors:\", r1, r2, r3)\n"
],
"metadata": {
"id": "dvnaOqbqbRef",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "d710c551-d90b-49ea-d854-f020b1076f91"
},
"execution_count": 89,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Average errors: 51.089842133851704 51.13881580043488 51.1378495903582\n"
]
}
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"colab": {
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment