Skip to content

Instantly share code, notes, and snippets.

@drorata
Created March 21, 2025 15:46
Show Gist options
  • Save drorata/83092ad355d2c351a484472cc1e58219 to your computer and use it in GitHub Desktop.
Save drorata/83092ad355d2c351a484472cc1e58219 to your computer and use it in GitHub Desktop.
Referencing to columns when using PySpark
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import findspark\n",
"from pyspark.sql import SparkSession\n",
"from pyspark.sql import functions as F\n",
"findspark.init()\n",
"\n",
"\n",
"spark = (\n",
" SparkSession.builder.appName(\"TestApp\")\n",
" .config(\"spark.driver.host\", \"localhost\")\n",
" .getOrCreate()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing Columns of a Dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = [(\"Alice\", 25, 3.14), (\"Bob\", 30, 2.71), (\"Charlie\", 35, 1.618)]\n",
"df = spark.createDataFrame(data, [\"Name\", \"Age\", \"A nice number\"])\n",
"df.toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The 🐼 Pandas' way"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.toPandas().Age"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.toPandas()[\"A nice number\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The ⚑️ Spark's Way"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the following calls are the same:\n",
"\n",
"```python\n",
"df.select(\"Name\")\n",
"df.select(df[\"Name\"])\n",
"df.select(df.Name)\n",
"df.select(F.col(\"Name\"))\n",
"```\n",
"\n",
"For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.select(F.col(\"A nice number\")).toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also the following have the same impact:\n",
"\n",
"```python\n",
"df.withColumn(\"foo\", df.Age * 2)\n",
"df.withColumn(\"foo\", df[\"Age\"] * 2)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.withColumn(\"foo\", df.Age * 2).toPandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Am I defined??\n",
"\n",
"![alt text](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExYzI3aGtuOWVmNzlkenk4OG1kYXZpZjJjZmlkaWl6Ym9kNTh3dnhsYiZlcD12MV9naWZzX3RyZW5kaW5nJmN0PWc/7MDZS8zS1ixtJAUEul/giphy.gif \"I'm not defined!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# πŸ‘‡πŸ»πŸ‘‡πŸ»πŸ‘‡πŸ»πŸ‘‡πŸ»\n",
"df.withColumn(\"foo\", df.Age * 2).withColumn(\"foo2\", \"foo\" * 2).toPandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.withColumn(\"foo\", df.Age * 2).withColumn(\"foo2\", F.col(\"foo\") * 2).toPandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df2 = df.withColumn(\"foo\", df.Age * 2)\n",
"# df2.withColumn(\"foo2\", df2.foo * 2).toPandas()\n",
"df2.withColumn(\"foo2\", F.col(\"foo\") * 2).toPandas()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment