Skip to content

Instantly share code, notes, and snippets.

@deepanshumehtaa
Last active March 19, 2024 19:30
Show Gist options
  • Save deepanshumehtaa/49ff189bc74e1019e2cb9814a0151569 to your computer and use it in GitHub Desktop.
Save deepanshumehtaa/49ff189bc74e1019e2cb9814a0151569 to your computer and use it in GitHub Desktop.
PySpark
lect 0: https://colab.research.google.com/drive/1crwg2yOosVEQATTlMfPe-FdiQKZF-gFN?usp=sharing
lec 1: https://drive.google.com/file/d/1_H5jhUut-DPjT5U0vmgCkg30DKZCa_-G/view?usp=sharing
lect 2: https://colab.research.google.com/drive/1DxeRBxRaqENX-HnxqSj-fexuRxBaTp9R?usp=sharing
lect 3: https://colab.research.google.com/drive/1fzTxGc0ttQeUV4D4VQf0zYDSOb6gDkoH?usp=sharing
databrick NB: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6073444084857434/2270297475493747/7614095282524650/latest.html
what is PySpark Partition ?
A:
- way to split a large dataset into smaller datasets based on one or more partition keys.
- main advantages of PySpark DataFrame over Pandas DataFrame. Transformations on partitioned data run
faster as they execute transformations parallelly for each partition.
PySpark supports partition in two ways:
- partition in memory (DataFrame)
- partition on the disk (File system).
Partition in memory:
- repartition():
used to increase or decrease the RDD/DataFrame partitions
- coalesce():
used to only decrease the number of partitions in an efficient way.
(improved version of repartition() where the movement of the data across the partitions is lower using coalesce)
Partition on disk:
While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using
`partitionBy()` of pyspark.sql.DataFrameWriter.
This is similar to Hives partitions scheme
HDFS (Hadoop Distributed File System):
- distributed file system provided by the Apache Hadoop project.
- It is a key component of the Hadoop ecosystem and is designed to store large amounts of data reliably across multiple machines.
overview of HDFS in PySpark:
1. Distributed Storage:
HDFS stores data across a cluster of machines in a distributed manner.
Files are divided into blocks, and each block is replicated across multiple nodes in the cluster for fault tolerance.
2. Fault Tolerance:
HDFS provides fault tolerance by replicating data across multiple nodes in the cluster.
If a node fails, the data can still be accessed from other nodes where it is replicated.
3. Scalability:
HDFS is designed to scale horizontally to accommodate large amounts of data.
New nodes can be added to the cluster as needed to increase storage capacity and processing power.
https://sparkbyexamples.com/pyspark/pyspark-cache-explained/
from pyspark.sql.functions import countDistinct, avg, stddev
df.select(.alias("myAlias"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment