deepanshumehtaa · March 19, 2024 19:30
diff --git a/PySpark.py b/PySpark.py
 lect 0: https://colab.research.google.com/drive/1crwg2yOosVEQATTlMfPe-FdiQKZF-gFN?usp=sharing

 lec 1: https://drive.google.com/file/d/1_H5jhUut-DPjT5U0vmgCkg30DKZCa_-G/view?usp=sharing

 lect 2: https://colab.research.google.com/drive/1DxeRBxRaqENX-HnxqSj-fexuRxBaTp9R?usp=sharing

 lect 3: https://colab.research.google.com/drive/1fzTxGc0ttQeUV4D4VQf0zYDSOb6gDkoH?usp=sharing

 databrick NB: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6073444084857434/2270297475493747/7614095282524650/latest.html



 what is PySpark Partition ?
 A:
 -  way to split a large dataset into smaller datasets based on one or more partition keys.
 -  main advantages of PySpark DataFrame over Pandas DataFrame. Transformations on partitioned data run 
   faster as they execute transformations parallelly for each partition.

 PySpark supports partition in two ways:
  - partition in memory (DataFrame)
  - partition on the disk (File system).

 Partition in memory:
  - repartition():
    used to increase or decrease the RDD/DataFrame partitions
  - coalesce():
    used to only decrease the number of partitions in an efficient way.
    (improved version of repartition() where the movement of the data across the partitions is lower using coalesce)

 Partition on disk:
  While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using 
  `partitionBy()` of pyspark.sql.DataFrameWriter.
  
 This is similar to Hives partitions scheme



 HDFS (Hadoop Distributed File System):
    - distributed file system provided by the Apache Hadoop project. 
    - It is a key component of the Hadoop ecosystem and is designed to store large amounts of data reliably across multiple machines.

     overview of HDFS in PySpark:
     1. Distributed Storage:
        HDFS stores data across a cluster of machines in a distributed manner. 
        Files are divided into blocks, and each block is replicated across multiple nodes in the cluster for fault tolerance.
     
     2. Fault Tolerance:
        HDFS provides fault tolerance by replicating data across multiple nodes in the cluster. 
        If a node fails, the data can still be accessed from other nodes where it is replicated.
     
     3. Scalability:
        HDFS is designed to scale horizontally to accommodate large amounts of data.
        New nodes can be added to the cluster as needed to increase storage capacity and processing power.
       

 https://sparkbyexamples.com/pyspark/pyspark-cache-explained/





 from pyspark.sql.functions import countDistinct, avg, stddev

 df.select(.alias("myAlias"))
	lect 0: https://colab.research.google.com/drive/1crwg2yOosVEQATTlMfPe-FdiQKZF-gFN?usp=sharing

	lec 1: https://drive.google.com/file/d/1_H5jhUut-DPjT5U0vmgCkg30DKZCa_-G/view?usp=sharing

	lect 2: https://colab.research.google.com/drive/1DxeRBxRaqENX-HnxqSj-fexuRxBaTp9R?usp=sharing

	lect 3: https://colab.research.google.com/drive/1fzTxGc0ttQeUV4D4VQf0zYDSOb6gDkoH?usp=sharing

	databrick NB: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6073444084857434/2270297475493747/7614095282524650/latest.html



	what is PySpark Partition ?
	A:
	- way to split a large dataset into smaller datasets based on one or more partition keys.
	- main advantages of PySpark DataFrame over Pandas DataFrame. Transformations on partitioned data run
	faster as they execute transformations parallelly for each partition.

	PySpark supports partition in two ways:
	- partition in memory (DataFrame)
	- partition on the disk (File system).

	Partition in memory:
	- repartition():
	used to increase or decrease the RDD/DataFrame partitions
	- coalesce():
	used to only decrease the number of partitions in an efficient way.
	(improved version of repartition() where the movement of the data across the partitions is lower using coalesce)

	Partition on disk:
	While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using
	`partitionBy()` of pyspark.sql.DataFrameWriter.

	This is similar to Hives partitions scheme



	HDFS (Hadoop Distributed File System):
	- distributed file system provided by the Apache Hadoop project.
	- It is a key component of the Hadoop ecosystem and is designed to store large amounts of data reliably across multiple machines.

	overview of HDFS in PySpark:
	1. Distributed Storage:
	HDFS stores data across a cluster of machines in a distributed manner.
	Files are divided into blocks, and each block is replicated across multiple nodes in the cluster for fault tolerance.

	2. Fault Tolerance:
	HDFS provides fault tolerance by replicating data across multiple nodes in the cluster.
	If a node fails, the data can still be accessed from other nodes where it is replicated.

	3. Scalability:
	HDFS is designed to scale horizontally to accommodate large amounts of data.
	New nodes can be added to the cluster as needed to increase storage capacity and processing power.


	https://sparkbyexamples.com/pyspark/pyspark-cache-explained/





	from pyspark.sql.functions import countDistinct, avg, stddev

	df.select(.alias("myAlias"))