dineshdharme’s gists

dineshdharme / ClusteringNamesUsingSimHashing.py

Created March 19, 2024 18:22

Clustering names using simhash algorithm for further processing via fuzzywuzzy library for each matches.

	https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853

	I have taken inspiration from this blogpost to write the following code.

	https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

	The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together.

	Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches.

dineshdharme / DaskPreProcessing50GBFile.py

Last active March 19, 2024 17:02

Preprocessing example of a file using Dask.

	https://stackoverflow.com/questions/78162865/handling-column-breaks-in-pipe-delimited-file/78182964#78182964

	You can use `dask` to achieve this task of preprocessing. The following code will process the 50GB file in blocks of 500MB and write out the output in 5 partitions. Everything is a delayed / lazy operation just like in spark. Let me know how it goes. You may have to remove the header line from the data and then provide the header in your spark dataframe.


	Install dask as

	`pip install dask[complete]`

dineshdharme / VideoProcessingAtScaleUsingSpark.py

Created March 11, 2024 06:10

An example to demonstrate using of Pyspark in video processing.

	I have adapted the following jupyter notebook to show how spark can do video processing at scale.

	https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html

	You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:

	`pip install ffmpeg-python`

	`pip install face-recognition`

dineshdharme / Hashing2TeraByteFileOnSpark.py

Created February 29, 2024 14:38

A workflow on how to hash a 2TB file located on S3.

	Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/


	A workflow that can help you achieve this.

	Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB.

	Reason for splitting is this :

	https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440

dineshdharme / CumulativeSumWithResetFlag.py

Created February 27, 2024 15:21

Cumulative Sum with Reset Flag.

	https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131

	I modified my answer from here https://stackoverflow.com/a/78056548/3238085
	to this problem setup.



	import sys

	from pyspark.sql import Window

dineshdharme / ResetCumulativeSumComplexAccumulatorExample.py

Created February 25, 2024 15:29

Grouping cumulative sum with reset condition achieved through complex accumulator structure.

	https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/

	I used the following answer as an inspiration to write the following code.
	Basically, clever use of complex accumulator function allows the grouping index to be performed properly.


	https://stackoverflow.com/a/64957835/3238085

	import sys

dineshdharme / ExtractZippedFilesCSV.scala

Created February 5, 2024 13:26

Porting previous tar.gz uncompressor function to zipped uncompressor.

	https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

	Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult.

	Important point(s) to keep in mind.

	Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.


	`ZipFileReader.scala`

dineshdharme / ZipFileReader.scala

Created February 3, 2024 15:43

ReadingCompressedArchiveFilesForETL

	https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

	The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here.

	Few important points to keep in mind.

	If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format.

	Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.

dineshdharme / TimeConversionQuandry.py

Created December 12, 2023 10:12

Make sure your time fractional part is only till millisecond. Additional precision causes parsing error in Spark.

	The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error.

	Here's modified code which works.

	import sys
	from pyspark import SparkContext, SQLContext
	from pyspark.sql import functions as F
	import dateutil.parser

dineshdharme / GraphXFindParents.py

Created December 8, 2023 14:53

Finding path from source to destination using GraphFrames in Pyspark

	I am adapting my previous answer from here:

	https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177

	Graphframe jar can be found at this location: Files : (jar[242KB])

	https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12

	Requirements :

Dinesh Dharme dineshdharme