This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853 | |
| I have taken inspiration from this blogpost to write the following code. | |
| https://leons.im/posts/a-python-implementation-of-simhash-algorithm/ | |
| The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together. | |
| Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/78162865/handling-column-breaks-in-pipe-delimited-file/78182964#78182964 | |
| You can use `dask` to achieve this task of preprocessing. The following code will process the 50GB file in blocks of 500MB and write out the output in 5 partitions. Everything is a delayed / lazy operation just like in spark. Let me know how it goes. You may have to remove the header line from the data and then provide the header in your spark dataframe. | |
| Install dask as | |
| `pip install dask[complete]` | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| I have adapted the following jupyter notebook to show how spark can do video processing at scale. | |
| https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html | |
| You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively: | |
| `pip install ffmpeg-python` | |
| `pip install face-recognition` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/ | |
| A workflow that can help you achieve this. | |
| Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB. | |
| Reason for splitting is this : | |
| https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131 | |
| I modified my answer from here https://stackoverflow.com/a/78056548/3238085 | |
| to this problem setup. | |
| import sys | |
| from pyspark.sql import Window |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/ | |
| I used the following answer as an inspiration to write the following code. | |
| Basically, clever use of complex accumulator function allows the grouping index to be performed properly. | |
| https://stackoverflow.com/a/64957835/3238085 | |
| import sys | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/ | |
| Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult. | |
| Important point(s) to keep in mind. | |
| Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized. | |
| `ZipFileReader.scala` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/ | |
| The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here. | |
| Few important points to keep in mind. | |
| If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format. | |
| Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error. | |
| Here's modified code which works. | |
| import sys | |
| from pyspark import SparkContext, SQLContext | |
| from pyspark.sql import functions as F | |
| import dateutil.parser | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| I am adapting my previous answer from here: | |
| https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177 | |
| Graphframe jar can be found at this location: Files : (jar[242KB]) | |
| https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12 | |
| Requirements : |