Skip to content

Instantly share code, notes, and snippets.

View rom1504's full-sized avatar

Romain Beaumont rom1504

View GitHub Profile
@rom1504
rom1504 / dalle_mega_prompts.json
Last active May 14, 2022 15:04
dalle_mega_prompts.json
[
"t-shirt, size M",
"flower dress, size M",
"a t-shirt of an avocado",
"a rainbow hat",
"white snow covered mountain under blue sky during daytime",
"aerial view of the beach during daytime",
"aerial view of the beach at night",
"double rainbow over a lake",
"a beautiful sunset at a beach with a shell on the shore",
@rom1504
rom1504 / monitor_efa_aws.py
Last active July 25, 2022 17:59
monitor_efa_aws.py
from glob import glob
import time
import datetime
def get_read_bytes():
return sum([int(open(f"{p}/ports/1/hw_counters/rdma_read_bytes", "r").read().strip()) for p in glob("/sys/class/infiniband/*")])
from os.path import expanduser
home = expanduser("~")
@rom1504
rom1504 / upload_to_hf.md
Last active February 5, 2023 10:19
upload_to_hf.md
@rom1504
rom1504 / fsspec_sync.py
Last active May 13, 2022 07:38
fsspec sync
from multiprocessing.pool import ThreadPool
import fsspec
from tqdm import tqdm
import sys
m = int(sys.argv[1])
t = int(sys.argv[2])
path_s3 = "s3://laion5b/data/laion1B-nolang"
@rom1504
rom1504 / domain_stats.py
Last active March 13, 2022 17:44
pyspark domain udf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from urllib.parse import urlparse
from pyspark.sql.functions import udf
import os
import fire
def main(input_folder, output_folder):
spark = SparkSession.builder.config("spark.driver.memory", "16G") .master("local[16]").appName('spark-stats').getOrCreate()
@rom1504
rom1504 / stoppable_process.py
Created January 9, 2022 02:50
stoppable_process.py
from multiprocessing import Process, Queue
import queue
import time
class StoppableProcess(Process):
def __init__(self, lol):
super().__init__()
self.lol = lol
self.q = Queue()
@rom1504
rom1504 / parquet_to_tfrecord_pyspark.py
Last active July 21, 2021 23:03
parquet_to_tfrecord_pyspark.py
# I advised to run this in an interactive environment (python shell, jupyter, ...) to understand well all the steps
from pyspark.sql import SparkSession
# Let's get tfrecord and rapids (rapids is not necessary, remove all mention if wanted)
# wget https://search.maven.org/remotecontent?filepath=com/linkedin/sparktfrecord/spark-tfrecord_2.12/0.3.2/spark-tfrecord_2.12-0.3.2.jar -O tfrecord.jar
# wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.06.0/rapids-4-spark_2.12-21.06.0.jar -O rapids.jar
# wget https://repo1.maven.org/maven2/ai/rapids/cudf/21.06.1/cudf-21.06.1-cuda11.jar -O cudf.jar
# creating the spark session with tfrecord plugin, rapids plugins, and some basic options, with local executor
spark = SparkSession.builder.config("spark.jars", "tfrecord.jar,rapids.jar,cudf.jar").config("spark.plugins","com.nvidia.spark.SQLPlugin").config("spark.rapids.sql.incompatibleOps.enabled","true").config("spark.driver.memory", "16G") .master("local[16]").appName('spark-stats').getOrCreate()
@rom1504
rom1504 / a_download_cah_from_theeye.md
Last active August 1, 2021 20:57
cah_download_from_theeye.py

This is about downloading http://the-eye.eu/eleuther_staging/cah/ which is a big dataset of image/text pairs filtered from common crawl

  1. run get_links.sh ; this will produce a to_aria.txt file which contains all the urls to download and where to put them
  2. run download.sh ; it will use aria2c to download files fast (takes about 1h)

Note if you only want one type of file, you may change this part grep 'csv\|txt\|pkl\|tfrecord'

@rom1504
rom1504 / a_cah_to_parquet_pyspark.md
Last active August 10, 2021 19:46
cah_stats_spark.py
@rom1504
rom1504 / a_cah_stats_dask.md
Last active July 20, 2021 19:19
cah_stats_dask