Skip to content

Instantly share code, notes, and snippets.

View rom1504's full-sized avatar

Romain Beaumont rom1504

View GitHub Profile
@rom1504
rom1504 / README.md
Created January 26, 2025 20:49
join wikivg and mcdata proto and add descriptions

generated by talking with chatgpt

I instantly hit rate limits so not sure it really works, but something like that should work

@rom1504
rom1504 / wind_turbines_world_energy_plan.md
Last active December 15, 2024 11:24
Wind turbines to replace world energy production

Global Wind Energy Transition Report: 2024-2045

Introduction

This report evaluates the feasibility and cost of transitioning to offshore wind energy to meet global energy consumption. The focus is on installing 19.66 TW of offshore wind capacity to match the estimated global energy consumption of 620 EJ in 2023, with the transition starting in 2024.


Key Findings

Energy and Turbine Requirements

@rom1504
rom1504 / video_platform_filter.md
Last active October 20, 2023 15:51
Filtering url to keep only video platforms links

End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset

This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:

  • link not containing video that can be downloaded by yt-dlp should be discarded
  • "Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

@rom1504
rom1504 / Streaming.py
Last active August 7, 2023 02:02
Count tar. Generated by gpt4
"""
Can you improve it to avoid reading the whole tar file to count the number of samples?
"""
import json
import concurrent.futures
import tarfile
import fsspec
import io
@rom1504
rom1504 / bucket_dedup.py
Created February 19, 2023 21:47
bucket_dedup.py
"""
This is a deduplication method using pyspark.
input: table with id and 2 columns that contain float values
2 items are considered the same if the float values are equal with a threshold of 0.05
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
Pyspark does distributed sort then linear dedup, so this scales to 100B
"""
@rom1504
rom1504 / does_it_freeze.py
Last active August 7, 2023 02:03
does_it_freeze.py
import wandb
import os
import numpy as np
import time
from os import listdir
import uuid
import sys
path = "/fsx/home-rom1504/"
@rom1504
rom1504 / spark_session_aws.py
Last active June 26, 2023 21:40
spark_session_aws.py
from pyspark.sql import SparkSession
import os
import sys
from pyspark import SparkContext
from pyspark.sql.functions import rand
from pyspark.sql import SparkSession
import random
import math
import time
import boto3
@rom1504
rom1504 / spark_on_ssh.md
Last active August 7, 2023 02:03
spark_on_ssh.py
@rom1504
rom1504 / phash.py
Created December 1, 2022 17:58
phash.py
import numpy as np
from scipy.fftpack import dct
def hash_algo(pil_img, size=10):
"""
Get perceptual hash of the input image.
Args:
image_array: numpy array that corresponds to the image.