generated by talking with chatgpt
I instantly hit rate limits so not sure it really works, but something like that should work
generated by talking with chatgpt
I instantly hit rate limits so not sure it really works, but something like that should work
This report evaluates the feasibility and cost of transitioning to offshore wind energy to meet global energy consumption. The focus is on installing 19.66 TW of offshore wind capacity to match the estimated global energy consumption of 620 EJ in 2023, with the transition starting in 2024.
End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:
To collect this eval set we can:
""" | |
Can you improve it to avoid reading the whole tar file to count the number of samples? | |
""" | |
import json | |
import concurrent.futures | |
import tarfile | |
import fsspec | |
import io |
""" | |
This is a deduplication method using pyspark. | |
input: table with id and 2 columns that contain float values | |
2 items are considered the same if the float values are equal with a threshold of 0.05 | |
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns | |
Pyspark does distributed sort then linear dedup, so this scales to 100B | |
""" |
import wandb | |
import os | |
import numpy as np | |
import time | |
from os import listdir | |
import uuid | |
import sys | |
path = "/fsx/home-rom1504/" |
from pyspark.sql import SparkSession | |
import os | |
import sys | |
from pyspark import SparkContext | |
from pyspark.sql.functions import rand | |
from pyspark.sql import SparkSession | |
import random | |
import math | |
import time | |
import boto3 |
See https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 for step by step about spark jars
import numpy as np | |
from scipy.fftpack import dct | |
def hash_algo(pil_img, size=10): | |
""" | |
Get perceptual hash of the input image. | |
Args: | |
image_array: numpy array that corresponds to the image. |
Steps:
(you can get https://huggingface.co/datasets/laion/laion-coco/resolve/main/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet as an example parquet)