Produced using gpt4o https://gist.github.com/rom1504/9400a2213dd5459def72cb030c0e9d28
Transcript of 西安方言词典 ; the dictionary of the dialect from xi'an
Produced using gpt4o https://gist.github.com/rom1504/9400a2213dd5459def72cb030c0e9d28
Transcript of 西安方言词典 ; the dictionary of the dialect from xi'an
import os | |
import time | |
import json | |
import base64 | |
from openai import OpenAI | |
from pdf2image import convert_from_path | |
from tqdm import tqdm | |
# --- Configuration --- | |
KEY_FILE = os.path.expanduser("~/chinese_pdf_key") |
generated by talking with chatgpt
I instantly hit rate limits so not sure it really works, but something like that should work
This report evaluates the feasibility and cost of transitioning to offshore wind energy to meet global energy consumption. The focus is on installing 19.66 TW of offshore wind capacity to match the estimated global energy consumption of 620 EJ in 2023, with the transition starting in 2024.
End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:
To collect this eval set we can:
""" | |
Can you improve it to avoid reading the whole tar file to count the number of samples? | |
""" | |
import json | |
import concurrent.futures | |
import tarfile | |
import fsspec | |
import io |
""" | |
This is a deduplication method using pyspark. | |
input: table with id and 2 columns that contain float values | |
2 items are considered the same if the float values are equal with a threshold of 0.05 | |
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns | |
Pyspark does distributed sort then linear dedup, so this scales to 100B | |
""" |
import wandb | |
import os | |
import numpy as np | |
import time | |
from os import listdir | |
import uuid | |
import sys | |
path = "/fsx/home-rom1504/" |
from pyspark.sql import SparkSession | |
import os | |
import sys | |
from pyspark import SparkContext | |
from pyspark.sql.functions import rand | |
from pyspark.sql import SparkSession | |
import random | |
import math | |
import time | |
import boto3 |
See https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 for step by step about spark jars