Romain Beaumont rom1504

README

Produced using gpt4o https://gist.github.com/rom1504/9400a2213dd5459def72cb030c0e9d28

Transcript of 西安方言词典 ; the dictionary of the dialect from xi'an

From https://archive.org/details/yunshu-zidian/%E7%8E%B0%E4%BB%A3%E6%B1%89%E8%AF%AD%E6%96%B9%E8%A8%80%E5%A4%A7%E8%AF%8D%E5%85%B8/%E8%A5%BF%E5%AE%89%E6%96%B9%E8%A8%80%E8%AF%8D%E5%85%B8/

generated by talking with chatgpt

I instantly hit rate limits so not sure it really works, but something like that should work

Global Wind Energy Transition Report: 2024-2045

Introduction

This report evaluates the feasibility and cost of transitioning to offshore wind energy to meet global energy consumption. The focus is on installing 19.66 TW of offshore wind capacity to match the estimated global energy consumption of 620 EJ in 2023, with the transition starting in 2024.

Key Findings

Energy and Turbine Requirements

End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset

This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:

link not containing video that can be downloaded by yt-dlp should be discarded
"Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

See https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 for step by step about spark jars

	import os
	import time
	import json
	import base64
	from openai import OpenAI
	from pdf2image import convert_from_path
	from tqdm import tqdm

	# --- Configuration ---
	KEY_FILE = os.path.expanduser("~/chinese_pdf_key")

	"""
	Can you improve it to avoid reading the whole tar file to count the number of samples?
	"""

	import json
	import concurrent.futures

	import tarfile
	import fsspec
	import io

	"""
	This is a deduplication method using pyspark.

	input: table with id and 2 columns that contain float values
	2 items are considered the same if the float values are equal with a threshold of 0.05

	algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
	Pyspark does distributed sort then linear dedup, so this scales to 100B
	"""

	import wandb

	import os
	import numpy as np
	import time
	from os import listdir
	import uuid
	import sys

	path = "/fsx/home-rom1504/"

	from pyspark.sql import SparkSession
	import os
	import sys
	from pyspark import SparkContext
	from pyspark.sql.functions import rand
	from pyspark.sql import SparkSession
	import random
	import math
	import time
	import boto3