Skip to content

Instantly share code, notes, and snippets.

@SinclairCoder
Created January 3, 2025 04:02
Show Gist options
  • Save SinclairCoder/33f3b4a819df1fd03b0384e6d422cf28 to your computer and use it in GitHub Desktop.
Save SinclairCoder/33f3b4a819df1fd03b0384e6d422cf28 to your computer and use it in GitHub Desktop.
cc_requirements
html5lib
bs4
html2text
libzim
pyahocorasick
datatrove
xxhash
trafilatura
resiliparse
fasttext
prettytable
fasteners
tldextract
faust-cchardet
python-magic
warcio
numpy==1.26.3 # avoid value error raised from fasttext
tokenizers
nltk
langdetect
pyarrow
psutil
# Precommit requirements
pre-commit==3.7.1
black==23.3.0
isort==5.13.2
orjson
nltk
spacy
pandas
html5lib
jieba
fasteners
tiktoken
# import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('punkt_tab')
# python -m spacy download zh_core_web_sm
# python -m spacy download en_core_web_sm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment