Skip to content

Instantly share code, notes, and snippets.

View lovit's full-sized avatar
๐Ÿงฉ
Focusing

Hyunjoong Kim lovit

๐Ÿงฉ
Focusing
View GitHub Profile
@lovit
lovit / huggingface_konlpy_usage.md
Created August 27, 2020 22:29
Huggingface tokenizers / transformers + KoNLPy.md
import huggingface_konlpy

KoNLPy as pre-tokenizer

@lovit
lovit / huggingface_tokenizers_usage.md
Created August 27, 2020 22:28
Hugging Face tokenizers usage
import tokenizers
tokenizers.__version__
@lovit
lovit / huggingface_konlpy.md
Last active November 20, 2024 18:00
huggingface + KoNLPy

Huggingface

  • NLP ๊ด€๋ จ ๋‹ค์–‘ํ•œ ํŒจํ‚ค์ง€๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ํŠนํžˆ ์–ธ์–ด ๋ชจ๋ธ (language models) ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์„ธ ๊ฐ€์ง€ ํŒจํ‚ค์ง€๊ฐ€ ์œ ์šฉ
package note
transformers Transformer ๊ธฐ๋ฐ˜ (masked) language models ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๊ธฐํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ œ๊ณต
tokenizers transformers ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฌ๋‚˜์ด์ €๋“ค์„ ํ•™์Šต/์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ ์ œ๊ณต. transformers ์™€ ๋ถ„๋ฆฌ๋œ ํŒจํ‚ค์ง€๋กœ ์ œ๊ณต
nlp ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ํ‰๊ฐ€ ์ฒ™๋„ (evaluation metrics) ์„ ์ œ๊ณต
@lovit
lovit / umap_supervised_embedding.md
Created December 10, 2019 14:27
UMAP supervised embedding example
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=50000, n_features=200, n_informative=5, 
    n_redundant=0, n_clusters_per_class=10, weights=[0.80],
    flip_y=0.05, class_sep=3.5, random_state=42
)

# standard normalization: (x - mean) / std
@lovit
lovit / scraped_news_corpus_statistics.md
Created September 16, 2019 07:38
scraped new scorpus statistics

Corpus summary

  • begin date = 2014-01-01
  • end date = 2019-08-16
  • num docs = 51016505
  • num sents = 413471083

Yearly summary

year begin date end date num docs num sents
@lovit
lovit / soynlp_noun_tokenizer_usage.md
Created April 30, 2019 05:32
soynlp Noun Tokenizer usage

ํ˜„์žฌ ๋ฒ„์ „ (0.0.491) ์—์„œ๋Š” ์ฝ”๋“œ๊ฐ€ ์ •๋ฆฌ๋˜์ง€ ์•Š์•„์„œ init ํ•จ์ˆ˜์˜ argument ์ด๋ฆ„์ด ๋ฐ”๋€” ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํŠœํ† ๋ฆฌ์–ผ์€ github.com/lovit/textmining-dataset ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

import soynlp
from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2
from lovit_textmining_dataset.navernews_10days import get_news_paths
@lovit
lovit / bokeh_show_image.py
Last active April 9, 2019 17:24
Bokeh (1.0.4) image show example
# replace matplotlib.pyplot.imshow(img)
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
N = 500
x = np.linspace(0, 10, N)

tmux cheatsheet

As configured in my dotfiles.

start new:

tmux

start new with session name: