Skip to content

Instantly share code, notes, and snippets.

View woodongk's full-sized avatar
🎯
Focusing

woodong woodongk

🎯
Focusing
  • Samsung Electronics, Samsung Research
  • Seoul, Korea
View GitHub Profile
@woodongk
woodongk / text_preprocessing.py
Last active July 10, 2022 07:35
Korean-Text-Preprocessing in Python
import re
from konlpy.tag import Mecab
from khaiii import KhaiiiApi
def remove_brackets(string, left_paren_type,right_paren_type):
'''Remove brackets (parentheses) and their contents within a string
Args :
left_paren_type = '[','(' etc
right_paren_type = ']', ')' etc
@woodongk
woodongk / get_outlier.py
Last active April 4, 2020 13:47
λ°μ΄ν„°ν”„λ ˆμž„μ—μ„œ μ΄μƒμΉ˜ κ²€μΆœν•˜κΈ° - IQR μ‚¬μš©
#좜처 - νŒŒμ΄μ¬μ„ μ΄μš©ν•œ λ¨Έμ‹ λŸ¬λ‹, λ”₯λŸ¬λ‹ μ‹€μ „ 개발 μž…λ¬Έ
import np
def get_outlier(df=None,column=None,weight=1.5):
'''인자둜 Dataframeκ³Ό μ΄μƒμΉ˜λ₯Ό κ²€μΆœν•  μΉΌλŸΌμ„ μž…λ ₯λ°›λŠ”λ‹€.
iqr에 1.5 κ³±ν•΄μ„œ 이에 κΈ°λ°˜ν•˜μ—¬ μ΄μƒμΉ˜λ₯Ό ꡬ해 ν•΄λ‹Ή μ΄μƒμΉ˜κ°€ μžˆλŠ” index λ°˜ν™˜
'''
column_x = df[column]
@woodongk
woodongk / crawling_naver_news_comments.py
Last active April 6, 2020 23:28
넀이버 λ‰΄μŠ€μ—μ„œ λŒ“κΈ€ κΈμ–΄μ˜€κΈ°
# 좜처 - https://wikidocs.net/61221
from selenium import webdriver
import time
def get_comments(URL,imp_time=5,delay_time=0.1):
#μ›Ή λ“œλΌμ΄λ²„
driver = webdriver.Chrome('/usr/local/bin/chromedriver') #chromedriver
driver.implicitly_wait(imp_time)
driver.get(URL)
@woodongk
woodongk / word_cloud.py
Last active May 26, 2020 04:46
word cloud λ§Œλ“€κΈ°
def generate_circular_wordcloud(strings):
"""Returns circle shape Word Cloud
Example:
strings (str): "κΈ°μ–΅ λ‹ˆμ€ λ””κ·Ώ κΈ°μ–΅ κΈ°μ–΅"
strings (dict) {"κΈ°μ–΅":30, "λ‹ˆμ€":10, "λ””κ·Ώ":1}
"""
# mask circle
x, y = np.ogrid[:1000, :1000]
@woodongk
woodongk / markdown.md
Last active May 14, 2020 08:10 — forked from ihoneymon/how-to-write-by-markdown.md
λ§ˆν¬λ‹€μš΄ μ‚¬μš©λ²•

[곡톡] λ§ˆν¬λ‹€μš΄ markdown μž‘μ„±λ²•

1. λ§ˆν¬λ‹€μš΄μ— κ΄€ν•˜μ—¬

1.1. λ§ˆν¬λ‹€μš΄μ΄λž€?

Markdown은 ν…μŠ€νŠΈ 기반의 λ§ˆν¬μ—…μ–Έμ–΄λ‘œ 2004λ…„ 쑴그루버에 μ˜ν•΄ λ§Œλ“€μ–΄μ‘ŒμœΌλ©° μ‰½κ²Œ μ“°κ³  읽을 수 있으며 HTML둜 λ³€ν™˜μ΄ κ°€λŠ₯ν•˜λ‹€. νŠΉμˆ˜κΈ°ν˜Έμ™€ 문자λ₯Ό μ΄μš©ν•œ 맀우 κ°„λ‹¨ν•œ ꡬ쑰의 문법을 μ‚¬μš©ν•˜μ—¬ μ›Ήμ—μ„œλ„ 보닀 λΉ λ₯΄κ²Œ 컨텐츠λ₯Ό μž‘μ„±ν•˜κ³  보닀 μ§κ΄€μ μœΌλ‘œ 인식할 수 μžˆλ‹€. λ§ˆν¬λ‹€μš΄μ΄ 졜근 각광받기 μ‹œμž‘ν•œ μ΄μœ λŠ” κΉƒν—™(https://github.com) 덕뢄이닀. κΉƒν—™μ˜ μ €μž₯μ†ŒRepository에 κ΄€ν•œ 정보λ₯Ό κΈ°λ‘ν•˜λŠ” README.mdλŠ” 깃헙을 μ‚¬μš©ν•˜λŠ” μ‚¬λžŒμ΄λΌλ©΄ λˆ„κ΅¬λ‚˜ κ°€μž₯ λ¨Όμ € μ ‘ν•˜κ²Œ λ˜λŠ” λ§ˆν¬λ‹€μš΄ λ¬Έμ„œμ˜€λ‹€. λ§ˆν¬λ‹€μš΄μ„ ν†΅ν•΄μ„œ μ„€μΉ˜λ°©λ²•, μ†ŒμŠ€μ½”λ“œ μ„€λͺ…, 이슈 등을 κ°„λ‹¨ν•˜κ²Œ κΈ°λ‘ν•˜κ³  가독성을 높일 수 μžˆλ‹€λŠ” 강점이 λΆ€κ°λ˜λ©΄μ„œ 점점 μ—¬λŸ¬ 곳으둜 νΌμ Έκ°€κ²Œ λœλ‹€.

1.2. λ§ˆν¬λ‹€μš΄μ˜ μž₯-단점

1.2.1. μž₯점

@woodongk
woodongk / count_ngram.py
Created May 26, 2020 04:48
λ§λ­‰μΉ˜ ngram counter
from collections import Counter
from itertools import chain
def ngram_count(docs_tokenized, n, n_display=50):
'''
Args:
docs : 토큰 λ­‰μΉ˜ 2d list
μ˜ˆμ‹œ :[['문재인', '원전', 'κ΅­λ―Ό', 'ν˜ˆμ„Έ', 'λ¬Όμ–΄λ‚΄', '문재인', 'λŒ€ν†΅λ Ή', 'λ¬Όμ–΄λ‚΄'],
['μ „μŸ', '제일', 'λ¨Όμ €', '아가리', 'λŒ€ν†΅λ Ή', '특수', 'λΆ€λŒ€', 'μ‹€λ―Έ'],
n : n-gram 선택. e.g., unigram : 1, bigram : 2
@woodongk
woodongk / I'm a night πŸ¦‰
Last active October 29, 2020 00:07
I'm a night πŸ¦‰
🌞 Morning 33 commits β–ˆβ–Žβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 6.4%
πŸŒ† Daytime 165 commits β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 32.1%
πŸŒƒ Evening 180 commits β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Žβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 35.0%
πŸŒ™ Night 136 commits β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 26.5%