Skip to content

Instantly share code, notes, and snippets.

@eiichi-worker
Last active October 9, 2017 16:40
Show Gist options
  • Select an option

  • Save eiichi-worker/ee26b0ad1d6f2ce7f3247e76bfba9193 to your computer and use it in GitHub Desktop.

Select an option

Save eiichi-worker/ee26b0ad1d6f2ce7f3247e76bfba9193 to your computer and use it in GitHub Desktop.
[ブログ済み] HIVEで形態素解析するときに、不要と思われる文字列は除外していくクエリ

HIVEで形態素解析するときに、不要と思われる文字列は除外していくクエリ

SELECT *,word
FROM {table}
LATERAL VIEW explode(tokenize_ja({text_column})) t AS word
WHERE 1=1
AND word not rlike '^[a-zA-Z0-9]{1}$' -- 除外 英語一文字
AND word not rlike "^[!-9@_]*$" -- 除外 数字記号のみ
AND word not rlike "^[〇一二三四五六七八九]*$" -- 漢数字のみ
AND word not rlike "^[\u3041-\u3096\u30A1-\u30FA]{1}$" -- ひらがなカタカナ1文字

参考

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment