Last active
November 25, 2015 14:48
-
-
Save buruzaemon/1f9fae20d635b2f2006e to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
import re | |
from natto import MeCab | |
patt = re.compile(r'''(?x) | |
# Ticker symbols | |
[0-9\uFF10-\uFF19]{2,}(\s|\.)[A-Z\uFF21-\uFF3A]{1,2} | | |
# short-from contractions | |
[\'|\u2019](d|ll|m|s|re|ve) | | |
# short-form negative (preserve in their entirety | |
n[\'|\u2019]t | | |
# hyphen- and forward-slash delimited words | |
([A-Z]+(\-|\/))+[A-Z]+ | | |
# extract text from single- and double-quotes | |
[\"\u2018\u2019\u201C\u201D] | |
''', re.IGNORECASE| re.MULTILINE) | |
txt = """ | |
Wouldn't you know? I've had it! This's a fine how-do-you-do, isn't it, Jor-El! | |
She said "That'll be the day, uh-huh, that I die." | |
1099.T は銘柄コードの例です。 | |
注意: | |
今季はやはり 1079 JP および AAPL も気になる存在である。 | |
P/E の観点からいうと、双方は「ばちぐぅ~!」という感じです。 | |
GOOGおよびAAPL、または1234.JPか1099 JPも何れも買いである。 | |
An example with fancy double quotes: \u201CThe Sun Doesn't Rise Always, Don't You know?\u201D. | |
An example with fancy single quotes: What an \u2018exquisite\u2019 bouquet! | |
""" | |
nm = MeCab() | |
for n in nm.parse(txt, boundary_constraints=patt, as_nodes=True): | |
print(n.surface) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment