This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def getFeature(node): | |
unicodedFeature = unicode(node.feature.split(",")[0], "utf-8") | |
return unicodedFeature | |
#node = tagger.parseToNode(sentence) | |
def wakati(self, sentence, tagger=None): | |
""" | |
文を単語に分解して、単語間に", "を挿入して返す | |
@param sentence: 分解対象の文(utf-8を想定) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Flow: | |
1. ひらがな、漢字、記号などの前後のみのシンプルな形で切る | |
-> ひらがな::漢字 | |
-> 記号::非記号 | |
-> ひらがな::数字 | |
2. mecabを使って前の品詞を用いる | |
-> * :: 助詞 | |
3. かぼちゃ?を使って句構造で区切る | |
4. 機会学習させる | |
-> naive bayes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
import sys | |
def split(text): | |
""" | |
入力: | |
text = "aaabbbccc" | |
出力: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
git branchを切ったらそのbranchは現在のbranchから切られることになるので、必ずdevelopmentからbranchを作る。 | |
developmentがlocalにない場合は | |
``` | |
git checkout -b development origin/development | |
``` | |
これでローカルにdevelopmentを作る。引数の -b はbranchを切りつつcheckoutをするということ。 | |
addしてしまったものをいったん横においておきたいときは |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
u"「あいつは1424年に死ぬだろう。」ただしそれはあいつの責任じゃないことは誰もが知っていた。そう、1mmの虫でさえ。。。" | |
現在のnaive title splitter => | |
「あいつは | 1424年に | 死ぬだろう。 | 」ただしそれはあいつの | 責任じゃないことは | 誰もが | 知っていた。 | そう、 | 1mmの | 虫でさえ。。 | |
※修正箇所 | |
1. ただし-それは 切りたい | |
2. 記号の連結は切らない |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def get_break_points(self, sentence): | |
if len(sentence) <= 0: | |
return [] | |
else: | |
#encoded_sentence = sentence.encode('utf-8') | |
break_points = [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
〜くっつける〜 | |
1. * + 助詞 | |
2. 形容詞 + 名詞 | |
3. * + 助動詞 | |
つまり切る位置は | |
1. 助詞の後 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Must : 改行のやつ完成 | |
方針 : 形態素解析+文字種 => 特定の強いルール , つまり複数回す。 | |
課題: 1)記号+記号のケースを一つの名詞と捉えることがある。 | |
=> 1. とりあえず細かく切って、後から修正 | |
2. 危険なとこは斬らないで、あとから別ルールで切る | |
2)ルールの優先順位を厳格化と動作の共通するパターンをまとめる |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
separate編: | |
1. 助詞もしくは副詞の後は切る | |
2. 記号のあとは切る | |
3. 長いと切る | |
regulate編: | |
1. 左括弧の前で切れてたらくっつける | |
2. 記号の連続、助詞の連続の場合もくっつける |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def json_articles(self): | |
""" | |
複数動作確認用 | |
""" | |
f = open("sample_title.txt", "r") | |
lines = f.readlines() | |
cleaner = title_cleaner.TitleCleaner() | |
articles = [] |
OlderNewer