Skip to content

Instantly share code, notes, and snippets.

@kisabaka
Created May 3, 2011 13:37
Show Gist options
  • Save kisabaka/953339 to your computer and use it in GitHub Desktop.
Save kisabaka/953339 to your computer and use it in GitHub Desktop.
Some text processing tasks are easier to read when using continuation and ... "pipe"
import pprint
from pipe import Pipe, select, as_list
import nltk
stokenize = Pipe(lambda text: nltk.sent_tokenize(text))
wtokenize = Pipe(lambda sentences: sentences | select(nltk.word_tokenize))
tag = Pipe(lambda sentences: sentences | select(nltk.pos_tag))
chunk = Pipe(lambda sentences: sentences | select(nltk.ne_chunk))
text = ("A Wiki is a website which is editable over the web by it's users. "
"This allows information to be more rapidly updated than traditional websites.")
text | stokenize | wtokenize | tag | select(pprint.pprint) | as_list
'''The output should be something like this:
[('A', 'DT'),
('Wiki', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('website', 'JJ'),
('which', 'WDT'),
('is', 'VBZ'),
('editable', 'JJ'),
('over', 'IN'),
('the', 'DT'),
('web', 'NN'),
('by', 'IN'),
('it', 'PRP'),
("'s", 'VBZ'),
('users', 'NNS'),
('.', '.')]
[('This', 'DT'),
('allows', 'VBZ'),
('information', 'NN'),
('to', 'TO'),
('be', 'VB'),
('more', 'JJR'),
('rapidly', 'RB'),
('updated', 'VBN'),
('than', 'IN'),
('traditional', 'JJ'),
('websites', 'NNS'),
('.', '.')]
[None, None]
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment