This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##Text classification is always in the form of binary. | |
##ie, either spam or not spam | |
import nltk | |
import random | |
from nltk.corpus import movie_reviews | |
import pickle | |
## Create a list of tuples or features |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##Text classification is always in the form of binary. | |
##ie, either spam or not spam | |
import nltk | |
import random | |
from nltk.corpus import movie_reviews | |
import pickle | |
## Create a list of tuples or features |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from nltk.corpus import wordnet | |
syns = wordnet.synsets("program") | |
print(syns[0].name()) | |
#plan.n.01 | |
print(syns[0].lemmas()[0].name()) | |
#plan | |
print(syns[0].definition()) | |
#a series of steps to be carried out or goals to be accomplished | |
print(syns[0].examples()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk | |
from nltk.corpus import state_union | |
from nltk.tokenize import PunktSentenceTokenizer | |
train_text = state_union.raw("2005-GWBush.txt") | |
sample_text = state_union.raw("2006-GWBush.txt") | |
custom_sent_tokenizer = PunktSentenceTokenizer(train_text) | |
tokenized = custom_sent_tokenizer.tokenize(sample_text) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##POS tagging is labeling words in a sentence as nouns, adjectives, verbs...etc | |
import nltk | |
from nltk.corpus import state_union | |
from nltk.tokenize import PunktSentenceTokenizer | |
##PunktSentenceTokenizer a new sentence tokenizer | |
## This tokenizer is capable of unsupervised machine learning, | |
##so you can actually train it on any body of text that you use | |
##Creating training and testing data |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from nltk.corpus import stopwords | |
from nltk.tokenize import sent_tokenize, word_tokenize | |
##Tokenizing - Splitting sentences and words from the body of text. | |
##Part of Speech tagging | |
##Corpus - Body of text, singular. Corpora is the plural of this. | |
##Example: A collection of medical journals. | |
##Lexicon - Words and their meanings. |