Skip to content

Instantly share code, notes, and snippets.

@language-engineering
Created October 23, 2012 12:31
Show Gist options
  • Select an option

  • Save language-engineering/3938489 to your computer and use it in GitHub Desktop.

Select an option

Save language-engineering/3938489 to your computer and use it in GitHub Desktop.
from corpus_readers import AmazonReviewCorpusReader
arcr = AmazonReviewCorpusReader()
#Format some dvd data ready for the NB classifier
dvd_pos_formatted = format_data(arcr.positive().category("dvd").documents(),"pos")
dvd_neg_formatted = format_data(arcr.negative().category("dvd").documents(),"neg")
dvd_formatted = dvd_pos_formatted + dvd_neg_formatted
book_pos_formatted = format_data(arcr.positive().category("book").documents(),"pos")
book_neg_formatted = format_data(arcr.negative().category("book").documents(),"neg")
book_formatted = book_pos_formatted + book_neg_formatted
kitchen_pos_formatted = format_data(arcr.positive().category("kitchen").documents(),"pos")
kitchen_neg_formatted = format_data(arcr.negative().category("kitchen").documents(),"neg")
kitchen_formatted = kitchen_pos_formatted + kitchen_neg_formatted
electronics_pos_formatted = format_data(arcr.positive().category("electronics").documents(),"pos")
electronics_neg_formatted = format_data(arcr.negative().category("electronics").documents(),"neg")
electronics_formatted = electronics_pos_formatted + electronics_neg_formatted
#split the formatted data, in order to avoid testing on training data
train_electronics_formatted, test_electronics_formatted = split_data(electronics_formatted)
#three alternative combinations of source and target domain
source_1 = train_electronics_formatted
target_1 = test_electronics_formatted
source_2 = dvd_formatted + book_formatted + kitchen_formatted
target_2 = test_electronics_formatted
source_3 = dvd_formatted + book_formatted + kitchen_formatted + train_electronics_formatted
target_3 = test_electronics_formatted
#notice that in source_3 and target_3 we have electronics reviews. This is why the data had to be split.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment