Skip to content

Instantly share code, notes, and snippets.

@rocketraman
Last active November 15, 2016 22:51
Show Gist options
  • Save rocketraman/664f42d1e075537752c5468ad5a4e39a to your computer and use it in GitHub Desktop.
Save rocketraman/664f42d1e075537752c5468ad5a4e39a to your computer and use it in GitHub Desktop.
Conversion of convert-to-stanford-classifier.csh (http://nlp.stanford.edu/wiki/Software/Classifier/20_Newsgroups) to bourne shell
#!/bin/sh
# The 20 newsgroups documents are traditional 8 bit not utf-8
export LC_ALL=en_US.ISO-8859-1
for dataset in 20news-bydate-train 20news-bydate-test; do
output="$dataset-stanford-classifier-iso-8859-1.txt"
rm -f $output
for newsgroup in $dataset/*; do
for file in $newsgroup/*; do
cls=$(echo $file | cut -d "/" -f 2)
article=$(echo $file | cut -d "/" -f 3)
printf "$cls\t$article\t" >> $output
tr '\n\r\t' ' '< $file | tr -d '\377' >> $output
printf "\n" >> $output
done
done
iconv -f iso-8859-1 -t utf-8 < $output > $dataset-stanford-classifier.txt
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment