Skip to content

Instantly share code, notes, and snippets.

@frendhisaido
Created July 24, 2012 14:58
Show Gist options
  • Save frendhisaido/3170455 to your computer and use it in GitHub Desktop.
Save frendhisaido/3170455 to your computer and use it in GitHub Desktop.
TF-IDF
2012-04-02T06:52:32Z||oprator berpengalaman telkomsel selain excel indosat saya mmbutuhkan operator marketing yang ckp handal berpengalaman
2012-04-02T07:12:42Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan
2012-04-02T07:00:12Z||pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g dijamin internetan wusshhhhh
2012-04-02T07:03:31Z||edit jalur akses internet indosat gunakan proxy ip add 195 189 142 132 port ip 80 yang lain biarkan seperti aslinya
2012-04-02T06:56:49Z||haha <makian> oprator berpengalaman telkomsel selain indosat bth operator marketing yang pengalaman
2012-04-02T07:22:10Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g dijamin internetan wusshhhhh
2012-04-02T07:32:05Z||di atmajaya abisnya mirip sangat aula indosat gambarnya tadi haha
2012-04-02T07:28:44Z||saya cinta karo indosat mergo terpaksa
2012-04-02T09:43:10Z||pan sarua indosat mnh haha dibawain ngan peje hela ngke hayu wk
2012-04-02T11:24:16Z||euw indosat should fix their bad connection
2012-04-02T12:57:58Z||gadeliv deliv acan eleuh eleuh indosat tahun meni geleuh
2012-04-02T12:54:53Z||rt indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance
2012-04-02T12:52:18Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance
2012-04-02T15:07:35Z||2rts ktupat aya bsi ek aya 550 e63 hde kneh brow msi hp indosat wii hyong symbian pguh hp masuk kneh n0
2012-04-02T15:16:57Z||senyumlicik penghianat yaps beralih ke indosat maybe it better than
2012-04-02T15:51:08Z||giliran lancar teman teman saya pada tidur terimakasih indosat
2012-04-02T16:18:37Z||disappointing with indosat internet connection slow it has been like this week
2012-04-02T08:01:02Z||pc laptop handphone barang impor operator selular indosat xl telkomsel milik asing qatar singapur malaysia
2012-04-02T12:58:14Z||pakai sarung tangan ngerakit kabel22 pasang petasan otw gedung indosat kedipin mata 2kali bom duarr tetap gdlv3
2012-04-02T08:11:36Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g
2012-04-02T12:54:27Z||they selling unlimited that really limit our call hello where have ylki they still alive indosat good cute
2012-04-02T13:04:54Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance
2012-04-02T14:41:12Z||reservation southeast asia official phone number 62 856 2121 666 indosat official blackberry
2012-04-02T12:49:19Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance
2012-04-02T15:53:15Z||sama2 giliran lancar teman teman saya pada tidur terimakasih indosat
2012-04-02T13:00:46Z||ah jan sikak oq gra2 lali pngaturane njuk seg indosat ra keno gawe bka fb
2012-04-02T13:27:06Z||perbulannya berapa ini min pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g
2012-04-02T14:09:59Z||guess indosat android worst combination former were slow til now latter eating enormous bytes
2012-04-02T15:15:07Z||penghianat yaps beralih ke indosat maybe it better than
2012-04-02T15:22:11Z||adele old friends why so shy me indosat why so bad
2012-04-02T11:25:13Z||walaupun hujan deras gini sinyal 3g indosat dirumah saya tetap kuat
2012-04-02T12:54:57Z||iklan indosat eneg tiru2 genkisudo huek
2012-04-02T14:11:55Z||tetap saja indosat abaaaaal haha
2012-04-02T13:53:53Z||rt apa definisi sukses menurut teman teman pakai indosat mobile
2012-04-02T14:09:17Z||haha tidak-ada kerjaan waktu ngerjain operator indosat
2012-04-02T17:22:28Z||dang saat mau buka koran ternyata ada iklan indosat haha suka shock begitu saya
2012-04-02T16:44:54Z||euweuh ka urg nte geus diaktifkeun can rhie zoel zul aya telepon ti indosat jang ngaaktfkeun kartu prabayar tea
2012-04-02T16:36:48Z||people who complain about indosat services like who complain about getting aids whore they knew had aids
2012-04-02T10:30:52Z||asik puas internetan pakai indosat internet broom gas pool ngebuut
oprator=2.9444389791664403; yang=2.5649493574615367; saya=1.791759469228055; marketing=2.9444389791664403; selain=2.9444389791664403; telkomsel=2.5649493574615367; berpengalaman=5.8888779583328805; operator=2.1972245773362196; cepat=1.9459101490553132; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internetan=1.791759469228055; internet=1.3862943611198906; tanpa=1.9459101490553132; putus=1.9459101490553132; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; wusshhhhh=2.9444389791664403; internetan=3.58351893845611; cepat=1.9459101490553132; 5g=2.1972245773362196; dijamin=2.9444389791664403; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; jaringan=2.1972245773362196; tanpa=1.9459101490553132; putus=1.9459101490553132; yang=2.5649493574615367; internet=1.3862943611198906; oprator=2.9444389791664403; yang=2.5649493574615367; marketing=2.9444389791664403; selain=2.9444389791664403; haha=1.791759469228055; telkomsel=2.5649493574615367; berpengalaman=2.9444389791664403; operator=2.1972245773362196; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; wusshhhhh=2.9444389791664403; internetan=3.58351893845611; cepat=1.9459101490553132; 5g=2.1972245773362196; dijamin=2.9444389791664403; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; jaringan=2.1972245773362196; putus=1.9459101490553132; tanpa=1.9459101490553132; haha=1.791759469228055; saya=1.791759469228055; haha=1.791759469228055; connection=2.9444389791664403; bad=2.9444389791664403; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; rt=1.9459101490553132; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; aya=5.8888779583328805; penghianat=2.9444389791664403; it=2.5649493574615367; maybe=2.9444389791664403; ke=2.9444389791664403; yaps=2.9444389791664403; better=2.9444389791664403; beralih=2.9444389791664403; than=2.9444389791664403; lancar=2.9444389791664403; saya=1.791759469228055; tidur=2.9444389791664403; giliran=2.9444389791664403; teman=5.1298987149230735; terimakasih=2.9444389791664403; pada=2.9444389791664403; connection=2.9444389791664403; it=2.5649493574615367; slow=2.9444389791664403; like=2.9444389791664403; internet=1.3862943611198906; telkomsel=2.5649493574615367; operator=2.1972245773362196; pakai=1.0986122886681098; tetap=2.5649493574615367; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; internetan=1.791759469228055; cepat=1.9459101490553132; 5g=2.1972245773362196; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; tanpa=1.9459101490553132; putus=1.9459101490553132; jaringan=2.1972245773362196; call=1.9459101490553132; they=3.58351893845611; have=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; lancar=2.9444389791664403; saya=1.791759469228055; tidur=2.9444389791664403; giliran=2.9444389791664403; teman=5.1298987149230735; terimakasih=2.9444389791664403; pada=2.9444389791664403; cepat=1.9459101490553132; 5g=2.1972245773362196; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internetan=1.791759469228055; internet=1.3862943611198906; putus=1.9459101490553132; tanpa=1.9459101490553132; jaringan=2.1972245773362196; slow=2.9444389791664403; penghianat=2.9444389791664403; it=2.5649493574615367; maybe=2.9444389791664403; ke=2.9444389791664403; yaps=2.9444389791664403; better=2.9444389791664403; beralih=2.9444389791664403; than=2.9444389791664403; so=3.8918202981106265; bad=2.9444389791664403; saya=1.791759469228055; tetap=2.5649493574615367; iklan=2.9444389791664403; haha=1.791759469228055; tetap=2.5649493574615367; rt=1.9459101490553132; teman=5.1298987149230735; pakai=1.0986122886681098; haha=1.791759469228055; operator=2.1972245773362196; iklan=2.9444389791664403; saya=1.791759469228055; haha=1.791759469228055; can=1.9459101490553132; aya=2.9444389791664403; they=1.791759469228055; like=2.9444389791664403; broom=1.791759469228055; internetan=1.791759469228055; pakai=1.0986122886681098; internet=1.3862943611198906;
pakai=1.0986122886681096, df=12
internet=1.3862943611198906, df=8
broom=1.7917594692280547, df=6
haha=1.7917594692280547, df=6
saya=1.7917594692280547, df=6
bisa=1.945910149055313, df=5
call=1.945910149055313, df=5
can=1.945910149055313, df=5
cepat=1.945910149055313, df=5
cute=1.945910149055313, df=5
dengan=1.945910149055313, df=5
didukung=1.945910149055313, df=5
good=1.945910149055313, df=5
have=1.945910149055313, df=5
putus=1.945910149055313, df=5
rt=1.945910149055313, df=5
tanpa=1.945910149055313, df=5
they=2.0903860474327307, df=6
5g=2.1972245773362196, df=4
ads=2.1972245773362196, df=4
but=2.1972245773362196, df=4
even=2.1972245773362196, df=4
fire=2.1972245773362196, df=4
hope=2.1972245773362196, df=4
insurance=2.1972245773362196, df=4
jaringan=2.1972245773362196, df=4
operator=2.1972245773362196, df=4
price=2.1972245773362196, df=4
reasonable=2.1972245773362196, df=4
single=2.1972245773362196, df=4
so=2.335092178866376, df=5
internetan=2.3890126256374065, df=6
it=2.5649493574615367, df=3
telkomsel=2.5649493574615367, df=3
tetap=2.5649493574615367, df=3
yang=2.5649493574615367, df=3
bad=2.9444389791664403, df=2
beralih=2.9444389791664403, df=2
better=2.9444389791664403, df=2
connection=2.9444389791664403, df=2
dijamin=2.9444389791664403, df=2
giliran=2.9444389791664403, df=2
iklan=2.9444389791664403, df=2
ke=2.9444389791664403, df=2
lancar=2.9444389791664403, df=2
like=2.9444389791664403, df=2
marketing=2.9444389791664403, df=2
maybe=2.9444389791664403, df=2
oprator=2.9444389791664403, df=2
pada=2.9444389791664403, df=2
penghianat=2.9444389791664403, df=2
selain=2.9444389791664403, df=2
slow=2.9444389791664403, df=2
terimakasih=2.9444389791664403, df=2
than=2.9444389791664403, df=2
tidur=2.9444389791664403, df=2
wusshhhhh=2.9444389791664403, df=2
yaps=2.9444389791664403, df=2
aya=4.41665846874966, df=2
berpengalaman=4.41665846874966, df=2
teman=5.1298987149230735, df=3
package dataConvert;
import java.io.*;
import java.util.*;
import java.util.Map.Entry;
/**
* Program hitung TFIDF
*
* @author frendhisaidodanaro
*/
public class procTFIDF {
//Array untuk pengecekan stop word.
private ArrayList<String> alExtStopWords = new ArrayList<String>();
// Fungsi sorting TreeMap berdasarkan value.
static <K,V extends Comparable<? super V>> SortedSet<Map.Entry<K,V>> entriesSortedByValues(Map<K,V> map) {
SortedSet<Map.Entry<K,V>> sortedEntries = new TreeSet<Map.Entry<K,V>>(
new Comparator<Map.Entry<K,V>>() {
@Override public int compare(Map.Entry<K,V> e1, Map.Entry<K,V> e2) {
int res = e1.getValue().compareTo(e2.getValue());
return res != 0 ? res : 1;
}
}
);
sortedEntries.addAll(map.entrySet());
return sortedEntries;
}
//Snippet dari program edu.upi.cs.tweetmining.TFIDF untuk memasukkan data stopwords ke array alExtStopWords
private void loadExtStopWords(String inputExtStopWords) {
try {
FileInputStream fstream = new FileInputStream(inputExtStopWords);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
int cc=0;
while ((strLine = br.readLine()) != null) {
alExtStopWords.add(strLine);
}
br.close();
in.close();
}catch (Exception e) {
System.out.println(e.toString());
}
}
public void process(String fileInput, String extStopWord, boolean denganStat) {
String namaFile = fileInput.substring(0, fileInput.indexOf("."));
int totalTerms = 0;
int totalDoc;
// mulai load stopwords ke arrayExtStopWords.
loadExtStopWords(extStopWord);
//
ArrayList<HashMap<String, Integer>> arrTweets = new ArrayList<HashMap<String, Integer>>();
ArrayList<HashMap<String, Double>> arrTFIDF = new ArrayList<HashMap<String, Double>>();
HashMap<String, Integer> docFreq = new HashMap<String, Integer>();
TreeMap<String, Double> tfIDF = new TreeMap<String, Double>();
try{
FileInputStream fstream = new FileInputStream(fileInput);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
System.out.println("Reading "+ fileInput);
// HITUNG TERM FREQUENCY
// Membaca file input
// Mencari jumlah tf tiap term per baris
String strLine;
Integer tfreq;
while ((strLine = br.readLine()) != null) {
HashMap<String, Integer> termFreq = new HashMap<String, Integer>();
String docn = strLine.substring(22,strLine.length());
Scanner sc = new Scanner(docn);
while(sc.hasNext()) {
String term = sc.next();
if(!term.equalsIgnoreCase("indosat")){ //Skip keyword indosat, karena ada di setiap tweet.
tfreq = termFreq.get(term); //Ambil value
termFreq.put(term, (tfreq == null) ? 1 : tfreq + 1); //Jika value masih kosong, isi 1. Jika 1, increment.
totalTerms++;
}
}
sc.close();
arrTweets.add(termFreq);//Simpan termFreq.
}
br.close();
// Selesai membaca dataset.
// arrTweet berisi HashMap termFreq, tiap termFreq adalah representasi dokumen/tweet, berisi jumlah tf dari masing2 term.
// HITUNG DOCUMENT FREQUENCY
// Iterasi arrTweets, untuk menghitung df.
// Menghitung jumlah dokumen yang mengandung term.
// docFreq.put("awan",7)
// Artinya term "awan", ditemukan di 7 dokumen/tweet
Iterator iterArray = arrTweets.iterator();
while(iterArray.hasNext()){
HashMap perTweet = (HashMap) iterArray.next();
Iterator iterEach = perTweet.keySet().iterator();
while(iterEach.hasNext()){
String eachW = (String) iterEach.next();
if(alExtStopWords.contains(eachW)){ //Kalau ada di stopword, DF = 0.
docFreq.put(eachW, 0);
}else{
Integer dfreq = docFreq.get(eachW);
docFreq.put(eachW,(dfreq == null)? 1 : dfreq +1 );
}
}
}
// Selesai menghitung DF tiap term
// HashMap docFreq berisi key= term, value= document frequency
// HITUNG IDF dan TFIDF
// arrTweets sekali lagi di iterasi
// untuk menghitung nilai IDF lalu sekaligus dihitung TF*IDF nya
// di tiap dokumen nilai TF*IDF per term dihitung, dan disimpan di HashMap valTFIDF
// lalu valTFIDF ini dikumpulkan di arrTFIDF,\
Iterator iterTF = arrTweets.iterator();
Double idf,tfidf;
totalDoc = arrTweets.size();
while(iterTF.hasNext()){
HashMap<String, Double> valTFIDF = new HashMap<String, Double>();
HashMap perTweet = (HashMap) iterTF.next();
Iterator iterEach = perTweet.keySet().iterator();
while(iterEach.hasNext()){
String aTerm = (String) iterEach.next(); //ambil term yang akan diproses
Integer dfreq = docFreq.get(aTerm); //ambil nilai DF dari term yang akan diproses
if(dfreq>1){
Integer cfreq = (Integer) perTweet.get(aTerm); // ambil nilai tf dari aTerm
idf = Math.log(totalDoc/dfreq);
tfidf = cfreq * idf;
valTFIDF.put(aTerm, tfidf);
//System.out.println("TFIDF("+aTerm+")= "+cfreq+" * "+"log("+totDoc+"/"+dfreq+") = "+ tfidf+" , ");
}
}
arrTFIDF.add(valTFIDF); //Selesai olah satu perTweet, simpan HashMap valTFIDF ke arrTFIDF
}
// Selesai hitung IDF dan TF*IDF
// arrTFIDF berisi nilai tfidf tiap term per dokumen, yaitu valTFIDF
// Tulis hasil hitung TF*IDF ke file output namafile_tfidf.txt
BufferedWriter writeTFIDF = new BufferedWriter(new FileWriter( (namaFile+"_tfidf.txt") ,true));
Iterator iterValTFIDF = arrTFIDF.iterator();
while(iterValTFIDF.hasNext()){
HashMap perTweet = (HashMap) iterValTFIDF.next();
//System.out.println(perTweet.toString());
Iterator iterEach = perTweet.keySet().iterator();
while(iterEach.hasNext()){
String aTerm = (String) iterEach.next();
Double valTFIDF = (Double) perTweet.get(aTerm);
writeTFIDF.write(aTerm+"="+valTFIDF+"; ");
//System.out.print(aTerm+"="+valTFIDF+"; ");
}
//System.out.println("__");
//writeTFIDF.newLine();
}
writeTFIDF.close();
// Hitung rata-rata bobot TFIDF term, jika denganStat= true
if(denganStat){
// HITUNG jumlah rata2 TFIDF tiap term
for(String word : docFreq.keySet()){
Integer dfreq = docFreq.get(word);
if(dfreq>1){ //hanya hitung term yang muncul di lebih dari satu dokumen
//System.out.println("Collecting term: "+word+" df= "+dfreq);
Double tfIDFstat = 0.0; // Inisiasi nilai tfIDFstat, digunakan untuk akumulasi
int cc=0;
Iterator iterTFIDF = arrTFIDF.iterator();
while(iterTFIDF.hasNext()) {
HashMap val = (HashMap) iterTFIDF.next();
if(val.containsKey(word)){
for(Object t : val.keySet()) {
if(t.toString().equals(word)){
cc++;
tfIDFstat = tfIDFstat + (Double) val.get(word); //akumulasi nilai tfidf suatu term di seluruh dokumen
}
}
}
}
//System.out.println("Counted="+cc+" tfIDFstats="+tfIDFstat);
Double tfIDFtot = tfIDFstat/cc; //HITUNG RATA-RATA
//System.out.println("tfidf("+word+")="+tfIDFtot);
tfIDF.put(word, tfIDFtot); //Simpan di TreeMap tfIDF
}
}
// Tulis hasil hitung rata-rata ke file output namafile_tfidf_stat.txt
BufferedWriter writeStat = new BufferedWriter(new FileWriter( (namaFile+"_tfidf_stat.txt") ,true));
for (Iterator<Entry<String, Double>> it = entriesSortedByValues(tfIDF).iterator(); it.hasNext();) {
Entry<String, Double> entry = it.next();
String oneWord = entry.getKey();
Double oneValue = entry.getValue();
Integer dfreq= docFreq.get(oneWord);
//System.out.println("tdidf("+oneWord+")= "+oneValue);
writeStat.write(oneWord+"="+oneValue+", df="+dfreq);
writeStat.newLine();
}
writeStat.close();
}
}catch(Exception e){
System.out.println(e.toString());
}
System.out.println("unik: "+docFreq.size());
System.out.println("Jumlah document:"+ arrTweets.size());
System.out.println("Total term: "+totalTerms);
}
public static void main(String[] a) {
procTFIDF pt = new procTFIDF();
pt.process("negatif_2012.txt", "catatan_stopwords_ekstensif.txt", true);
}
}
@rifinda
Copy link

rifinda commented Dec 26, 2019

permisi saya ingin bertanya, untuk kodingan dibawah ini digunakan untuk proses apa? dan Apa diharuskan untuk menggunakan nya?

//Snippet dari program edu.upi.cs.tweetmining.TFIDF untuk memasukkan data stopwords ke array alExtStopWords
private void loadExtStopWords(String inputExtStopWords) {

     try {
            FileInputStream fstream = new FileInputStream(inputExtStopWords);
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine;
            int cc=0;
            while ((strLine = br.readLine()) != null)   {
               alExtStopWords.add(strLine);
            }
            br.close();
            in.close();
        }catch (Exception e) {
            System.out.println(e.toString());
        }
 }

terima kasih sebelumnya, mohon penjelasannya :)

@frendhisaido
Copy link
Author

permisi saya ingin bertanya, untuk kodingan dibawah ini digunakan untuk proses apa? dan Apa diharuskan untuk menggunakan nya?

//Snippet dari program edu.upi.cs.tweetmining.TFIDF untuk memasukkan data stopwords ke array alExtStopWords
private void loadExtStopWords(String inputExtStopWords) {

     try {
            FileInputStream fstream = new FileInputStream(inputExtStopWords);
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine;
            int cc=0;
            while ((strLine = br.readLine()) != null)   {
               alExtStopWords.add(strLine);
            }
            br.close();
            in.close();
        }catch (Exception e) {
            System.out.println(e.toString());
        }
 }

terima kasih sebelumnya, mohon penjelasannya :)

Bagian code itu fungsinya hanya untuk untuk mengisi daftar stop-words (ArrayList<String> alExtStopWords) dari file.
Kalau stopwords nya cukup di-"hardcode" mungkin bagian code itu tidak perlu.

Daftar stopwords ini nanti digunakan saat menghitung DF:
https://gist.github.com/frendhisaido/3170455#file-proctfidf-java-L109
Stopwords tidak dihitung Document Frequency nya https://gist.github.com/frendhisaido/3170455#file-proctfidf-java-L109

Maaf karena sudah 8 tahun yang lalu jadi agak lupa pastinya,
tapi seingat saya dulu untuk TF-IDF stopwords tidak perlu dihitung karena (mungkin) tidak ada nilai sentimennya.
Jadi supaya tidak beri pengaruh banyak ke klasifikasinya, stopwords di skip.

Rujukan dari blog dosen saya: https://yudiwbs.wordpress.com/2008/07/23/stop-words-untuk-bahasa-indonesia/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment