henriquelalves · May 6, 2018 00:52
diff --git a/class_17 b/class_17
 Assignment 3:

 Headlines -> Vectors -> Groups
 \_________ ________/
          v
  the complicated
        part

 Baseline idea:
    - Bag of Words (BoW)
    1. Throw all words in a bag, and start counting words that appear.
        -> careful with useless words ('the', 'of'...), variations of words ('banana', 'bananas', case sensitiveness).
    2. Create dictionary with most frequent words.
    3. Transform training set into vectors (dimension = size of dictionary).
    4. Normalization!
        -> TF/IDF (Term Frequence / Inverse Document Frequence)
            e.g. <0, 1, 1, 0, 0> => <0/2, 1/2, 1/2, 0/2, 0/2>
                                 (or divide by whole dataset sum)
    - N-Grams:
        Same as Bag of Words, but using n-tuples of words.
    - N-Char-Grams:
        Using sequence of characters instead of whole words.
        
    Recommend reading:
        - Autorship Attribution (TIFS, paper 2017, Rocha).
    
    Probably will need dimensionality reduction (very sparse vectors!)


 Debugging a learning Alg.:
    High Error?
    1- More data.
    2- Features -> Selection?
                -> Reduction? (math transformation to reduce dimensionality, PCA, etc)
    3- Regularization
    4- Alfa, Lambda
    5- Additional Features
    _______________________
    = Lots of Alogorithms => ML Diagnostics
    
    - Choosing by accuracy might not be wise (just over one dataset).
        -> Better having a series of results showing accuracy differences between algorithms.
        -> Getting the statistical information of algorithms as a base of comparison.

 To solve this:
    Dividing training set:
        60% Train
        20% Validation
        20% Test (SACRED DATA - JUST USE IT TO GET THE GENERALIZATION ERRORS)
    
    Bias or Underfitting X Variance or Overfitting

 If algorithm doesn't behave as expected, what to do?
                   helps
    + Data          -> fix high variance
    Select features -> fix high variance
    Add features    -> fix high bias
    Add complext    -> fix high bias
    Decrease Lambda -> fix high bias
    Increase Lambda -> fix high variance

 Forms of comparing algorithms:
    - Classification accuracy or Error (effectiveness)
    - Time spent (efficiency)
    - Comprehensiveness of model
    - Storage
    - Model complexity
    - Battery/Power requirements

 Sampling on the data:
    - Hold out
    - Random rampling
    - Cross-validation
    - Bootstrapping
	Assignment 3:

	Headlines -> Vectors -> Groups
	\_________ ________/
	v
	the complicated
	part

	Baseline idea:
	- Bag of Words (BoW)
	1. Throw all words in a bag, and start counting words that appear.
	-> careful with useless words ('the', 'of'...), variations of words ('banana', 'bananas', case sensitiveness).
	2. Create dictionary with most frequent words.
	3. Transform training set into vectors (dimension = size of dictionary).
	4. Normalization!
	-> TF/IDF (Term Frequence / Inverse Document Frequence)
	e.g. <0, 1, 1, 0, 0> => <0/2, 1/2, 1/2, 0/2, 0/2>
	(or divide by whole dataset sum)
	- N-Grams:
	Same as Bag of Words, but using n-tuples of words.
	- N-Char-Grams:
	Using sequence of characters instead of whole words.

	Recommend reading:
	- Autorship Attribution (TIFS, paper 2017, Rocha).

	Probably will need dimensionality reduction (very sparse vectors!)


	Debugging a learning Alg.:
	High Error?
	1- More data.
	2- Features -> Selection?
	-> Reduction? (math transformation to reduce dimensionality, PCA, etc)
	3- Regularization
	4- Alfa, Lambda
	5- Additional Features
	_______________________
	= Lots of Alogorithms => ML Diagnostics

	- Choosing by accuracy might not be wise (just over one dataset).
	-> Better having a series of results showing accuracy differences between algorithms.
	-> Getting the statistical information of algorithms as a base of comparison.

	To solve this:
	Dividing training set:
	60% Train
	20% Validation
	20% Test (SACRED DATA - JUST USE IT TO GET THE GENERALIZATION ERRORS)

	Bias or Underfitting X Variance or Overfitting

	If algorithm doesn't behave as expected, what to do?
	helps
	+ Data -> fix high variance
	Select features -> fix high variance
	Add features -> fix high bias
	Add complext -> fix high bias
	Decrease Lambda -> fix high bias
	Increase Lambda -> fix high variance

	Forms of comparing algorithms:
	- Classification accuracy or Error (effectiveness)
	- Time spent (efficiency)
	- Comprehensiveness of model
	- Storage
	- Model complexity
	- Battery/Power requirements

	Sampling on the data:
	- Hold out
	- Random rampling
	- Cross-validation
	- Bootstrapping