Created
May 6, 2018 00:52
-
-
Save henriquelalves/9425f7f12144ba574c6e1f360cb9dabf to your computer and use it in GitHub Desktop.
ML class 17
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Assignment 3: | |
Headlines -> Vectors -> Groups | |
\_________ ________/ | |
v | |
the complicated | |
part | |
Baseline idea: | |
- Bag of Words (BoW) | |
1. Throw all words in a bag, and start counting words that appear. | |
-> careful with useless words ('the', 'of'...), variations of words ('banana', 'bananas', case sensitiveness). | |
2. Create dictionary with most frequent words. | |
3. Transform training set into vectors (dimension = size of dictionary). | |
4. Normalization! | |
-> TF/IDF (Term Frequence / Inverse Document Frequence) | |
e.g. <0, 1, 1, 0, 0> => <0/2, 1/2, 1/2, 0/2, 0/2> | |
(or divide by whole dataset sum) | |
- N-Grams: | |
Same as Bag of Words, but using n-tuples of words. | |
- N-Char-Grams: | |
Using sequence of characters instead of whole words. | |
Recommend reading: | |
- Autorship Attribution (TIFS, paper 2017, Rocha). | |
Probably will need dimensionality reduction (very sparse vectors!) | |
Debugging a learning Alg.: | |
High Error? | |
1- More data. | |
2- Features -> Selection? | |
-> Reduction? (math transformation to reduce dimensionality, PCA, etc) | |
3- Regularization | |
4- Alfa, Lambda | |
5- Additional Features | |
_______________________ | |
= Lots of Alogorithms => ML Diagnostics | |
- Choosing by accuracy might not be wise (just over one dataset). | |
-> Better having a series of results showing accuracy differences between algorithms. | |
-> Getting the statistical information of algorithms as a base of comparison. | |
To solve this: | |
Dividing training set: | |
60% Train | |
20% Validation | |
20% Test (SACRED DATA - JUST USE IT TO GET THE GENERALIZATION ERRORS) | |
Bias or Underfitting X Variance or Overfitting | |
If algorithm doesn't behave as expected, what to do? | |
helps | |
+ Data -> fix high variance | |
Select features -> fix high variance | |
Add features -> fix high bias | |
Add complext -> fix high bias | |
Decrease Lambda -> fix high bias | |
Increase Lambda -> fix high variance | |
Forms of comparing algorithms: | |
- Classification accuracy or Error (effectiveness) | |
- Time spent (efficiency) | |
- Comprehensiveness of model | |
- Storage | |
- Model complexity | |
- Battery/Power requirements | |
Sampling on the data: | |
- Hold out | |
- Random rampling | |
- Cross-validation | |
- Bootstrapping |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment