Skip to content

Instantly share code, notes, and snippets.

@henriquelalves
Created May 6, 2018 00:52
Show Gist options
  • Save henriquelalves/9425f7f12144ba574c6e1f360cb9dabf to your computer and use it in GitHub Desktop.
Save henriquelalves/9425f7f12144ba574c6e1f360cb9dabf to your computer and use it in GitHub Desktop.
ML class 17
Assignment 3:
Headlines -> Vectors -> Groups
\_________ ________/
v
the complicated
part
Baseline idea:
- Bag of Words (BoW)
1. Throw all words in a bag, and start counting words that appear.
-> careful with useless words ('the', 'of'...), variations of words ('banana', 'bananas', case sensitiveness).
2. Create dictionary with most frequent words.
3. Transform training set into vectors (dimension = size of dictionary).
4. Normalization!
-> TF/IDF (Term Frequence / Inverse Document Frequence)
e.g. <0, 1, 1, 0, 0> => <0/2, 1/2, 1/2, 0/2, 0/2>
(or divide by whole dataset sum)
- N-Grams:
Same as Bag of Words, but using n-tuples of words.
- N-Char-Grams:
Using sequence of characters instead of whole words.
Recommend reading:
- Autorship Attribution (TIFS, paper 2017, Rocha).
Probably will need dimensionality reduction (very sparse vectors!)
Debugging a learning Alg.:
High Error?
1- More data.
2- Features -> Selection?
-> Reduction? (math transformation to reduce dimensionality, PCA, etc)
3- Regularization
4- Alfa, Lambda
5- Additional Features
_______________________
= Lots of Alogorithms => ML Diagnostics
- Choosing by accuracy might not be wise (just over one dataset).
-> Better having a series of results showing accuracy differences between algorithms.
-> Getting the statistical information of algorithms as a base of comparison.
To solve this:
Dividing training set:
60% Train
20% Validation
20% Test (SACRED DATA - JUST USE IT TO GET THE GENERALIZATION ERRORS)
Bias or Underfitting X Variance or Overfitting
If algorithm doesn't behave as expected, what to do?
helps
+ Data -> fix high variance
Select features -> fix high variance
Add features -> fix high bias
Add complext -> fix high bias
Decrease Lambda -> fix high bias
Increase Lambda -> fix high variance
Forms of comparing algorithms:
- Classification accuracy or Error (effectiveness)
- Time spent (efficiency)
- Comprehensiveness of model
- Storage
- Model complexity
- Battery/Power requirements
Sampling on the data:
- Hold out
- Random rampling
- Cross-validation
- Bootstrapping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment