Presentation by John Burt, Ph.D @ PSU Business Accelerator on 25 Feb 2018 @ 1pm
- CV = cross-validation
- Done mainly using the
sklearn
library - Count vectorizor: each word occurrence is a vector
- Text data > Feature Engineering: TfidVectorizer > Classifier: SGDClassifier > Hyper-parameter tuning: GridsearchCV
- Default variables
- Specify params for Gridsearch to check. Gridsearch takes estimator but you can pass it other things such as a normalizer or other data pre-processor
- Run Gridsearch: pass it training data and target data
- Gridsearch then outputs best score and best set of params
- Generate an accuracy heatmap of X:penalty and Y:number of iterations
- Pipeline is another estimator that is passed to Gridsearch
- Pipeline consists of defining the two objects: TfidVectorizer and SGDClassifier
- Instead of passing the classifier object, the pipeline is passed into Gridsearch
- Pipeline does all param vectorization
- No tuning w/ default params: 93.5% acc
- Classifier tuning: 94.3%
- Vectorizer + classifier tuning: 95.3%
- Can't just vectorize everything, need to experiment tuning methods and which params to tune for better results
- Find models > test > model is done > tune parameters
- Get example code working, try different parmas: both TfidVectorizer and SGDClassifier have lots of params!
- Implement Gridsearch to optimizer your classifier for last session'ss Wikipedia toxicity data