You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Makes a compressed representation of a data without any human intervention
Cons Bad compression and generalizing to datasets
Pros Dimensionality Reduction and Image denoising
A simple autoencoder
Just compresses data, for example images from the MNIST database.
- Also need a corresponding decoder to reconstruct the image back.
Using Tensor Flow to create a simple autoencoder
Convolutional Autoencoder
Why is it better?
Encoder goes from a larger image to a smaller image (using max pooling layers etc)
For decoding- use transpose convolutions, upsampling.
The checker board effect with transpose convolution
Making the network learn how to denoise images by providing input as noisy images and output as noise-free images
All Gates
Learn Gate
Takes the short term memory and the event and combines it; and then keeps the important part of it.
Lesson 10- Recurrent Neural Networks
The motivation- ordered sequences!
Better with time series data
Eg. stock price, we have to do supervised learning with sequences
Natural language Processing
Language Translation
Vanilla supervised learners
Make no assumption about the input structure of the data. the order matters!
Images and video have structure, relations within image.
Modelling sequential data
What is an ordered sequence?
Indexing values by timestamp (the order in which they appeared)
A product of some underlying process/processes
Eg. temperature/stock prices
Model the sequence recursively
Model future values based on the past
Simple recursive examples
Example odd numbers, something that can be expressed as a function of its predeccesors
The seed is the first element, eg. for fibonnaci is 1 and 1.
The order is the number of previous elements an element depends upon-eg. for odd number sequence it is 1, for fiboncacci is 2.
Thinking about recursivity
The unfolded view vs folded view vs graph view.
Driving a recursive sequence
Savings account example.
Injecting recursivity into a learner, the lazy way
Learning the function to describe a sequence
Learn weights of parameterized functio by fitting; take the least squares cost function.
Regression! Windowing our data points based on input-output pairs.
Shows how to do this in Keras.
There can be more than one way to describe a sequence..
Applies this to a real financial dataset
Injecting recursivity into a learner, the proper way
Failures of the FNN approach
We assumed no structure- just went on pair by pair- * there's a dependence on input-output*, they are not IID
Basic RNN approach
Force consecutive dependency!
Hidden states - the h variables
Use the least squares loss again, albeit now including the hidden variables as well.
RNNs and memory
RNNs go much farther back in time to take into account the previous values; whereas FNN just depends on the immediately previous value.
Every levels contain a complete history, or in other words, have memory.
Technical issues such as vanishing/exploding gradients exist
Lesson 11 : LSTMs- Long Short term Memory Networks
RNN vs LSTM
Use previous information- the animal NatGeo example.
RNNs generally store short term memory due to vanishing gradients; but LSTMs keeps track of both long term memory and short memory. (GHAJINI)
Combine both forms of memories into 4 gates- forget gate, remember gate, learn gate and use gate - these dates are used to update both long and short term memories.
About all gates: using example of NatGeo science and nature show
Learn Gate
Joins the short term memory and the event; and forgets the un-important part--> ignore factor
Forget Gate
The forget factor
Remember Gate
Combines the long term memory from the forget gate and short term memory from the learn gate and SIMPLY ADD THEM; and generate the new long term memory.
Use Gate
Takes whatever is useful from both long term and short term memories; and generate the new short term memory.
Hay muchas otras architecturas para tratar los
Lesson 12 : Implementing RNNs and LSTMs
Begins with a review of RNNs and LSTMs
RNNs: Google translate improvement example, and the need for RNNs.
Route the output from the previous hidden layer back into the hidden layer.
LSTMs: Begins with the need due to vanishing or exploding gradients
Talks about the four gates.
Character wise RNN
Learn text one character at a time, and produce text one character at a time.
Get a probability distribution for the next character.
Sequence batching
Splitting sequences into batches of some lengths
Building a character wise RNN - Anna KaRNNa
Builds LSTMs using Tensorflow
Hyperparameters¶
Here are the hyperparameters for the network.
batch_size - Number of sequences running through the network in one pass.
num_steps - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
lstm_size - The number of units in the hidden layers.
num_layers - Number of hidden LSTM layers to use
learning_rate - Learning rate for training
keep_prob - The dropout keep probability when training. If you're network is overfitting, try decreasing this.
Lesson 13: Hyperparameters in RNNs
How to tune hyperparameters, no magic numbers, they depend on the dataset.
Two categories
Optimizer hyperparameters - learning rate, minibatch size, number of epochs
Model hyperparameters - number of layers/units
The learning rate is the MOST IMPORTANT hyperparameter
The learning rate takes us closer to the least error
Compares choosing too big vs too small of a learning rate.
Learning rate controls all weights versus various error curves
Learning rate decay
Minibatch size
Online training (batch size=1) vs batch training (batch size=all the examples)
Small minibatches have noise, that prevents being stuck in local minima, so that is preferred.
Generally, 32 to 256 are good starting values.
Number of iterations/epochs
Use Early stopping, to stop when the validation error stops decreasing.
Number of hidden units
More units = more prone to overfitting.
Generally more the better, but too large may lead to overfitting.
RNNs hyperparameters
No clear winner between GRUs and lSTMs- try both and test
Lesson 14: Sentiment analysis with RNNs
Sentiment analysis with RNNs - movie reviews in this case
Word2Vec- words to integers (word embeddings)
Embedding lookup layer- why do we need to do this?!THINK!
Dropout wrappers for dropout regularization
tf.nn.dynamic_rnn
Project:Recurrent Neural Network Projects
Windowing out the sequence into input/output pairs.
Write a very simple RNN sequence with LSTM.
How can we train a machine learning model to generate text automatically, character-by-character? By showing the model many training examples so it can learn a pattern between input and output.
It is a multiclass classification problem!
Lesson 16: Generative Adversarial networks - Ian Goodfellow!
Uses of GAN - to generate data
Mostly done in the field of images.
Example a description of a bird is used to generate images matching that description.
Imitation Learning
How GANs work?
Generative models
Generate images by running noise through a differential funcion (an image) to generate images.
Training process is different than supervised learning. We show the model a bunch of images to generate images.
Uses a discriminator to assign images as real or fake
Generator tries to fool the discriminator by generating fake images and gets better (forced) to make real images.
Generator vs Discriminator- a game between these two
Discusses a bit about game theory.
Equilibrium in the GAN game - has two different players with two different costs
Generator and discriminator compete against each other.
Not always that you find the equilibrium
Tips to trains GANs
Need to learn two optimization algorithms - generator loss and discriminator loss
For large images, we use convolutional networks.
Different from CNNs as in CNNs we go from a larger image to a smaller image; but in GANs we go from a small feature to large images.
A project to build a GAN.
Generator which generates data and discriminator discriminates (acts as police) to call real as real and fake as fake.
tf.variable_scope and tf.trainable_variables
Calculating losses is tricky.
Labels smoothing
Shows how GANs learn progressively epoch by epoch..
Lesson 17: Deep Convolution GANs
DC-GAN
Transposed Convolution - You upsample the image here
The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to layers within the network. It's called "batch" normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current mini-batch.
Has several benefits, converges faster; can train with higher learning rates.
Batch normalization is a technique for improving the performance and stability of neural networks. The idea is
to normalize the layer inputs such that they have a mean of zero and variance of one, much like how we standardize
the inputs to networks. Batch normalization is necessary to make DCGANs work.
Project: DC Gan: Generate street signs.
Changing the generator and discriminator to be convolutional networks
Use transposed convolution, after transposed convolution, do batch normalization
For each of these layers, the general scheme is convolution > batch norm > leaky ReLU.
GANs are VERY SENSITIVE to hyperparameters.
Lesson 18: Semisupervised learning
An applciation of GANs
Use GAN to improve classification of models.
People vs Deep Learning, people receive a lot of unlabeled data, but deep learning only receivs labeled data.
Has both labeled and unlabeled data; eg. can leverage internet to get huge amounts of unlabeled data.
Train both generator and discriminator, and then generator is throwed away, the discriminator is used as a classifier
Feature matching
Notebook on GANs- streer view numbers
Turn discriminator into classifier
Three sources, labeled images, unlabeled real images, fake/imaginary images.
Generator is a normal DCGan, Discriminator is a multi class classifier now
More regularization, because less labeled example. Also leaky relu to allow gradients to pass through.
Feature Matching - Make sure that feature values in the test are similar to the ones generated by the generator.
Add Loss functions for both supervised and unsupervised.
Moment Matching
Lesson 20: Intro To Computer Vision
What is visual perception?
Role in AI: For example for a self driving car to see and react
In recognizing persons from images etc.
Medical images
Emotional intelligence
Computer Vision pipeline
Afectiva demo
Lesson 21: Intro to Natural Language Processing
Why is it difficult for computers to understand us?
because human language does not have a fixed structure.
Structured langugaes have a fixed grammar, and gives up if something is out of its grammar.
Human discourse is unstructured and complex.
Needs context.
We implicity apply our kowledge of physical world.
Applications of NLP
Chat bots
Challenges in NLP
Maintaing a context
Lesson 22: Intro to Voice User Interfaces (VUI)
VUI Overview and pipeline
Acoustic Model, Language Model and Accent Model
VUI Applications
Speaking is faster and less distractive than typing
Alexa Demo
Alexa Skills
Computer Vision
Project : Mimic me
Use of emotiva's API.
Task to recognize a mood and put an emoji next to it.
Image Representation and Analysis
Pre-processing
Why?
To correct images and remove noise
Enhance parts of image that are important
How
RBG TO grayscala
Reduces storage size needed
Color is not required to detect object and interpret image
Color is important, for example when you have to distinguish between yellow and white lane lines.
Images as functions
Color Thresholds
Used to select an area of interest.
OpenCV reads its images as BGR, so always should convert to RGB
The blue screen coding exercise
Color Spaces
Different color spaces- HSV for example; sometimes HSV is better to segment out objects (*for eg. in the pink balloons case!)
Separate out channels - RGB/HSV
Geometric Transforms
Move pixels based on a mathematical formula, to change the perspective of an image.
Eg. scanning and aligning text.
cvtColor
Transforming Text
Straighteing text out from a business card
Map from original image to warped image using geometric transformations
Guesstimate the coordinates, get the transform and apply it
Filters in images
Edge detection filters- high pass filters
Noise removal filters- low pass filters
High frequency components vs low frequency components
High pass filter example- emphasizes edges, where the intensity changes very quickly (high frequency)
Kernel MUST sum to zero (WHY?!)
Kernel and weights
How to create your own filter, using openCV and use inbuilt filters. Eg. Sobel Filter (Ah memories!)
Importance of setting thresholds
High pass filters can enhance noise, so should do low pass before doing edge detection.
Low pass filters
Noise such as speckles, no useful information; for eg. edge detection filters AMPLIFY noise.
Low pass filters use average, and should be normalized so that sum is one.
Gaussian blur - the most common used low pass filter.
First pass through gaussian blur (remove noise), then do edge detection.
Canny edge detector! (memories!)
Non maximal suppression (Remember Bobick!)
Image segmentation
To segment an image into areas
Image contouring
Useful to see connected objects and segment objects.
Done on a B&W image, after thresholding.
Contour features
Area, perimeter, orientation (based on the eclipse fitted)
Hough transform
Line detection (BOBICK! CV assignment 1)
Convert into hough space (m and b coordinates).
Better to convert into polar coordinates
cv2 houghlines parameters
K-means clustering
Unsupervised method to break image into methods.
Feature Extraction and Object Recognition
What is a feature?
A feature is a measurable piece of data in an image
Should be consistent across different scales, lightning etc.
Should be repeatable- very important
Types of features
Edges, corners and blobs
Corners- best repeatable
Are Unique than other features.
Corner detector
Calculate gradient magnitude and direction
Corner has a bid variation in direction and magnitude of the gradient
Dilation and erosion (AH MEMORIES!!)
Morphological operations
Remember closing and opening
Feature Vectos
Look at the direction of gradients
Eg. divide into grid and see directions of gradients.
HOG (Histogram of oriented gradients)
Use binning to separate pixels.
Orientation and magnitude of gradients - get via Sobel, then place the data into a histogram (after dividing into cells)
Should be scale and rotation invariant
HOG is also referred to as a type of feature descriptor, which is a simplified representation of an image that is made up of extracted features (that highlight important parts in an image) and that discards extraneous information. In this case the features represent the image gradient -- it's magnitude and directions, which describe the shapes and patterns of intensity that make up the image.
Talks about how to create a HOG feature vector
Block normalization
Object recognition
Positive vs negative examples
Extract features and feed to training algorithm (supervised learning)
Eg. use an SVM classifier
Haar cascades
Haar features
Detect rectangular patterns like edges etc. (good for faces).
Haar cascades classifiers and rejects non-face (irrelevant) data
Fast enough to process in real time.
Motion/Video
Video is a frame of images!
Optical flow (BOBICK!)
For motion and tracking analysis
Assumptions- Pixel values do not change from frame to frame, and neighboring pixels have similar motion
Heuristic - some additional info that makes brute force act in a more intelligent manner
A* search
Tic tac toe quiz question
Every board config is a node
Pruning the search tree and adversarial search; anticipate to changes in the environment
Monty hall problem
Probability theory
What is intelligence?
Defining intelligence? Should not be based on our perception, but should be defined in the context of a task.
Definition of agent, environment and state
Perception, Action and Cognition. Reactive or behavior based agents.
Classifying AI problems
Based on environment- stochastic-deterministic; adversarial; partially observable etc.
Classifying different problems such as poker, driving on the road etc.
Rational behavior and bounded optimality
Lesson 5: Applying AI to Sudoku
Constraint Propagation and Search, simple concepts applied well
Setting up the board- Defining boxes, units and peers
Representing the sudoku puzzle as a string/dictionary
Strategy 1 : Elimination- eliminate values that cannot be there based on constraints
Strategy 2: Only choice- since a grid must have every digit atleast once, there can be a case where only one option is there for a box
Combine the two strategies: The concept of constraint propagation
Need to stop when the board is solved
Dont go further is there is no progress, or the solver is stalled
Recursively apply the two strategies from above
Does NOT work on hard sudoku puzzles! Which brings us to the third strategy..
Strategy 3: SEARCH
Pick boxes with fewest options- then branch out - use DFS
Lesson 6: Basics of Anaconda
PROYECTO: SUDOKU
Lesson 8: Introduction to Game Playing
The game of isolation
Building a game tree - build a tree based on choices of move at every step
Early detection and telling the computer not to lose early, which leads us to...
The minimax algorithm
Computer tries to maximize its score, and the opponent tries to minimize it
Propagate score up the tree
Finding branches where computer ji can win
Branching factor and the number of nodes one needs to visit
Depth Limited Search
The average branching factor - just try it out and see the average branching factor for a particular board setting. Even this due to the exponential nature of the game, it is too many branches!
Need a way for the computer to choose a move quickly; given the limited processing power
The evaluation function, in this example it is comparing the nodes based on the maximum number of moves a node can have; propagating bottom-up
Quiescent Search - after which level the worst branches do not change; i.e. become quiescent
Lesson 9: Advanced Game Playing
Iterative Deepening- return the answer given within the given time constraint; i.e. how deep along the three you can go.
Number of nodes needed to explore based on the branching factor
Varying the branching factor as you progress along the game
Horizon effect
Using different evaluation functions and find the best one out
Alpha beta pruning algorithm
Pruning the number of nodes to look at using the parameters alpha and beta, reduces the search space
Solving 5 by 5 isolation, use the symmetry of the board to see similar move
3 Player isolation game
Minimax dont work anymore. We have triplets at each level and choose values most suitable to a particular player, i.e. where its score can be maximized.
Alpha beta pruning for a 3 player isolation
Deep pruning is not possible- can do immediate and shallow pruning
Paper by Korsch
Probabilistic Games
Sloppy isolation
EXPECTIMAX function - pruning in a probabilistic sense
Lesson 11: Search
What is a problem?
Examples: Route Finding
Definition of a problem
an initial state,
a set of actions in a particular state
a result
Goaltest, i.e. if this state is the GOAL or not (GOALLLLL)
Path Cost - implemented as a Step Cost function
Example Route Finding
The state space is the entire area
Frontier- the farthest part explored, the unexplored region; and the explored region
Tree Search algorithms
Breadth first search
Keeping track of explored (visited) states!
Notes about termination only when you find the best path
Uniform Cost Search
Taking into account the cost
Continue to search till you find a better path; and stop till you can't better it no more.
Depth First Search is not optimial in this context
Then why would you use depth first search at all?! - Due to less storage requirements!
Depth First Search is also not complete
About uniform cost- need more knowledge to get to the goal faster
For example, in the route finding problem, having an estimate of the distance to the goal would help.
The A star algorithm!
Minimizing f- the sum of g and h
Minimizes keep the path short and also focused on finding the path. Minimizing both components at the same time.
KEEP EXPANDING until all paths have been explored!
A star finds the best path if the heuristic function less than the true cost.
Should not overestimate
Is optimistic
Is admissible
Why does the optimistic function 'h' work?
State Spaces
The Vaccum Example
The number of states is TOO DAMN HIGH!
Sliding Blocks Example
Finding an appropriate heuristic function
What is an admissible heuristic function? Comparing two good heuristic functions
Can we automatically come up with a heuristic function? Can come up with heuristic by defining the problem in words.
Generating a relaxed problem
Problems with search
Constraints
Domain must be fully observable
Domain must be known
Must be discrete
Deterministic
Static
Notes on implementation
Nodes and Paths
PACMAN project
VERY NICE AND DIFICILE!
Implementing all search algos on PACMAN
Lesson 12: Simulated annealing
Solving a problem by adding some simple intelligence
Travelling Salesman Problem
NP Hard
N Queens Problem
Arrange in a way to have no more attacks
The heuristic function for N Queens-iterate and reduce number of attacks with each move
Local Minima - you get stuck! (But it is still solvable)
Hill Climbing
Less dimensions- Local maximum problem
Random Restart!- take maximum of all local peaks
Tabu Search algo
Step Size- too small vs too large
Start with a large step size, then reduce it to make sure to reach the minima
Simulated Annealing
Introduction to physical annealing
Heating and cooling to get out of global minima
Iterate to find a better position
Start with higher randomness, gradually reducing randomness ( Vary T from very large to very small)
GUARANTEED to converge to the global Maximum!
Local Beam Search - keeps track of K particles
Genetic Algorithms
Survival of the fittest - breeding and mutation.
Crossover - children get good aspects of their parents through natural selection.
What if a Critical piece gets eliminated?- Solved by having more randomness
Without mutation, we might NEVER reach the goal!
Simulated annealing lab
Implement simulated annealing
Nice assignment! All functions were kind of challenging.
Constraint Satisfaction
Constraint Graph
Map Coloring Constraint Problem
Constraints can be unary, binary or have even more variables.
Backtracking Search
Improving efficiency, use the least constrained value
Use the most constrained variables - solve more constraints sooner; or the minimum remaining values
Forward Checking - maintain a map of all possible values for a variable
Arc consistency
Structured CSPs
Break into independent variables - the tasmania example
Challenge Question - TWO + TWO = FOUR crypto question
Constraint Satisfaction Lab
Logic and Reasoning
Propositional Logic
Representing events as True or False, or relation between them.
Truth Tables and the symbols used
Valid, Satisfiable and Unsatisfiable
Limitations
Can only handle true and false; not probability
Cannot talk about objects or relationships between them
First order logic
Comparison with propositional logic and probability theory
Represents relationships between objects; more complex models
Defining the models; objects, functions and constants
Talk about syntax
sentences, terms, quantifiers
Representing the vaccum world as first order logic
Questions for First Order Logic: Practice. Representing English statements as First Order Logic. - NICE QUESTIONS
Planning
Just planning is NOT enough, need feedback to rightly execute and finish the task.
Environment is stochastic, and there are other agents too ; can't know this info beforehand
Partial observability
Some unknown
BELIEF states instead of WORLD states
Example with vaccum cleaner, what if the vaccum sensors break down?!
Successful plans!
Mathemical formulation for succesful plans
Tracking the predict-update cycle
Describing in terms of variables
Belief state space
Sensorless vaccum example!
Comformant plans - where we do not know everything about the world, pero todavia llegaremos a nuestro objetivo!
Partially observable vaccum example
Act-observe cycle
Actions increase uncertainty, and observations bring them down!
Can't guarantee ALWAYS!
Infinite sequences!
Classical Planning
Assign all values to K boolean variables - State Space
World state- complete assignment
Belief state- complete assignment or partial assignment
Actions and preconditions
Example of the fly action schema*
Progression state space search vs Regression state space search
Regression starts from the goal
Progression starts from the initial state
When is it better to search backwards vs forwards?
Plan Space Search
Search through plans
Forward search is the MOST POPULAR
Importance of heuristics
Situation Calculus
Successor state axiom
Lesson 16:Probability - SEBASTIAN THRUN IS BACK!
Intro to Probability and Bayes Network
A network of reasons- the car wont start example
Car wont' start- battery wont start/battery wont charge-and so on and so forth in reasons.
Come up with a sort of a dependency graph for various variables
16 variables in this structure, so 2^16 values
Specify..observe..compute
Assumption that every event is discrete/binary
Probability concepts
Complementary and joint probability
Concept of dependence and conditional probability
Total Probability- CANNOT NEGATE the conditional variable!
Some quizzes on these concepts.
Bayes Rule!
Likelihood, prior and marginal likelihood
Lesson 17: Bayes Networks
A and B - A is not observable, B is observable
Diagnostic Reasoning
Computing Bayes Rule
The denominator (total probability for B is HARD to compute)
So we just use the unnormalized terms, and then adding them to get the normalized version
Nice qay to calculate probabilities in the quizzes!
Conditionally independent
Absolute independence does not implement conditional independence, and vice versa; neither can de deduced from the other.
Confounding case
Two causes effect an observable variable
Explaining away effect
If an effect can be caused by multiple causes, seeing one of those causes to be true/untrue, the other can explain away the effect.
DIFFICULT QUESTIONS IN QUIZ on this effect!
Defining Bayes Networks
A graph explaining probability relationships between various event
Joint probability is defined by factoring in conditional relationships etc.
A node with K inputs requires 2^k variables (parameters) to define
Bayes netwrosk reduce the number of params needed by quite a lot! So very useful
D-separation
Any two variables are independent if they are not linked by just unknown variables
Two independent varibles affecting a variable; and if we know about that variable, then these variables become dependent, the explained away effect
Lesson 18: Inference in Bayes Nets
Evidence variables, Hidden Variables and Query Variables
Output is a joint variability distribution over the query variables, given the evidence variables
Which query variable is the most likely?!
Can also go in opposite direction- reverse evidence variables and query variables.
Inference by Enumeration
Enumerate over all the hidden variables
Speeding up enumeration - Maximize independence (determine through the bayes network)
Causal direction- easier to inference when the graph goes from causes to effects
Variable elimination
Divide into smaller parts..enumerate..then combine
Join factos to form larger factors and then eliminate variables
Approximate Inference and Sampling
Estimating by sampling and performing experiments
Advantage- no complex coputations; simualtion does not need conditional probability tables
Sprinkler Rain example
With an infinite number of samples, we approach the true probabilities
Rejection Sampling - only keep the samples that match the scenario we want to compute
Can end up rejecting a lot of samples...for eg. Burglaries and earthquakes are very infrequent
Likelihood weighting - add a probabilistic weight to each sample, according to the probability of the conditions
Does not solve all our problems though...so Gibbs Sampling - takes all the evidence into account - MCMC, samples depend on each other
Monty Hall Problem Example
Learning more about a door changes probability
Monty Hall Letter
Lesson 19: Hidden Markov Models
Pattern Recognition through time
Dolphin communication problem
'Delta Frequency'
Time warping- should not matter if a whistle is quick or drawn out longer in time
Dynamic Time Warping
Matching two signals sample-wise
Try to keep to the diagonal as much as possible
Could get false positives- matching signals that are not actually similar
Bound how much we can deviate- The Sakoe Chiba bound
Hidden Markov Models
Pattern recognition through time
Representing markov models
Self transition
Application: Sign Language Recognition
HMM for "I" vs "We"
Viterbi Trellis
Eliminating by the constraints
Many options in the middle
The Viterbi Path
Theory on HMMs and Phrase recognition- LOST due to Github error
Context Training
Using context in phrases- eg. combine models for I and need - Coarticulation
What is a neural network? Housing price prediction model.
Neural networks and Supervised Learning; and types of neural networks-
Structured Data vs Unstructured Data
Why is deep learning taking off?
Because of Scale! (more and more data)
NNs performance generally increases with more data
Faster Computation
Week 2: Logistic Regression as a Neural Network
Binary Classification
Logistic Regression
Loss Function and the Cost function- The benefits of choosing a convex function for a loss function.
Gradient Descent and finding the minima
A refresher on derivatives
Computation Graph,
derivatives with computation graph- excellent video! - Chain rule
Gradient descent using logistic regression- minimizing the loss function.
Updating the weights using the backward propagation step.
Vectorization
Removing for loops- to improve the run time. Eg. np.dot to get the dot product.
Try to avoid for loops when you can. Many functions in numpy to do so!
A logistic regression without any for loop
Doing the backward and forward propagation steps without any for loops, using numpy
Broadcasting in python/numpy
how python/numpy treats arrays of different sizes.
PROJECT: Logistic Regression Model to recognize cats
Preprocessing steps
Use assertions for size and shape of numpy arrays
Nice assignment!- implementing a NN yourself from scratch.
Key Takeaways from the assignment
Preprocessing the dataset is important.
You implemented each function separately: initialize(), propagate(), optimize(). Then you built a model().
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm.
You will see more examples of this later in this course!
A discussion (optional exercise) on the importance of choosing a good learning rate!
Different learning rates give different costs and thus different predictions results.
If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example, using 0.01 still eventually ends up at a good value for the cost).
A lower cost doesn't mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy.
In deep learning, we usually recommend that you:
Choose the learning rate that better minimizes the cost function.
If your model overfits, use other techniques to reduce overfitting. (We'll talk about this in later videos.)
Week 3: Shallow Neural Network
Overview of neural networks, comparison with logistic regression.
Neural networks with a single hidden layer.
Introduction to hidden layer.
Superscript notations etc.
Computation using a neural network
Logistic Regression multiple times
Vectorization
Vectorization across multiple examples
Justification for the implementation
Activation functions
Hyperbolic tan ( tanh )- why is this better?
Why to use sigmoid for the activation layer? (Andrew Ng says sigmoid is always superior to sigmoid; except use signmoid in the output layer)
The weights of a neural network should be initialized to random values (WHY NOT ZERO? What's the problem?)- Symmetry Breaking Problem
Project: Planar data classification with one hidden layer
Logistic Regression don't do well because the data is not linearly separable
Reminder: The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units, # of hidden units, etc).
2. Initialize the model's parameters
3. Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
CUIDARLE CON EL TAMANO DE LOS MATRICES!!
The importance of a good converging learning rate
The larger models (with more hidden units) are able to fit the training set better,
until eventually the largest models overfit the data.
The best hidden layer size seems to be around n_h = 5. Indeed, a value around here
seems to fits the data well without also incurring noticable overfitting.
You will also learn later about regularization, which lets you use very
large models (such as n_h = 50) without much overfitting.
Week 4: Deep Neural Network
What is a deep neural network? Notations etc.
Forward Propagation
FOCUS on MATRIX DIMENSIONS- Working through the matrix dimensions of a deep neural network. Think about the dimensions of the weight matrix and the bias vector at every step.
Why need a DEEP network? Great video!
Circuit Theory and Deep Learning
Building blocks of deep learning networks - going through a backward and forward propagation, layer by layer
Hyperparameters vs Parameters- deep learning is an empirical process, wash-rinse-repeat.
Is there a relation between the brain and deep learning? (Spoiler Alert: Not a whole lot)
Project: Building your deep neural network
* Implementing a L layer neural network from scratch- both backward and forward
Project: Using above project to detect cats vs non cats
* ALWAYS resize all images to the same size before feeding to the network.
* 2 Layer vs L Layer network, try different values of L
* **Vectorization helps a LOT with the speed**
* Causes of mis-prediction
Course 2: Improving Deep Neural Networks, Hyperparameter tuning, regularization
Week 1: Practical Aspects of Deep Learning
Setting up your train/dev/test sets
It is an iterative LEARNING process!
Why need a test/valid sets?
Bias variance tradeoff
Parameters to analyze bias and variance (Overfitting vs underfitting) - See the error!
Basic recipe for machine learning/deep learning?
Question 1: Does the model have a high bias? See the train set performance.
Question 2: Does the model have a high variance? See the dev/validation set performance.
Rinse and repeat
Through deep learning, it has been possible to somehow reduce bias variance tradeoff, i.e. you can bring one down without affecting the other
Regularization
L2 normalization, L1 normalization, lambda is the regularization parameter, Frobius Norm for w
L2 normalization is also called weight decay (Why?, Remember HDP!)
Why does regularization prevent overfitting?
Dropout regularization - Keep a hidden unit for training with some probability; Inverted Dropout
The intuition behind dropout- great video! Because a node knows that an input (feature) can go away randomly, it spreads out weights across features.
Can change the probability of drop out (keep_prob), by layers, For example, layers with more nodes can have a higher probability of dropout. Drawback- The loss function is a bit undefined here, so hard to debug if it is monotonically decreasing with epochs
Other regularization methods - Data Augmentation, Early Stopping
Orthogonalization - Think of one problem at a time, machine learning funda by Andrew Ng.
Normalization
Normalization the mean and the variances of both features. Make variance for all features as 1. Why normalize?
Vanishing and Exploding gradients problem!
Gradient checking- to check your implementation
Only use for debugging, not for training; does not work with dropout.
Assignments: Initialization
Comparing three types of initialization for weights, zeros initialization vs random vs He initialization
Zero initialization is MUY MAL!* The cost function does not even go down with iterations. Why?
In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.
What you should remember:
The weights W[l]W[l] should be initialized randomly to break symmetry.
It is however okay to initialize the biases b[l]b[l] to zeros. Symmetry is still broken so long as W[l]W[l] is initialized randomly.
Random initialization- good, but not great
The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0)log(a[3])=log(0) , the loss goes to infinity.
Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.
He initialization
Based on a paper, it does great!
WHAT TO REMEMBER
What you should remember from this notebook:
Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don't intialize to values that are too large
He initialization works well for networks with ReLU activations.
Assignments: Regularization
Project to see where the french goalkeeper should kick to reach his team's players.
Implement regularization- overfitting reduces and test set accuracy goes up after regularization
What you should remember -- the implications of L2-regularization on:
The cost computation:
A regularization term is added to the cost
The backpropagation function:
There are extra terms in the gradients with respect to weight matrices
Weights end up smaller ("weight decay"):
Weights are pushed to smaller values.
Implement dropoout
Apply mask with probabilities to activation and backpropagation, and divide by probabilties to scale the result.
What you should remember about dropout:
Dropout is a regularization technique.
You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Assignments: Gradient Checking
Small exercise to check and verify the gradient calculation.
What you should remember from this notebook:
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.
Week 2: Optimization Algorithms
Used to speed up neural networks and make them practical
Mini-batch gradient descent
The term epochs
The loss function does not neccesarily decrease always, as it does with a normal gradient descent
Choosing the batch size- The extreme case of stochastic gradient descent vs batch gradient descent
What is a small batch size? How to choose a batch size?
Exponentially weighted averages
Approximately how many data points are taken into account, with respect to the value of epsilon?
How to compute?
Bias correction!- Important during the initial phase of learning
Mismatched train/test distribution- maybe training set comes from cat pictures on web pages, however you test on low resolution pics uploaded by people
Gradient Descent with Momentum
A new hyperparameter, Beta
Optimization Algorithms
RMSProp
Adam (Adpative Moment Estimation) - RMSProp + Gradient Descent with momentum
* Learning Rate Decay
Why to decay learning rate(alpha)?
Decay Formulas in terms of epochs- Exponential Decay
* The problem of local optima!
Saddle points, plateaus
In a high dimensional place, it is very unlikely to get stuck in a local optima (Why? REMEMBER HDP!)
Solving the problem through better initialization - by constraining the mean and the variance
Numerical approximation of gradients- Two side difference vs one side difference
Project: Trying out different optimization algorithms, mini-batch gradient descent
Stochastic Gradient Descent vs Batch Gradient Descent
Shuffling and partitioning to get batches, the size of the batch (power of 2)
The larger the momentum ββ is, the smoother the update because the more we take the past gradients into
account. But if ββ is too big, it could also smooth out the updates too much.
Implement Adam yourself, implement correction formulas for s and v params.
Observe the loss decay with/without momentum- Adam is VERY GOOD!
Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.
Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.
A shit ton of hyperparameters! Rank them by importance.
Instead of a grid search for best hyperparameters, do a random search because you do not know what the most important hyperparameter is, and want to try out a lot more values.
Picking an appropriate scale is important - Eg. a log scale for the learning rate.
Hyperparameters tuning in practice: Baybsitting one model(Panda) vs training multiple models in practice (Caviar). Depends on how much computational resources you have.
Batch Normalization
Also normalizing the activations of the hidden units. Implementing this with more parameters to tune the mean and variance of your hidden layer activations.
Fitting batch norm into a deep neural network. Convert Z to a normalized Z, then apply activation. Additional parameters are added to apply normalization at every layer.
Working with mini batches.
Why does Batch Norm Work? IMPORTANT VIDEO!- Eg. if you have a model that detects cats vs non cats on black cats; and now you want to use the same model for colored cats. Covariate shift. Even though the input values change, the mean and variance remains the same. It limits the amount by which the earlier layers' outputs change. Allows each layer to learn by itself independently of the earlier layers.
Also has a slight regularization effect.
Batch norm at test time. Estimate the mu and sigma-squared by estimating exponentially weighted averages across all batches.
Multi Class Classification
Softmax regression, a generalization of linear regression. Suppose we have a n classes, the output layer is a n1 layer, denoting a n1 bector with probability that input belongs to one of the n classes.
Training a softmax classifier - hardmax vs softmax. Defining the loss function.
Introduction to Programming Frameworks
How to choose your framework? Ease of programming, running speed and truly open.
Using TensorFlow
Assignments: Tensor Flow to detect signs: multivariate classification
remember to initialize your variables, create a session and run the operations inside the session.
Placeholders- just define the shape now, value later. Defining simple operations, and getting results using session.run()
Pass placeholder values using feed_dict
One Hot Encoding
SIGNS dataset: Normalize and flatten the image dataset. Why use 'None' in the placeholder? Using Xavier initialization to init the parameters.
Running on mini batches, with optimizer defined by TensorFlow.
Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.
What you should remember:
Tensorflow is a programming framework used in deep learning
The two main object classes in tensorflow are Tensors and Operators.
When you code in tensorflow you have to take the following steps:
Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
Create a session
Initialize the session
Run the session to execute the graph
You can execute the graph multiple times as you've seen in model()
The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
Course 3: Structuring your machine learning projects
Week 1: Introduction to ML Strategy
Strategies to analyze a problem and coming up with ideas that one should try to improve the performance.
Orthogonalization - Adjust one knob to adjust one parameter, to solve one problem - The TV knob analogy and the car analogy.
Chain of assumptions in Machine Learning and different knobs to say improve performance on train/dev set.
Andrew Ng does not recommend Early stopping, as it is a knob that affects multiple thing at once.
Setting up your goal
Set a SINGLE NUMBER for metrics- precision and recall- but these are two numbers, and you ideally need one number. ENTER THE F1 score!
Satisficing and optimizing metrics - metrics that satisfy, for example time
As a general rule of thumb, out of N metrics, pick one to be optimizing and (N-1) to be satisficing.
How to set up your test set and dev sets
Dev and test set should come from the same distribution.
Size of the test and dev set.
Test set should be good enough to give you confidence.
When to change your metrics/dev set/test set
Halfway through you solving the problem, metrics might change based on goals. Defining a new evaluation metric, to tell which algorithm is better for your problem.
Orthogonalization- Defining the metric is one step, and doing well on it is another step.
If a metric that says you are doing well on your dev/test, but does not reflect well on your application; CHANGE THE METRIC!
Comparing to human level performance
Bayes optimal error- the best theoritical possible error, there is no way to surpass this in terms of performance.
You can improve till your algorithm is doing worse than human level performance.
Avoidable Bias
Think of human error as an estimate of bayes error, as a baseline (esp in computer vision tasks). This is avoidable bias, keep training till you get the training error down to avoidable bias.
Understanding human level performance
How to define it? The medical image classification example. Reduce bias or reduce variance?
Difference between human baseline and training error = measure of bias; and difference between train error and test error = dev error
* Surpassing human level error
- Sometimes ambiguity as to whether improve bias, or improve variance.
- Examples where ML kicks humans' ass : online advertising, product recos, logistcs, loan approvals. Humans are great at computer vision tasks. Some speech recognition systems can surpass humans.
Improving your model's performance
Assumptions - fit training set well (low avoidable bias)
Generalizes pretty well to dev/test set.
Error analysis
Manually examine the mistakes the model is making; manually making some notes. Find mis-predicted labels, and prioritize based on where you can improve the most.
Mislabel examples
What to do with incorrectly label training examples. Deep learning algos are robust to random errors, but not to systematic errors
For test/dev set incorrectly examples, have a column called 'incorrectly labeled'; and analyze if it makes sense to spend time fixing the incorrect labels. Depends on how much error they contribute to error wise.
Correcting labels: Apply same principles to both dev and test sets
Iterating on your algorithm
Build a first system quickly, and then analyse what to do next- Iterate.
Do not overthink initially, and just get a quick and dirty first solution going.
Mismatched training and dev/test set
The cat example- 200,000 images from web crawling, 10,000 from the data from mobile cameras (low quality)
Set your test/dev test to be the distribution to be the one you want your application to do well on.
Analyzing bias and variance on different distributions of training and dev set. The concept of training-dev sets!
Data mismatch error - The new problem of data mismatch! How to solve the data mismatch problem. Manually analyzing the difference. Eg.: artifical data synthesis- add random noise to clean data
Transfer Learning
Use a model used to identify cats, and apply to identifying X-ray scans. pre training and fine tuning
When does transfer learning make sense? When you have pre-learnt on a lot of data and don't have much data for the new problem.
Multi learning
Do multiple tasks at one time; demonstrated with the autonomous driving car example. Detecting if an image has a stop sign, has a human, has another car etc.
Should be done on tasks that share lowel level features. Works better if the amount of data is similar, per task. Knowledge of one task should help all the other tasks.
End to end deep learning
Replacing a whole pipeline of feature engineering, extracting features with one neural network.
Need a lot of data than traditional pipelines.
The turnstile problem example- breaking into steps since you have much more data from the steps than the end-to-end problem.
Can simplify the problem, but does not always work. Think about the amount of data!
Whether to use end-to-end deep learning
PROS: Just lets the data speak; rather than human perceptions. Don't need to hand-design the features.
CONS: Need a LOT of data. Sometimes not available for the entire step. Excludes hand-designed components/features.
Course 4: Convolutional Neural Networks
Week 1: Intro to CNNs
Convolution operation
How to detect edges, defining the filter/kernel. How to do convolution. How does edge detection work really, with convolution?
Horizontal and vertical edge detection. Dark to white edges and white to dark edges. Sobel filter, Scharr filter. Can possibly learn the coefficients of your kernel through deep learning; rather than hand-pick a kernel.
Padding
Why is it needed? Because the image shrinks! Valid convolutions and same convolutions.
How to calculate the padding size?
Rarely even dimension-ed kernels are used.
Strided Convolutions
General formula for dimensions of the output image; for an input image of size n by n, kernel of size f * f, padding = p and stride = s; the dimension of the output image is floor( ((n+2p-f)/s) + 1 )
The filter must lie fully in the image when convolving
Cross correlation vs convolution: We are not doing the mirroring step (as done in maths). What we are essentially doing is cross-correlation, and calling it convolution.
Convolution over volumes- 3D Images
Number of channels in kernel must be equal to the number of channels in the image.
Finding multiple types of edges using multiple convolutions with different filters suited to find different kinds of images. So output dimension becomes (n-f+1) * (number of convolution filters used)
Building one layer of a CNN
Add bias and non-linearity to the convolution result; analogy with the standard forward propagation
Calculate total number of params in a layer- coefficients of the filter and the DO NOT FORGET THE BIAS!
Naming conventions- formula in terms of this layer's filters and previous layer's inputs
A simple example
The depth keeps increasing, while you reduce the height and width at each layer (Remember Udacity!). Andrew calls depth as the number of channels
Other layers: Pooling and Fully Connected
Pooling layer - eg. find the max in a sub-region of an image (Max Pooling). There is nothing to learn, has a fixed set of parameters (stride and size of kernel)
Average pooling layer; max pooling is used more than global average pooling
Combining all of these together and one example based on LeNet-5
Remember pooling layer has no parameters. As a convention, count only the layers that have weights (parameters)
At the end, flatten and feed into the fully connected layer
Why convolution? - Great video!
Reduces the number of param much more than fully connected layers - Parameter Sharing- a feature detector (such as edge detector) useful in one part of the image is probably useful in another part of the image.
Sparse Connections - one pixel is only connected to its neighbors, and not to everyone else (and does not need to be!)
Project: Step by step convolution model
Implement a CNN yourself
Implmenet padding, convolution, forward pass etc from scratch
Nice tidy implementation of a single layer!
Optional exercises on back propagation
Implement ConvNet using TensorFlow
Initialize placeholders, weighs etc.
Remember the tensor flow sessions! How to run etc.
Week 2: Looking at case studies
Learn from others, why reinvent the wheel?
LeNet -5 : the architecture; used for digit recognition.
AlexNet: bigger, more parameters; better than LeNet as it used ReLu. Uses a layer called as local response normalization Has a lot of hyperparameters
VGG 16: A simpler network; although large. Has 16 layers with weights.
ResNets: Residual block- applying a shortcut as opposed to the main path. Skip connections
Plain network vs residual block networks. In practice, deeper the networ, the error can go up.
Why do Resnets work so well?
If you make a plain network deeper, it can hurt training error on the training set. Not with ResNets though.
Because ResNets can learn the identity function much easily. Therefore adding extra layers does not hurt performance, and might even help performance!
Residual layers easily learn the identity function.
Uses a lot of same convolution; as it preserves the dimensions.
A 1 by 1 convolution
What is it? Convolving with a 1 by 1 by d filter. Why is it useful! It multiplies a number across the depth and then applies a ReLu activation.
It is like having a fully connected network with depth. Also called Network in Network architecture.
Helps you shrink the depth/ the number of channels!
Inception Network Motivation and Inception Networks
Why not take ALL filters, and ALL types of layers. Just stack all the various outputs (Keep the same convolution)
Do them all; but huge computation cost!
Use a 1 by 1 convolution to reduce depth (volume) and reduce the amount of multiplications (reduce the computation cost)
Use if you want to TRY THEM ALL! (Like a marica)
Padding with max pooling layer, a weird thing...
How to combine: Just concatenate the blocks along the channel (depth)! Height and widht are kept the same.
Inception network has a lot of inception blocks. Also has a side branch layer to make predictions; tends to have a regularizing effect.
Inception network's name actually comes from the movie inception.
Practical Advice on Using other networks
How to use open source implementations. A common way to go about in computer vision is to take a known network, and use transfer learning
Use something that has already been done before! Rather than starting from scratch, why reinvent the wheel. Freeze the earlier layers**; pre-compute the earlier layers' activation and just apply softmax on that.
If you have a larger training set, you freeze fewer earlier layers. The more data you have, the more layers tu puedes entrenar.
Data Augmentation for computer vision. Just can't get enough of data for computer vision.
Techniques used are random cropping, mirroring.
Color shifting.
Advanced- PCA color augmentation
Implement distortions during training. Have a thread for distortions, other for training. Distortion can also have hyperparameters.
Computer vision and deep learning
ML problems fall in the spectrum from 'little data' to 'lots of data'. Lot of data means simpler algorithms, letss hand-engineering.
Computer vision has relied on hand engineering a lot.
Tips for doing well on benchmarks
Ensembling - average the labels for multiple Neural Networks
Multi crop at test time, the 10 crop technique
Optional keras exercise
Keras is a higher level of abstraction than tensor flow. The happy faces project.
To remember:
What we would like you to remember from this assignment:
Keras is a tool we recommend for rapid prototyping. It allows you to quickly try out different model architectures. Are there any applications of deep learning to your daily life that you'd like to implement using Keras?
Remember how to code a model in Keras and the four steps leading to the evaluation of your model on the test set. Create->Compile->Fit/Train->Evaluate/Test.
ResNets
Deep networks can learn complex functions, however not always the best choice. Remember vanishing gradients!
you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function--even more than skip connections helping with vanishing gradients--accounts for ResNets' remarkable performance.)
ResNets identity block; convolutional block
Using the blocks above to build a DEEP Resnet! Layer naming ne heecha bana diya!
PARA RECORDAR:
What you should remember:
Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.
The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
There are two main type of blocks: The identity block and the convolutional block.
Very deep Residual Networks are built by stacking these blocks together.
Week 3: Object Detection Algorithms
Object localization- Not only classifying an image with an object, but also localizing (bounding box) the object. Object detection- detect an object in an image that has many object.
Have the neural networks have the bounding box outputted in the form of four numbers! The output label is now a vector, with values being (is there an object); four values corresponding to bounding boxes and also the type of the object. If there is no object, we don't care about anything else other than the fact y_object_exists = 0. If no object, all the other values are me da igual.
Calculating the loss function based on the two cases: (1) Object Exists (2) Does not exists
Landmark detection- just give the (X,Y) coordinates- a landmark is one point with a (x,y) coordinates.
Sliding windows for object detection - take cropped inputs (of cars for example) and train a NN to output 1/0. In sliding window, you slide/stride a window across the whole image and then have it classify for every such section of the image.
Then do this for a bigger region...rinse..repeat.
HUGE COMPUTATIONAL COST. And ConvNets' complexity time adds to the problem of computational cost.
An efficient implementation using convolution
Replace the fully connected layer with a convolution layer. Implement fully connected layers as convolutional layers.
The benefit of this is that a lot of computations get shared between sliiding windows. Instead of running forward propagation indepedently, can run it together.
ACHTUNG!! Bounding boxes is not correct/best in this implementation: STEP IN YOLO!
YOLO algorithm
Apply the localization algorithm to nine grid cells in an image; assign every grid a vector label. Total volume = Number of grids multiplied by the target vector for each of the boxes. Could use finer/coarser grids.
Achtung! An object might appear in more than one grid, will address this later.
It is only a single computation, is an efficient algorithm and runs fast!
How to tell if your object detection algorithm is working well
Intersection over union (IoU) function to calculate the efficacy of the bounding boxes. If IoU > 0.5; then it is considered good.
Non maximal supression
The problem of multiple detections for the same object.
All the ones with high overlap will get supressed.
Non maximal supression algorithm.
Repeatedly pick boxes with high object probability, and eliminate boxes with high IoU with this one.
Anchor Boxes
Use different kinds of boxes for different boxes to assign to. Each object is now assigned to the (grid cell, anchor box) pair that has the highest IoU with the object.
Helps your algorithm specialize better.
Can use k means to cluster into types of anchor boxes! (neat!
The generalized YOLO algorithm combining anchor boxes, non max supression; into the algorithm
Region Proposals: R-CNN - propose regions via segmentation. Different algorithms to propose regions.
Assignment: Autonomous Driving- Car Detection using YOLO
Need to collect images: Done via a car mounted camera. YOLO - solo una mirada, hijo de puta
YOLO: If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Find box scores; apply max supression.
Summary for YOLO:
Input image (608, 608, 3)
The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
Each cell in a 19x19 grid over the input image gives 425 numbers.
425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw)(pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we'd like to detect
You then select only few boxes based on:
Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
This gives you YOLO's final output.
What you should remember:
YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.
Week 4: Face Recognition
Face verification vs face recognition
One shot learning- you need to perform well with just one image of the person. Learn from just one example. We compute a similarity function for images.
Use a siamese network architecture.
Learn a function such that encodings of same person's images is small; and of different persons' is large.
*MISSED SOME NOTES HERE
Neural Style Transfer
Content cost function - choose a layer (neither two shallow, neither two deep); and then analyze the activations caused by two images. If the activations are similar, it implies that the images have a similar content.
Style Cost, how correlated are the activations across different channels? How often do high level features such as texture occur together.
Choose a layer and see how correlated are the activations between different channels.
Degree of correlation is a measure of style; how similar is the style of the generated image with the style image.
Generate a style matrix; a (number of channels) * (number of channels) matrix; see how correlated different channels are. Make pairs of every channel with the other to get this matrix's values.
Compute the style matrix for both the images- cost function is the norm (difference) between the two style matrices.
The combine the cost function across all layers
Generalization to 2D and 3D images.
Convolution for a 1D image.
3 Dimensional Data- convolve with a 3D filter
Assignment: Neural Style Transfer Art Generation
Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!
We would like the "generated" image G to have similar content as the input image C. Suppose you have chosen some layer's activations to represent the content of an image. In practice, you'll get the most visually pleasing results if you choose a layer in the middle of the network--neither too shallow nor too deep.
What you should remember about computing the cost function
What you should remember:
The content cost takes a hidden layer activation of the neural network, and measures how different a(C)a(C) and a(G)a(G) are.
When we minimize the content cost later, this will help make sure GG has similar content as CC .
Computing the style function
Calcualting the Gram Function for a single layer
Then merging for multiple layers; using lambdas
What you should remember:
The style of an image can be represented using the Gram matrix of a hidden layer's activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image GG to follow the style of the image SS .
What you should remember:
The total cost is a linear combination of the content cost Jcontent(C,G)Jcontent(C,G) and the style cost Jstyle(S,G)Jstyle(S,G)
αα and ββ are hyperparameters that control the relative weighting between content and style
CONCLUSION
What you should remember:
Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
It uses representations (hidden layer activations) based on a pretrained ConvNet.
The content cost function is computed using one hidden layer's activations.
The style cost function for one layer is computed using the Gram matrix of that layer's activations. The overall style cost function is obtained using several hidden layers.
Optimizing the total cost function results in synthesizing new images.
Assignment: Face Recognition for the happy house
Face Verification - "is this the claimed person?". For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.
Face Recognition - "who is this person?". For example, the video lecture showed a face recognition video (https://www.youtube.com/watch?v=wr4rx0Spihs) of Baidu employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem.
Implement FaceNet
Encode an image into a 128 dimensional vector
implement the triplet loss function
What you should remember:
Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
The triplet loss is an effective loss function for training a neural network to learn an encod
Course 5: Sequence Models
Week 1: Recurrent neural networks
Why sequence models are useful- speech recognition, translation, music generation etc.
Name Entity Recognition example
Given a sentence, find the words that correspond to names.
Talks about notations etc.- how to represent individual words- make a vocabulary/dictionary of all the words
One hot encoding the words
It is a supervised learning problem.
Why not use a standard neural network?!
Inputs and outputs can be different lengths- you can have sentences of different lengths (different words)
Does not share features learned across different positions. (Kinda similar to convolutional neural network)
What is a recurrent neural network?
You want things learnt in one part to be used in other parts..
Learning from one time step to the other, passing along the activation
Y3 comes not only from X3, but also from X2 and X1
The case for bidirectional recurrent network; versus a unidirectional neural network.
Explains forward propagation
Backpropagation through time
Loss defined for a single word
Compute the total loss by summing the loss per word in time
Different types of RNNs
Input length and output length can be different
Many to many RNNs
Sentiment classification- many to one RNNs
One to many RNNs - generate Music
Machine translation- many to many, but of different lengths!
Sequence generation and machine translation
'Pair' vs 'pear'
Speech recognition tells the probability of a sentence existing.
Tells the probability of a sequence of words existing
See the probability of a word existing in a particular position
Sample novel sequences
Keep sampling until you have hit EOS
Character level language model vs Word level language model
Dont have to worry about Unknown in character level.
Character language models are much longer!
Vanishing gradients
Hard to propgatae information along the sentence - farther the word, lesser the influence
For exploding gradients, use gradient clipping
Gated Recurrent units
To solve the problems of vanishing gradients
Memory cell, to preserve the information
Memorize the value such as singular/plural; and the gate (Gamma) to see if you need to update the value or not
Can use different bits to remember different things, such as plural/talking about food etc.
Long Term Short Memory
LSTMs - Has two gates, update gate and forget gate
LSTM is the preferred choice over GRUs
Bidirectional RNNs
Take info from both earlier and later in the sequence
Has a backward recurrent layer, in addition to the forward recurrent layer
Deep RNNs
Stacking a single layer we have learnt so far one over the other.
Because of the temporal dimension, these are less deeper than traditional neural networks.
Assignment : Building a recurrent neural network Step by Step
Describes how LSTM can be used to solve the vanishing gradients problem
Assignment: The Dinosaur problem
Clipping of gradients and why to do it
Assignment: Improvise a jazz solo
Similar to the dinosaur model, except in Keras
Here's what you should remember:
A sequence model can be used to generate musical values, which are then post-processed into midi music.
Fairly similar models can be used to generate dinosaur names or to generate music, with the major difference being the input fed to the model.
In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps 1,…,Tx1,…,Tx .
Week 2: Natural Language Processing & Word Embeddings
Introduction to word embeddings
How to represent words, that is good to learn realtions?
Featurized representation
Features such as Gender, 'Royal', Age etc.
Take a vector of features
Helps find words that are closely related
Eg. apple and orange are closer to each other than apple and 'man'
Visualizing word embeddings
Using word embeddings
Can analyze a lot of unlabeled text to decipher less common words
Download word embeddings from a large text corps
Transfer embedding to a smaller training set
Continue to fine tune the word embeddings
Similar to face encoding
Properties of word embeddings
INTERESTING! How to find analogies, eg. if man is to woman, king is to what?
The difference would come up in the subtraction; a single property would stand out
Define the similarity function, we use cosine similarity
Embedding Matrix
A (number of words * dimensions) matrix
Learning Word Embeddings
Take all the embedded vectors and put it into a neural layer followed by a softmax activation
One hyperparamater is the history of how many words before you want to learn - what context do you want to learn the word?
Word2Vec Model
Randomly pick the context word and the target word (within some window of the context word)
Hierarichal softmax classifier , like a tree that splits into groups such as (first 5000 words) etc.
In general more common words are at the top of the three, and less common ones at the bottom
Helps in speeding up the algorithm
How to sample the context word
Don't take it uniformly, else you will always get words like a, then, the etc.
In general softmax is the blocking part, computationally expensive
Negative Sampling
Determine if two words are a context and target pair
Orange and juice are a pair, orange and king are not
Make a table of positive and negative examples; for every positive example, you have K negative examples
We dont train all the words in the corpus, but only K+1 of them based on your table from above.
How to select the negative words, according to what distribution?
GloVe word vectors algo
Very Simple: Global Vectors for Word Representation
Sample how manytimes two words appear in close proximity
Sentiment classification
Challenge is sometimes not having a hude training set.
Average the word vectors and feed to softmax
Use RNN for classification, a many to one architecture
Debiasing word embeddings - SJW stuff!
First find the direction that corresponds to the bias we are trying to solve (eg. Gender Bias)
Remove bias, by prijecting them onto the orthogonal direction of the bias we want to solve
Equalize bias by making grandfathers and grandmothers; for example the distance between babysitter should be equal between grandfathers and grandmothers
Assignment: Debiasing
Cosine similarity
Cosine similarity a good way to compare similarity between pairs of word vectors. (Though L2 distance works too.)
For NLP applications, using a pre-trained set of word vectors from the internet is often a good way to get started.
Assigment : Emojify
Adding emojis to sentences based on emotion
Emojifier V2 using LSTMs in KERAS
What you should remember:
If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.
LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
You can use Dropout() right after LSTM() to regularize your network.
Week 3: Sequence to sequence architectures
Sequence to sequence models
Language translation for example
Image captioning, caption an image
Picking the most likely model
Machine Transation Model
Split into a model encoding the sentence; and then a language model.
Calculate the probability of an English sentence conditioned on a French sentence.
DONT DO RANDOM! - Find the sentence that maximizes the conditional probability
BEAM search
Beam Width - maintain a list of the best three words (for example) in a probabilistic sense.
After the first word, you maintain a list of conditional probabilities of say two words together. You hardwire the previous word output into the next. You do it for all the three contenders- then find the top three across all.
And you therefore continue, fragment by fragment.
If beam width = 1, it essentially becomes greedy search,
Refinements to BEAM search
Dealing with numerial underflow- so we take the log!- because multiplying small numbers might result in underflow.
Also tends to favor shorter translations due to being multiplied by zero over and over again. Normalize by the number of words, and reduces penalty for longer transations.
Take the top sentences and compute the score- pick the highest!
Choosing B
If B is large, you take in a lot of possibilities, but more computation power.
if B is small, then you are taking in less context, but is quicker to run.
Beam Search Error Analysis
How to analyse where the error lies, is it the network or the Beam Search algo?
Switch into two cases, and you can find who is at fault exactly.
Find such cases, and do an error analysis for all faulty examples, and ascribe the error to either of the two.
Bleu Score- to decide between multiple good answers for a translation.
Stands for 'Bilingual evaluation'
Modified Precision - see how many times a word is in total in the human provided reference transations.
Look at pairs of words- bigrams - how many times do the bigrams appear?
We do this for unigrams, bigrams, n-grams..
Combined Blue Score- basically average for unigrams, bigrams, n-grams...
Brevity penalty- if you output short penalty, to Adjust by penalizing
Attention Model Intuition
A human does not memorize the entire sentence, and then translates it; this is what the encoder architecture is doing.
So it does bad on longish sentences; so you work on one sentence at a time.
A set of attention weights - how much attention should you give to words when determining the translation.
Implementation details
At every step, you decide how much context weight to give to the other words.
You input Context vectors at each time step.
Calculate factors for getting the attention weights using a small neural network.
TAKES A LOT OF TIME TO RUN THOUGH!- Is Quadratic
- You can apply this idea to image captioning as well, just pay attention to parts of the picture.
Speech recognition problem
First you generate a spectrogram of the speech data and then run recognition
Initially was broken into phenomes; but now deep-learning is showing that phenomes is not required. Also because of much large audio sets available for training.
CTC cost for speech recognition
You collapse repeated characters bnot separated by a blank.
Trigger word detection- TRIGGERED!
Hey Siri, Okay Google etc.
Just binarize the target label - Imbalance might be due to skewed.
To solve this, you might output more 1s in continuation.
CONCLUSION AND 谢谢!
Assignment: Using a machine translation model to convert dates to human readable dates
Implement an attention model
Here's what you should remember from this notebook:
Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
A network using an attention mechanism can translate from inputs of length TxTx to outputs of length TyTy , where TxTx and TyTy can be different.
You can visualize attention weights α⟨t,t′⟩α⟨t,t′⟩ to see what the network is paying attention to while generating each output.
FINAL ASSIGNMENT - Trigger Word Detection
Converting raw audio to spectograms
Use a conv layer to convert spectogram to features
We use unidirectional instead of bidirectional; because we want to detect the word asap (and not wait for the whole sentence!)
Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
An end-to-end deep learning approach can be used to built a very effective trigger word detection system.
Overfitting (High variance) -- (Killing a fly with a bazooka) vs Underfitting (High Bias) (killing a godzilla with a flyswatter)
Cross validation, K fold cross validation
Learning curves
Lesson 9: A short summary - putting it all together
Grid search
Practice Project - Bag of words concept
Part 3
Lesson 2: Introduction to regression
Finding the best fit using calculus - Minimizing the sum of square error
Find best order of polynomial - Polynomial regression
Types of errors in training data
Cross Validation in Regression
Lesson 3: More regression
Parametric regression
Non Parametric Regression -- Instance based methods -- K Nearest neighbour vs Kernel Regression
Lesson 4: Regressions in sklearn
Continuous supervised learning (how does it differ from what you have learnt thus far?)
Continuous (Generally some sort of ordering) vs discrete classifier (No ordering ,even though they might be numbers)
Slope and Intercept
skLearn Practice - R square metric
Errors in Linear Regession- best model minimizes the sum of squared errors (Why Square and Not Absolute?). What is the problem with the sum of squared errors?
Benefits of R-square over Least Squares
Classification vs Regression: Differentiate based on output-- Chunk Number 34 -- Classification gives discrete labels (yes or no); but regression gives a concrete number from a continous model
Lesson 5: Decision Trees
Classification vs Regression
Classification Learning Concepts- Hypothesis, Target Concept etc.
Decision Tree Introduction: How to decide which trees are better
Analogy with the 20 questions car games
Best Attributes
Decision Tree Expressiveness
Space complexity of decision trees: how many decision trees are possible?
ID3 algorithm - What does the best attribute mean? (Information Gain) - Formula for entropy
Biases in ID3
Can you repeat an attribute? For continous values, you can ask a different question. For discrete, no attribute should be repeated
Dealing with overfitting
Lesson 6: More decision trees
Multiple linear questions (Think of it as multiple linear questions)
Coding decision trees - Tuning parameters
Regression using decision trees
Data Impurity/Entropy - min_split criteria; tuning skLearn
Lesson 7: Neural Networks
Perceptron Units- Representing basic boolean operations using perceptron units
Perceptron Training -
(1) Perceptron Rule - Works if the dataset is Linearly Separable (The Halting problem) -- half plane, half space
(2) Gradient Descent (For non linear separability) - Sigmoid function- avoiding local minimas!
Comparison of the two approaches
Back Propagation- Neural Networks
Restriction Bias, Preference Bias, Occam's Razor
Lesson 8: Support Vector Machines - The Math behind it
Best line is consistent to the training data, while committing to it the least
Derivation for the best line- maximizing the margin
Solving the best line for SVM - Quadratic Programming Problem -- Zero/Non zero alphas for vectors (input data)
Only a few points matter; the one close to the decision boundary- those are our SUPPORT VECTORS!
Linearly married - Kernel Trick
Domain Knowledge is introduced via Kernel Trick- THe Mercer Condition
Lesson 10 - SVMs in Practice
Lesson 11 - Instace Based Learning
K Nearest neighbors
Intro
Classification vs Regression
Running times of various algos (Learning vs Querying)
Eager vs Lazy Learners
Different Distance Metrics - IMPORTANT TO HAVE NICE DOMAIN KNOWLEDGE!
KNN Preference Bias - Locality, Smoothness and importance of features
Curse of Dimensionality - Number of data points with respect to the dimensionality of your feature space
Increasing weight on the ones getting wrong, and reducing weight on the ones right ; in a particular iteration. Combining how to get the final hypothesis.
Boosting and overfitting - Error vs confidence
Project 2- Charity ML
Data preprocessing - Normalization, scikit minMaxScaler
Data preprocessing - OneHotEncoding for categorical values
F Beta Score
Grid Search
Feature Importance
Part 4 - Unsupervised Learning
Clustering
Trying to guess the data's structure when a data does not come with labels
SkLearn's KNN - The number of clusters is a VERY important parameter
Limitations: Result not always the same; the problems of local minima
A local hill climbing algorithm
More Clustering
Single Linkage Clustering- Inter Cluster Distance - Big O Running Time
Soft Clustering- A point belongs to a cluster 'probabilistically'
Maximum Likelihood Gaussian
Expectation Maximization
Properties of Clustering - Richness, scale-invariance, consistency - IMPOSSIBILITY THEOREM- Cannot have All three!
Properties of EM (Can get stuck-local optima)
Clustering Mini Project
Feature Scaling
Feature Scaling
Giving equal weightage to all features.
Feature Scaling Formula
SKLearn MinMaxScaler
What algorithms would be affected?
Feature Selection
Important for 'Knowledge Discovery' and 'Curse of Dimensionality'
Feature Selection Algorithms Filtering and Wrappping- Tradeoffs between the two
Filtering: Use something as a decision tree to use information gain and get the subset of the most important features- then these features are passed into another learner
Wrapping: Ways to do wrapping - Searching, forward and backward search - WHAT FEATURES ARE IMPORTANT?
Feature Relevance, Relevance (Strong vs Weak) vs usefulness
Relevance measures effect on the Bayes Optimal Classifier
Usefulness for a feature is defined for a particular algorithm
PCA - focuses on shifting and rotating only - for eg. y = sin (x) will be a 2D system; PCA just does translation and rotation.
The center of the coordinate moves to the center of the data.
Importance of the new axis
Measurable vs Latent Features
Composite Features- Principal Component is NOT regression!
How to geet the Principal Component? Maximal Variance! - find the dimension with the maximum spread/or minimizes the information loss
Feature Transformation- PCs can be used as independent features, i.e. they do not overlap in terms of information with each other
When to use PCA? Eg.: Eigenfaces
PCA Mini Project: Eigenfaces
Observation: A lot PCs can lead to overfitting.
Feature Transformation
Transform a set of features into smaller, more compact features while retaining as much info as possible.
Why? An example of the google search problem- Problems such as polysemy and synonymy
Independent Component Analysis- ICA looks for statistical independence!
PCA vs ICA- very different! Whereas PCA finds global stuff such as eigenfaces, ICA finds more distinct features such as 'nose', 'eyes' etc.
Alternatives: RCA (Random Component Analysis) - deals with curse of dimensionality and LDA (Linear Discrimant Analysis) - cares about the labels
Unsupervised learning project
Good way to find relevant features: try making a feature a label and predicting it from other features.--find R2 score and see if you can model a feature using the others.
Think of a world where actions are uncertain with some probabilities
The parameters/variables in a Markov Decision Process: states, models, action, rewards
Markovian Property- The next state only depends on the current state
Delayed Rewards - What was the action that led to the ultimate reward? - Temporal Credit Assignment Problem
Rewards - the hot sand beach walking to the water analogy
Sequences of rewards- Infinite horizons, Utility of Sequences; Relationship between rewards and utilities
Optimal Policy- The policy that maximizes the expected rewards. Reward in a state is not the same as the utility for the state - Reward is short term gratification, while utility is long term gratification
The BellMan Equation - How to solve it? Value Iteration
Finding Policies (pi)- Policy Iterations
Reinforcement Learning
Managing vs Learning, Modeler and Simulator
Three approaches to reinforcement learning - Policy Search, Vaue Functional Based, Model Based
MiniMax theorem: Minimax is same as Maximin in the 2-player zero-sum game; i.e. maximizing the min is same as minimizing the max. Find the value of the game.
Von Neuman's Theorem - Relevant on non deterministic game of info as well
Then instead of perfect information, we go to hidden formation- Minimax theorem fails!!!
Mixed strategy vs pure strategy; Center Game
Non zero sum game- Prisoner's dilemna
Nash equilibrium - Playing the game multiple times; for a n repeated game, solution is n repeated N.E.
More Game Theory
Stochastic Games and Multi Agent Reinforcement Learning
Zero Sum Stochastic Games and General Sum Stochastic Games - Nash Q Algorithm
Reinforcement Learning Project: Smart Cab
PyGame
Selecting state space
Exploration vs Exploitation
Other times the agent learns a suboptimal policy because it first explores an action which is sub-optimal, but does yield positive rewards, and then repeatedly exploits that action. Later it may randomly explore the optimal policy, but at that point the suboptimal policy will have a higher value in the q-table.
For example, this might be "going forward at a green light" instead of following the waypoint at a green light. We will get some reward for simply moving on green, regardless of the waypoint, but it's not optimal. However it will be regularly exploited until exploration occurs again. During the exploitation period, it will build up a significant lead on the optimal policy.
Deep Learning
Lesson 1: More Deep Learning
Juanito esta jugando; el tiene que dividar puntos, y el va a dibujar una linea; como el va a hacer lo?
Linear boundaries for dividing data points. Then generalized for higher dimensions/features.
Perceptrons in terms of nodes (Neural Networks); perceptrons as logical operators
Perceptron Trick, Learning Rate. Start Random and then try fitting the line iteratively to correctly predict the mis-predicted points.
Error function- log-loss error function- When can you use Gradient descent?
Discrete vs Continous Predictions- Sigmoid Function; Softmax Function, One Hot Encoding
Maximum Likelihood; Cross Entropy; Multi Class Cross Entropy
Minimizing the error function given by cross entropy formula. Gradient Descent.
Similarities and Comparison between Perceptron and Gradient Descent; A correctly classified asks the separation line to go away; and a misclassified point asks the line to come closer.. (think about it, makes sense!)
Non Linear Models; Combining multiple perceptrons; Hidden Layers, Multi class classification
Feedforward and training Neural Networks; Backpropagation
Keras, Student Admissions Mini Project
Description of batch size etc:
* one epoch = one forward pass and one backward pass of all the training examples
* batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
* number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.
Training optimization- Stochastic Gradient Descent and Batch Gradient Descent; How to choose the decay of the learning rate; Overfitting vs Underfitting (the exam analogy); and how it is applicable in the neural network setting
Model Complexity Graph- Complexity generally increases with the increasing number of epochs- Early Stopping
Regularization and Overfitting- Punish high coefficients to avoid them!
Dropout- Avoid dominance of one part of the neural network to let some of the other weaker parts train- Vanishing Gradient - try other Activation Functions, like hyperbolic tan, Relu etc.
The problem of local minima! Try random restart or go with more momentum
Choosing the best model- split into validation sets!
CNNs vs MLPs (Multi layer perceptrons)-
When do MLPs fail? -.-
MLPs use a lot of params (Sparsely (locally) connected vs fully connected layer)
Throwing away 2D neighborhood information (such as in an image!) due to flattening
Color coding
Convolutional layers
Convolutional windows- Use multiple filters to detect multiple patterns
Activation Maps
Color Images!
Stride and Padding
Convolutional Layers in Keras
You are strongly encouraged to add a ReLU activation function to every convolutional layer in your networks.
Formula for number of parameters in a convolutional layer and formulas for shape of a convolutional layer
Pooling Layers
Used for dimensionality reduction and avoiding overfitting.
Take feature maps as input
Max Pooling Layer, Global Average Pooling Layer
Think as a stack of pancakes!
CNNs for image classification
Resizing the images. Aim is to decrease the weight and the height of the image, while increasing the depth of the image. Use max pooling layers to reduce dimensionality, i.e. reduce height and width.
A connected layer at the very end.
When constructing a network for classification, the final layer in the network should be a Dense layer with a softmax activation function. The number of nodes in the final layer should equal the total number of classes in the dataset.
Scale invariance, Translation Invariance and Rotation invariance. Add random images with a bit of rotation, translation etc to the dataset. Augment ImageDataGenerator. Not the use of steps_per_epoch, fit_generator and flow in the fit command.
Grounbreaking CNNs architectures, eg. ResNet, VGG etc.
Transfer Learning- Using a pre-trained neural network to solve a new problem, i.e. a different dataset.
The initial layers detect more common pattermns such as circles, shapes etc; so they can be kept. Then you just train the final layers.
Here is an generalized overview of what the convolutional neural network does:
the first layer will detect edges in the image
the second layer will detect shapes
the third convolutional layer detects higher level features
CNN Dog Recognition project
Haar Wavelet Face Detection, using ResNet50 for dog detection
Practical Aspects of Deep Learning - have variables to have zero mean and equal variance - Badly conditioned vs well conditioned- well coditioned makes optimization easier (numerically)
Measuring performance- Have classifiers generalize, not memorize. That's why you use validation sets!
The problem of scaling the gradient descent- take a random set of training data, computer gradient of it- do this many times! Stochastic Gradient Descent- Exponential Decay
Small Exercise
Nice exercise to see how much ram spance you need.