sgoyal1012/AI Nano- Deep Learning.md

Last active May 19, 2020 17:08

Star (4) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sgoyal1012/b30d70d12b6efad88bb285e8e709b161.js"></script>
Save sgoyal1012/b30d70d12b6efad88bb285e8e709b161 to your computer and use it in GitHub Desktop.

Download ZIP

Machine Learning Gist

Raw

AI Nano- Deep Learning.md

Lesson 8- Intro to tensorflow

Lesson 9- Autoencoders

What is an autoencoder?
- Makes a compressed representation of a data without any human intervention
- Cons Bad compression and generalizing to datasets
- Pros Dimensionality Reduction and Image denoising
A simple autoencoder
- Just compresses data, for example images from the MNIST database. - Also need a corresponding decoder to reconstruct the image back.
- Using Tensor Flow to create a simple autoencoder
Convolutional Autoencoder
- Why is it better?
- Encoder goes from a larger image to a smaller image (using max pooling layers etc)
- For decoding- use transpose convolutions, upsampling.
  - The checker board effect with transpose convolution
- Making the network learn how to denoise images by providing input as noisy images and output as noise-free images
All Gates
- Learn Gate
  - Takes the short term memory and the event and combines it; and then keeps the important part of it.

Lesson 10- Recurrent Neural Networks

The motivation- ordered sequences!
- Better with time series data
  - Eg. stock price, we have to do supervised learning with sequences
- Natural language Processing
- Language Translation
Vanilla supervised learners
- Make no assumption about the input structure of the data. the order matters!
- Images and video have structure, relations within image.
Modelling sequential data
- What is an ordered sequence?
  - Indexing values by timestamp (the order in which they appeared)
  - A product of some underlying process/processes
    - Eg. temperature/stock prices
- Model the sequence recursively
  - Model future values based on the past
Simple recursive examples
- Example odd numbers, something that can be expressed as a function of its predeccesors
- The seed is the first element, eg. for fibonnaci is 1 and 1.
- The order is the number of previous elements an element depends upon-eg. for odd number sequence it is 1, for fiboncacci is 2.
Thinking about recursivity
- The unfolded view vs folded view vs graph view.
Driving a recursive sequence
- Savings account example.
Injecting recursivity into a learner, the lazy way
- Learning the function to describe a sequence
  - Learn weights of parameterized functio by fitting; take the least squares cost function.
    - Regression! Windowing our data points based on input-output pairs.
    - Shows how to do this in Keras.
    - There can be more than one way to describe a sequence..
    - Applies this to a real financial dataset
Injecting recursivity into a learner, the proper way
- Failures of the FNN approach
  - We assumed no structure- just went on pair by pair- * there's a dependence on input-output*, they are not IID
- Basic RNN approach
  - Force consecutive dependency!
  - Hidden states - the h variables
  - Use the least squares loss again, albeit now including the hidden variables as well.
RNNs and memory
- RNNs go much farther back in time to take into account the previous values; whereas FNN just depends on the immediately previous value.
- Every levels contain a complete history, or in other words, have memory.
Technical issues such as vanishing/exploding gradients exist

Lesson 11 : LSTMs- Long Short term Memory Networks

RNN vs LSTM
- Use previous information- the animal NatGeo example.
- RNNs generally store short term memory due to vanishing gradients; but LSTMs keeps track of both long term memory and short memory. (GHAJINI)
- Combine both forms of memories into 4 gates- forget gate, remember gate, learn gate and use gate - these dates are used to update both long and short term memories.
About all gates: using example of NatGeo science and nature show
- Learn Gate
  - Joins the short term memory and the event; and forgets the un-important part--> ignore factor
- Forget Gate
  - The forget factor
- Remember Gate
  - Combines the long term memory from the forget gate and short term memory from the learn gate and SIMPLY ADD THEM; and generate the new long term memory.
- Use Gate
  - Takes whatever is useful from both long term and short term memories; and generate the new short term memory.
Hay muchas otras architecturas para tratar los

Lesson 12 : Implementing RNNs and LSTMs

Begins with a review of RNNs and LSTMs
- RNNs: Google translate improvement example, and the need for RNNs.
  - Route the output from the previous hidden layer back into the hidden layer.
- LSTMs: Begins with the need due to vanishing or exploding gradients
  - Talks about the four gates.

Character wise RNN

Learn text one character at a time, and produce text one character at a time.
- Get a probability distribution for the next character.
Sequence batching
- Splitting sequences into batches of some lengths

Building a character wise RNN - Anna KaRNNa

Builds LSTMs using Tensorflow

Hyperparameters¶
Here are the hyperparameters for the network.
batch_size - Number of sequences running through the network in one pass.
num_steps - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
lstm_size - The number of units in the hidden layers.
num_layers - Number of hidden LSTM layers to use
learning_rate - Learning rate for training
keep_prob - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Lesson 13: Hyperparameters in RNNs
- How to tune hyperparameters, no magic numbers, they depend on the dataset.
- Two categories
  - Optimizer hyperparameters - learning rate, minibatch size, number of epochs
  - Model hyperparameters - number of layers/units
- The learning rate is the MOST IMPORTANT hyperparameter
  - The learning rate takes us closer to the least error
  - Compares choosing too big vs too small of a learning rate.
  - Learning rate controls all weights versus various error curves
  - Learning rate decay
- Minibatch size
  - Online training (batch size=1) vs batch training (batch size=all the examples)
  - Small minibatches have noise, that prevents being stuck in local minima, so that is preferred.
  - Generally, 32 to 256 are good starting values.
- Number of iterations/epochs
  - Use Early stopping, to stop when the validation error stops decreasing.
- Number of hidden units
  - More units = more prone to overfitting.
  - Generally more the better, but too large may lead to overfitting.
- RNNs hyperparameters
  - No clear winner between GRUs and lSTMs- try both and test

Lesson 14: Sentiment analysis with RNNs

Sentiment analysis with RNNs - movie reviews in this case
- Word2Vec- words to integers (word embeddings)
- Embedding lookup layer- why do we need to do this?!THINK!
- Dropout wrappers for dropout regularization
- tf.nn.dynamic_rnn

Project:Recurrent Neural Network Projects

Windowing out the sequence into input/output pairs.
Write a very simple RNN sequence with LSTM.
How can we train a machine learning model to generate text automatically, character-by-character? By showing the model many training examples so it can learn a pattern between input and output.
It is a multiclass classification problem!

Lesson 16: Generative Adversarial networks - Ian Goodfellow!

Uses of GAN - to generate data
- Mostly done in the field of images.
- Example a description of a bird is used to generate images matching that description.
- Imitation Learning
How GANs work?
- Generative models
  - Generate images by running noise through a differential funcion (an image) to generate images.
  - Training process is different than supervised learning. We show the model a bunch of images to generate images.
  - Uses a discriminator to assign images as real or fake
  - Generator tries to fool the discriminator by generating fake images and gets better (forced) to make real images.
Generator vs Discriminator- a game between these two
- Discusses a bit about game theory.
- Equilibrium in the GAN game - has two different players with two different costs
  - Generator and discriminator compete against each other.
  - Not always that you find the equilibrium
Tips to trains GANs
- Need to learn two optimization algorithms - generator loss and discriminator loss
- For large images, we use convolutional networks.
- Different from CNNs as in CNNs we go from a larger image to a smaller image; but in GANs we go from a small feature to large images.
- A project to build a GAN.
  - Generator which generates data and discriminator discriminates (acts as police) to call real as real and fake as fake.
  - tf.variable_scope and tf.trainable_variables
  - Calculating losses is tricky.
  - Labels smoothing
  - Shows how GANs learn progressively epoch by epoch..

Lesson 17: Deep Convolution GANs

DC-GAN
- Transposed Convolution - You upsample the image here
- Batch Normalization
  - https://github.com/udacity/deep-learning/blob/master/batch-norm/Batch_Normalization_Lesson.ipynb
  - The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to layers within the network. It's called "batch" normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current mini-batch.
  - Has several benefits, converges faster; can train with higher learning rates.
  - Notebook : https://github.com/udacity/deep-learning/blob/master/batch-norm/Batch_Normalization_Lesson.ipynb
```
Batch normalization is a technique for improving the performance and stability of neural networks. The idea is 
to normalize the layer inputs such that they have a mean of zero and variance of one, much like how we standardize
the inputs to networks. Batch normalization is necessary to make DCGANs work.
```

Project: DC Gan: Generate street signs.

Changing the generator and discriminator to be convolutional networks
- Use transposed convolution, after transposed convolution, do batch normalization
- For each of these layers, the general scheme is convolution > batch norm > leaky ReLU.
- GANs are VERY SENSITIVE to hyperparameters.

Lesson 18: Semisupervised learning

An applciation of GANs
- Use GAN to improve classification of models.
- People vs Deep Learning, people receive a lot of unlabeled data, but deep learning only receivs labeled data.
- Has both labeled and unlabeled data; eg. can leverage internet to get huge amounts of unlabeled data.
- Train both generator and discriminator, and then generator is throwed away, the discriminator is used as a classifier
- Feature matching

Notebook on GANs- streer view numbers

Turn discriminator into classifier
- Three sources, labeled images, unlabeled real images, fake/imaginary images.
- Generator is a normal DCGan, Discriminator is a multi class classifier now
- More regularization, because less labeled example. Also leaky relu to allow gradients to pass through.
- Feature Matching - Make sure that feature values in the test are similar to the ones generated by the generator.
- Add Loss functions for both supervised and unsupervised.
- Moment Matching

Lesson 20: Intro To Computer Vision

What is visual perception?
Role in AI: For example for a self driving car to see and react
- In recognizing persons from images etc.
- Medical images
Emotional intelligence
Computer Vision pipeline
Afectiva demo

Lesson 21: Intro to Natural Language Processing

Why is it difficult for computers to understand us?
- because human language does not have a fixed structure.
- Structured langugaes have a fixed grammar, and gives up if something is out of its grammar.
- Human discourse is unstructured and complex.
- Needs context.
  - We implicity apply our kowledge of physical world.
Applications of NLP
- Chat bots
Challenges in NLP
- Maintaing a context

Lesson 22: Intro to Voice User Interfaces (VUI)

VUI Overview and pipeline
- Acoustic Model, Language Model and Accent Model
VUI Applications
- Speaking is faster and less distractive than typing
Alexa Demo
- Alexa Skills

Computer Vision

Project : Mimic me

Use of emotiva's API.
Task to recognize a mood and put an emoji next to it.

Image Representation and Analysis

Pre-processing
- Why?
  - To correct images and remove noise
  - Enhance parts of image that are important
- How
  - RBG TO grayscala
    - Reduces storage size needed
    - Color is not required to detect object and interpret image
    - Color is important, for example when you have to distinguish between yellow and white lane lines.
Images as functions
Color Thresholds
- Used to select an area of interest.
OpenCV reads its images as BGR, so always should convert to RGB
- The blue screen coding exercise
Color Spaces
- Different color spaces- HSV for example; sometimes HSV is better to segment out objects (*for eg. in the pink balloons case!)
- Separate out channels - RGB/HSV
Geometric Transforms
- Move pixels based on a mathematical formula, to change the perspective of an image.
  - Eg. scanning and aligning text.
- cvtColor
Transforming Text
- Straighteing text out from a business card
- Map from original image to warped image using geometric transformations
- Guesstimate the coordinates, get the transform and apply it
Filters in images
- Edge detection filters- high pass filters
- Noise removal filters- low pass filters
- High frequency components vs low frequency components
- High pass filter example- emphasizes edges, where the intensity changes very quickly (high frequency)
  - Kernel MUST sum to zero (WHY?!)
  - Kernel and weights
  - How to create your own filter, using openCV and use inbuilt filters. Eg. Sobel Filter (Ah memories!)
  - Importance of setting thresholds
  - High pass filters can enhance noise, so should do low pass before doing edge detection.
- Low pass filters
  - Noise such as speckles, no useful information; for eg. edge detection filters AMPLIFY noise.
  - Low pass filters use average, and should be normalized so that sum is one.
- Gaussian blur - the most common used low pass filter.
  - First pass through gaussian blur (remove noise), then do edge detection.
- Canny edge detector! (memories!)
  - Non maximal suppression (Remember Bobick!)

Image segmentation

To segment an image into areas
Image contouring
- Useful to see connected objects and segment objects.
- Done on a B&W image, after thresholding.
- Contour features
  - Area, perimeter, orientation (based on the eclipse fitted)
Hough transform
- Line detection (BOBICK! CV assignment 1)
- Convert into hough space (m and b coordinates).
- Better to convert into polar coordinates
- cv2 houghlines parameters
K-means clustering
- Unsupervised method to break image into methods.

Feature Extraction and Object Recognition

What is a feature?
- A feature is a measurable piece of data in an image
- Should be consistent across different scales, lightning etc.
- Should be repeatable- very important
Types of features
- Edges, corners and blobs
- Corners- best repeatable
  - Are Unique than other features.
Corner detector
- Calculate gradient magnitude and direction
  - Corner has a bid variation in direction and magnitude of the gradient
Dilation and erosion (AH MEMORIES!!)
- Morphological operations
- Remember closing and opening
Feature Vectos
- Look at the direction of gradients
  - Eg. divide into grid and see directions of gradients.
HOG (Histogram of oriented gradients)
- Use binning to separate pixels.
- Orientation and magnitude of gradients - get via Sobel, then place the data into a histogram (after dividing into cells)
- Should be scale and rotation invariant
- HOG is also referred to as a type of feature descriptor, which is a simplified representation of an image that is made up of extracted features (that highlight important parts in an image) and that discards extraneous information. In this case the features represent the image gradient -- it's magnitude and directions, which describe the shapes and patterns of intensity that make up the image.
- Talks about how to create a HOG feature vector
- Block normalization
Object recognition
- Positive vs negative examples
- Extract features and feed to training algorithm (supervised learning)
- Eg. use an SVM classifier
Haar cascades
- Haar features
  - Detect rectangular patterns like edges etc. (good for faces).
- Haar cascades classifiers and rejects non-face (irrelevant) data
  - Fast enough to process in real time.
Motion/Video
- Video is a frame of images!
- Optical flow (BOBICK!)
  - For motion and tracking analysis
  - Assumptions- Pixel values do not change from frame to frame, and neighboring pixels have similar motion

Final Capstone Project: CV Face Detection

Raw

Artifical Intelligence Nanodegree.md

Lesson 4: Introduction to AI

Examples of problems to use AI?
- Navigation problem: planning a path
- Heuristic - some additional info that makes brute force act in a more intelligent manner
- A* search
Tic tac toe quiz question
- Every board config is a node
- Pruning the search tree and adversarial search; anticipate to changes in the environment
Monty hall problem
- Probability theory
What is intelligence?
- Defining intelligence? Should not be based on our perception, but should be defined in the context of a task.
- Definition of agent, environment and state
- Perception, Action and Cognition. Reactive or behavior based agents.
Classifying AI problems
- Based on environment- stochastic-deterministic; adversarial; partially observable etc.
- Classifying different problems such as poker, driving on the road etc.
- Rational behavior and bounded optimality

Lesson 5: Applying AI to Sudoku

Constraint Propagation and Search, simple concepts applied well
Setting up the board- Defining boxes, units and peers
- Representing the sudoku puzzle as a string/dictionary
Strategy 1 : Elimination- eliminate values that cannot be there based on constraints
Strategy 2: Only choice- since a grid must have every digit atleast once, there can be a case where only one option is there for a box
Combine the two strategies: The concept of constraint propagation
- Need to stop when the board is solved
- Dont go further is there is no progress, or the solver is stalled
- Recursively apply the two strategies from above
- Does NOT work on hard sudoku puzzles! Which brings us to the third strategy..
Strategy 3: SEARCH
- Pick boxes with fewest options- then branch out - use DFS

Lesson 6: Basics of Anaconda

PROYECTO: SUDOKU

Lesson 8: Introduction to Game Playing

The game of isolation
- Building a game tree - build a tree based on choices of move at every step
- Early detection and telling the computer not to lose early, which leads us to...
The minimax algorithm
- Computer tries to maximize its score, and the opponent tries to minimize it
- Propagate score up the tree
- Finding branches where computer ji can win
- Branching factor and the number of nodes one needs to visit
Depth Limited Search
- The average branching factor - just try it out and see the average branching factor for a particular board setting. Even this due to the exponential nature of the game, it is too many branches!
- Need a way for the computer to choose a move quickly; given the limited processing power
- The evaluation function, in this example it is comparing the nodes based on the maximum number of moves a node can have; propagating bottom-up
- Quiescent Search - after which level the worst branches do not change; i.e. become quiescent

Lesson 9: Advanced Game Playing

Iterative Deepening- return the answer given within the given time constraint; i.e. how deep along the three you can go.
- Number of nodes needed to explore based on the branching factor
- Varying the branching factor as you progress along the game
- Horizon effect
Using different evaluation functions and find the best one out
Alpha beta pruning algorithm
- Pruning the number of nodes to look at using the parameters alpha and beta, reduces the search space
- Solving 5 by 5 isolation, use the symmetry of the board to see similar move
3 Player isolation game
- Minimax dont work anymore. We have triplets at each level and choose values most suitable to a particular player, i.e. where its score can be maximized.
- Alpha beta pruning for a 3 player isolation
- Deep pruning is not possible- can do immediate and shallow pruning
- Paper by Korsch
Probabilistic Games
- Sloppy isolation
- EXPECTIMAX function - pruning in a probabilistic sense

Lesson 11: Search

What is a problem?
- Examples: Route Finding
- Definition of a problem
  - an initial state,
  - a set of actions in a particular state
  - a result
  - Goaltest, i.e. if this state is the GOAL or not (GOALLLLL)
  - Path Cost - implemented as a Step Cost function
Example Route Finding
- The state space is the entire area
- Frontier- the farthest part explored, the unexplored region; and the explored region
Tree Search algorithms
- Breadth first search
- Keeping track of explored (visited) states!
- Notes about termination only when you find the best path
Uniform Cost Search
- Taking into account the cost
- Continue to search till you find a better path; and stop till you can't better it no more.
- Depth First Search is not optimial in this context
  - Then why would you use depth first search at all?! - Due to less storage requirements!
  - Depth First Search is also not complete
- About uniform cost- need more knowledge to get to the goal faster
  - For example, in the route finding problem, having an estimate of the distance to the goal would help.
The A star algorithm!
- Minimizing f- the sum of g and h
- Minimizes keep the path short and also focused on finding the path. Minimizing both components at the same time.
- KEEP EXPANDING until all paths have been explored!
- A star finds the best path if the heuristic function less than the true cost.
  - Should not overestimate
  - Is optimistic
  - Is admissible
  - Why does the optimistic function 'h' work?
State Spaces
- The Vaccum Example
- The number of states is TOO DAMN HIGH!
Sliding Blocks Example
- Finding an appropriate heuristic function
- What is an admissible heuristic function? Comparing two good heuristic functions
- Can we automatically come up with a heuristic function? Can come up with heuristic by defining the problem in words.
- Generating a relaxed problem
Problems with search
- Constraints
  - Domain must be fully observable
  - Domain must be known
  - Must be discrete
  - Deterministic
  - Static
Notes on implementation
- Nodes and Paths
PACMAN project
VERY NICE AND DIFICILE!
- Implementing all search algos on PACMAN

Lesson 12: Simulated annealing

Solving a problem by adding some simple intelligence

Travelling Salesman Problem
- NP Hard
N Queens Problem
- Arrange in a way to have no more attacks
- The heuristic function for N Queens-iterate and reduce number of attacks with each move
- Local Minima - you get stuck! (But it is still solvable)
Hill Climbing
- Less dimensions- Local maximum problem
- Random Restart!- take maximum of all local peaks
- Tabu Search algo
- Step Size- too small vs too large
- Start with a large step size, then reduce it to make sure to reach the minima
Simulated Annealing
- Introduction to physical annealing
- Heating and cooling to get out of global minima
- Iterate to find a better position
- Start with higher randomness, gradually reducing randomness ( Vary T from very large to very small)
- GUARANTEED to converge to the global Maximum!
- Local Beam Search - keeps track of K particles
Genetic Algorithms
- Survival of the fittest - breeding and mutation.
- Crossover - children get good aspects of their parents through natural selection.
- What if a Critical piece gets eliminated?- Solved by having more randomness
- Without mutation, we might NEVER reach the goal!
Simulated annealing lab
- Implement simulated annealing
- Nice assignment! All functions were kind of challenging.

Constraint Satisfaction

Constraint Graph
Map Coloring Constraint Problem
Constraints can be unary, binary or have even more variables.
Backtracking Search
- Improving efficiency, use the least constrained value
- Use the most constrained variables - solve more constraints sooner; or the minimum remaining values
- Forward Checking - maintain a map of all possible values for a variable
- Arc consistency
Structured CSPs
- Break into independent variables - the tasmania example
Challenge Question - TWO + TWO = FOUR crypto question
Constraint Satisfaction Lab

Logic and Reasoning

Propositional Logic
- Representing events as True or False, or relation between them.
- Truth Tables and the symbols used
- Valid, Satisfiable and Unsatisfiable
- Limitations
  - Can only handle true and false; not probability
  - Cannot talk about objects or relationships between them
First order logic
- Comparison with propositional logic and probability theory
- Represents relationships between objects; more complex models
- Defining the models; objects, functions and constants
- Talk about syntax
  - sentences, terms, quantifiers
- Representing the vaccum world as first order logic
- Questions for First Order Logic: Practice. Representing English statements as First Order Logic. - NICE QUESTIONS

Planning

Just planning is NOT enough, need feedback to rightly execute and finish the task.
- Environment is stochastic, and there are other agents too ; can't know this info beforehand
- Partial observability
- Some unknown
- BELIEF states instead of WORLD states
- Example with vaccum cleaner, what if the vaccum sensors break down?!
- Successful plans!
Mathemical formulation for succesful plans
- Tracking the predict-update cycle
  - Describing in terms of variables
Belief state space
- Sensorless vaccum example!
- Comformant plans - where we do not know everything about the world, pero todavia llegaremos a nuestro objetivo!
Partially observable vaccum example
- Act-observe cycle
- Actions increase uncertainty, and observations bring them down!
- Can't guarantee ALWAYS!
  - Infinite sequences!
Classical Planning
- Assign all values to K boolean variables - State Space
- World state- complete assignment
- Belief state- complete assignment or partial assignment
- Actions and preconditions
  - Example of the fly action schema*
Progression state space search vs Regression state space search
- Regression starts from the goal
- Progression starts from the initial state
- When is it better to search backwards vs forwards?
Plan Space Search
- Search through plans
Forward search is the MOST POPULAR
Importance of heuristics
Situation Calculus
- Successor state axiom

Lesson 16:Probability - SEBASTIAN THRUN IS BACK!

Intro to Probability and Bayes Network
- A network of reasons- the car wont start example
- Car wont' start- battery wont start/battery wont charge-and so on and so forth in reasons.
- Come up with a sort of a dependency graph for various variables
  - 16 variables in this structure, so 2^16 values
- Specify..observe..compute
- Assumption that every event is discrete/binary
Probability concepts
- Complementary and joint probability
- Concept of dependence and conditional probability
- Total Probability- CANNOT NEGATE the conditional variable!
- Some quizzes on these concepts.
Bayes Rule!
- Likelihood, prior and marginal likelihood

Lesson 17: Bayes Networks

A and B - A is not observable, B is observable
- Diagnostic Reasoning
Computing Bayes Rule
- The denominator (total probability for B is HARD to compute)
- So we just use the unnormalized terms, and then adding them to get the normalized version
- Nice qay to calculate probabilities in the quizzes!
Conditionally independent
- Absolute independence does not implement conditional independence, and vice versa; neither can de deduced from the other.
Confounding case
- Two causes effect an observable variable
Explaining away effect
- If an effect can be caused by multiple causes, seeing one of those causes to be true/untrue, the other can explain away the effect.
- DIFFICULT QUESTIONS IN QUIZ on this effect!
Defining Bayes Networks
- A graph explaining probability relationships between various event
- Joint probability is defined by factoring in conditional relationships etc.
- A node with K inputs requires 2^k variables (parameters) to define
- Bayes netwrosk reduce the number of params needed by quite a lot! So very useful
D-separation
- Any two variables are independent if they are not linked by just unknown variables
- Two independent varibles affecting a variable; and if we know about that variable, then these variables become dependent, the explained away effect

Lesson 18: Inference in Bayes Nets

Evidence variables, Hidden Variables and Query Variables
Output is a joint variability distribution over the query variables, given the evidence variables
Which query variable is the most likely?!
- Can also go in opposite direction- reverse evidence variables and query variables.
Inference by Enumeration
- Enumerate over all the hidden variables
- Speeding up enumeration - Maximize independence (determine through the bayes network)
- Causal direction- easier to inference when the graph goes from causes to effects
Variable elimination
- Divide into smaller parts..enumerate..then combine
- Join factos to form larger factors and then eliminate variables
Approximate Inference and Sampling
- Estimating by sampling and performing experiments
- Advantage- no complex coputations; simualtion does not need conditional probability tables
- Sprinkler Rain example
- With an infinite number of samples, we approach the true probabilities
Rejection Sampling - only keep the samples that match the scenario we want to compute
- Can end up rejecting a lot of samples...for eg. Burglaries and earthquakes are very infrequent
- Likelihood weighting - add a probabilistic weight to each sample, according to the probability of the conditions
- Does not solve all our problems though...so Gibbs Sampling - takes all the evidence into account - MCMC, samples depend on each other
Monty Hall Problem Example
- Learning more about a door changes probability
- Monty Hall Letter

Lesson 19: Hidden Markov Models

Pattern Recognition through time
- Dolphin communication problem
- 'Delta Frequency'
- Time warping- should not matter if a whistle is quick or drawn out longer in time
Dynamic Time Warping
- Matching two signals sample-wise
- Try to keep to the diagonal as much as possible
- Could get false positives- matching signals that are not actually similar
- Bound how much we can deviate- The Sakoe Chiba bound
Hidden Markov Models
- Pattern recognition through time
- Representing markov models
- Self transition
- Application: Sign Language Recognition
  - HMM for "I" vs "We"
- Viterbi Trellis
  - Eliminating by the constraints
  - Many options in the middle
  - The Viterbi Path
Theory on HMMs and Phrase recognition- LOST due to Github error
Context Training
- Using context in phrases- eg. combine models for I and need - Coarticulation
Statistical Grammar
- Record fraction of words occuring together
State Tying
Segmentally Boosted HMMs
Using HMMs to generate data

Raw

Coursera- Deep learning Specialization.md

Course 1 : Neural Networks and Deep Learning

Week 1: Introduction to Deep learning

What is a neural network? Housing price prediction model.
Neural networks and Supervised Learning; and types of neural networks-
- Structured Data vs Unstructured Data
Why is deep learning taking off?
- Because of Scale! (more and more data)
- NNs performance generally increases with more data
- Faster Computation

Week 2: Logistic Regression as a Neural Network

Binary Classification
Logistic Regression
Loss Function and the Cost function- The benefits of choosing a convex function for a loss function.
Gradient Descent and finding the minima
A refresher on derivatives
Computation Graph,
- derivatives with computation graph- excellent video! - Chain rule
Gradient descent using logistic regression- minimizing the loss function.
- Updating the weights using the backward propagation step.
Vectorization
- Removing for loops- to improve the run time. Eg. np.dot to get the dot product.
- Try to avoid for loops when you can. Many functions in numpy to do so!
- A logistic regression without any for loop
- Doing the backward and forward propagation steps without any for loops, using numpy
Broadcasting in python/numpy
- how python/numpy treats arrays of different sizes.

PROJECT: Logistic Regression Model to recognize cats

Preprocessing steps
Use assertions for size and shape of numpy arrays
Nice assignment!- implementing a NN yourself from scratch.
Key Takeaways from the assignment

Preprocessing the dataset is important.
You implemented each function separately: initialize(), propagate(), optimize(). Then you built a model().
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm. 
You will see more examples of this later in this course!

A discussion (optional exercise) on the importance of choosing a good learning rate!

Different learning rates give different costs and thus different predictions results.
If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example,   using 0.01 still eventually ends up at a good value for the cost).
A lower cost doesn't mean a better model. You have to check if there is possibly overfitting. It happens when the training   accuracy is a lot higher than the test accuracy.
In deep learning, we usually recommend that you:
Choose the learning rate that better minimizes the cost function.
If your model overfits, use other techniques to reduce overfitting. (We'll talk about this in later videos.)

Week 3: Shallow Neural Network

Overview of neural networks, comparison with logistic regression.
Neural networks with a single hidden layer.
- Introduction to hidden layer.
- Superscript notations etc.
Computation using a neural network
- Logistic Regression multiple times
- Vectorization
Vectorization across multiple examples
- Justification for the implementation
Activation functions
- Hyperbolic tan ( tanh )- why is this better?
- Why to use sigmoid for the activation layer? (Andrew Ng says sigmoid is always superior to sigmoid; except use signmoid in the output layer)
- yeh ReLU ReLU kya hai, yeh ReLU ReLU? Leaky ReLU, with Relu learns faster
Why do you need a activation function? IMPORTANT
- Derivatives of activation functions
Gradient descent for neural networks
Intuition behind backpropagation
The weights of a neural network should be initialized to random values (WHY NOT ZERO? What's the problem?)- Symmetry Breaking Problem

Project: Planar data classification with one hidden layer

Logistic Regression don't do well because the data is not linearly separable

Reminder: The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
  - Implement forward propagation
  - Compute loss
  - Implement backward propagation to get the gradients
  - Update parameters (gradient descent)

CUIDARLE CON EL TAMANO DE LOS MATRICES!!
The importance of a good converging learning rate

The larger models (with more hidden units) are able to fit the training set better, 
until eventually the largest models overfit the data.
The best hidden layer size seems to be around n_h = 5. Indeed, a value around here
seems to fits the data well without also incurring noticable overfitting.
You will also learn later about regularization, which lets you use very 
large models (such as n_h = 50) without much overfitting.

Week 4: Deep Neural Network

What is a deep neural network? Notations etc.
Forward Propagation
FOCUS on MATRIX DIMENSIONS- Working through the matrix dimensions of a deep neural network. Think about the dimensions of the weight matrix and the bias vector at every step.
Why need a DEEP network? Great video!
- Circuit Theory and Deep Learning
Building blocks of deep learning networks - going through a backward and forward propagation, layer by layer
Hyperparameters vs Parameters- deep learning is an empirical process, wash-rinse-repeat.
Is there a relation between the brain and deep learning? (Spoiler Alert: Not a whole lot)

Project: Building your deep neural network

* Implementing a L layer neural network from scratch- both backward and forward

Project: Using above project to detect cats vs non cats

* ALWAYS resize all images to the same size before feeding to the network.
* 2 Layer vs L Layer network, try different values of L
* **Vectorization helps a LOT with the speed**
* Causes of mis-prediction

Course 2: Improving Deep Neural Networks, Hyperparameter tuning, regularization

Week 1: Practical Aspects of Deep Learning

Setting up your train/dev/test sets
- It is an iterative LEARNING process!
- Why need a test/valid sets?
Bias variance tradeoff
- Parameters to analyze bias and variance (Overfitting vs underfitting) - See the error!
Basic recipe for machine learning/deep learning?
- Question 1: Does the model have a high bias? See the train set performance.
- Question 2: Does the model have a high variance? See the dev/validation set performance.
- Rinse and repeat
- Through deep learning, it has been possible to somehow reduce bias variance tradeoff, i.e. you can bring one down without affecting the other
Regularization
- L2 normalization, L1 normalization, lambda is the regularization parameter, Frobius Norm for w
- L2 normalization is also called weight decay (Why?, Remember HDP!)
- Why does regularization prevent overfitting?
- Dropout regularization - Keep a hidden unit for training with some probability; Inverted Dropout
- The intuition behind dropout- great video! Because a node knows that an input (feature) can go away randomly, it spreads out weights across features.
- Can change the probability of drop out (keep_prob), by layers, For example, layers with more nodes can have a higher probability of dropout. Drawback- The loss function is a bit undefined here, so hard to debug if it is monotonically decreasing with epochs
- Other regularization methods - Data Augmentation, Early Stopping
- Orthogonalization - Think of one problem at a time, machine learning funda by Andrew Ng.
Normalization
- Normalization the mean and the variances of both features. Make variance for all features as 1. Why normalize?
Vanishing and Exploding gradients problem!
Gradient checking- to check your implementation
- Only use for debugging, not for training; does not work with dropout.

Assignments: Initialization

Comparing three types of initialization for weights, zeros initialization vs random vs He initialization Zero initialization is MUY MAL!* The cost function does not even go down with iterations. Why?

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every  neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.
What you should remember:
The weights  W[l]W[l]  should be initialized randomly to break symmetry.
It is however okay to initialize the biases  b[l]b[l]  to zeros. Symmetry is still broken so long as  W[l]W[l]  is initialized randomly.

Random initialization- good, but not great

The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when  log(a[3])=log(0)log⁡(a[3])=log⁡(0) , the loss goes to infinity.
Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

He initialization
- Based on a paper, it does great!
WHAT TO REMEMBER

What you should remember from this notebook:
Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don't intialize to values that are too large
He initialization works well for networks with ReLU activations.

Assignments: Regularization

Project to see where the french goalkeeper should kick to reach his team's players.

Implement regularization- overfitting reduces and test set accuracy goes up after regularization

What you should remember -- the implications of L2-regularization on:
The cost computation:
A regularization term is added to the cost
The backpropagation function:
There are extra terms in the gradients with respect to weight matrices
Weights end up smaller ("weight decay"):
Weights are pushed to smaller values.

Implement dropoout

Apply mask with probabilities to activation and backpropagation, and divide by probabilties to scale the result.

What you should remember about dropout:
Dropout is a regularization technique.
You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For   example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the   output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

Assignments: Gradient Checking

Small exercise to check and verify the gradient calculation.

What you should remember from this notebook:
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

Week 2: Optimization Algorithms

Used to speed up neural networks and make them practical
Mini-batch gradient descent
- The term epochs
- The loss function does not neccesarily decrease always, as it does with a normal gradient descent
- Choosing the batch size- The extreme case of stochastic gradient descent vs batch gradient descent
- What is a small batch size? How to choose a batch size?
Exponentially weighted averages
- Approximately how many data points are taken into account, with respect to the value of epsilon?
- How to compute?
- Bias correction!- Important during the initial phase of learning
- Mismatched train/test distribution- maybe training set comes from cat pictures on web pages, however you test on low resolution pics uploaded by people
Gradient Descent with Momentum
- A new hyperparameter, Beta
Optimization Algorithms
- RMSProp
- Adam (Adpative Moment Estimation) - RMSProp + Gradient Descent with momentum * Learning Rate Decay
- Why to decay learning rate(alpha)?
- Decay Formulas in terms of epochs- Exponential Decay * The problem of local optima!
- Saddle points, plateaus
- In a high dimensional place, it is very unlikely to get stuck in a local optima (Why? REMEMBER HDP!)
- Solving the problem through better initialization - by constraining the mean and the variance
- Numerical approximation of gradients- Two side difference vs one side difference

Project: Trying out different optimization algorithms, mini-batch gradient descent

Stochastic Gradient Descent vs Batch Gradient Descent
Shuffling and partitioning to get batches, the size of the batch (power of 2)

The larger the momentum  ββ  is, the smoother the update because the more we take the past gradients into
account. But if  ββ  is too big, it could also smooth out the updates too much.

Implement Adam yourself, implement correction formulas for s and v params.
Observe the loss decay with/without momentum- Adam is VERY GOOD!

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable.  Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.
Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.

Week 3: Hyperparameter Tuning, Batch Normalization, Multi-class classification

A shit ton of hyperparameters! Rank them by importance.
- Instead of a grid search for best hyperparameters, do a random search because you do not know what the most important hyperparameter is, and want to try out a lot more values.
- Picking an appropriate scale is important - Eg. a log scale for the learning rate.
- Hyperparameters tuning in practice: Baybsitting one model(Panda) vs training multiple models in practice (Caviar). Depends on how much computational resources you have.
Batch Normalization
- Also normalizing the activations of the hidden units. Implementing this with more parameters to tune the mean and variance of your hidden layer activations.
- Fitting batch norm into a deep neural network. Convert Z to a normalized Z, then apply activation. Additional parameters are added to apply normalization at every layer.
- Working with mini batches.
- Why does Batch Norm Work? IMPORTANT VIDEO!- Eg. if you have a model that detects cats vs non cats on black cats; and now you want to use the same model for colored cats. Covariate shift. Even though the input values change, the mean and variance remains the same. It limits the amount by which the earlier layers' outputs change. Allows each layer to learn by itself independently of the earlier layers.
- Also has a slight regularization effect.
- Batch norm at test time. Estimate the mu and sigma-squared by estimating exponentially weighted averages across all batches.
Multi Class Classification
- Softmax regression, a generalization of linear regression. Suppose we have a n classes, the output layer is a n1 layer, denoting a n1 bector with probability that input belongs to one of the n classes.
- Training a softmax classifier - hardmax vs softmax. Defining the loss function.
Introduction to Programming Frameworks
- How to choose your framework? Ease of programming, running speed and truly open.
- Using TensorFlow

Assignments: Tensor Flow to detect signs: multivariate classification

remember to initialize your variables, create a session and run the operations inside the session.

Placeholders- just define the shape now, value later. Defining simple operations, and getting results using session.run()
Pass placeholder values using feed_dict
One Hot Encoding
SIGNS dataset: Normalize and flatten the image dataset. Why use 'None' in the placeholder? Using Xavier initialization to init the parameters.
Running on mini batches, with optimizer defined by TensorFlow.
Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.

What you should remember:
Tensorflow is a programming framework used in deep learning
The two main object classes in tensorflow are Tensors and Operators.
When you code in tensorflow you have to take the following steps:
Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
Create a session
Initialize the session
Run the session to execute the graph
You can execute the graph multiple times as you've seen in model()
The backpropagation and optimization is automatically done when running the session on the "optimizer" object.

Course 3: Structuring your machine learning projects

Week 1: Introduction to ML Strategy

Strategies to analyze a problem and coming up with ideas that one should try to improve the performance.
Orthogonalization - Adjust one knob to adjust one parameter, to solve one problem - The TV knob analogy and the car analogy.
- Chain of assumptions in Machine Learning and different knobs to say improve performance on train/dev set.
- Andrew Ng does not recommend Early stopping, as it is a knob that affects multiple thing at once.
Setting up your goal
- Set a SINGLE NUMBER for metrics- precision and recall- but these are two numbers, and you ideally need one number. ENTER THE F1 score!
Satisficing and optimizing metrics - metrics that satisfy, for example time
- As a general rule of thumb, out of N metrics, pick one to be optimizing and (N-1) to be satisficing.
How to set up your test set and dev sets
- Dev and test set should come from the same distribution.
- Size of the test and dev set.
- Test set should be good enough to give you confidence.
When to change your metrics/dev set/test set
- Halfway through you solving the problem, metrics might change based on goals. Defining a new evaluation metric, to tell which algorithm is better for your problem.
- Orthogonalization- Defining the metric is one step, and doing well on it is another step.
- If a metric that says you are doing well on your dev/test, but does not reflect well on your application; CHANGE THE METRIC!
Comparing to human level performance
- Bayes optimal error- the best theoritical possible error, there is no way to surpass this in terms of performance.
- You can improve till your algorithm is doing worse than human level performance.
Avoidable Bias
- Think of human error as an estimate of bayes error, as a baseline (esp in computer vision tasks). This is avoidable bias, keep training till you get the training error down to avoidable bias.
Understanding human level performance
- How to define it? The medical image classification example. Reduce bias or reduce variance? Difference between human baseline and training error = measure of bias; and difference between train error and test error = dev error

* Surpassing human level error - Sometimes ambiguity as to whether improve bias, or improve variance. - Examples where ML kicks humans' ass : online advertising, product recos, logistcs, loan approvals. Humans are great at computer vision tasks. Some speech recognition systems can surpass humans.

Improving your model's performance
- Assumptions - fit training set well (low avoidable bias)
- Generalizes pretty well to dev/test set.
Error analysis
- Manually examine the mistakes the model is making; manually making some notes. Find mis-predicted labels, and prioritize based on where you can improve the most.
Mislabel examples
- What to do with incorrectly label training examples. Deep learning algos are robust to random errors, but not to systematic errors
- For test/dev set incorrectly examples, have a column called 'incorrectly labeled'; and analyze if it makes sense to spend time fixing the incorrect labels. Depends on how much error they contribute to error wise.
- Correcting labels: Apply same principles to both dev and test sets
Iterating on your algorithm
- Build a first system quickly, and then analyse what to do next- Iterate.
- Do not overthink initially, and just get a quick and dirty first solution going.
Mismatched training and dev/test set
- The cat example- 200,000 images from web crawling, 10,000 from the data from mobile cameras (low quality)
- Set your test/dev test to be the distribution to be the one you want your application to do well on.
- Analyzing bias and variance on different distributions of training and dev set. The concept of training-dev sets!
- Data mismatch error - The new problem of data mismatch! How to solve the data mismatch problem. Manually analyzing the difference. Eg.: artifical data synthesis- add random noise to clean data
Transfer Learning
- Use a model used to identify cats, and apply to identifying X-ray scans. pre training and fine tuning
- When does transfer learning make sense? When you have pre-learnt on a lot of data and don't have much data for the new problem.
Multi learning
- Do multiple tasks at one time; demonstrated with the autonomous driving car example. Detecting if an image has a stop sign, has a human, has another car etc.
- Should be done on tasks that share lowel level features. Works better if the amount of data is similar, per task. Knowledge of one task should help all the other tasks.
End to end deep learning
- Replacing a whole pipeline of feature engineering, extracting features with one neural network.
- Need a lot of data than traditional pipelines.
- The turnstile problem example- breaking into steps since you have much more data from the steps than the end-to-end problem.
- Can simplify the problem, but does not always work. Think about the amount of data!
Whether to use end-to-end deep learning
- PROS: Just lets the data speak; rather than human perceptions. Don't need to hand-design the features.
- CONS: Need a LOT of data. Sometimes not available for the entire step. Excludes hand-designed components/features.

Course 4: Convolutional Neural Networks

Week 1: Intro to CNNs

Convolution operation
- How to detect edges, defining the filter/kernel. How to do convolution. How does edge detection work really, with convolution?
- Horizontal and vertical edge detection. Dark to white edges and white to dark edges. Sobel filter, Scharr filter. Can possibly learn the coefficients of your kernel through deep learning; rather than hand-pick a kernel.
Padding
- Why is it needed? Because the image shrinks! Valid convolutions and same convolutions.
- How to calculate the padding size?
- Rarely even dimension-ed kernels are used.
Strided Convolutions
- General formula for dimensions of the output image; for an input image of size n by n, kernel of size f * f, padding = p and stride = s; the dimension of the output image is floor( ((n+2p-f)/s) + 1 )
- The filter must lie fully in the image when convolving
- Cross correlation vs convolution: We are not doing the mirroring step (as done in maths). What we are essentially doing is cross-correlation, and calling it convolution.
Convolution over volumes- 3D Images
- Number of channels in kernel must be equal to the number of channels in the image.
- Finding multiple types of edges using multiple convolutions with different filters suited to find different kinds of images. So output dimension becomes (n-f+1) * (number of convolution filters used)
Building one layer of a CNN
- Add bias and non-linearity to the convolution result; analogy with the standard forward propagation
- Calculate total number of params in a layer- coefficients of the filter and the DO NOT FORGET THE BIAS!
- Naming conventions- formula in terms of this layer's filters and previous layer's inputs
A simple example
- The depth keeps increasing, while you reduce the height and width at each layer (Remember Udacity!). Andrew calls depth as the number of channels
Other layers: Pooling and Fully Connected
- Pooling layer - eg. find the max in a sub-region of an image (Max Pooling). There is nothing to learn, has a fixed set of parameters (stride and size of kernel)
- Average pooling layer; max pooling is used more than global average pooling
Combining all of these together and one example based on LeNet-5
- Remember pooling layer has no parameters. As a convention, count only the layers that have weights (parameters)
- At the end, flatten and feed into the fully connected layer
Why convolution? - Great video!
- Reduces the number of param much more than fully connected layers - Parameter Sharing- a feature detector (such as edge detector) useful in one part of the image is probably useful in another part of the image.
- Sparse Connections - one pixel is only connected to its neighbors, and not to everyone else (and does not need to be!)

Project: Step by step convolution model

Implement a CNN yourself
- Implmenet padding, convolution, forward pass etc from scratch
- Nice tidy implementation of a single layer!
- Optional exercises on back propagation
Implement ConvNet using TensorFlow
- Initialize placeholders, weighs etc.
- Remember the tensor flow sessions! How to run etc.

Week 2: Looking at case studies

Learn from others, why reinvent the wheel?
- LeNet -5 : the architecture; used for digit recognition.
- AlexNet: bigger, more parameters; better than LeNet as it used ReLu. Uses a layer called as local response normalization Has a lot of hyperparameters
- VGG 16: A simpler network; although large. Has 16 layers with weights.
- ResNets: Residual block- applying a shortcut as opposed to the main path. Skip connections
- Plain network vs residual block networks. In practice, deeper the networ, the error can go up.
Why do Resnets work so well?
- If you make a plain network deeper, it can hurt training error on the training set. Not with ResNets though.
- Because ResNets can learn the identity function much easily. Therefore adding extra layers does not hurt performance, and might even help performance!
- Residual layers easily learn the identity function.
- Uses a lot of same convolution; as it preserves the dimensions.
A 1 by 1 convolution
- What is it? Convolving with a 1 by 1 by d filter. Why is it useful! It multiplies a number across the depth and then applies a ReLu activation.
- It is like having a fully connected network with depth. Also called Network in Network architecture.
- Helps you shrink the depth/ the number of channels!
Inception Network Motivation and Inception Networks
- Why not take ALL filters, and ALL types of layers. Just stack all the various outputs (Keep the same convolution)
- Do them all; but huge computation cost!
- Use a 1 by 1 convolution to reduce depth (volume) and reduce the amount of multiplications (reduce the computation cost)
- Use if you want to TRY THEM ALL! (Like a marica)
- Padding with max pooling layer, a weird thing...
- How to combine: Just concatenate the blocks along the channel (depth)! Height and widht are kept the same.
- Inception network has a lot of inception blocks. Also has a side branch layer to make predictions; tends to have a regularizing effect.
- Inception network's name actually comes from the movie inception.
Practical Advice on Using other networks
- How to use open source implementations. A common way to go about in computer vision is to take a known network, and use transfer learning
- Use something that has already been done before! Rather than starting from scratch, why reinvent the wheel. Freeze the earlier layers**; pre-compute the earlier layers' activation and just apply softmax on that.
- If you have a larger training set, you freeze fewer earlier layers. The more data you have, the more layers tu puedes entrenar.
- Data Augmentation for computer vision. Just can't get enough of data for computer vision.
  - Techniques used are random cropping, mirroring.
  - Color shifting.
  - Advanced- PCA color augmentation
  - Implement distortions during training. Have a thread for distortions, other for training. Distortion can also have hyperparameters.
Computer vision and deep learning
- ML problems fall in the spectrum from 'little data' to 'lots of data'. Lot of data means simpler algorithms, letss hand-engineering.
- Computer vision has relied on hand engineering a lot.
- Tips for doing well on benchmarks
  - Ensembling - average the labels for multiple Neural Networks
  - Multi crop at test time, the 10 crop technique

Optional keras exercise

Keras is a higher level of abstraction than tensor flow. The happy faces project. To remember:

What we would like you to remember from this assignment:
Keras is a tool we recommend for rapid prototyping. It allows you to quickly try out different model architectures. Are there any applications of deep learning to your daily life that you'd like to implement using Keras?
Remember how to code a model in Keras and the four steps leading to the evaluation of your model on the test set. Create->Compile->Fit/Train->Evaluate/Test.

ResNets

Deep networks can learn complex functions, however not always the best choice. Remember vanishing gradients!
you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function--even more than skip connections helping with vanishing gradients--accounts for ResNets' remarkable performance.)
ResNets identity block; convolutional block
Using the blocks above to build a DEEP Resnet! Layer naming ne heecha bana diya!

PARA RECORDAR:

What you should remember:
Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.
The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an    identity function.
There are two main type of blocks: The identity block and the convolutional block.
Very deep Residual Networks are built by stacking these blocks together.

Week 3: Object Detection Algorithms

Object localization- Not only classifying an image with an object, but also localizing (bounding box) the object. Object detection- detect an object in an image that has many object.
Have the neural networks have the bounding box outputted in the form of four numbers! The output label is now a vector, with values being (is there an object); four values corresponding to bounding boxes and also the type of the object. If there is no object, we don't care about anything else other than the fact y_object_exists = 0. If no object, all the other values are me da igual.
Calculating the loss function based on the two cases: (1) Object Exists (2) Does not exists
Landmark detection- just give the (X,Y) coordinates- a landmark is one point with a (x,y) coordinates.
Sliding windows for object detection - take cropped inputs (of cars for example) and train a NN to output 1/0. In sliding window, you slide/stride a window across the whole image and then have it classify for every such section of the image.
- Then do this for a bigger region...rinse..repeat.
- HUGE COMPUTATIONAL COST. And ConvNets' complexity time adds to the problem of computational cost.
An efficient implementation using convolution
- Replace the fully connected layer with a convolution layer. Implement fully connected layers as convolutional layers.
- The benefit of this is that a lot of computations get shared between sliiding windows. Instead of running forward propagation indepedently, can run it together.
- ACHTUNG!! Bounding boxes is not correct/best in this implementation: STEP IN YOLO!
- YOLO algorithm
  - Apply the localization algorithm to nine grid cells in an image; assign every grid a vector label. Total volume = Number of grids multiplied by the target vector for each of the boxes. Could use finer/coarser grids.
  - Achtung! An object might appear in more than one grid, will address this later.
  - It is only a single computation, is an efficient algorithm and runs fast!
How to tell if your object detection algorithm is working well
- Intersection over union (IoU) function to calculate the efficacy of the bounding boxes. If IoU > 0.5; then it is considered good.
Non maximal supression
- The problem of multiple detections for the same object.
- All the ones with high overlap will get supressed.
- Non maximal supression algorithm.
  - Repeatedly pick boxes with high object probability, and eliminate boxes with high IoU with this one.
Anchor Boxes
- Use different kinds of boxes for different boxes to assign to. Each object is now assigned to the (grid cell, anchor box) pair that has the highest IoU with the object.
- Helps your algorithm specialize better.
- Can use k means to cluster into types of anchor boxes! (neat!
The generalized YOLO algorithm combining anchor boxes, non max supression; into the algorithm
Region Proposals: R-CNN - propose regions via segmentation. Different algorithms to propose regions.

Assignment: Autonomous Driving- Car Detection using YOLO

Need to collect images: Done via a car mounted camera. YOLO - solo una mirada, hijo de puta
YOLO: If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Find box scores; apply max supression.

Summary for YOLO:
Input image (608, 608, 3)
The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
Each cell in a 19x19 grid over the input image gives 425 numbers.
425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
85 = 5 + 80 where 5 is because  (pc,bx,by,bh,bw)(pc,bx,by,bh,bw)  has 5 numbers, and and 80 is the number of classes we'd   like to detect
You then select only few boxes based on:
Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
This gives you YOLO's final output.

What you should remember:
YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.

Week 4: Face Recognition

Face verification vs face recognition
One shot learning- you need to perform well with just one image of the person. Learn from just one example. We compute a similarity function for images.
- Use a siamese network architecture.
  - Learn a function such that encodings of same person's images is small; and of different persons' is large.
*MISSED SOME NOTES HERE

Neural Style Transfer
- Content cost function - choose a layer (neither two shallow, neither two deep); and then analyze the activations caused by two images. If the activations are similar, it implies that the images have a similar content.
- Style Cost, how correlated are the activations across different channels? How often do high level features such as texture occur together.
  - Choose a layer and see how correlated are the activations between different channels.
  - Degree of correlation is a measure of style; how similar is the style of the generated image with the style image.
  - Generate a style matrix; a (number of channels) * (number of channels) matrix; see how correlated different channels are. Make pairs of every channel with the other to get this matrix's values.
  - Compute the style matrix for both the images- cost function is the norm (difference) between the two style matrices.
  - The combine the cost function across all layers
- Generalization to 2D and 3D images.
  - Convolution for a 1D image.
  - 3 Dimensional Data- convolve with a 3D filter

Assignment: Neural Style Transfer Art Generation

Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!
We would like the "generated" image G to have similar content as the input image C. Suppose you have chosen some layer's activations to represent the content of an image. In practice, you'll get the most visually pleasing results if you choose a layer in the middle of the network--neither too shallow nor too deep.
What you should remember about computing the cost function

What you should remember:
The content cost takes a hidden layer activation of the neural network, and measures how different  a(C)a(C)  and  a(G)a(G)  are.
When we minimize the content cost later, this will help make sure  GG  has similar content as  CC .

Computing the style function
- Calcualting the Gram Function for a single layer
- Then merging for multiple layers; using lambdas

What you should remember:
The style of an image can be represented using the Gram matrix of a hidden layer's activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image  GG  to follow the style of the image  SS .

  What you should remember:
The total cost is a linear combination of the content cost  Jcontent(C,G)Jcontent(C,G)  and the style cost  Jstyle(S,G)Jstyle(S,G) 
αα  and  ββ  are hyperparameters that control the relative weighting between content and style

CONCLUSION

What you should remember:
Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
It uses representations (hidden layer activations) based on a pretrained ConvNet.
The content cost function is computed using one hidden layer's activations.
The style cost function for one layer is computed using the Gram matrix of that layer's activations. The overall style cost function is obtained using several hidden layers.
Optimizing the total cost function results in synthesizing new images.

Assignment: Face Recognition for the happy house

Face Verification - "is this the claimed person?". For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.
Face Recognition - "who is this person?". For example, the video lecture showed a face recognition video (https://www.youtube.com/watch?v=wr4rx0Spihs) of Baidu employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem.

Implement FaceNet
- Encode an image into a 128 dimensional vector
- implement the triplet loss function

What you should remember: Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem. The triplet loss is an effective loss function for training a neural network to learn an encod

Course 5: Sequence Models

Week 1: Recurrent neural networks

Why sequence models are useful- speech recognition, translation, music generation etc.
Name Entity Recognition example
- Given a sentence, find the words that correspond to names.
- Talks about notations etc.- how to represent individual words- make a vocabulary/dictionary of all the words
  - One hot encoding the words
- It is a supervised learning problem.
Why not use a standard neural network?!
- Inputs and outputs can be different lengths- you can have sentences of different lengths (different words)
- Does not share features learned across different positions. (Kinda similar to convolutional neural network)
What is a recurrent neural network?
- You want things learnt in one part to be used in other parts..
- Learning from one time step to the other, passing along the activation
  - Y3 comes not only from X3, but also from X2 and X1
- The case for bidirectional recurrent network; versus a unidirectional neural network.
- Explains forward propagation
Backpropagation through time
- Loss defined for a single word
- Compute the total loss by summing the loss per word in time
Different types of RNNs
- Input length and output length can be different
  - Many to many RNNs
  - Sentiment classification- many to one RNNs
  - One to many RNNs - generate Music
  - Machine translation- many to many, but of different lengths!
Sequence generation and machine translation
- 'Pair' vs 'pear'
- Speech recognition tells the probability of a sentence existing.
- Tells the probability of a sequence of words existing
- See the probability of a word existing in a particular position
Sample novel sequences
- Keep sampling until you have hit EOS
- Character level language model vs Word level language model
  - Dont have to worry about Unknown in character level.
  - Character language models are much longer!
Vanishing gradients
- Hard to propgatae information along the sentence - farther the word, lesser the influence
- For exploding gradients, use gradient clipping
Gated Recurrent units
- To solve the problems of vanishing gradients
- Memory cell, to preserve the information
- Memorize the value such as singular/plural; and the gate (Gamma) to see if you need to update the value or not
- Can use different bits to remember different things, such as plural/talking about food etc.
Long Term Short Memory
- LSTMs - Has two gates, update gate and forget gate
- LSTM is the preferred choice over GRUs
Bidirectional RNNs
- Take info from both earlier and later in the sequence
- Has a backward recurrent layer, in addition to the forward recurrent layer
Deep RNNs
- Stacking a single layer we have learnt so far one over the other.
  - Because of the temporal dimension, these are less deeper than traditional neural networks.

Assignment : Building a recurrent neural network Step by Step

Describes how LSTM can be used to solve the vanishing gradients problem

Assignment: The Dinosaur problem

Clipping of gradients and why to do it

Assignment: Improvise a jazz solo

Similar to the dinosaur model, except in Keras

Here's what you should remember:
A sequence model can be used to generate musical values, which are then post-processed into midi music.
Fairly similar models can be used to generate dinosaur names or to generate music, with the major difference being the input fed to the model.
In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps  1,…,Tx1,…,Tx .

Week 2: Natural Language Processing & Word Embeddings

Introduction to word embeddings
- How to represent words, that is good to learn realtions?
- Featurized representation
  - Features such as Gender, 'Royal', Age etc.
  - Take a vector of features
    - Helps find words that are closely related
    - Eg. apple and orange are closer to each other than apple and 'man'
    - Visualizing word embeddings
Using word embeddings
- Can analyze a lot of unlabeled text to decipher less common words
  - Download word embeddings from a large text corps
  - Transfer embedding to a smaller training set
  - Continue to fine tune the word embeddings
- Similar to face encoding
Properties of word embeddings
- INTERESTING! How to find analogies, eg. if man is to woman, king is to what?
- The difference would come up in the subtraction; a single property would stand out
- Define the similarity function, we use cosine similarity
Embedding Matrix
- A (number of words * dimensions) matrix
Learning Word Embeddings
- Take all the embedded vectors and put it into a neural layer followed by a softmax activation
  - One hyperparamater is the history of how many words before you want to learn - what context do you want to learn the word?
- Word2Vec Model
  - Randomly pick the context word and the target word (within some window of the context word)
  - Hierarichal softmax classifier , like a tree that splits into groups such as (first 5000 words) etc.
    - In general more common words are at the top of the three, and less common ones at the bottom
    - Helps in speeding up the algorithm
  - How to sample the context word
    - Don't take it uniformly, else you will always get words like a, then, the etc.
  - In general softmax is the blocking part, computationally expensive
- Negative Sampling
  - Determine if two words are a context and target pair
    - Orange and juice are a pair, orange and king are not
    - Make a table of positive and negative examples; for every positive example, you have K negative examples
    - We dont train all the words in the corpus, but only K+1 of them based on your table from above.
    - How to select the negative words, according to what distribution?
- GloVe word vectors algo
  - Very Simple: Global Vectors for Word Representation
  - Sample how manytimes two words appear in close proximity
- Sentiment classification
  - Challenge is sometimes not having a hude training set.
  - Average the word vectors and feed to softmax
    - Use RNN for classification, a many to one architecture
  - Debiasing word embeddings - SJW stuff!
    - First find the direction that corresponds to the bias we are trying to solve (eg. Gender Bias)
    - Remove bias, by prijecting them onto the orthogonal direction of the bias we want to solve
    - Equalize bias by making grandfathers and grandmothers; for example the distance between babysitter should be equal between grandfathers and grandmothers

Assignment: Debiasing

Cosine similarity

Cosine similarity a good way to compare similarity between pairs of word vectors. (Though L2 distance works too.)
For NLP applications, using a pre-trained set of word vectors from the internet is often a good way to get started.

Assigment : Emojify

Adding emojis to sentences based on emotion
Emojifier V2 using LSTMs in KERAS

What you should remember:
If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.
LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
You can use Dropout() right after LSTM() to regularize your network.

Week 3: Sequence to sequence architectures

Sequence to sequence models
- Language translation for example
- Image captioning, caption an image
Picking the most likely model
- Machine Transation Model
  - Split into a model encoding the sentence; and then a language model.
  - Calculate the probability of an English sentence conditioned on a French sentence.
  - DONT DO RANDOM! - Find the sentence that maximizes the conditional probability
BEAM search
- Beam Width - maintain a list of the best three words (for example) in a probabilistic sense.
- After the first word, you maintain a list of conditional probabilities of say two words together. You hardwire the previous word output into the next. You do it for all the three contenders- then find the top three across all.
- And you therefore continue, fragment by fragment.
- If beam width = 1, it essentially becomes greedy search,
Refinements to BEAM search
- Dealing with numerial underflow- so we take the log!- because multiplying small numbers might result in underflow.
- Also tends to favor shorter translations due to being multiplied by zero over and over again. Normalize by the number of words, and reduces penalty for longer transations.
- Take the top sentences and compute the score- pick the highest!
- Choosing B
  - If B is large, you take in a lot of possibilities, but more computation power.
  - if B is small, then you are taking in less context, but is quicker to run.
Beam Search Error Analysis
- How to analyse where the error lies, is it the network or the Beam Search algo?
  - Switch into two cases, and you can find who is at fault exactly.
  - Find such cases, and do an error analysis for all faulty examples, and ascribe the error to either of the two.
Bleu Score- to decide between multiple good answers for a translation.
- Stands for 'Bilingual evaluation'
- Modified Precision - see how many times a word is in total in the human provided reference transations.
- Look at pairs of words- bigrams - how many times do the bigrams appear?
We do this for unigrams, bigrams, n-grams..
- Combined Blue Score- basically average for unigrams, bigrams, n-grams...
- Brevity penalty- if you output short penalty, to Adjust by penalizing
Attention Model Intuition
- A human does not memorize the entire sentence, and then translates it; this is what the encoder architecture is doing.
- So it does bad on longish sentences; so you work on one sentence at a time.
- A set of attention weights - how much attention should you give to words when determining the translation.
- Implementation details
  - At every step, you decide how much context weight to give to the other words.
  - You input Context vectors at each time step.
  - Calculate factors for getting the attention weights using a small neural network.
  - TAKES A LOT OF TIME TO RUN THOUGH!- Is Quadratic - You can apply this idea to image captioning as well, just pay attention to parts of the picture.
Speech recognition problem
- First you generate a spectrogram of the speech data and then run recognition
- Initially was broken into phenomes; but now deep-learning is showing that phenomes is not required. Also because of much large audio sets available for training.
- CTC cost for speech recognition
  - You collapse repeated characters bnot separated by a blank.
- Trigger word detection- TRIGGERED!
  - Hey Siri, Okay Google etc.
  - Just binarize the target label - Imbalance might be due to skewed.
    - To solve this, you might output more 1s in continuation.

CONCLUSION AND 谢谢!

Assignment: Using a machine translation model to convert dates to human readable dates

Implement an attention model

Here's what you should remember from this notebook:
Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
A network using an attention mechanism can translate from inputs of length  TxTx  to outputs of length  TyTy , where  TxTx  and  TyTy  can be different.
You can visualize attention weights  α⟨t,t′⟩α⟨t,t′⟩  to see what the network is paying attention to while generating each output.

FINAL ASSIGNMENT - Trigger Word Detection

Converting raw audio to spectograms
Use a conv layer to convert spectogram to features
We use unidirectional instead of bidirectional; because we want to detect the word asap (and not wait for the whole sentence!)

Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
An end-to-end deep learning approach can be used to built a very effective trigger word detection system.

Raw

Udacity Nanodegree.md

Part 1

Choosing the right estimator

A Great PPT!

https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/preview?imm_mid=0f9b7e&cmp=em-data-na-na-newsltr_20171213&slide=id.g22aaaf9c33_0_76

The Golden Question

Choosing the right estimator, a cheatsheet from scikit: http://scikit-learn.org/stable/tutorial/machine_learning_map/
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

Lesson 1

Refresher on machine learning: bite sized -- https://classroom.udacity.com/nanodegrees/nd009/parts/1d267043-f968-4853-9128-56f88f519d46
Visualizing ML, an intro: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Lesson 5 : Training Models

Statistics Refresher - In Extracurricular

Mean, Median, Quartiles, IQR etc., Variability, Standard deviations and distributions

Training Models- small intro to numpy

Lesson 6 : Testing Models

Splitting data into testing and training
GOLDEN RULE: Never use training data for testing (duh..)

Lesson 7: Evaluation Metrics

Confusion matrix: True positive, false positive
Accuracy (Why it might be bad sometimes?
Precision and Recall (https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg)
F1 Score (Harmonic Mean), F- beta Score
ROC (Receiver Operating Characteristic) Curve, Regression Metrics (R2 score)

Lesson 8: Detecting Errors

Overfitting (High variance) -- (Killing a fly with a bazooka) vs Underfitting (High Bias) (killing a godzilla with a flyswatter)
Cross validation, K fold cross validation
Learning curves

Lesson 9: A short summary - putting it all together

Grid search

Practice Project - Bag of words concept

Part 3

Lesson 2: Introduction to regression

Finding the best fit using calculus - Minimizing the sum of square error
Find best order of polynomial - Polynomial regression
Types of errors in training data
Cross Validation in Regression

Lesson 3: More regression

Parametric regression
Non Parametric Regression -- Instance based methods -- K Nearest neighbour vs Kernel Regression

Lesson 4: Regressions in sklearn

Continuous supervised learning (how does it differ from what you have learnt thus far?)
Continuous (Generally some sort of ordering) vs discrete classifier (No ordering ,even though they might be numbers)
Slope and Intercept
skLearn Practice - R square metric
Errors in Linear Regession- best model minimizes the sum of squared errors (Why Square and Not Absolute?). What is the problem with the sum of squared errors?
Benefits of R-square over Least Squares
Classification vs Regression: Differentiate based on output-- Chunk Number 34 -- Classification gives discrete labels (yes or no); but regression gives a concrete number from a continous model

Lesson 5: Decision Trees

Classification vs Regression
Classification Learning Concepts- Hypothesis, Target Concept etc.
Decision Tree Introduction: How to decide which trees are better

Analogy with the 20 questions car games

Best Attributes
Decision Tree Expressiveness
Space complexity of decision trees: how many decision trees are possible?
ID3 algorithm - What does the best attribute mean? (Information Gain) - Formula for entropy

Biases in ID3
Can you repeat an attribute? For continous values, you can ask a different question. For discrete, no attribute should be repeated
Dealing with overfitting

Lesson 6: More decision trees

Multiple linear questions (Think of it as multiple linear questions)
Coding decision trees - Tuning parameters
Regression using decision trees
Data Impurity/Entropy - min_split criteria; tuning skLearn

Lesson 7: Neural Networks

Perceptron Units- Representing basic boolean operations using perceptron units
Perceptron Training -

(1) Perceptron Rule - Works if the dataset is Linearly Separable (The Halting problem) -- half plane, half space

(2) Gradient Descent (For non linear separability) - Sigmoid function- avoiding local minimas!

Comparison of the two approaches
Back Propagation- Neural Networks
Restriction Bias, Preference Bias, Occam's Razor

Lesson 8: Support Vector Machines - The Math behind it

Best line is consistent to the training data, while committing to it the least

Derivation for the best line- maximizing the margin
Solving the best line for SVM - Quadratic Programming Problem -- Zero/Non zero alphas for vectors (input data)

Only a few points matter; the one close to the decision boundary- those are our SUPPORT VECTORS!

Linearly married - Kernel Trick Domain Knowledge is introduced via Kernel Trick- THe Mercer Condition

Lesson 10 - SVMs in Practice

Lesson 11 - Instace Based Learning

K Nearest neighbors

Intro
Classification vs Regression
Running times of various algos (Learning vs Querying)
Eager vs Lazy Learners
Different Distance Metrics - IMPORTANT TO HAVE NICE DOMAIN KNOWLEDGE!
KNN Preference Bias - Locality, Smoothness and importance of features
Curse of Dimensionality - Number of data points with respect to the dimensionality of your feature space
Locally weighted regression

Lesson 13: Bayesian Learning

Basic Refresher Video : https://www.youtube.com/watch?v=xw6utjoyMi4

Bayes Rule - Derivation using Probability Chain Rule , Prior - domain knowledge, Priors Matter A LOT!

Many times, we do not have the Prior of Data, but is not needed, as we just need the Maximum among all hypothesis

Maximum Likelihood vs Maximum A priori

With noisy data- General Gaussian Derivation-- Comes up to minimizing sum of squared errors - mind blown!
Minimum Description Length - Entropy! - Minimizing error (mis classification) and getting simplest model
Finding best hypothesis vs finding best label

Lesson 14: Bayesian Inference

Joint Distribution
Conditional Independence
Belief Networks/Bayesian Network
Joint Distribution and Sampling - Conditional Probability
Inference Rules

Lesson 16: Ensemble Bagging and Boosting

Ensemble Classification : Combine simple rules to make a complex rule to classify
Ensemble Bagging - How is better? Relates to avoid overfitting
Ensemble Boosting - Weighted error rate - weak learning - Boosting Code
Increasing weight on the ones getting wrong, and reducing weight on the ones right ; in a particular iteration. Combining how to get the final hypothesis.
Boosting and overfitting - Error vs confidence

Project 2- Charity ML

Data preprocessing - Normalization, scikit minMaxScaler
Data preprocessing - OneHotEncoding for categorical values
F Beta Score
Grid Search
Feature Importance

Part 4 - Unsupervised Learning

Clustering

Trying to guess the data's structure when a data does not come with labels
KMeans and Outliers: https://stats.stackexchange.com/questions/214362/trouble-in-understanding-outliers-influence-on-k-means
KMeans - Assign a center and optimize
Visualization to understand: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
SkLearn's KNN - The number of clusters is a VERY important parameter
Limitations: Result not always the same; the problems of local minima A local hill climbing algorithm

More Clustering

Single Linkage Clustering- Inter Cluster Distance - Big O Running Time
Soft Clustering- A point belongs to a cluster 'probabilistically'

Maximum Likelihood Gaussian

Expectation Maximization
Properties of Clustering - Richness, scale-invariance, consistency - IMPOSSIBILITY THEOREM- Cannot have All three!

Properties of EM (Can get stuck-local optima)

Clustering Mini Project

Feature Scaling

Feature Scaling

Giving equal weightage to all features.
Feature Scaling Formula
SKLearn MinMaxScaler
What algorithms would be affected?

Feature Selection

Important for 'Knowledge Discovery' and 'Curse of Dimensionality'
Feature Selection Algorithms Filtering and Wrappping- Tradeoffs between the two
Filtering: Use something as a decision tree to use information gain and get the subset of the most important features- then these features are passed into another learner
Wrapping: Ways to do wrapping - Searching, forward and backward search - WHAT FEATURES ARE IMPORTANT?
Feature Relevance, Relevance (Strong vs Weak) vs usefulness Relevance measures effect on the Bayes Optimal Classifier
Usefulness for a feature is defined for a particular algorithm

Principal Component Analysis

A great link to visualize PCA: http://setosa.io/ev/principal-component-analysis/
PCA - focuses on shifting and rotating only - for eg. y = sin (x) will be a 2D system; PCA just does translation and rotation. The center of the coordinate moves to the center of the data.
Importance of the new axis
Measurable vs Latent Features
Composite Features- Principal Component is NOT regression!
How to geet the Principal Component? Maximal Variance! - find the dimension with the maximum spread/or minimizes the information loss
Feature Transformation- PCs can be used as independent features, i.e. they do not overlap in terms of information with each other
When to use PCA? Eg.: Eigenfaces

PCA Mini Project: Eigenfaces

Observation: A lot PCs can lead to overfitting.

Feature Transformation

Transform a set of features into smaller, more compact features while retaining as much info as possible.
Why? An example of the google search problem- Problems such as polysemy and synonymy
Independent Component Analysis- ICA looks for statistical independence!
Cocktail Party problem: http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi
PCA vs ICA- very different! Whereas PCA finds global stuff such as eigenfaces, ICA finds more distinct features such as 'nose', 'eyes' etc.
Alternatives: RCA (Random Component Analysis) - deals with curse of dimensionality and LDA (Linear Discrimant Analysis) - cares about the labels

Unsupervised learning project

Good way to find relevant features: try making a feature a label and predicting it from other features.--find R2 score and see if you can model a feature using the others.
Box-Cox transformation
Outlier Detection- Tukey's method
Outliers: To Drop, or not to drop?
Silhouette's coefficient for the effectiveness for k means: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
GMM (soft clustering) vs KNN (hard clustering)
PCA in layman terms: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Part 5 - Reinforcement Learning

Markov Decision Process

Think of a world where actions are uncertain with some probabilities

The parameters/variables in a Markov Decision Process: states, models, action, rewards
Markovian Property- The next state only depends on the current state
Delayed Rewards - What was the action that led to the ultimate reward? - Temporal Credit Assignment Problem
Rewards - the hot sand beach walking to the water analogy
Sequences of rewards- Infinite horizons, Utility of Sequences; Relationship between rewards and utilities
Optimal Policy- The policy that maximizes the expected rewards. Reward in a state is not the same as the utility for the state - Reward is short term gratification, while utility is long term gratification
The BellMan Equation - How to solve it? Value Iteration
Finding Policies (pi)- Policy Iterations

Reinforcement Learning

Managing vs Learning, Modeler and Simulator
Three approaches to reinforcement learning - Policy Search, Vaue Functional Based, Model Based
Focus on Value Function- The Q function- Q Learning- VERY GOOD EXAMPLE: http://mnemstudio.org/path-finding-q-learning-tutorial.htm- Incremental Learning
VERY NICE SLIDES ON RL: http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf
Q-Learning: How to choose actions?- Local Min Problem- Simulated Annealing
Eplison Greedy Exploration- Exploration vs Exploitation Lemma, a tradeoff between two things

Game Theory

The mathematics of conflict of interests- Game theory has multiple agents (as opposed to single agents in the above cases) ;Game Tree
Matrix form of the game, writing strategies of agents against one other and placing rewards.
Material on 2-sum zero game: http://www.cs.cmu.edu/~./awm/tutorials/gametheory.html
MiniMax theorem: Minimax is same as Maximin in the 2-player zero-sum game; i.e. maximizing the min is same as minimizing the max. Find the value of the game.
Von Neuman's Theorem - Relevant on non deterministic game of info as well
Then instead of perfect information, we go to hidden formation- Minimax theorem fails!!!
Mixed strategy vs pure strategy; Center Game
Non zero sum game- Prisoner's dilemna
Nash equilibrium - Playing the game multiple times; for a n repeated game, solution is n repeated N.E.

More Game Theory

Stochastic Games and Multi Agent Reinforcement Learning
Zero Sum Stochastic Games and General Sum Stochastic Games - Nash Q Algorithm

Reinforcement Learning Project: Smart Cab

PyGame
Selecting state space
Exploration vs Exploitation

Other times the agent learns a suboptimal policy because it first explores an action which is sub-optimal, but does yield positive rewards, and then repeatedly exploits that action. Later it may randomly explore the optimal policy, but at that point the suboptimal policy will have a higher value in the q-table.
For example, this might be "going forward at a green light" instead of following the waypoint at a green light. We will get some reward for simply moving on green, regardless of the waypoint, but it's not optimal. However it will be regularly exploited until exploration occurs again. During the exploitation period, it will build up a significant lead on the optimal policy.

Deep Learning

Lesson 1: More Deep Learning

Juanito esta jugando; el tiene que dividar puntos, y el va a dibujar una linea; como el va a hacer lo?
Linear boundaries for dividing data points. Then generalized for higher dimensions/features.
Perceptrons in terms of nodes (Neural Networks); perceptrons as logical operators
Perceptron Trick, Learning Rate. Start Random and then try fitting the line iteratively to correctly predict the mis-predicted points.
Error function- log-loss error function- When can you use Gradient descent?
Discrete vs Continous Predictions- Sigmoid Function; Softmax Function, One Hot Encoding
Maximum Likelihood; Cross Entropy; Multi Class Cross Entropy
Minimizing the error function given by cross entropy formula. Gradient Descent.
Similarities and Comparison between Perceptron and Gradient Descent; A correctly classified asks the separation line to go away; and a misclassified point asks the line to come closer.. (think about it, makes sense!)
Non Linear Models; Combining multiple perceptrons; Hidden Layers, Multi class classification
Feedforward and training Neural Networks; Backpropagation
Keras, Student Admissions Mini Project

Description of batch size etc:

* one epoch = one forward pass and one backward pass of all the training examples
* batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
* number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

Training optimization- Stochastic Gradient Descent and Batch Gradient Descent; How to choose the decay of the learning rate; Overfitting vs Underfitting (the exam analogy); and how it is applicable in the neural network setting
Model Complexity Graph- Complexity generally increases with the increasing number of epochs- Early Stopping
- Regularization and Overfitting- Punish high coefficients to avoid them!
Dropout- Avoid dominance of one part of the neural network to let some of the other weaker parts train- Vanishing Gradient - try other Activation Functions, like hyperbolic tan, Relu etc.
The problem of local minima! Try random restart or go with more momentum
A blog on the optimizers available in Keras: http://ruder.io/optimizing-gradient-descent/index.html#rmsprop
Mini Project: IMDB

Convolutional Neural Networks (CNN)

Applications of CNNs- Eg. Image Classification, Text Classification, Pictionary etc.
The MNIST project: recognizing digits from images
- One Hot Encoding, Flattening image matrices to vectors, Vanishing Gradient Problem
- A great link on activation functions: http://cs231n.github.io/neural-networks-1/#actfun; Categorical cross entropy loss as a loss function
- Choosing the best model- split into validation sets!
CNNs vs MLPs (Multi layer perceptrons)- When do MLPs fail? -.-
- MLPs use a lot of params (Sparsely (locally) connected vs fully connected layer)
- Throwing away 2D neighborhood information (such as in an image!) due to flattening
- Color coding
Convolutional layers
- Convolutional windows- Use multiple filters to detect multiple patterns
- Activation Maps
- Color Images!
- Stride and Padding
Convolutional Layers in Keras
- You are strongly encouraged to add a ReLU activation function to every convolutional layer in your networks.
- Formula for number of parameters in a convolutional layer and formulas for shape of a convolutional layer
Pooling Layers
- Used for dimensionality reduction and avoiding overfitting.
- Take feature maps as input
- Max Pooling Layer, Global Average Pooling Layer
- Think as a stack of pancakes!
- CNNs for image classification
- Resizing the images. Aim is to decrease the weight and the height of the image, while increasing the depth of the image. Use max pooling layers to reduce dimensionality, i.e. reduce height and width.
- A connected layer at the very end.
```
When constructing a network for classification, the final layer in the network should be a Dense layer with a softmax activation function. The number of nodes in the final layer should equal the total number of classes in the dataset.
```
The CIFAR 10 image database project
Keras Cheat Sheet- https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf
Image Augmentation
- Scale invariance, Translation Invariance and Rotation invariance. Add random images with a bit of rotation, translation etc to the dataset. Augment ImageDataGenerator. Not the use of steps_per_epoch, fit_generator and flow in the fit command.
Grounbreaking CNNs architectures, eg. ResNet, VGG etc.
Transfer Learning- Using a pre-trained neural network to solve a new problem, i.e. a different dataset.
- The initial layers detect more common pattermns such as circles, shapes etc; so they can be kept. Then you just train the final layers.
```
Here is an generalized overview of what the convolutional neural network does:
  the first layer will detect edges in the image
  the second layer will detect shapes
  the third convolutional layer detects higher level features
```

CNN Dog Recognition project

Haar Wavelet Face Detection, using ResNet50 for dog detection
Classifying breeds is a difficult problem.
Additional links-
- http://cs231n.github.io/
- http://cs231n.github.io/transfer-learning/

Deep Learning Extracurricular

Lesson 3: Intro to TensorFlow

brings different communities such as speech recognition, computer vision together with a common set of tools to solve the problems.
Intro to tensorflow constants and sessions. placeholder and feed_dict
Supervised Classification
- Training a logistic classifier- weights, bias etc.
Some coding quizzes on tensor flow placeholder, softmax etc.
Activation functions- Relu; softmax; implementing cross entropy
Practical Aspects of Deep Learning - have variables to have zero mean and equal variance - Badly conditioned vs well conditioned- well coditioned makes optimization easier (numerically)
Measuring performance- Have classifiers generalize, not memorize. That's why you use validation sets!
The problem of scaling the gradient descent- take a random set of training data, computer gradient of it- do this many times! Stochastic Gradient Descent- Exponential Decay

Small Exercise

Nice exercise to see how much ram spance you need.

lesson 4: Intro to Neural Networks

Luis is back!!