1. Important Links

A Medium publication sharing concepts, ideas, and codes.

https://towardsdatascience.com/

Pandas DataFrame - Playing with CSV Files

https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff

Machine Learning | An Introduction

https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0

The Art of Effective Visualization of Multi-dimensional Data

https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57

2. Best TED Talks for Data Science

How not to be ignorant about the world
The Beauty of Data Visualization
Three ways to spot a bad statistic
Big Data is better data
The human insight missing from Big Data
Your company’s data could end world hunger
Who Controls the World

3. Bayesian Statistics and Naive Bayes Classifier Refresher

Bayes Theorem

Bayes' Theorem is a formula that tells us how to update the probabilities of a hypothesis when given an event occurs. https://miro.medium.com/max/1022/1*YTWinOBUgmStxkbUJZ1vNw.png

Feature Scaling need to be done when the Algorithms are based on Eulidean Dstance.

Examples:

4. Algorithms based on Eulidean Distance:

K-nearest neighbors (classification) or K-means (clustering)

5. Algorithms not based on Eulidean Distance:

Decision Tree Classification

6. What are the things to know in ML to complete the Titanic challenge on kaggle?

https://www.ahmedbesbes.com/blog/kaggle-titanic-competition https://www.kaggle.com/anaskad/step-by-step-solving-titanic-problem https://www.quora.com/How-does-one-solve-the-titanic-problem-in-Kaggle https://www.kaggle.com/startupsci/titanic-data-science-solutions

7. Metrics for Summarizing Model Quality

7.1 MAE (Mean Absolute Error)

The average magnitude of the errors in a set of predictions, without considering their direction.
It's the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

MAE = 1/n Σj=1-n |yj -ŷj|

8. Overfitting & Underfitting

Overfitting is a phenomenon, where a model matches the training data almost perfectly, but does poorly in validation and other new data.
When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called Underfitting.

A model is said to have high bias when its structure does not describe the data model.

Relation With Overfitting And Underfitting

Analyzing the linear models presented in the first image it's clear that:

A model with low variance and low bias is the ideal model (grade 1 model).
A model with low bias and high variance is a model with overfitting (grade 9 model).
A model with high bias and low variance is usually an underfitting model (grade 0 model).
A model with high bias and high variance is the worst case scenario, as it is a model that produces the greatest possible prediction error.

As a general rule, the more flexible a model is, the higher its variance and the lower its bias. The less flexible a model is, the lower its variance and the higher its bias.

9. Decision Tree

A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

10. Random Forest

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.
It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

mepsrajput/ds_python_notes.md