Skip to content

Instantly share code, notes, and snippets.

@Rubix982
Last active June 22, 2021 16:46
Show Gist options
  • Select an option

  • Save Rubix982/9f5c96a5cfe13157a3cc960a92f01f1a to your computer and use it in GitHub Desktop.

Select an option

Save Rubix982/9f5c96a5cfe13157a3cc960a92f01f1a to your computer and use it in GitHub Desktop.
Just the final AI Lab Notes

AI Lab Preparation

K Nearest Neighbors

  1. Code,
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
 distance = 0.0
 for i in range(len(row1)-1):
  distance += (row1[i] - row2[i])**2
 return sqrt(distance)

# List of the data ...
dataset = []

# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
 distances = list()
 for train_row in train:
  dist = euclidean_distance(test_row, train_row)
  distances.append((train_row, dist))
 distances.sort(key=lambda tup: tup[1])
 neighbors = list()
 for i in range(num_neighbors):
  neighbors.append(distances[i][0])
 return neighbors
  1. Advantages as "NNE",

    1. No Training - KNN is called Lazy Learner (Instance based learning). It does not learn anything in the training period. It does not derive any discriminative function from the training data. In other words, there is no training period for it. It stores the training dataset and learns from it only at the time of making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g. SVM, Linear Regression etc.
    2. New Data - Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.
    3. Easy Implementation - KNN is very easy to implement. There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)
  2. Disadvantages as "LHFS",

    1. Large Datsets - Does not work well with large dataset: In large datasets, the cost of calculating the distance between the new point and each existing points is huge which degrades the performance of the algorithm
    2. High Demensionality Problems - Does not work well with high dimensions: The KNN algorithm doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate the distance in each dimension
    3. Feature Scaling - Need feature scaling: We need to do feature scaling (standardization and normalization) before applying KNN algorithm to any dataset. If we don't do so, KNN may generate wrong predictions
    4. Sensitives - Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise in the dataset. We need to manually impute missing values and remove outliers.

K Means

  1. Code
def euclidean_distance(row1, row2):
 distance = 0.0
 for i in range(len(row1)-1):
  distance += (row1[i] - row2[i])**2
 return sqrt(distance)

def kmeans(X,k=3,max_iterations=100):
    '''
    X: multidimensional data
    k: number of clusters
    max_iterations: number of repetitions before clusters are established
    
    Steps:
    1. Convert data to numpy aray
    2. Pick indices of k random point without replacement
    3. Find class (P) of each data point using euclidean distance
    4. Stop when max_iteration are reached of P matrix doesn't change
    
    Return:
    array: containg class of each data point
    '''
    idx = random.choice(len(X), k, replace=False)
    centroids = X[idx, :]
    P = argmin(euclidean_distance(X, centroids))
    for _ in range(max_iterations):
        centroids = vstack([X[P==i,:].mean(axis=0) for i in range(k)])
        tmp = argmin(euclidean_distance(X, centroids),axis=1)
        if array_equal(P, tmp):
            break
        P = tmp
    return P
  1. Advantages as "ESCWNC",
    1. Easy Implementation - Relatively simple to implement
    2. Scalable - Scales to large data sets
    3. Convergence - Guarantees convergence
    4. Warm Start - Can warm-start the positions of centroids
    5. New Data Adaption - Easily adapts to new examples
    6. Clusters With Different Sizes - Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
  2. Disadvantages as "KOVID",
    1. k - Choosing k manually
    2. Outliers - Clustering outliers
    3. Varying Sizes - Clustering data of varying sizes and density
    4. Initial Values - Being dependent on initial values
    5. Dimensionality - Scaling with number of dimensions

Decision Tree

  1. Code
  2. Advantages as "LNNME",
    1. Less Effort - Compared to other algorithms decision trees requires less effort for data preparation during pre-processing
    2. No Normalization - A decision tree does not require normalization of data
    3. No Scaling Required - A decision tree does not require scaling of data as well
    4. Missing Values Do Not Effect - Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent
    5. Easy To Explain Results - A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.
  3. Disadvantages as "ICMEC",
    1. Instability - A small change in the data can cause a large change in the structure of the decision tree causing instability
    2. Complex Calculations - For a Decision tree sometimes calculation can go far more complex compared to other algorithms
    3. More Time - Decision tree often involves higher time to train the model
    4. Expensive Training - Decision tree training is relatively expensive as the complexity and time has taken are more
    5. Continuous Values Prediction - The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

Bayesian Network (Naive Bayes)

  1. Code
# Example of calculating class probabilities
from math import sqrt
from math import pi
from math import exp
 
# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
 separated = dict()
 for i in range(len(dataset)):
  vector = dataset[i]
  class_value = vector[-1]
  if (class_value not in separated):
   separated[class_value] = list()
  separated[class_value].append(vector)
 return separated
 
# Calculate the mean of a list of numbers
def mean(numbers):
 return sum(numbers)/float(len(numbers))
 
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
 avg = mean(numbers)
 variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
 return sqrt(variance)
 
# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
 summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
 del(summaries[-1])
 return summaries
 
# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
 separated = separate_by_class(dataset)
 summaries = dict()
 for class_value, rows in separated.items():
  summaries[class_value] = summarize_dataset(rows)
 return summaries
 
# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
 exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
 return (1 / (sqrt(2 * pi) * stdev)) * exponent
 
# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
 total_rows = sum([summaries[label][0][2] for label in summaries])
 probabilities = dict()
 for class_value, class_summaries in summaries.items():
  probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
  for i in range(len(class_summaries)):
   mean, stdev, _ = class_summaries[i]
   probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
 return probabilities
  1. Advantages as "ERER",
    1. Extensible - Bayesian Networks are more extensible than other networks and learning methods. Adding a new piece in the network requires only a few probabilities and a few edges in the graph. So, it is an excellent network for adding a new piece of data to an existing probabilistic model
    2. Readable Results - The graph of a Bayesian Network is useful. It is readable to both computers and humans; both can interpret the information, unlike some networks like neural networks, which humans can’t read
    3. Efficient Calculation - Computations calculate complex probability problems efficiently
    4. Relationships - Relationships determine the type of relationship and the presence or absence of it between variables
  2. Disadvantages as "CCCS",
    1. Computationally expensive - Eg: Approximate structure learning is too NP-Complete
    2. Cause Effect Relationship - Forces random variables to be in a cause-effect relationship. As a result, it does not depicts variables which are correlated. Thanks to Dmitry Lunin for adding more clarity. Here, the motive was to put forth the idea that bayesian network only encodes directional relationship and not the bi-directional. BN does not provides any guarantee of depicting the cause and effect relationship
    3. Cycle Networks - BN is a DAG that said. If the data was generated from a model where there at least 3 variables correlated to each other (cyclic relationship) then Bayesian networks (BNs) will not be able to model this relationship
    4. Sophisticated - One of the most important issues with BNs is that some of the sophisticated scoring functions require reliable priors in order to find a structure closer to the original model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment