Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save slimlime/4a3744362678eb7173e7c47bbf624854 to your computer and use it in GitHub Desktop.
Save slimlime/4a3744362678eb7173e7c47bbf624854 to your computer and use it in GitHub Desktop.
Forgotten things from data mining and machine learning. A bit rough and lengthy. Submission suggested a plain text file... I feel sorry for the reader. Models and explanations are best accompanied with diagrams.
Scenario:
1. Explain the difference between (a) supervised and (b) unsupervised machine learning (i.), and (ii.) give examples of when a machine learning specialist would use each. Additionally, provide a description of algorithms one might use in either or both of these machine learning categories.
The major difference between supervised learning in contrast to unsupervised machine learning is that in supervised the datasets are already labelled (discrete classified, or known values in the variable the analyst is interested in determining), whereas in unsupervised analysis the aim is to produce interesting inferences about the data.
(a.) Supervised machine learning and algorithms
(i.) Supervised machine learning refers to predictive modelling based on known (labelled), prospective training data (e.g. a set of tuples with variable inputs and a known output/label) in the aim to be able to predict subsequent new datasets accordingly (categorical classification and regression continuous variables).
Regression, decision trees and classification are common examples of supervised machine learning algorithms
(ii.) Decision trees can be fundamentally described as hierarchical decision rules that aim to reduce information about a data set to distinct rules for the highest information gain. Using each variable as a feature to partition into distinct (homogenous) data subsets, the resultant rules can be used to classify and predict real world data to reflect the same model as the trained data.
Constraints: Poor fit, overly fit (bespoke split decision rules based too closely on training data set that does not well reflect real data set), dependent on sample size and quality of data.
Simple example of decision trees for classifiying flowers based on sepal and petal dimensions (tested on Weka) : (Famous Fisher Iris dataset). When visualised across combinations of different variables, there are intuitively clear clusters between the differently labelled data (species names setosa, versicolor, virginica). These dimensions can then be crystallised into binary decision rules (e.g length > 0.5cm) that sort into nicely homogenous classes.
(b.) Unsupervised machine learning and algorithms
(i.) Unsupervised machine learning refers to techniques applied where there may not be a definite outcome (in contrast to supervised learning), where the objective is to learn about interesting inferences in the data.
(ii.) Common algorithms include K-means clustering and association rule mining.
K-means clustering and association rule learning (Apriori)
The K-means algorithm consists of 4 basic steps:
1. Initialisation
Randomly (possibly tweaked) plots initial centroids as the centres of each cluster.
2. Cluster assignment
The closest data point is assigned to the respective centroid based on the shortest distance between each data point and the centre.
3. Re-compute centroids
Average all the points.
4. Reiterate steps 1 to 3 until the centroids are stable (i.e. the averages do not change within a specified margin). If the centroids do not change, the clustering algorithm is complete.
A popular algorithm for association rule and item set mining is the Apriori algorithm
Famous market basket analysis case study found beer and diapers were supermarket items often bought together (associated).
Metrics based on evaluating the support and confidence.
Based on the incidence rate of associated item combinations appearing in transactions.
Useful for generating rules from high confidence, frequent itemsets.
Apriori is a bottom up approach that progressively extends the subsets of the items based upon the characteristic that, if a given itemset is frequent, all of its subsets must also be frequent.
2. Why is the AWS Cloud a great enabler for machine learning and deep learning and what does it enable that you can’t easily do using other approaches.
(a.) The AWS Cloud allows data scientists and application developers to easily access machine and deep learning technologies through their frameworks and interfaces.
(b.) The flexible infrastructure and cost model
WIP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment