- to build a classification system using the k-nearest neighbors algorithm.
- regression: predicting a number,like the value of a stock tomorrow, or how much a user will enjoy a movie.
- the use cases and limitations of k-nearest neighbors.
k-nearest neighbors (KNN)
- you get new fruit to calsify
- you look at its THREE nearest neighbours
- more neighbours are orange so this is probably an orange
you’re trying to classify something, you might want to try KNN first.
/*
│ ┌────┐
│ │JO │
│ │ │
│ ┌─────┐ │ │
│ ┌──────┐ │ J │ └────┘
│ │ Pr │ │ │
│ │ │ └─────┘
│ │ │ ┌──G─┐
│ └──────┘ │ │
│ ┌────┐ ┌───┐ └────┘
│ │ │ │ │
│ ┌───┐ │L │ │Ch │
│ │JC │ └────┘ │ │
│ │ │ └───┘ ┌H─┐
│ └───┘ ┌──┐ └──┘
│ │ │
│ ┌─────┐ │ │
│ │ │ └──┘
└──────────────└─────┘──────────────────────
*/
- plot every user on graph
- users are plotted by similarity, users with similar taste are plotted closer together
- you want to recommend for Priyank,
- Justin, JC, Joey, Lance, & Chris all have similiar taste in movie whatever movie they like, Priyank will like too
if you have this graph, a recommendations system is easy. if jsutin like a movie,recommend it to Priyanka.
1. JUSTIN WACHES a MOVIE 2. He like it ---> ★★★★★ ---> 3. recommend it to Priyanka
You graphed the users by similarity. How do you figure out how similar two users are?
When users sign up for Netflix, have them rate some categories of movies based on how much they like those categories. For each user, you now have a set of ratings!
Prinaka JUSTIN MORPHEUS
COMEDY 3 4 2
ACTION 4 3 5
DRAMA 4 5 1
HORROR 1 2 3
ROMANCE 4 5 1
Priyank (3, 4, 4, 1, 4)
\[
\sqrt{(a_1 - a_2)^2 + (b_1 - b_2)^2 + (c_1 - c_2)^2 + (d_1 - d_2)^2 + (e_1 - e_2)^2}
\]
It just involves a set of five numbers instead of a set of two numbers.
The distance formula is flexible: you could have a set of a _million_ numbers and still use the same old distance formula to find the distance. Maybe you're wondering, "What does **distance** mean when you have five numbers?" The distance tells you how similar those sets of numbers are.
\[
\sqrt{(3 - 4)^2 + (4 - 3)^2 + (4 - 5)^2 + (1 - 1)^2 + (4 - 5)^2}
\]
\[
= \sqrt{1 + 1 + 1 + 0 + 1}
\]
\[
= \sqrt{4}
\]
\[
= 2
\]
Here’s the distance between Priyanka and Justin.
Priyanka and Morpheus are 24 apart
JUSTIN 5
JC 4
JOEY 4
LANCE 5
CHRIS 3
average of their ratings and get 4.2 stars. That’s called regression.
These are the two basic things you’ll do with KNN—classification and regression:
- Classification = categorization into a group
- Regression = predicting a response (like a number)
Cosine similarity measures the similarity between two vectors of an inner product space.
When you’re working with KNN, it’s really important to pick the right features to compare against. Picking the right features means
- Features that directly correlate to the movies you’re trying to recommend
- Features that don’t have a bias (for example, if you ask the users to only rate comedy movies, that doesn’t tell you whether they like action movies)
Machine learning is all about making your computer more intelligent.
OCR stands for optical character recognition. It means you can take a photo of a page of text, and your computer will automatically read the text for you. Google uses OCR to digitize books. How does OCR work? For example, consider this number
You can use KNN for this:
- Go through a lot of images of numbers, and extract features of those numbers.
- When you get a new image, extract the features of that image, and see what its nearest neighbors are!
OCR algorithms measure lines, points, and curves.
The first step of OCR, where you go through images of numbers and extract features, is called training. Most machine-learning algorithms have a training step: before your computer can do the task, it must be trained
Spam filters use another simple algorithm called the Naive Bayes classifier.
you train your Naive Bayes classifier on some data.
Subject Spam?
"RESET YOUR PASSWORD" Not Spam
"YOU HAVE WON 1 MILLION DOLLARS" Spam
"SEND ME YOUR PASSWORD" Spam
"NIGERIAN PRINCE SENDS YOU 10 MILLION DOLLARS" Spam
"HAPPY BIRTHDAY" Not Spam
For example, in this very simple model, the word million only appears in spam emails.
Naive Bayes figures out the probability that something is likely to be spam. It has applications similar to KNN.
Challenge of Predictive Features: Selecting reliable features to predict stock market movements is difficult; simple patterns like "it went up yesterday, so it’ll go up today" or "stocks fall in May" are often ineffective.
Complexity of Variables: The stock market is influenced by numerous, unpredictable variables, making it hard to isolate factors that consistently forecast future performance.
Limitations of Historical Data: Past stock performance doesn’t guarantee future trends, complicating accurate predictions in a dynamic, multifactorial environment.
-
10 algorithms and why they’re useful.
-
You get pointers on what to read next, depending on what your interests are.
______________DAVID______________
/ \
______ADIT______ ______MANNING______
/ \
__MAGGIE__ __MIKE__
the nodes to its left are smaller in value, and the nodes to the right are larger in value.
Suppose you’re searching for Maggie. You start at the root node.
David -> Maggie comes after David, so go toward the right. MANNING -> Maggie comes before Manning, so go to the left.
You found Maggie! It’s almost like running a binary search! Searching for an element in a binary search tree takes O(log n) time on average and O(n) time in the worst case
| Operation | Array | Binary Search Tree |
|-----------|--------------|--------------------|
| Search | O(log n) | O(log n) |
| Insert | O(n) | O(log n) |
| Delete | O(n) | O(log n) |
you don’t get random access “Give me the fifth element of this tree.” Those performance times are also on average and rely on the tree being balanced.
/ \
__1__ __4__
/ \
__4__ __7__
\
__20__
\
__40__
in databases or more-advanced data structures,
- B-trees // mongodb used B-tree indexes
- Red-black trees
- Heaps
- Splay trees
we're building a hash table where each key is a word, and each value is a list of pages where that word appears.
This structure, known as an inverted index, allows us to efficiently look up pages that contain a specific word.
A hash that maps words to places where they appear. This data structure is called an inverted index
For instance, if a user searches for "hi," the hash table quickly shows the pages (e.g., A and B) where it appears.
Similarly, searching for "there" returns pages like A and C. Inverted indexes are widely used in search engines because they enable fast retrieval of information based on keywords.
algorithm that breaks down complex signals into their fundamental components
given a smoothie, can tell you the individual ingredients, or given a song, can separate it into its component frequencies.
This decomposition is useful in various applications: it enables you to enhance certain sounds in audio (like boosting bass) or compress files by discarding less important components.
MP3 and JPG formats rely on Fourier transforms to compress audio and images. Beyond media, it's also used in fields like earthquake prediction and DNA analysis due to its signal-processing capabilities
To make your algorithms faster, you need to change them to run in parallel across all the cores at once!
Parallel algorithms are hard to design. And it’s also hard to make sure they work correctly and to figure out what type of speed boost you’ll see
- Overhead of managing the parallelism
- Load balancing
The MapReduce algorithm is a popular distributed algorithm. e.g Apache Hadoop.
When working with massive data sets or large-scale tasks, distributed algorithms like MapReduce are essential. For instance, if you have a database with billions of rows, traditional SQL databases like MySQL may struggle. Using MapReduce on a distributed system like Hadoop can handle it effectively.
MapReduce breaks down tasks using two key functions:
Map: Processes each data chunk in parallel. Reduce: Aggregates the results.
By distributing jobs across many machines, MapReduce allows for faster data processing, making it ideal for big data tasks that would be too slow on a single machin
let arr1 = [1,2,3,4,5]
let arr2 = arr1.map(( x ) => x * 2 ) // 1 ,3, 6,8, 10
// Doubling an element is pretty fast.
let arr1 = # A list of URLs
let arr2 = map(download_page, arr1)
If you have a list of URLs to download, fetching each one individually can take hours for large lists.
With the map function in MapReduce, you can distribute the work across multiple machines.
For example, using 100 machines allows you to download 100 pages at once, speeding up the process significantly.
This parallel processing is the essence of the "map" function in MapReduce, where tasks are spread across many workers for faster completion.
reduce, you transform an array to a single item.
suppose you’re Google, and you’re crawling web pages. You only want to crawl a web page if you haven’t crawled it already. So you need to figure out whether this page has been crawled before.
suppose you’re running bit.ly, which is a URL shortener. You don’t want to redirect users to malicious websites. You have a set of URLs that are considered malicious. Now you need to figure out whether you’re redirecting the user to a URL in that set.
All of these examples have the same problem. You have a very large set.
hash needs to be huge. Google indexes trillions of web pages. If this hash has all the URLs that Google has indexed, it will take up a lot of space.
Reddit and bit.ly have the same space problem. When you have so much data, you need to get creative!
Bloom filters are probabilistic data structures that offer a space-efficient way to check if an item (like a URL) has been seen before. They provide a "probably correct" answer:
False positives are possible, meaning the filter might incorrectly say a URL has been crawled. False negatives are not possible, meaning if the filter says a URL hasn't been crawled, it definitely hasn't.
Bloom filters are useful when space is limited and an exact answer isn't necessary. They’re ideal for applications like URL checking or detecting potentially malicious sites, where a small chance of error is acceptable, but a guarantee of no false negatives is crucial.
HyperLogLog is a probabilistic algorithm used to approximate the number of unique elements in a large set, such as counting unique searches or items viewed.
HyperLogLog uses much less space while providing an estimate that’s very close to the actual number.
It’s ideal for large datasets where exact counts aren’t necessary, and approximate answers are acceptable, making it a space-efficient choice for big data applications.
Another hash function is a secure hash algorithm (SHA) function. Given a string, SHA gives you a hash for that string.
SHA is a hash function. It generates a hash, which is just a short string. The hash function for hash tables went from string to array index, whereas SHA goes from string to string
SHA generates a different hash for every string.
SHA to tell whether two files are the same.
You want to check whether your friend has the same large file. You don’t have to try to email them your large file. Instead, you can both calculate the SHA hash and compare it.
use SHA-2 or SHA-3. The gold standard for password-hashing functions is currently bcrypt
it’s locality insensitive. Suppose you have a string, and you generate a hash for it.
dog -> cd6357
If you change just one character of the string and regenerate the hash, it’s totally different!
dot -> e392a
you want a locality-sensitive hash function. That’s where Simhash comes
in. If you make a small change to a string, Simhash generates
a hash that’s only a little different.
This
allows you to compare hashes and see how similar two strings are, which is pretty useful!
- Google uses Simhash to detect duplicates while crawling the web.
- A teacher could use Simhash to see whether a student was copying an essay from the web.
- Scribd allows users to upload documents or books to share with others. But Scribd doesn’t want users uploading copyrighted content! The site could use Simhash to check whether an upload is similar to a Harry Potter book and, if so, reject it automatically.
Simhash is useful when you want to check for similar items.
Diffie-Hellman is a cryptographic method that uses two keys:
Public key: This key can be shared openly with anyone. You can post it online or send it to others.
Private key: This key is kept secret and is used to decrypt messages encrypted with the public key.
When someone wants to send you a secure message, they encrypt it with your public key. Since only your private key can decrypt the message, only you can read it, ensuring confidentiality
The Diffie-Hellman algorithm is still used in practice, along with its successor, RSA
Linear programming (LP) is a mathematical technique used to find the best outcome (such as maximum profit or minimum cost) in a mathematical model with linear relationships. It involves optimizing a linear objective function, subject to a set of linear constraints (equations or inequalities)