Skip to content

Instantly share code, notes, and snippets.

@jamesmurdza
Created August 31, 2024 12:39
Show Gist options
  • Save jamesmurdza/1811816c46347a4f3d72f0b6cc0a26ba to your computer and use it in GitHub Desktop.
Save jamesmurdza/1811816c46347a4f3d72f0b6cc0a26ba to your computer and use it in GitHub Desktop.

00:00:02 hey everyone today we're going to be doing cnns or convolutional neural networks

doing cnns or convolutional neural networks and uh those are neural network models designed for processing structured grid data like images

00:00:06 cnns are neural networks that can be used to classify or detect images

networks that can be used to classify or detect images, and that's about all i know. passing it to naid. yeah, just like the simple.

00:00:24 the idea for cnns was inspired by biological networks, similar to simple neural networks

neural networks were inspired by biological networks, and the same applies to cnns. the human visual system and cnns are quite similar.

00:00:40 the human visual system is similar to that of other mammals, enabling extensive research

other mammals are used because we can do experiments on animals without legal hurdles, giving researchers great depth of knowledge.

00:00:57 researchers have a wealth of knowledge about the visual system, especially the visual cortex

to the visual system, particularly the visual

00:01:03 the visual cortex plays a crucial role in vision and helps recognize objects in layers

researchers in the 60s or 70s found that when humans look at something, the brain's visual cortex first identifies low-level features like edges before combining them to form the whole object. these findings were later used to simulate similar phenomena in artificial computers.

00:02:00 lower levels detect low-level features like edges, while higher levels combine these features

the object is recognized in layers, starting with the lower levels.

00:02:20 cnns work similarly by identifying features in layers and combining them to recognize objects

look at the low level features like edges and boundaries and then combine them into more complex structures, like recognizing a person. cenn does this on a computer. without scen ends, we could have taken a 32x32 pixel image and flattened the entire picture.

00:02:54 a cnn can transform an image into a vector and feed it to a fully connected neural network

into one long vector and feed the pixels directly to a fully connected layer.

00:03:00 fully connected layers require that all features connect to each neuron, complicating the model

a neural network is called fully connected because every feature is connected to each neuron. for example, one neuron in the first layer connects to all 32 * 32 pixels. this approach, however, comes with several difficulties.

00:03:33 high-resolution images pose challenges for fully connected layers due to a large number of weights

an image with a resolution of 1000x1000 pixels has too many pixels to process at once in a fully connected layer due to the high number of trainable parameters. this approach also limits the network's ability to recognize objects, like a cat, if its position in the image changes. for example, if the cat is in the upper-left corner, the network only learns to identify it in that specific position.

00:05:06 cnns face translational invariance issues, where the position of an object in the image matters

the neural network identifies a new picture of the same cat as a completely different image because the cat appears in a different region, like the bottom right corner. despite it being the same cat, the network relies on raw pixel intensity values without recognizing it's the same object in a different position. thus, it struggles to label it correctly.

00:06:00 cnns aim to identify features regardless of their spatial position in an image

we need a new network to recognize a cat in an image regardless of its position. the network should isolate cat features and ignore spatial information. traditional neural networks struggle with this due to their fully connected layers. a CNN addresses this issue by being translationally invariant, meaning it can identify the cat even if its position within the image changes.

00:08:05 a cnn filter can perform convolutions by passing over an image and detecting features

just a 6x6 image, with pixel intensity values from 0 to 255. half the image is white (intensity 10) and the other half is black (intensity 0). it illustrates how images use pixel intensities, where 255 is fully bright and 0 is fully black.

00:09:03 filters are smaller matrices that scan over an image, performing pointwise multiplication

we have a filter, a 3x3 matrix, used to compute a convolution on an image. you place the filter on the image, starting at the upper left corner, perform pointwise multiplication, sum the results, and that sum becomes the first pixel in the output. then you shift the filter one place at a time, repeating the process to get the next output pixel.

00:11:00 the result of a convolution is a feature map, which highlights important features in the image

shift it, generate the output, and you've performed convolution. in signal processing, convolution involves using a filter to modify signals, producing a new output. think of low pass filters as allowing only lower frequencies to pass through, blocking higher ones. do you get how convolutions work in image processing?

00:14:15 the convolution filter used can detect specific features, such as vertical edges

a filter, like a vertical edge detector, identifies vertical edges in an image by highlighting the edges and darkening the rest. by altering the filter's values or rotating it, different features can be detected, such as horizontal edges or transitions from black to white. thus, filters can perform various tasks to highlight specific features in images.

00:17:28 cnn filters can detect various features by altering their composition and rotation

the convolution result can have negative numbers, but in machine learning, we apply a nonlinearity function like reLU to turn these negative numbers into zero before feeding the data to the next layer. instead of manually designing filters like edge detectors, we let the model learn and adjust these filter weights itself through the training process using gradient descent, creating necessary features automatically.

00:20:51 the process of creating filters is automated by the model through training, known as gradient descent

convolutions in deep learning help extract features from images through filters, creating feature maps. a single filter can detect simple patterns, while multiple filters can identify complex structures. by stacking layers, more abstract features can be recognized. padding is used to maintain image dimensions and ensure edge features aren't lost. this process aids in correctly classifying objects within images regardless of their position.

00:27:10 padding is used to prevent edge information from fading away during convolution

stride refers to how much you shift the filter when you apply convolution on an image. a stride of one means moving the filter one pixel at a time, while a stride of two or three means shifting it by two or three pixels respectively.

00:27:40 the stride defines how much the filter shifts during convolution, influencing feature map size

stride refers to how much you shift the filter when computing the convolution. increasing the stride reduces the output dimension of the feature map, making it smaller. finally, pooling is introduced as the last concept to understand the entire convolution process before building cnn architectures.

00:28:51 pooling, such as max pooling, reduces the size of feature maps by taking maximum values from regions

pooling reduces the feature map size to manage output features better, and max pooling is a common method for this. max pooling takes groups of pixels, like 0, 30, 0, and 30, and replaces them with the maximum value (30 in this case), effectively reducing the feature map and highlighting important features. shifting and repeating this process drops some information but emphasizes the remaining features.

00:30:34 pooling highlights important features and further compresses the data, aiding in cnn performance

pooling in cnn compresses feature space, keeping only the most intensified features, which can enhance performance. typically, pooling follows each convolution layer, and you can experiment with various hyperparameters like the number of layers, filters, and filter sizes to optimize your cnn architecture. the feature maps from multiple filters remain separate after each convolution layer.

00:33:22 after convolutions and pooling, the feature maps can be flattened and fed into a fully connected layer

applying a convolution layer with 16 filters to an image will produce an output reduced in size, determined by the formula n + 2p - F / s where n is the input size, p is the padding, F is the filter size, and s is the stride. After convolution, for example, a 150x150 image convolved with 16 filters would result in 16 feature maps. To visualize what the model focuses on, after processing, each feature map can highlight different parts of the image, showcasing the intermediate representations helpful for debugging and understanding the model's behavior.

01:00:00 intermediate representations in convolve layers help visualize what features the cnn detects

Please provide the text you would like me to summarize.

00:48:00 cnn models can be built using libraries like TensorFlow, with layers for convolution, pooling, and dense connections

I need the text you'd like summarized to help you with that! Please provide the text so I can get started.

00:47:55 data preprocessing steps are crucial, such as resizing and normalizing images for consistent input

we're processing the images to be passed directly to model.fit. if you run into issues, let me know. let's move to the fun part: building the model. we use a sequential model with layers forming a convolutional neural network, starting with a convolutional layer followed by max pooling and a flatten layer. the flatten layer converts feature maps into a single vector. we have 16 feature maps each likely 150x150, halved by max pooling to 74x74, ultimately flattened and fed into the neural network. the convolution part has fewer trainable parameters compared to the neural network.

00:55:00 training involves adjusting weights using the adam optimizer and binary cross-entropy for loss measurement

setting up the model here and using binary cross entropy as the loss function; training it takes about five minutes due to not too many epochs. run it yourself—it just takes a few minutes. validation accuracy is 69%, and testing it on a dog image at home, the model mistakenly identified it as a cat. with 70% accuracy, it's important to test with various images for better results. try it out on your own projects.

00:58:00 validation and testing on new images are essential to evaluate model accuracy

you can train a model on internet data and then test it with real-world images for a fun project. after training, there's some extra code to visualize intermediate representations, which shows what parts of an image the model focuses on. for example, a model might mistake a dog for a cat because it focuses on the fur around the face rather than other distinctive features.

01:01:30 testing with new images reveals how well the cnn generalizes beyond the training dataset

Sure, just provide the text you'd like summarized, and I'll get started.

01:00:00 intermediate visualizations can reveal why the cnn makes certain classifications, indicating areas of focus

we have 16 little images, one for each filter, showing the intermediate representations of a dog. interestingly, it doesn't focus much on the tongue but rather on the line around its furry face and the fur around its neck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment