Skip to content

Instantly share code, notes, and snippets.

@juancruzsosa
Last active September 1, 2022 00:22
Show Gist options
  • Save juancruzsosa/ba342c41dfa8b8dfe9419c7c81b29371 to your computer and use it in GitHub Desktop.
Save juancruzsosa/ba342c41dfa8b8dfe9419c7c81b29371 to your computer and use it in GitHub Desktop.
Computer Vision Reference

Computer Vision Reference


Table of Contents

Frameworks

Image processing libraries

Framework Image Video C/C++ Python
OpenCV βœ… βœ… βœ… βœ…
Pillow βœ… βœ…
Scikit-image βœ… βœ…

Image Augmentation

Library Basic Transforms Keypoints Bounding Boxes Segmentation
Torchvision βœ…
imgaug βœ… βœ… βœ… βœ…
albumentations βœ… βœ… βœ… βœ…

Low-level Deep Learning Frameworks

Framework Creator Python C/C++ R Java Javascript
PyTorch Facebook βœ… βœ…
Tensorflow Google βœ… βœ… βœ… βœ…
Caffe UC Berkeley βœ… βœ…
Darknet Joseph Redmon βœ…
MXNet Apache βœ… βœ… βœ… βœ…

Pytorch

PyTorch is a Python package that provides * Tensor computation (like NumPy) with strong GPU acceleration * Deep neural networks built on a tape-based autograd system

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.

Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.

Darknet

Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

MXNet

A truly open source deep learning framework suited for flexible research prototyping and production.

High-level frameworks

Framework Creator
Torchvision Facebook βœ…
PyTorch Lightning CILVR βœ…
Keras Google βœ…
GluonCV DMLC βœ… βœ…
MediaPipe Google
Detectron2 Facebook βœ…
MMCV OpenMMLab βœ…

Torchvision

PyTorch Lightning

Keras

GluonCV

MediaPipe

Detectron2

MMCV

Tasks

πŸ“· Image classification

Image classification refers to the task of extracting information classes from a multiband raster image.

Models

Model Paper Published
AlexNet One weird trick for parallelizing convolutional neural networks Apr-2014
GoogLeNet Going Deeper with Convolutions Sep-2014
VGG Very Deep Convolutional Networks for Large-Scale Image Recognition Sep-2014
InceptionV3 Rethinking the Inception Architecture for Computer Vision Dec-2015
ResNet Deep Residual Learning for Image Recognition Dec-2015
SqueezeNet SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size Feb-2016
Wide ResNet Wide Residual Networks May-2016
DenseNet Densely Connected Convolutional Networks Aug-2016
ResNeXt Aggregated Residual Transformations for Deep Neural Networks Nov-2016
DarknetV1 YOLO9000: Better, Faster, Stronger Dec-2016
MobileNetV2 MobileNetV2: Inverted Residuals and Linear Bottlenecks Jan-2018
DarknetV2 YOLOv3: An Incremental Improvement Apr-2018
MNASNet MnasNet: Platform-Aware Neural Architecture Search for Mobile Jul-2018
MobileNetV3 Searching for MobileNetV3 May-2019
EfficientNet EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks May-2019
ViT An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Oct-2020
Swin Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Mar-2021

Pretrained models

Benchmark

Family Network Year #P (M) * Acc@1
AlexNet AlexNet Apr-2014 61 56.522
VGG VGG-11 (BN) Sep-2014 133 70.370
VGG VGG-13 (BN) Sep-2014 133 71.586
VGG VGG-16 (BN) Sep-2014 138 73.360
VGG VGG-19 (BN) Sep-2014 144 74.218
GoogLeNet GoogLeNet 2014 13 69.778
ResNet ResNet-18 2015 12 69.758
ResNet ResNet-34 2015 22 73.314
ResNet ResNet-50 2015 26 76.130
ResNet ResNet-50 2015 45 77.374
ResNet ResNet-101 2015 50 78.312
InceptionV3 Inception-V3 2015 27 77.294
SqueezeNet SqueezeNet 1.0 2016 1 58.092
SqueezeNet SqueezeNet 1.1 2016 1 58.178
DenseNet DenseNet-121 2016 8 74.434
DenseNet DenseNet-126 2016 29 75.600
DenseNet DenseNet-169 2016 14 75.600
DenseNet DenseNet-201 2016 20 76.896
DarknetV1 Darknet-19 2016 ? 72.9
Darknet Darknet-53 2016 ? 77.2
Wide ResNet Wide ResNet-50-2 2017 69 78.468
Wide ResNet Wide ResNet-101-2 2017 127 78.848
MobileNet MobileNet-v2 2018 4 71.878
MobileNet MobileNet-v3-small 2018 3 67.668
MobileNet MobileNet-v3-large 2018 5 74.042
MNASNet MNASNet 0-5 2018 2 67.734
MNASNet MNASNet 0-75 2018 3 ??
MNASNet MNASNet 1-0 2018 4 73.456
MNASNet MNASNet 1-3 2018 6 ??
EfficientNet EfficientNet B0 2019 5 77.692
EfficientNet EfficientNet B1 2019 8 78.642
EfficientNet EfficientNet B2 2019 9 80.608
EfficientNet EfficientNet B3 2019 12 82.008
EfficientNet EfficientNet B4 2019 19 83.384
EfficientNet EfficientNet B5 2019 30 83.444
EfficientNet EfficientNet B6 2019 43 84.008
EfficientNet EfficientNet B6 2019 66 84.122
Swin Swin-S 2021 49 83.21
Swin Swin-B 2021 87 85.16
Swin Swin-L 2021 196 86.24
ViT ViT-L/14 2020 307 87.46
ViT ViT-H/14 2020 632 88.55
  • Acc@1: Top-1 Accuracy on ImageNet at 224x224 resolution

πŸ” Object detection

Object detection refers to identifying the location of one or more objects in an image with its bounding box.

Models

Model Paper Published
R-CNN Rich feature hierarchies for accurate object detection and semantic segmentation Oct-2013
Fast R-CNN Fast R-CNN Sep-2015
Faster R-CNN Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Jan-2016
YOLOv1 You Only Look Once: Unified, Real-Time Object Detection Jun-2015
YOLOv2 YOLO9000: Better, Faster, Stronger Dec-2016
SSD SSD: Single Shot MultiBox Detector Dec-2015
YOLOv3 YOLOv3: An Incremental Improvement Apr-2018
RetinaNet Focal Loss for Dense Object Detection Feb-2018
MaskRCNN Mask R-CNN Mar-2017
Cascade R-CNN Cascade R-CNN: Delving into High Quality Object Detection Dec-2017
DETR End-to-End Object Detection with Transformers May-2020

Pretrained models

Benchmark

Family Network Backbone Year * AP
R-CNN R-CNN ?? 2013 -
Fast R-CNN Fast R-CNN VGG16 2015 19.7
Faster R-CNN Faster R-CNN VGG16 2016 21.9
YOLOv1 YOLO V1 * GoogLeNet 2015 -
YOLOv2 YOLO V2 Darknet-19 2016 21.6
SSD SSD300 VGG16 2016 23.2
SSD SSD500 VGG16 2016 26.8
Cascade R-CNN Cascade-R-CNN-100 ResNet-101 2017 42.8
YOLOv3 YOLO V3 Darknet-53 2018 33.0
RetinaNet RetinaNet-ResNet-101 ResNet-101-FPN 2017 39.1
RetinaNet RetinaNet-ResNeXt-101 ResNeXt-101-FPN 2017 40.8
MaskRCNN MaskRCNN R-101-FPN ResNet-101-FP 2018 40.8
MaskRCNN MaskRCNN X-101-64x4d-FPN ResNeXt-101-64x4d 2018 42.7
DETR DETR-DC5 ResNet-50 + DC 2020 43.3
DETR DETR-DC5-R101 ResNet-101 + DC 2020 44.9
  • AP: AP[.5:.05:0.95] on COCO test-dev

πŸ‘₯ Image Segmentation

Image Segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects

Models

Model Paper Published
U-Net U-Net: Convolutional Networks for Biomedical Image Segmentation May-2015
FCN Fully Convolutional Networks for Semantic Segmentation Nov-2014
FCN Improving Fully Convolution Network for Semantic Segmentation Nov-2016
DeepLabV3 Rethinking Atrous Convolution for Semantic Image Segmentation Jun-2017
PSPNet Pyramid Scene Parsing Network Dec-2016
UPerNet Unified Perceptual Parsing for Scene Understanding Jul-2018
DANet Dual Attention Network for Scene Segmentation Sep-2018

Pretrained models

Benchmark

Family Backbone Year * mIoU
UNet UNet-S5-D16 2016 69.10
FCN ResNet-18-D8 2017 71.11
FCN ResNet-50-D8 2017 73.61
FCN ResNet-101-D8 2017 76.80
DeepLabV3 ResNet-50-D8 2017 77.85
PSPNet ResNet-101-D8 2016 78.34
UPerNet ResNet-50 2018 78.19
UPerNet ResNet-101 2018 79.40
DANet ResNet-50-D8 2018 79.34
DANet ResNet-101-D8 2018 80.41
  • mIoU: Mean IoU on CityScapes at 512x1024 resolution

πŸ“Œ Landmark/Keypoint Extraction

Landmark/Keypoint Extraction is the process of determining spatial key-points of an object in an image (e.g: Pose keypoints)

Methods

  1. Cascade
  2. Heatmap methods
    1. Top-down heatmap
    2. Bottom-up heatmap
    3. Multi-Scale High-Resolution Networks

Pretrained models

Pose detection

Papers:

Models

Model Paper Published
Deep Pose DeepPose: Human Pose Estimation via Deep Neural Networks Dec-2013
CPM Convolutional Pose Machines Jan-2016
RSN Learning Delicate Local Representations for Multi-Person Pose Estimation Mar-2020
HRNet Deep High-Resolution Representation Learning for Visual Recognition Aug-2019

Benchmark

Family Method Backbone Year * AP
Deep Pose Cascade Resnet-50 2014 52.6
Deep Pose Cascade Resnet-101 2014 56.0
Deep Pose Cascade Resnet-152 2014 58.3
CPM Top-down heatmap ? 2016 62.3
ResnetV1 Top-down Heatmap ResnetV1D-50 2019 72.2
ResnetV1 Top-down Heatmap ResnetV1D-100 2019 73.1
ResnetV1 Top-down Heatmap Resnet-152 2019 73.7
VGG Top-down Heatmap VGG-16 2015 69.8
Mobilenetv2 Top-down Heatmap MobileNetV2 2018 64.6
RSN Top-down Heatmap ResNet-18 2020 70.4
RSN Top-down Heatmap 3x ResNet-50 2020 75.0
HRNet Multi-Scale High-Resolution Networks HRNet-w48 2019 75.6
  • AP: Average precision on COCO-2017 at 256x192 resolution

πŸ“ Metric Learning / Few-Shot Learning

Methods

  1. Siamese Networks
  2. Meta-Learning

Models

Paper Backbone R@1 Published
Siamese Neural Networks for One-shot Image Recognition Custom Aug-2016
Hardness-Aware Deep Metric Learning GoogLeNet 43.6 Mar-2019
Local Similarity-Aware Deep Feature Embedding GoogLeNet 58.3 Oct-2016
Hard-Aware Deeply Cascaded Embedding GoogLeNet 60.7 Nov-2016
Sampling Matters in Deep Embedding Learning ResNet-50 63.9 Jun-2017
SoftTriple Loss: Deep Metric Learning Without Triplet Sampling GoogLeNet 65.4 Sep-2019
Calibrated neighborhood aware confidence measure for deep metric learning ?? 74.9 Jun-2020
A Closer Look at Few-shot Classification Conv4 60.5 Jan-2020
Negative Margin Matters: Understanding Margin in Few-shot Classification ResNet-18 72.7 Mar-2020
Prototypical Networks for Few-shot Learning GoogLeNet 54.6 Jun-2016

Pretrained models

Datasets

MNIST

A large database of handwritten digits

CIFAR-10

Labeled subsets of the 80 million tiny images dataset

CIFAR-100

Labeled subsets of the 80 million tiny images dataset

  • 🏠 Main page: https://www.cs.toronto.edu/~kriz/cifar.html
  • #Images: RGB-32x32
    • Train: 50k
    • Test: 10k
  • Super-Classes: (100) aquatic mammals, fish, flowers, food containers, fruit and vegetables, household electrical devices, household furniture, insects, large carnivores, large man-made outdoor things, large natural outdoor scenes, large omnivores and herbivores, medium-sized mammals, non-insect invertebrates, people, reptiles, small mammals, trees, vehicles 1, vehicles 2
  • Tasks:

Caltech 101

Pictures of objects belonging to 101 categories

CelebA

Large-scale face attributes dataset

  • 🏠 Main page: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
  • Images:
    • Train: 200k
    • Test: ??
  • Tasks:
  • Classes:
    • Face attributes (40): 5_o_Clock_Shadow, Arched_Eyebrows, Attractive, Bags_Under_Eyes, Bald, Bangs, Big_Lips, Big_Nose, Black_Hair, Blond_Hair, Blurry, Brown_Hair, Bushy_Eyebrows, Chubby, Double_Chin, Eyeglasses, Goatee, Gray_Hair, Heavy_Makeup, High_Cheekbones, Male, Mouth_Slightly_Open, Mustache, Narrow_Eyes, No_Beard, Oval_Face, Pale_Skin, Pointy_Nose, Receding_Hairline, Rosy_Cheeks, Sideburns, Smiling, Straight_Hair, Wavy_Hair, Wearing_Earrings, Wearing_Hat, Wearing_Lipstick, Wearing_Necklace, Wearing_Necktie, Young,
    • Landmarks: (5)
      • Left-eye, Right-eye, Nose, Left-Mouth, Right-Mouth

WiderFace

A face detection benchmark dataset with a high degree of variability in scale, pose and occlusion

LFW

A database of face photographs designed for studying the problem of unconstrained face recognition

  • 🏠 Main page: http://vis-www.cs.umass.edu/lfw/
  • Images: 13k
  • Tasks:
  • Classes:
    • Face attributes (73): Male, Asian, White, Black, Baby, Child, Youth, Middle_Aged, Senior, Black_Hair, Blond_Hair, Brown_Hair, Bald, No_Eyewear, Eyeglasses, Sunglasses, Mustache, Smiling, Frowning, Chubby, Blurry, Harsh_Lighting, Flash, Soft_Lighting, Outdoor, Curly_Hair, Wavy_Hair, Straight_Hair, Receding_Hairline, Bangs, Sideburns, Fully_Visible_Forehead, Partially_Visible_Forehead, Obstructed_Forehead, Bushy_Eyebrows, Arched_Eyebrows, Narrow_Eyes, Eyes_Open, Big_Nose, Pointy_Nose, Big_Lips, Mouth_Closed, Mouth_Slightly_Open, Mouth_Wide_Open, Teeth_Not_Visible, No_Beard, Goatee, Round_Jaw, Double_Chin, Wearing_Hat, Oval_Face, Square_Face, Round_Face, Color_Photo, Posed_Photo, Attractive_Man, Attractive_Woman, Indian, Gray_Hair, Bags_Under_Eyes, Heavy_Makeup, Rosy_Cheeks, Shiny_Skin, Pale_Skin, 5_o'_Clock_Shadow, Strong_Nose-Mouth_Lines, Wearing_Lipstick, Flushed_Face, High_Cheekbones, Brown_Eyes, Wearing_Earrings, Wearing_Necktie, Wearing_Necklace

CelebAMask-HQ

Large-scale face attributes dataset

ImageNet

An image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images

COCO

A large-scale object detection, segmentation, and captioning dataset.

CityScapes

A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5000 frames in addition to a larger set of 20β€―000 weakly annotated frames

  • 🏠 Main page: https://www.cityscapes-dataset.com/
  • Images:
    • Total: 25k
    • Val:
  • Clases
    • Flat: road, sidewalk, parking, rail track
    • Human: person, rider
    • Vehicle: car, truck, bus, on rails, motorcycle, bicycle, caravan, trailer
    • Construction: building, wall, fence, guard rail, bridge, tunnel
    • Object: pole, pole group, traffic sign, traffic light nature vegetation, terrain
    • Sky: sky
    • Void: ground, dynamic, static
  • Tasks:

Pascal VOC

  • 🏠 Main page: http://host.robots.ox.ac.uk/pascal/VOC/
  • Images:
    • Total: 11.5k
    • Object Detection: 11.5k
    • Image Segmentation: 6.9k
  • Classes:
    • Person: person
    • Animal: bird, cat, cow, dog, horse, sheep
    • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
    • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
  • Tasks:

CUB-200-2011

An extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations

Material

πŸ“š Books

πŸ“„ Papers

πŸ“™ Blogs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment