Created
November 18, 2016 15:23
-
-
Save vade/c1c275f747418c5d06e9b71a4ccbd537 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Retraining (ie https://www.tensorflow.org/versions/r0.11/how_tos/image_retraining/index.html ) doesnt really go into | |
nuances about what types of labels you should choose based on your model. | |
Since InceptionV3 is an object recognition task, and the penultimate layer (pool 3) contains some 2048 vector length descriptions that | |
somehow infer various 'objectness' traits, its far better to say: | |
train for labels that tend toward objectness (lamp, lampshade, chandelier, standing lamp, desk lamp) | |
than train for labels that then to abstract image features like composition: chaotic, patterned, symmetric, asymmetric, mirrored, circular, diagonal, natural (photographic) , synthetic) | |
If I were interested in the latter labeling (ie, meta-features), is it more sensible to: | |
climb up the "network of objectness" and look further back from the pool3 layer to try to re-train on | |
or | |
your princess is in a different castle and just get a model more suited to what you are interested in. | |
I imagine at some point the ideas of composition and content have to converge | |
since one is presumably training on raster images and images without the "intent" to contain composition in fact have by | |
their very nature. | |
Barking up not just the wrong tree, but at a car? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It depends what you're ultimately trying to do.
If you want to just classify new object types in images, which are likely to consist of the same or similar types of features as the training data (edges, curves, corners, circles, boxes, gradients etc. in various different directions etc.), then likely just re-training the final fully connected layers could work. But you may also need to train the top conv layers too.
You can think of a typical CNN as two main parts: think of the conv layers as (automatic) 'feature extractors' (kernel filters), and then the final fully connected layers as a normal MLP, which acts on the features extracted by the conv layers. The initial full training is training both the MLP and the feature extractors at the same time. Once the feature extractors know what features to extract, you can retain just the MLP bit. But worth remembering that the feature extractors are a hierarchy, so you may need to train some of the higher level conv layers too. One nice example i'd heard (can't remember where) was that the earliest layers might detect some curves and lines in an image. the higher layers will realize that some of those curves are arranged so that they form circles, other higher layers realise that the lines are arranged to form a large box. Even higher layers realize that the circles and the box are arranged in a way that resembles a bus. higher layers still put that bus on a street etc.
If you do train from scratch, within reasonable boundaries, these architectures are likely to learn whatever features you need, provided you have the training data (and I'm talking about just images obviously). In short you are unlikely to have to change the architecture to learn the features you want. If you do start tweaking architectures, you are potentially entering a world of pain.
However, the pool layers (or >1 stride for the conv layers) are there to add spatial invariance (which helps classification). I.e. to remove (or minimise) translation info. So I don't think you'll get much composition info, at least not from the dense layers. You'll only get information that a lamp is in the picture, not exactly where it is. You'll probably need to look deeper (to the conv layers) to get that info, if it is there at all. For object detection tasks (i.e. finding where an object is in an image) slightly different architecture is used (e.g. Region Convolutional Neural Network), and Deepmind use convnets without pool layers at all for their atari vision system, since the higher order logic needs to know exactly where things are.
If you're looking for various methods of labelling images on things like chaotic, patterned, symmetric etc, instead of training a network on those features, you might have more luck (i.e. much quicker to implement and test) comparing your image to various target images. This could be with L2 norm (euclidean distance) or cos distance (dot product) between the vectors from various layers. Or maybe even something a bit more complex, like comparing gram matrices of the two images (which is what is used to determine the 'style' of an image in the neural style transfer paper http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf )
hope that helps. also check out http://cs231n.github.io/transfer-learning/