Created
November 19, 2022 19:10
-
-
Save Birch-san/4a85dcfba2923547cd52527b89fe1203 to your computer and use it in GitHub Desktop.
Asking justinpinkney how stable-diffusion image variations works (i.e. how to finetune SD to condition on CLIP image embeddings)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Screenshots in comments below | |
| https://canary.discord.com/channels/930499730843250783/950914873800396841/1026450454068084798 |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
text for accessibility.
justinpinkney, in response to:
Hello! I'm the person behind the image variations model. It's actually a little different to how you mentioned. I used the final output of the CLIP image encoder (e.g. 1x768) after the projection into the shared latent space. So it's fine tuning to accept a single token. But there is no problem in tuning the model to accept longer or shorter contexts. Just fine tune the cross attention layers and it works ok
mahouko:
oh, then I've completely misunderstood
https://github.com/justinpinkney/stable-diffusion/blob/3a64aae085e511356cb935a3bf5a47d216a258a8/ldm/modules/encoders/modules.py#L179-L183
so it doesn't go 768->77 to make it the same length as a text embedding. doing something else.
okay, so usually the embedding size returned by model.get_learned_conditioning() (or rather from
FrozenCLIPEmbedder#transformer#text_model#embedding_forwardakaCLIPTextTransformer#embedding_forwardwould be[1, 77, 768]or if you're doing multi-cond guidance, then
[n, 77, 768]are you saying that you just used the unchanged output of CLIPImageEncoder… and that it's just a single token… like this?
[1, 1, 768]okay, so something like this?
e.shape is:
[1, 768]it kinda looks like you didn't make any changes to the model at all… so it already accepts a
[1, 768]embedding just as happily as it accepts a[1, 77, 768]embedding? how does that work?oh, I missed the
e = e.unsqueeze(1)so the shape of the embedding is actually
[1, 1, 768]so… looks like the embedding of a single token
what was the
self.projection = torch.nn.Linear(768, 768)for? why unfreeze the image embedder?and why is fine-tuning required for cross attention to excel at 1-token sequences? does that mean that its experience on 77-token sequences isn't enough to solve a subset of that problem?