Skip to content

Instantly share code, notes, and snippets.

@Birch-san
Created November 19, 2022 19:10
Show Gist options
  • Select an option

  • Save Birch-san/4a85dcfba2923547cd52527b89fe1203 to your computer and use it in GitHub Desktop.

Select an option

Save Birch-san/4a85dcfba2923547cd52527b89fe1203 to your computer and use it in GitHub Desktop.
Asking justinpinkney how stable-diffusion image variations works (i.e. how to finetune SD to condition on CLIP image embeddings)
Screenshots in comments below
https://canary.discord.com/channels/930499730843250783/950914873800396841/1026450454068084798
@Birch-san
Copy link
Author

image

image

I didn't receive a response to my questions so can't confirm whether my understanding is correct

@Birch-san
Copy link
Author

text for accessibility.

justinpinkney, in response to:

CLIP image embeddings are a wider vector (so feels similar to the problem of trying to use multiple token embeddings)

Hello! I'm the person behind the image variations model. It's actually a little different to how you mentioned. I used the final output of the CLIP image encoder (e.g. 1x768) after the projection into the shared latent space. So it's fine tuning to accept a single token. But there is no problem in tuning the model to accept longer or shorter contexts. Just fine tune the cross attention layers and it works ok

mahouko:

oh, then I've completely misunderstood
https://github.com/justinpinkney/stable-diffusion/blob/3a64aae085e511356cb935a3bf5a47d216a258a8/ldm/modules/encoders/modules.py#L179-L183
so it doesn't go 768->77 to make it the same length as a text embedding. doing something else.
okay, so usually the embedding size returned by model.get_learned_conditioning() (or rather from FrozenCLIPEmbedder#transformer#text_model#embedding_forward aka CLIPTextTransformer#embedding_forward would be
[1, 77, 768]
or if you're doing multi-cond guidance, then
[n, 77, 768]
are you saying that you just used the unchanged output of CLIPImageEncoder… and that it's just a single token… like this?
[1, 1, 768]

okay, so something like this?

from PIL import Image
import clip
import kornia
import torch
from torch import tensor
from torchvision import transforms

device=torch.device('cpu')
model, _ = clip.load(name='ViT-L/14', device=device, jit=False)
mean = tensor([0.48145466, 0.4578275, 0.40821073], device=device)
std = tensor([0.26862954, 0.26130258, 0.27577711], device=device)
tforms = transforms.Compose([
   transforms.Resize(224),
   transforms.CenterCrop((224, 224)),
   transforms.ToTensor(),
])
i = Image.open('/Users/birch/git/stable-diffusion/outputs/txt2img-samples/samples0/00021.png').convert("RGB")
inp = tforms(i).unsqueeze(0)
inp = inp*2-1

x = kornia.geometry.resize(inp, (224, 224),
                           interpolation='bicubic',align_corners=True,
                           antialias=False)
x = (x + 1.) / 2.
# renormalize according to clip
x = kornia.enhance.normalize(x, mean, std)

e = model.encode_image(x)

e.shape is:
[1, 768]

it kinda looks like you didn't make any changes to the model at all… so it already accepts a [1, 768] embedding just as happily as it accepts a [1, 77, 768] embedding? how does that work?

oh, I missed the e = e.unsqueeze(1)
so the shape of the embedding is actually [1, 1, 768]
so… looks like the embedding of a single token
what was the self.projection = torch.nn.Linear(768, 768) for? why unfreeze the image embedder?
and why is fine-tuning required for cross attention to excel at 1-token sequences? does that mean that its experience on 77-token sequences isn't enough to solve a subset of that problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment