That's not honey, you're eating recursion!

Hi, in this post I'll explain how I created this gif:

Inspiration & goal

There's an old meme variant where Winnie the Pooh is eating recursion, I love it. My goal was to recreate it using locally runnable AI models and free software.

Software

I specifically worked with free (as in libre) software. These are all GPL licensed 👍

ComfyUI (for generating image and video)
Blender (for compositing and sequencing)

Models

These run well on my 12GB vram video card:

Qwen-image - The latest open-weight image generation model as of now, and best in class for written text which is needed here.
Wan 2.2 A14B (I2V) - Pretty much sota open-weight video gen model, this takes a starting frame to generate a video. Both the quality and the prompt adherence is very good.

Steps

Initially I tried just using the Text-To-Video variant of Wan, but there were multiple problems:

The text was unreadable to the point that the joke was lost.
Low aesthetics - I just didn't like the look, but judge it yourself:

1. Initial frame

Instead, I decided to create a good enough initial frame, and then use an Image-To-Video model to continue that.

I used the default Comfy template for Qwen-Image, with the Lightning lora.
Here is the prompt that worked after a few iterations:

An anthropomorphic tiger and an anthropomorphic bear sitting in a wooden cabin. The bear has yellow fur and wears red shirt. The tiger has orange fur with black stripes. The bear is holding a huge honeypot. The tiger is shouting in a comics bubble: "That's not honey, you're eating recursion!". 3D CGI movie aesthetic.

I'm not a prompt-master so let me know if you can improve on this^ For example I couldn't get the yellow fur on the bear..

Qwen-Image almost aced the text, but you can notice the word recurrsion is mistyped. And it does the same mistake for any seed, which makes me believe it's an issue with the tokenizer - probably splitting up the word as something like recur(r) and sion.
A good friend suggested I put "purr" in the negative prompt, reasoning that the model fails this because tiger/cat is associated with purring 🐱 I'm really not sure whether this theory holds, but it solved the problem consistently! At the end I decided to use the original purring version for lulz.

2. Image to Video

Once again I started off from ComfyUI's template for Wan I2V, with Lightning loras.
The first problem was getting the camera motion right. Even though Wan2.2 is supposed to be good at understanding camera motion prompts, I had to use a custom Lora (Camera Tilt-Up Overhead) to achieve it.
To ease the compositing effort I added a part in the prompt to put "solid green" in the honey pot. It was not flawless, there is still some noise in there but good enough.

And this is the last frame. I wanted a looping video, which means I needed a fully green last frame.

3. Extending the video

I needed to extend the video so that the camera continues zooming in on the pot until the whole frame is green.
For this purpose the same Wan2.2 I2V model can be used. I tried many attempts but the model was unable to zoom into the pot to show a fully green last frame. So my idea was to use a different conditioning node WanFirstLastFrameVideo which accepts a last frame too. Here's my solution that worked well:

The first frame is the last frame of the previously generated video
(but... after some experimenting I cut off a few of those last frames so the camera motion is smoother between the two parts. This was easy using the VHS custom node)
And for the last frame I just used Paint to fill a green image 🟩
Here is the full prompt for this 2nd part:

The pot is filled with green. The camera slowly speeds up and moves forward fast until the green fully covers the frame and everything else disappears beyond the frame. At the end it just shows a solid green frame.

This might be a bit overkill but I made up the prompt before coming up with the last frame idea.

4. Compositing

Time to open Blender. Also disclaimer, it's my first time using the compositor and sequencer - but it was super easy. Blender's GUI progressed a lot in recent years.
In the Compositing workspace I just added the movie clip loader node and connected a Keying node with default settings with eyedropping the green color from the video

5. Sequencing

This was a bit of a headache as a first time user of the Video Editing workspace, because even though I managed to set up the sequence easily, Blender kept crashing once I tried to render it. The problem was that scenes were referencing each other in a recursive way. Not sure why this had to be so complicated but my final solution has 3 scenes:

Scene #1 to do the above compositing that creates the transparent clip.
Scene #2 takes Scene #1's camera input and puts it in the sequencer, that's it.
Scene #3 takes Scene #2's sequencer input and that's where I do the actual sequencing:

As you can see there's two channels where the same video is played, offset by a few seconds. Also I had to add a static image to the beginning of the transition, so that the text doesn't disappear before its uncovered behind the honeypot's edge. Overall it took some moving around to get the looping perfect, but it was way easier than expected.

Finally I rendered the output video and then converted to a gif with ffmpeg. Send me your ideas, cheers! -bezo

bezo97/recursive-gif-with-wan.md