Objective

Informal (vibes-based) evaluation of the following vision-language-model captioners:

Florence-2-base-ft
CogVLM2
BLIP-2
MoonDream2
Share-Captioner
Florence-2-SD3-Captioner

TODO?

InternVL
BLIP-3
LLaVA-3 Next
llava-next 8b (is this the same as the previous one?)

Methodology

Human reference

I picked three sample images from Megalith-10m and tried writing some basic captions myself.

a drone photograph of a light-grey house surrounded by a neatly-trimmed green lawn in the middle of a deciduous forest on a sunny, blue-skied day. source
a closeup top-view photograph of a person's grey-and-white right shoe stepping on wet brown leaves. source
a shallow-focus photograph of justin trudeau putting on a black cloth mask with an embroidered red maple leaf at an indoor press conference. source

I graded my own captions, marking correct bits of description with ✅ and incorrect ones with ❌:

Details

1. (12✅) a drone✅ photograph✅ of a light-grey✅ house✅ surrounded✅ by a neatly-trimmed✅ green✅ lawn✅ in the middle of a deciduous✅ forest✅ on a sunny✅, blue-skied✅ day. 2. (12✅) a closeup✅ top-view✅ photograph✅ of a person's✅ grey✅-and-white✅ right✅ shoe✅ stepping✅ on wet✅ brown✅ leaves✅. 3. (12✅, 1❌) a shallow-focus✅ photograph✅ of justin trudeau✅ putting on❌ a black✅ cloth✅ mask✅ with an embroidered✅ red✅ maple✅ leaf✅ at an indoor✅ press conference✅. - In retrospect, "putting on" is ungrounded; he could be taking the mask off.

My average score across three images was 12.0 ✅, 0.3❌.

Next, I assessed each captioner in turn, using whatever endpoint I could find and playing around a bit to find good "caption-style" settings. I didn't add or deduct any points for filler, just for specific falsifiable details.

Florence-2-base-ft

https://huggingface.co/spaces/gokaygokay/Florence-2

Details

I used the "More-detailed-caption" task.

(16✅, 2❌) This is an aerial view✅ of a house✅. The house is white❌. The roof of the house is gray✅. The sky is blue✅ with white✅ clouds✅. The trees✅ are green✅. The ground is covered✅ with grass✅. The grass is green✅. There is a fence✅ by the trees✅. The fence is made of wood❌. There are bushes✅ by the fence✅. The bushes are dark green✅
- I marked "white" wrong for the house, since it's clearly grey or tan with white trim.
- I marked "made of wood" wrong for the fence, since I'm pretty sure it's metal (or at least not obviously wood)
(12✅) The leaves✅ on the ground✅ are brown✅. There is a person✅ wearing✅ a gray✅ shoe✅. The shoe has white✅ laces✅ on it. The person is wearing✅ black✅ pants✅.
(10✅, 3❌) A man✅ is wearing❌ a black✅ mask✅. There is a red✅ flower❌ on the mask✅. The man has dark brown✅ hair✅. There are two✅ red❌ flags✅ behind✅ the man.
- I marked "wearing" wrong since he's clearly donning / doffing the mask
- Calling the leaf a flower is obviously wrong
- I marked "red" incorrect for the flags since the visible portion of the Canadian flags is both red and white.

Florence's average score was 12.6✅, 1.6❌

CogVLM2

https://huggingface.co/spaces/THUDM/CogVLM-CogAgent

Details

I used the prompt "Write a detailed alt-text caption for this image." and selected "Use English model".

(21✅, 1❌) The image showcases a bird's-eye view❌ of a residential area✅ nestled✅ amidst lush✅ greenery✅. A two-story✅ house✅ with a gray✅ roof✅ stands prominently✅ in the foreground, surrounded by a well-maintained✅ lawn✅. Adjacent✅ to the house is a driveway✅ leading to a detached✅ garage✅. The backdrop is dominated by a dense✅ forest✅, and the sky overhead✅ is dotted✅ with fluffy white✅ clouds✅.
- I marked "bird's-eye view" as incorrect since I believe this would imply we're looking directly towards the ground
(12✅) A close-up✅ view captures a single✅, light-colored✅ sneaker✅ with white✅ laces✅ stepping onto✅ a dense✅ carpet✅ of wet✅, brown✅ autumn leaves✅.
(18✅) A man✅ in a suit✅ and tie✅ is adjusting✅ a navy blue✅ face mask✅, which features a red✅ maple✅ leaf✅ emblem✅, against✅ his face✅. He stands✅ in front of a backdrop✅ with the Canadian✅ flag✅, suggesting a formal or official✅ setting in Canada✅.
- I marked the mask "black" but I think "navy blue" is also reasonable - it's a very dark blue that's close to black in this lighting.

CogVLM2's average score was 17.0✅, 0.3❌

(I initially thought I was grading CogVLM1, but I tried http://36.103.203.44:7861 which lists itself as CogVLM2-Chat-En and got the exact same responses, so I think the original space is actually CogVLM2)

BLIP-2

https://huggingface.co/spaces/hysts/BLIP2

Details

I tried both VQA and Captioning mode and could not extract high-detail captions (may be an issue with the sampling settings? I'm not sure) so I just left it on Captioning.

(3✅) an aerial view✅ of a house✅ in the woods✅
(3✅,1❌) a person✅ standing✅ on a pile❌ of leaves✅
- This is clearly a shallow layer of leaves on top of a hard surface, not a pile
(4✅,1❌) a man✅ in a suit✅ and tie✅ wearing❌ a face mask✅
- Not wearing! CogVLM said "adjusting" which is acceptable.

BLIP-2's average score was 3.3✅, 0.6❌

MoonDream2

https://huggingface.co/spaces/vikhyatk/moondream2

Details

I used the prompt "Caption this image, in extreme detail." which seemed to give better caption-style responses than the default "describe this image" prompt.

(19✅, 1❌) An aerial view✅ of a serene residential area✅ reveals a two-story✅ house✅ with a gray✅ roof✅ and a white✅ garage✅, nestled✅ amidst a lush green✅ lawn✅ and a variety of trees✅. A small✅ shed❌ is visible on the left✅ side of the image. The sky✅ above is a clear blue✅, dotted✅ with fluffy white✅ clouds✅.
- I docked a point since the caption makes it sound like the garage is attached to the house (with a separate shed). It's possible that Moondream2's "shed" refers to the sliver of rooftop visible in the very bottom-right, but that looks like a proper house to me so I'm not counting it.
(13✅, 1❌) A gray✅ sneaker✅ with white✅ laces✅ stands out against a backdrop✅ of fallen✅ autumn leaves✅ in various shades of brown✅ and orange✅. The sneaker is positioned in the center✅ of the image, with the person's✅ foot✅ visible❌ on the right✅ side. The leaves are scattered around the person's feet, creating a sense of depth and perspective.
- I docked a point since this caption implies that person is not actually wearing the sneaker (and instead positioning their foot to the right of the unworn sneaker).
(15✅) A man✅ in a dark✅ suit✅ and tie✅ adjusts✅ a black✅ face mask✅ with a red✅ maple✅ leaf✅, standing✅ in front✅ of a red✅ and white✅ flag✅.
- It kinda sounds like he's using the leaf to adjust the mask, but we'll let it go.

Moondream2's average score was 15.6✅, 0.6❌

Share-Captioner

https://huggingface.co/spaces/Lin-Chen/Share-Captioner

Details

Overall, these captions capture an acute sense of nonspecific waffling, possibly in order to pad out a word count of some kind.

(16✅, 11❌) The image presents an expansive aerial✅ view of a tranquil suburban✅ neighborhood❌. Dominating the scene is a two-story✅ house✅, its gray✅ roof✅ contrasting with the white❌ walls. The house is nestled✅ amidst lush greenery✅, with trees and shrubbery enveloping❌ it, providing a sense of privacy and seclusion. To the left✅ of the house, a small✅ shed✅ stands, perhaps a storage unit for garden tools or outdoor equipment. On the right❌, a larger❌ garage❌ can be seen, possibly housing a vehicle or serving as a workshop. The neighborhood itself is dotted❌ with several❌ other houses❌, their rooftops❌ visible in the distance❌, suggesting a close-knit community. Above, the sky✅ is a clear blue✅, with fluffy white✅ clouds✅ scattered✅ across it, adding to the serene ambiance of the scene. Overall, the image captures a peaceful day in a suburban neighborhood, where houses coexist harmoniously with nature.
- Lots of penalties here. The walls aren't white (the trim is), the shrubbery is not enveloping the house, there's no garage on the right, and there are not other houses in the background.
(18✅,1❌) The image captures a moment of tranquility in an autumn setting. At the center✅ of the frame, a person's✅ foot✅, clad✅ in a gray✅ sneaker✅ with white✅ laces✅, is seen stepping✅ on a bed✅ of fallen✅ leaves✅. The leaves, a mix of orange✅ and brown✅ hues, are scattered✅ across the ground✅, hinting at the arrival of fall. The perspective of the image is from above✅, providing a bird's eye view of the scene. This angle accentuates the vastness of the leaf-strewn ground and the solitary journey of the foot forward. There are no discernible texts or other objects in the image✅. The relative position of the foot to the leaves suggests movement, as if the person is walking through this sea of fallen leaves. The image does not❌ provide any additional details that could be used to identify the person or the location.
- There are definitely additional details that could have been mentioned here (the leaves are mostly serrated and round, the person is wearing striped black socks / black pants, the laces are double-knotted...) so I'm deducting a point for saying there aren't any.
(20✅) In the image, a man✅ is the central✅ figure, standing✅ against a backdrop✅ of a wooden✅ wall✅ adorned with Canadian✅ flags✅. He is dressed in a formal suit✅ and tie✅, adding a professional touch to the scene. His attention is focused✅ on a black✅ face mask✅ that he is adjusting✅ with both✅ hands✅. The mask is notable for its red✅ maple✅ leaf✅ design, a symbol often associated with Canada. The man's action of adjusting the mask suggests a moment of preparation or attention to detail. The Canadian flags in the background add a patriotic element to the image, reinforcing the context of the man's attire. The overall composition of the image suggests a formal or professional setting, possibly related to Canadian politics✅ or business.

Share-captioner's average score was 18.0✅, 4.0❌

Florence-2-SD3-Captioner

https://huggingface.co/spaces/bdsqlsz/Florence-2-SD3-Captioner

Details

(15✅, 2❌) a large✅ white❌ house✅ with a gray✅ roof✅ nestled in the middle✅ of a green✅ field✅. The house is surrounded by trees✅ and shrubs✅, and there is a hill✅ in the background. The sky is blue✅ with some clouds✅ scattered✅ across it. On the left side is well-manicured and there are a few❌ trees scattered throughout the landscape. The image appears to be taken from a high vantage point✅, looking down✅ on the house and the surrounding area.
- The house isn't white.
- There are a ton of trees, not just a few.
- I'm not sure what "On the left side is well-manicured" is supposed to mean
(13✅, 6❌) a person's✅ foot✅ standing✅ on a bed✅ of fallen✅ leaves✅. The leaves are in various shades of brown✅ and orange✅, indicating that it is autumn. The person is wearing✅ a pair❌ of grey✅ sneakers✅ with white✅ laces✅. The background is blurred❌, but it appears to be a wooded❌ area. The image is taken from a low angle❌, looking up❌ at the person's feet❌.
- Not a pair of sneakers, just one.
- The background is not blurred
- No trees are visible
- The description of the camera angle is wrong
- The description of the vantage point is wrong
- Only one foot is visible
(22✅, 1❌) a man✅ wearing a blue❌ face mask✅ with a red✅ maple✅ leaf✅ on it✅. He is holding✅ the mask up to his face✅ with both✅ hands✅ and appears to be adjusting✅ it with his fingers✅. The man has dark✅ hair✅ and is wearing✅ a suit✅ and tie✅. The background is blurred✅, but it seems like he is in a room✅ with a Canadian✅ flag✅ visible behind✅ him.
- The mask is near black (very dark blue), just saying "blue face mask" is misleading

Florence-2-SD3-Captioner's average score was 16.6✅, 3.0❌

Conclusion

My preliminary vibes are:

CogVLM2 (quite detailed, waffles a bit) >
Moondream2 (less detailed but also a bit less waffling) >
Florence-2 (okay, makes some stuff up) >
Florence-2-SD3-Captioner (detailed with minimal yapping, but hallucinated a lot on my second image) >
Share Captioner (detailed, but tons of yapping and hallucinated a lot on my first image) >
Blip-2 (very low-detail and still some mistakes)

madebyollin/informal_evaluation_of_vlm_captioners.md

Select an option

No results found

Select an option

No results found

Objective

Methodology

Human reference

Florence-2-base-ft

CogVLM2

BLIP-2

MoonDream2

Share-Captioner

Florence-2-SD3-Captioner

Conclusion

madebyollin commented Jul 10, 2024

Uh oh!