Informal (vibes-based) evaluation of the following vision-language-model captioners:
- Florence-2-base-ft
- CogVLM2
- BLIP-2
- MoonDream2
- Share-Captioner
- Florence-2-SD3-Captioner
TODO?
- InternVL
- BLIP-3
- LLaVA-3 Next
- llava-next 8b (is this the same as the previous one?)
I picked three sample images from Megalith-10m and tried writing some basic captions myself.
- a drone photograph of a light-grey house surrounded by a neatly-trimmed green lawn in the middle of a deciduous forest on a sunny, blue-skied day. source
- a closeup top-view photograph of a person's grey-and-white right shoe stepping on wet brown leaves. source
- a shallow-focus photograph of justin trudeau putting on a black cloth mask with an embroidered red maple leaf at an indoor press conference. source
I graded my own captions, marking correct bits of description with ✅ and incorrect ones with ❌:
My average score across three images was 12.0 ✅, 0.3❌.
Next, I assessed each captioner in turn, using whatever endpoint I could find and playing around a bit to find good "caption-style" settings. I didn't add or deduct any points for filler, just for specific falsifiable details.
https://huggingface.co/spaces/gokaygokay/Florence-2
- (16✅, 2❌) This is an aerial view✅ of a house✅. The house is white❌. The roof of the house is gray✅. The sky is blue✅ with white✅ clouds✅. The trees✅ are green✅. The ground is covered✅ with grass✅. The grass is green✅. There is a fence✅ by the trees✅. The fence is made of wood❌. There are bushes✅ by the fence✅. The bushes are dark green✅
- I marked "white" wrong for the house, since it's clearly grey or tan with white trim.
- I marked "made of wood" wrong for the fence, since I'm pretty sure it's metal (or at least not obviously wood)
- (12✅) The leaves✅ on the ground✅ are brown✅. There is a person✅ wearing✅ a gray✅ shoe✅. The shoe has white✅ laces✅ on it. The person is wearing✅ black✅ pants✅.
- (10✅, 3❌) A man✅ is wearing❌ a black✅ mask✅. There is a red✅ flower❌ on the mask✅. The man has dark brown✅ hair✅. There are two✅ red❌ flags✅ behind✅ the man.
- I marked "wearing" wrong since he's clearly donning / doffing the mask
- Calling the leaf a flower is obviously wrong
- I marked "red" incorrect for the flags since the visible portion of the Canadian flags is both red and white.
Florence's average score was 12.6✅, 1.6❌
https://huggingface.co/spaces/THUDM/CogVLM-CogAgent
- (21✅, 1❌) The image showcases a bird's-eye view❌ of a residential area✅ nestled✅ amidst lush✅ greenery✅. A two-story✅ house✅ with a gray✅ roof✅ stands prominently✅ in the foreground, surrounded by a well-maintained✅ lawn✅. Adjacent✅ to the house is a driveway✅ leading to a detached✅ garage✅. The backdrop is dominated by a dense✅ forest✅, and the sky overhead✅ is dotted✅ with fluffy white✅ clouds✅.
- I marked "bird's-eye view" as incorrect since I believe this would imply we're looking directly towards the ground
- (12✅) A close-up✅ view captures a single✅, light-colored✅ sneaker✅ with white✅ laces✅ stepping onto✅ a dense✅ carpet✅ of wet✅, brown✅ autumn leaves✅.
- (18✅) A man✅ in a suit✅ and tie✅ is adjusting✅ a navy blue✅ face mask✅, which features a red✅ maple✅ leaf✅ emblem✅, against✅ his face✅. He stands✅ in front of a backdrop✅ with the Canadian✅ flag✅, suggesting a formal or official✅ setting in Canada✅.
- I marked the mask "black" but I think "navy blue" is also reasonable - it's a very dark blue that's close to black in this lighting.
CogVLM2's average score was 17.0✅, 0.3❌
(I initially thought I was grading CogVLM1, but I tried http://36.103.203.44:7861 which lists itself as CogVLM2-Chat-En and got the exact same responses, so I think the original space is actually CogVLM2)
https://huggingface.co/spaces/hysts/BLIP2
- (3✅) an aerial view✅ of a house✅ in the woods✅
- (3✅,1❌) a person✅ standing✅ on a pile❌ of leaves✅
- This is clearly a shallow layer of leaves on top of a hard surface, not a pile
- (4✅,1❌) a man✅ in a suit✅ and tie✅ wearing❌ a face mask✅
- Not wearing! CogVLM said "adjusting" which is acceptable.
BLIP-2's average score was 3.3✅, 0.6❌
https://huggingface.co/spaces/vikhyatk/moondream2
- (19✅, 1❌) An aerial view✅ of a serene residential area✅ reveals a two-story✅ house✅ with a gray✅ roof✅ and a white✅ garage✅, nestled✅ amidst a lush green✅ lawn✅ and a variety of trees✅. A small✅ shed❌ is visible on the left✅ side of the image. The sky✅ above is a clear blue✅, dotted✅ with fluffy white✅ clouds✅.
- I docked a point since the caption makes it sound like the garage is attached to the house (with a separate shed). It's possible that Moondream2's "shed" refers to the sliver of rooftop visible in the very bottom-right, but that looks like a proper house to me so I'm not counting it.
- (13✅, 1❌) A gray✅ sneaker✅ with white✅ laces✅ stands out against a backdrop✅ of fallen✅ autumn leaves✅ in various shades of brown✅ and orange✅. The sneaker is positioned in the center✅ of the image, with the person's✅ foot✅ visible❌ on the right✅ side. The leaves are scattered around the person's feet, creating a sense of depth and perspective.
- I docked a point since this caption implies that person is not actually wearing the sneaker (and instead positioning their foot to the right of the unworn sneaker).
- (15✅) A man✅ in a dark✅ suit✅ and tie✅ adjusts✅ a black✅ face mask✅ with a red✅ maple✅ leaf✅, standing✅ in front✅ of a red✅ and white✅ flag✅.
- It kinda sounds like he's using the leaf to adjust the mask, but we'll let it go.
Moondream2's average score was 15.6✅, 0.6❌
https://huggingface.co/spaces/Lin-Chen/Share-Captioner
- (16✅, 11❌) The image presents an expansive aerial✅ view of a tranquil suburban✅ neighborhood❌. Dominating the scene is a two-story✅ house✅, its gray✅ roof✅ contrasting with the white❌ walls. The house is nestled✅ amidst lush greenery✅, with trees and shrubbery enveloping❌ it, providing a sense of privacy and seclusion. To the left✅ of the house, a small✅ shed✅ stands, perhaps a storage unit for garden tools or outdoor equipment. On the right❌, a larger❌ garage❌ can be seen, possibly housing a vehicle or serving as a workshop. The neighborhood itself is dotted❌ with several❌ other houses❌, their rooftops❌ visible in the distance❌, suggesting a close-knit community. Above, the sky✅ is a clear blue✅, with fluffy white✅ clouds✅ scattered✅ across it, adding to the serene ambiance of the scene. Overall, the image captures a peaceful day in a suburban neighborhood, where houses coexist harmoniously with nature.
- Lots of penalties here. The walls aren't white (the trim is), the shrubbery is not enveloping the house, there's no garage on the right, and there are not other houses in the background.
- (18✅,1❌) The image captures a moment of tranquility in an autumn setting. At the center✅ of the frame, a person's✅ foot✅, clad✅ in a gray✅ sneaker✅ with white✅ laces✅, is seen stepping✅ on a bed✅ of fallen✅ leaves✅. The leaves, a mix of orange✅ and brown✅ hues, are scattered✅ across the ground✅, hinting at the arrival of fall. The perspective of the image is from above✅, providing a bird's eye view of the scene. This angle accentuates the vastness of the leaf-strewn ground and the solitary journey of the foot forward. There are no discernible texts or other objects in the image✅. The relative position of the foot to the leaves suggests movement, as if the person is walking through this sea of fallen leaves. The image does not❌ provide any additional details that could be used to identify the person or the location.
- There are definitely additional details that could have been mentioned here (the leaves are mostly serrated and round, the person is wearing striped black socks / black pants, the laces are double-knotted...) so I'm deducting a point for saying there aren't any.
- (20✅) In the image, a man✅ is the central✅ figure, standing✅ against a backdrop✅ of a wooden✅ wall✅ adorned with Canadian✅ flags✅. He is dressed in a formal suit✅ and tie✅, adding a professional touch to the scene. His attention is focused✅ on a black✅ face mask✅ that he is adjusting✅ with both✅ hands✅. The mask is notable for its red✅ maple✅ leaf✅ design, a symbol often associated with Canada. The man's action of adjusting the mask suggests a moment of preparation or attention to detail. The Canadian flags in the background add a patriotic element to the image, reinforcing the context of the man's attire. The overall composition of the image suggests a formal or professional setting, possibly related to Canadian politics✅ or business.
Share-captioner's average score was 18.0✅, 4.0❌
https://huggingface.co/spaces/bdsqlsz/Florence-2-SD3-Captioner
- (15✅, 2❌) a large✅ white❌ house✅ with a gray✅ roof✅ nestled in the middle✅ of a green✅ field✅. The house is surrounded by trees✅ and shrubs✅, and there is a hill✅ in the background. The sky is blue✅ with some clouds✅ scattered✅ across it. On the left side is well-manicured and there are a few❌ trees scattered throughout the landscape. The image appears to be taken from a high vantage point✅, looking down✅ on the house and the surrounding area.
- The house isn't white.
- There are a ton of trees, not just a few.
- I'm not sure what "On the left side is well-manicured" is supposed to mean
- (13✅, 6❌) a person's✅ foot✅ standing✅ on a bed✅ of fallen✅ leaves✅. The leaves are in various shades of brown✅ and orange✅, indicating that it is autumn. The person is wearing✅ a pair❌ of grey✅ sneakers✅ with white✅ laces✅. The background is blurred❌, but it appears to be a wooded❌ area. The image is taken from a low angle❌, looking up❌ at the person's feet❌.
- Not a pair of sneakers, just one.
- The background is not blurred
- No trees are visible
- The description of the camera angle is wrong
- The description of the vantage point is wrong
- Only one foot is visible
- (22✅, 1❌) a man✅ wearing a blue❌ face mask✅ with a red✅ maple✅ leaf✅ on it✅. He is holding✅ the mask up to his face✅ with both✅ hands✅ and appears to be adjusting✅ it with his fingers✅. The man has dark✅ hair✅ and is wearing✅ a suit✅ and tie✅. The background is blurred✅, but it seems like he is in a room✅ with a Canadian✅ flag✅ visible behind✅ him.
- The mask is near black (very dark blue), just saying "blue face mask" is misleading
Florence-2-SD3-Captioner's average score was 16.6✅, 3.0❌
My preliminary vibes are:
- CogVLM2 (quite detailed, waffles a bit) >
- Moondream2 (less detailed but also a bit less waffling) >
- Florence-2 (okay, makes some stuff up) >
- Florence-2-SD3-Captioner (detailed with minimal yapping, but hallucinated a lot on my second image) >
- Share Captioner (detailed, but tons of yapping and hallucinated a lot on my first image) >
- Blip-2 (very low-detail and still some mistakes)
To be finished up later but:
InternVL2-8B
https://huggingface.co/OpenGVLab/InternVL2-8B
The image is an aerial view of a residential area surrounded by lush greenery. The scene is dominated by a large, dense forest in the background, with various shades of green indicating a healthy, thriving ecosystem. The sky above is mostly clear with a few scattered clouds, suggesting a pleasant, sunny day.
In the foreground, there are several houses with well-maintained lawns. The houses appear to be single-family homes, each with a distinct roof and exterior design. One house, in particular, stands out due to its prominent position and size. It has a gray roof and a white exterior, with a small porch area that is partially covered by a blue tarp or awning. This house is surrounded by a neatly trimmed lawn and a small fence.
To the left of this house, there is another house with a similar gray roof and a white exterior. This house also has a well-kept lawn and appears to be part of a larger property. The surrounding area is mostly green, with a mix of grassy lawns and trees, indicating a suburban or rural setting.
The overall impression is one of tranquility and natural beauty, with the houses nestled comfortably within the verdant landscape. The image captures a serene moment in a residential area where nature and human habitation coexist harmoniously.
The image captures a close-up view of a person's lower leg and foot, which is stepping on a bed of fallen leaves. The person is wearing a light gray sneaker with white laces. The sneaker appears to be made of a canvas or suede material, and it has a classic, low-top design. The laces are neatly tied, and the sneaker looks well-worn, suggesting it has been used frequently.
The person is wearing black pants, and only the lower part of the pants is visible. The pants seem to be made of a smooth, possibly cotton or synthetic fabric. The person's foot is slightly raised, indicating that they are in the process of taking a step.
The ground is covered with a thick layer of fallen leaves, predominantly in shades of brown, ranging from light tan to deep chocolate brown. The leaves are of various shapes and sizes, with some being broad and flat, while others are more elongated and curled. The leaves appear to be dry and brittle, typical of autumn foliage.
The image depicts a man in a formal setting, likely a press conference or official event. He is wearing a dark suit and a white dress shirt, with a dark tie. The man has curly, dark hair and is in the process of adjusting a navy blue surgical mask over his face. The mask features a small, red maple leaf pin on the left side, which is a symbol often associated with Canada.
His hands are visible, with one hand holding the mask's ear loops and the other adjusting the mask to cover his face. The man's expression is serious, and he appears to be focused on the task of putting on the mask. The background is blurred, but it includes a red and white flag, which is likely the Canadian flag, indicating that the event is taking place in Canada or is related to Canadian affairs. The lighting in the image is bright, highlighting the man's face and the details of his attire.
The scene is set in a natural environment, likely a forest or a park, as suggested by the abundance of leaves and the absence of any man-made structures. The lighting in the image is natural, with soft, diffused light filtering through the leaves, creating a warm and earthy atmosphere. The overall mood of the image is serene and peaceful, evoking a sense of being in nature.