The field of computer vision has witnessed remarkable progress in recent years, driven largely by advancements in deep learning. Traditional approaches, however, often rely on monolithic neural networks that attempt to solve complex tasks end-to-end. This paper explores an alternative paradigm: vertical agents with swarm intelligence. This approach decomposes complex vision problems into smaller, specialized subtasks, each handled by an independent "agent" (often a smaller, focused neural network). These agents then collaborate, mimicking the principles of swarm intelligence found in nature, to achieve a collective understanding of visual input. This architecture offers potential advantages in terms of robustness, adaptability, transparency, and scalability, particularly in dynamic and complex environments.
This paper provides a comprehensive o