- Edge VLA (EVLA)
- Vision-Language-Latent-Action (ViLLA)
- Vision-Language Model (VLM)
- Mixture of Experts (MoE)
- Latent Action Model (LAM)
- 2025/03/20 Humanoids, VLAs, and the Race to Build Robots That Can Do Anything
- 2025/03/12 Gemini Robotics brings AI into the physical world
- 2025/03/10 AgiBot GO-1: The Evolution of Generalist Embodied Foundation Model from VLA to ViLLA
- 2025/03/05 How Vision-Language-Action Models Powering Humanoid Robots
- 2025/02/20 Helix: A Vision-Language-Action Model for Generalist
- 2025/02/04 π0 and π0-FAST: Vision-Language-Action Models for General Robot Control
- 2024/10/31 Frontiers of Robotics Intelligence: A Review of Vision and Language Models for Robots
- 2024/09/10 Efficient Vision-Language-Action Models
- 2024/07/12 Mobility VLA: Advanced Multimodal Robot Navigation
- 2024/03/12 Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
- 2024/12/11 π0: A Vision-Language-Action Flow Model for General Robot Control 정리
- 2024/02/01 거대언어모델 기반 로봇 인공지능 기술 동향
- Foundation Models Meet Embodied Agents 🔥🔥
- Eachannn/VLArena
- KAIST Lecture: Vision-Language Action Models
- OpenFlamingo - An open-source framework for training large multimodal models
- Habitat - 3.0
- Google DeepMind: Gemini Robotics / VLA model
- Microsoft: Embodided AI
- 2024/10/09 Vision Language Models | Multi Modality, Image Captioning, Text-to-Image | Advantages of VLM's
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
- MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
- IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
- VLA Model-Expert Collaboration for Bi-directional Manipulation Learning
- MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation - CVPR 2025
- DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping
- OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
- JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
- CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- OpenVLA: An Open-Source Vision-Language-Action Model
- Cognitive Drone: A Vision-Language-Action Model & Benchmark for UAV Cognitive Reasoning
- RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour
- CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
- NaVILA: Legged Robot Vision-Language-Action Model for Navigation
- Tiny-VLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
- LAPA: Latent Action Pretraining from Videos
- PointVLA: Injecting the 3D World into Vision-Language-Action Models
- Habitat Synthetic Scenes Dataset (HSSD): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
- OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
- RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
- VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation
- 2025 VLA Model-Expert Collaboration for Bi-directional Manipulation Learning
- 2025 MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
- 2025 OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
- 2025 Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding
- 2025 UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
- 2025 Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
- Shakti-VLM-1B, MolmoE-1B, InternVL2-1B, SmolVLM-2.25B, MiniCPM-V-2.0-2.8B, Qwen-2VL-2B, InternVL2-2B, Qwen-2.5VL-3B
- 2025 PointVLA: Injecting the 3D World into Vision-Language-Action Models
- 2025 Generative Artificial Intelligence in Robotic Manipulation: A Survey
- 2025 Survey on Vision-Language-Action Models
- 2025 SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
- 2025 RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour
- 2025 CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs
- 2024 CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
- 2024 OpenVLA: An Open-Source Vision-Language-Action Model
- 2024 EdgeVLA: Efficient Vision-Language-Action Models
- 2024 TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
- 2024 3D-VLA: a 3D vision-language-action generative world model
- 2024 RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
- 2024 QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
- 2024 A Survey on Vision-Language-Action Models for Embodied AI
- 2024 Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
- 2024 Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations
- 2024 A Survey of Language-Based Communication in Robotics
- 2024 Latent Action Pretraining from Videos
- 2024 DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
- 2024 RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
- 2024 RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
- 2024 Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs
- https://github.com/hzjian123/VLArena - Closed-loop evaluation for end-to-end VLM autonomous driving agent
- https://github.com/Stanford-ILIAD/openvla-mini - OpenVLA: An open-source vision-language-action model for robotic manipulation
- https://github.com/kscalelabs/evla - EdgeVLA: An open-source edge vision-language-action model for robotics.
- https://github.com/yueen-ma/Awesome-VLA - A Survey on Vision-Language-Action Models for Embodied AI
- https://github.com/jingyi0000/VLM_survey - Collection of AWESOME vision-language models for vision tasks
- https://github.com/eric-ai-lab/awesome-vision-language-navigation - A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"
- https://github.com/yueyang130/DeeR-VLA - Official code of paper "DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution"
- https://github.com/NVIDIA/Isaac-GR00T/ - NVIDIA Isaac GR00T N1 is the world's first open foundation model for generalized humanoid robot reasoning and skills
- https://github.com/Psi-Robot/DexGraspVLA - DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping
- https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln - vision-language-action(VLA), vision-language-navigation(VLN), vision-action(VA) models