- 2024/10/09 LLaVA-Llama: An All-Around Text Processor
- 2024/09/18 Multimodal RAG: Chat with Videos and the Future of AI Interaction
- 2024/09/05 Introducing LLaVA V1.5 7B on GroqCloud - Qwen2-Vision, MiniCPM are better
- 2024/07/30 LLaVA Multimodel Image Search
- 2024/07/17 LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
- 2024/06/18 LLaVA: Large Language and Vision Assistant
- 2024/06/12 Large Language and Vision Assistant (LLaVA) — v1.6 vs. v1.5
- 2024/06/06 Artificial IntelligenceLLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
- 2024/05/27 Tutorial: Using Ollama, LLaVA and Gravio to Build a Local Visual Question and Answer AI Assistant
- 2024/04/30 LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
- 2024/04/08 LLaVA—New Standards In AI Accuracy
- 2024/04/01 nteracting with the Open Source Model LLaVA 1.5 on Paperspace Gradient
- 2024/03/31 Introduction to LLaVA: A Multimodal AI Model
- 2024/03/29 OMG-LLaVA: AI Model Integrating Multi-Level Visual Reasoning for Enhanced Scene Understanding
- 2024/02/09 How to Fine-Tune LLaVA on a Custom Dataset
- 2024/02/01 LLaVA 1.5 vs. 1.6
- 2024/01/27 Introducing LLaVA: The Fusion of Visual and Linguistic Intelligence in AI with code
- 2023/12/11 Understanding LLaVA: Large Language and Vision Assistant
- 2023/12/10 Unlocking Multimodal AI: LLaVA and LLaVA-1.5's Evolution in Language and Vision Fusion
- 2023/11/27 Exploring LLaVA-1.5 Technology: A Comprehensive Overview
- 2023/11/17 A Comprehensive First Look at LLaVA-1.5 Technology
- 2023/10/17 LLaVA, LLaVA-1.5, and LLaVA-NeXT(1.6) Explained
- 2024/07/30 LLAVA 13b로 caption(설명) 또는 table 텍스트 데이터 생성해보기
- 2024/07/25 LLaVA & LLaVA 1.5
- 2024/07/01 OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
- 2024/05/03 PLLaVA, Vision-Language 모델인 LLaVA를 Video로 확장하는 프로젝트
- 2024/02/01 MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
- Multi Modal Transformers
- MarkTechPost: Computer Vision
- What is LLaVA?
- Ollama: llava
- LlamaIndex: LlaVa Demo with LlamaIndex
- vLLM: Llava Example
- Building Next-Gen Multimodal Foundation Models for General-Purpose Assistants
- NVIDIA Jetson AI Lab
- NVIDIA: NeVA (LLaVA)
- LLaVA vs. BakLLaVA
- BakLLaVA - BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.
- LLaVA Blog - LLaVA-NeXT, LLaVA-OneVision, LLaVA-Video
- LLaVa Model Guide
- ROCm blogs: Multimodal (Visual and Language) understanding with LLaVA-NeXT
- LLaVA-Critic: Learning to Evaluate Multimodal Models
- Video Instruction Tuning with Synthetic Data - LLaVA-Video
- MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- LLaVA: Large Language and Vision Assistant Visual Instruction Tuning - NeurIPS 2023 (Oral)
- LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models
- LLaVA-OneVision - Easy Visual Task Transfer
- Yo'LLaVA: Your Personalized Language and Vision Assistant
- Spatial VLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
- OpenVLA: An Open-Source Vision-Language-Action Model - Prismatic-7B VLM, Open X-Embodiment (OpenX) dataset
- Llava Hugging Face - LLaVa-NeXT, LLaVa-1.5, ViP-LLaVA, LLaVa-NeXT-Video, LLaVa-Interleave
- LLaVA WebGPU - A private and powerful multimodal AI chatbot that runs locally in your browser.
- LLaVA-Onevision
- LLaVA-Next-Interleave
- Video Language Models - Video LLaVA
- SpaceVLMs - LLaVA, MobileVLM
- bczhou/TinyLLaVA-1.5B
- lmms-lab: LLaVA-OneVision
- deepinfra: llava-hf/llava-1.5-7b-hf - LLaVa is a multimodal model that supports vision and language models combined.
- remyxai/SpaceLLaVA - SpaceLLaVA uses LoRA to fine-tune LLaVA on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM
- remyxai/SpaceLLaVA-lite - SpaceLLaVA-lite fine-tunes MobileVLM on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM
- Fifth Civil Defender - 5CD
- lamm-mit/Cephalo-Llava-v1.6-Mistral-vision-8b-alpha
Papers : CatalyzeX for LLaVA
- 2024 LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
- 2024 LLaVA-Critic: Learning to Evaluate Multimodal Models
- 2024 Leveraging vision-language models for fair facial attribute classification
- 2024 TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
- 2024 LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
- 2024 CarLLaVA: Vision language models for camera-only closed-loop driving
- 2024 LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal! - LLaVaOLMoBitnet1B
- 2024 HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
- 2024 SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities - CVPR 2024, Google DeepMind
- 2024 LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
- 2024 LLaVA-OneVision: Easy Visual Task Transfer
- 2024 LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
- 2024 Yo'LLaVA: Your Personalized Language and Vision Assistant
- 2024 LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
- 2024 AgEval: A Benchmark for Zero-Shot and Few-Shot Plant Stress Phenotyping with Multimodal LLMs
- 2024 Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection
- 2014 INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
- 2024 Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
- 2024 An Introduction to Vision-Language Modeling 🔥🔥
- 2024 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
- 2024 OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
- 2024 MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
- 2024 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models - Apple
- 2024 TinyLLaVA: A Framework of Small-scale Large Multimodal Models
- 2024 ViP-LLaVA_Making_Large_Multimodal_Models_Understand_Arbitrary_Visual_Prompts - CVPR 2024 Making Large Multimodal Models Understand Arbitrary Visual Prompts](
- 2023 Visual Instruction Tuning
- https://github.com/haotian-liu/LLaVA - NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
- https://github.com/tabtoyou/KoLLaVA - KoLLaVA: Korean Large Language-and-Vision Assistant (feat.LLaVA)
- https://github.com/LLaVA-VL/LLaVA-NeXT - LLaVA-NeXT: Open Large Multimodal Models
- https://github.com/TinyLLaVA/TinyLLaVA_Factory - A Framework of Small-scale Large Multimodal Models
- https://github.com/remyxai/VQASynth - Compose multimodal datasets
- https://github.com/hasanar1f/HiRED - HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.
- https://github.com/FreedomIntelligence/LongLLaVA - LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
- https://github.com/fangyuan-ksgk/Mini-LLaVA - A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability
- https://github.com/NiuTrans/Vision-LLM-Alignment - About This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.