masta-g3 · June 1, 2025 23:31 · drcuiu · Feb 1, 2025 · drcuiu · Feb 1, 2025
diff --git a/llm_papers.txt b/llm_papers.txt
 Cedille: A large autoregressive French language model
 The Wisdom of Hindsight Makes Language Models Better Instruction Followers
 ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks
 Query2doc: Query Expansion with Large Language Models
 The Internal State of an LLM Knows When its Lying
 Structured information extraction from complex scientific text with fine-tuned large language models
 TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
 Large Language Models Encode Clinical Knowledge
 PoET: A generative model of protein families as sequences-of-sequences
 Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
 Prompt Sapper: LLM-Empowered Software Engineering Infrastructure for AI-Native Services
 SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
 Modeling Protein Using Large-scale Pretrain Language Model
 A Watermark for Large Language Models
 GPT is becoming a Turing machine: Here are some ways to program it
 Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
 Large Language Models are Zero-Shot Reasoners
 From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
 How is ChatGPT's behavior changing over time?
 Meta-Transformer: A Unified Framework for Multimodal Learning
 Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
 Getting More out of Large Language Models for Proofs
 Teaching Small Language Models to Reason
 Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
 Learning to Retrieve In-Context Examples for Large Language Models
 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
 Context-Aware Abbreviation Expansion Using Large Language Models
 Focused Transformer: Contrastive Training for Context Scaling
 Flash normalization: fast RMSNorm for LLMs
 MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
 Long-range Language Modeling with Self-retrieval
 Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI
 Towards Generalist Biomedical AI
 Shortcut Learning of Large Language Models in Natural Language Understanding
 Quantifying Memorization Across Neural Language Models
 LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models
 Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
 Copy Is All You Need
 Automatic Chain of Thought Prompting in Large Language Models
 Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models
 Decomposed Prompting: A Modular Approach for Solving Complex Tasks
 Evaluating the Text-to-SQL Capabilities of Large Language Models
 On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models
 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
 Are Emergent Abilities of Large Language Models a Mirage?
 Enhancing Network Management Using Code Generated by Large Language Models
 Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
 ThinkSum: Probabilistic reasoning over sets using large language models
 On the Tool Manipulation Capability of Open-source Large Language Models
 Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
 WavJourney: Compositional Audio Creation with Large Language Models
 ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course
 Secrets of RLHF in Large Language Models Part I: PPO
 ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
 One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
 Prototypical Fine-tuning: Towards Robust Performance Under Varying Data Sizes
 Challenges and Applications of Large Language Models
 SPOT: Knowledge-Enhanced Language Representations for Information Extraction
 Kosmos-2: Grounding Multimodal Large Language Models to the World
 Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference
 SKILL: Structured Knowledge Infusion for Large Language Models
 Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
 Understanding Social Reasoning in Language Models with Language Models
 The Science of Detecting LLM-Generated Texts
 CausalLM is not optimal for in-context learning
 Questioning the Survey Responses of Large Language Models
 Extending Context Window of Large Language Models via Positional Interpolation
 ChatGPT and a New Academic Reality: Artificial Intelligence-Written Research Papers and the Ethics of the Large Language Models in Scholarly Publishing
 Probing Factually Grounded Content Transfer with Factual Ablation
 Teach LLMs to Personalize -- An Approach inspired by Writing Education
 Pre-Trained Large Language Models for Industrial Control
 WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences
 LongNet: Scaling Transformers to 1,000,000,000 Tokens
 Self-Alignment with Instruction Backtranslation
 Guiding Pretraining in Reinforcement Learning with Large Language Models
 Large Language Models are Zero-Shot Rankers for Recommender Systems
 Model evaluation for extreme risks
 Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks
 SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL
 A Simple and Effective Pruning Approach for Large Language Models
 Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
 Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
 Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
 TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
 VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models
 LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models
 PromptChainer: Chaining Large Language Model Prompts through Visual Programming
 PIPPA: A Partially Synthetic Conversational Dataset
 Let's Verify Step by Step
 Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics
 SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
 Large Language Models Are Reasoning Teachers
 GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
 Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
 Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
 Connecting Neural Response measurements & Computational Models of language: a non-comprehensive guide
 Accelerating LLM Inference with Staged Speculative Decoding
 Large Language Models for Supply Chain Optimization
 Do Large Language Models know what humans know?
 Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction
 Faithful Chain-of-Thought Reasoning
 AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
 Superposition of many models into one
 Learning to Model the World with Language
 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
 Unifying Large Language Models and Knowledge Graphs: A Roadmap
 RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
 QLoRA: Efficient Finetuning of Quantized LLMs
 Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
 Co-Writing with Opinionated Language Models Affects Users' Views
 Language models show human-like content effects on reasoning
 Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
 Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code
 OpenAGI: When LLM Meets Domain Experts
 Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
 Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
 Beyond Generating Code: Evaluating GPT on a Data Visualization Course
 Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
 UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition
 LLM-Rec: Personalized Recommendation via Prompting Large Language Models
 Studying Large Language Model Generalization with Influence Functions
 Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
 From Sparse to Soft Mixtures of Experts
 Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
 INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
 Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models
 Large Language Model Guided Tree-of-Thought
 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
 LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
 When Geometric Deep Learning Meets Pretrained Protein Language Models
 Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to Document Level
 Language models are weak learners
 How Many Demonstrations Do You Need for In-context Learning?
 Direct Preference Optimization: Your Language Model is Secretly a Reward Model
 TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
 Gorilla: Large Language Model Connected with Massive APIs
 Automatic Generation of Programming Exercises and Code Explanations using Large Language Models
 Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models
 Interactive Fashion Content Generation Using LLMs and Latent Diffusion Models
 WebArena: A Realistic Web Environment for Building Autonomous Agents
 Language Models can Solve Computer Tasks
 ChatGPT Is on the Horizon: Could a Large Language Model Be All We Need for Intelligent Transportation?
 Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling
 Invariant Language Modeling
 Solving Quantitative Reasoning Problems with Language Models
 Personality Traits in Large Language Models
 Prompting Large Language Models with Speech Recognition Abilities
 Selective Annotation Makes Language Models Better Few-Shot Learners
 Using Captum to Explain Generative Language Models
 Fine-Tuning Language Models with Just Forward Passes
 In-context Autoencoder for Context Compression in a Large Language Model
 Entity Projection via Machine Translation for Cross-Lingual NER
 OctoPack: Instruction Tuning Code Large Language Models
 AlpaGasus: Training A Better Alpaca with Fewer Data
 Large Language Models Are Human-Level Prompt Engineers
 DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
 CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction
 WizardCoder: Empowering Code Large Language Models with Evol-Instruct
 Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
 Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach
 Large Language Models Can Self-Improve
 Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
 Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
 More Agents Is All You Need
 Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
 Teaching Algorithmic Reasoning via In-context Learning
 SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning
 BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
 The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python
 KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding
 Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
 Tree of Thoughts: Deliberate Problem Solving with Large Language Models
 Automatic Evaluation of Attribution by Large Language Models
 Generative Agents: Interactive Simulacra of Human Behavior
 ALERT: Adapting Language Models to Reasoning Tasks
 How does the pre-training objective affect what large language models learn about linguistic properties?
 PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback
 LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
 From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
 Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks
 Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
 FLIRT: Feedback Loop In-context Red Teaming
 News Summarization and Evaluation in the Era of GPT-3
 Galactica: A Large Language Model for Science
 Towards Reasoning in Large Language Models: A Survey
 Chain-Of-Thought Prompting Under Streaming Batch: A Case Study
 Shepherd: A Critic for Language Model Generation
 Emergent autonomous scientific research capabilities of large language models
 Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
 Social Simulacra: Creating Populated Prototypes for Social Computing Systems
 HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
 LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs
 A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
 Universal and Transferable Adversarial Attacks on Aligned Language Models
 CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
 Complexity-Based Prompting for Multi-Step Reasoning
 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
 FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
 Scaling TransNormer to 175 Billion Parameters
 CodeTF: One-stop Transformer Library for State-of-the-art Code LLM
 A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
 Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
 Learning ASR pathways: A sparse multilingual ASR model
 Stay on topic with Classifier-Free Guidance
 Constitutional AI: Harmlessness from AI Feedback
 Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis
 Teaching Arithmetic to Small Transformers
 Demystifying GPT Self-Repair for Code Generation
 Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education
 Link-Context Learning for Multimodal LLMs
 Large Language Models Perform Diagnostic Reasoning
 InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback
 AgentBench: Evaluating LLMs as Agents
 Xmodel-LM Technical Report
 Simple synthetic data reduces sycophancy in large language models
 Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
 ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
 Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models
 Re-visiting Automated Topic Model Evaluation with Large Language Models
 Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
 Adaptive Test Generation Using a Large Language Model
 Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning
 SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
 PaLM: Scaling Language Modeling with Pathways
 Teaching Large Language Models to Self-Debug
 Building Cooperative Embodied Agents Modularly with Large Language Models
 Urdu text in natural scene images: a new dataset and preliminary text detection
 LIMA: Less Is More for Alignment
 Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs
 GPT-NER: Named Entity Recognition via Large Language Models
 Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge
 Code as Policies: Language Model Programs for Embodied Control
 Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
 From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
 Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models
 Inspecting and Editing Knowledge Representations in Language Models
 TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents
 Large language models effectively leverage document-level context for literary translation, but critical errors persist
 Med-Flamingo: a Multimodal Medical Few-shot Learner
 CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
 Jigsaw: Large Language Models meet Program Synthesis
 Large Language Models Struggle to Learn Long-Tail Knowledge
 Llama 2: Open Foundation and Fine-Tuned Chat Models
 Textbooks Are All You Need
 Crowd Score: A Method for the Evaluation of Jokes using Large Language Model AI Voters as Judges
 CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
 Orca: Progressive Learning from Complex Explanation Traces of GPT-4
 Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
 Three Bricks to Consolidate Watermarks for Large Language Models
 The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
 FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
 One-shot Machine Teaching: Cost Very Few Examples to Converge Faster
 Theory of Mind May Have Spontaneously Emerged in Large Language Models
 Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
 Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
 Tiny LVLM-eHub: Early Multimodal Experiments with Bard
 Language Is Not All You Need: Aligning Perception with Language Models
 Mind's Eye: Grounded Language Model Reasoning through Simulation
 StarCoder: may the source be with you!
 Self-Critique Prompting with Large Language Models for Inductive Instructions
 PaLM 2 Technical Report
 Repository-Level Prompt Generation for Large Language Models of Code
 L-Eval: Instituting Standardized Evaluation for Long Context Language Models
 Measuring and Narrowing the Compositionality Gap in Language Models
 Differentially Private Fine-tuning of Language Models
 A Latent Space Theory for Emergent Abilities in Large Language Models
 Reflexion: Language Agents with Verbal Reinforcement Learning
 Ambient Adventures: Teaching ChatGPT on Developing Complex Stories
 LEACE: Perfect linear concept erasure in closed form
 Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods
 A PhD Student's Perspective on Research in NLP in the Era of Very Large Language Models
 Voyager: An Open-Ended Embodied Agent with Large Language Models
 FinGPT: Open-Source Financial Large Language Models
 Block Belief Propagation for Parameter Learning in Markov Random Fields
 Lost in the Middle: How Language Models Use Long Contexts
 Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks
 Ada-Ranker: A Data Distribution Adaptive Ranking Paradigm for Sequential Recommendation
 Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
 BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents
 Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
 The Hydra Effect: Emergent Self-repair in Language Model Computations
 Educational data augmentation in physics education research using ChatGPT
 PolyLM: An Open Source Polyglot Large Language Model
 Towards Expert-Level Medical Question Answering with Large Language Models
 Is GPT-4 a Good Data Analyst?
 Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
 Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
 ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
 Seeing ChatGPT Through Students' Eyes: An Analysis of TikTok Data
 LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond
 ReAct: Synergizing Reasoning and Acting in Language Models
 Augmenting Language Models with Long-Term Memory
 BloombergGPT: A Large Language Model for Finance
 A Systematic Evaluation of Large Language Models of Code
 GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
 Robot Task Planning and Situation Handling in Open Worlds
 Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences
 Emergent Abilities of Large Language Models
 Can Large Language Models design a Robot?
 KoLA: Carefully Benchmarking World Knowledge of Large Language Models
 Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding
 DarkBERT: A Language Model for the Dark Side of the Internet
 Measuring Faithfulness in Chain-of-Thought Reasoning
 Retentive Network: A Successor to Transformer for Large Language Models
 Dissociating language and thought in large language models: a cognitive perspective
 Large Language Models are Better Reasoners with Self-Verification
 Can large language models reason about medical questions?
 Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
 ARB: Advanced Reasoning Benchmark for Large Language Models
 Rethinking with Retrieval: Faithful Large Language Model Inference
 A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
 Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning
 Explainable Verbal Reasoner Plus (EVR+): A Natural Language Reasoning Framework that Supports Diverse Compositional Reasoning
 Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
 Large Language Models as Corporate Lobbyists
 MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
 Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation
 OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
 Talking About Large Language Models
 Platypus: Quick, Cheap, and Powerful Refinement of LLMs
 Large Language Models Can Be Easily Distracted by Irrelevant Context
 Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
 OpenICL: An Open-Source Framework for In-context Learning
 Emergence of Maps in the Memories of Blind Navigation Agents
 PMC-LLaMA: Further Finetuning LLaMA on Medical Papers
 DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
 LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
 UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
 Learning to Reason and Memorize with Self-Notes
 ChemCrow: Augmenting large-language models with chemistry tools
 Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
 Learning to Compress Prompts with Gist Tokens
 Unlimiformer: Long-Range Transformers with Unlimited Length Input
 StructGPT: A General Framework for Large Language Model to Reason over Structured Data
 ChatGPT: Applications, Opportunities, and Threats
 Memory Augmented Large Language Models are Computationally Universal
 PaLM-E: An Embodied Multimodal Language Model
 M2T: Masking Transformers Twice for Faster Decoding
 Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
 A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models
 DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
 Auditing large language models: a three-layered approach
 Language models in molecular discovery
 Offsite-Tuning: Transfer Learning without Full Model
 MusicLM: Generating Music From Text
 Context-faithful Prompting for Large Language Models
 SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
 Hungry Hungry Hippos: Towards Language Modeling with State Space Models
 Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models
 The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
 GPTutor: a ChatGPT-powered programming tool for code explanation
 Larger language models do in-context learning differently
 MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
 Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker
 ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge
 Multimodal Chain-of-Thought Reasoning in Language Models
 Recitation-Augmented Language Models
 Hyena Hierarchy: Towards Larger Convolutional Language Models
 Eight Things to Know about Large Language Models
 PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
 A Survey on Model Compression for Large Language Models
 Active Retrieval Augmented Generation
 Toolformer: Language Models Can Teach Themselves to Use Tools
 Evaluating Verifiability in Generative Search Engines
 Augmented Language Models: a Survey
 Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
 Giraffe: Adventures in Expanding Context Lengths in LLMs
 LLM As DBA
 Scaling Transformer to 1M tokens and beyond with RMT
 TidyBot: Personalized Robot Assistance with Large Language Models
 Exploring the Intersection of Large Language Models and Agent-Based Modeling via Prompt Engineering
 Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
 Active Prompting with Chain-of-Thought for Large Language Models
 A Categorical Archive of ChatGPT Failures
 Artificial muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity
 Better Language Models of Code through Self-Improvement
 DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
 The Capacity for Moral Self-Correction in Large Language Models
 Poisoning Language Models During Instruction Tuning
 Prompt2Model: Generating Deployable Models from Natural Language Instructions
 Data Selection for Language Models via Importance Resampling
 Enabling Conversational Interaction with Mobile UI using Large Language Models
 Evidence of Meaning in Language Models Trained on Programs
 Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
 Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
 Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
 Symbol tuning improves in-context learning in language models
 REPLUG: Retrieval-Augmented Black-Box Language Models
 Why do Nearest Neighbor Language Models Work?
 Prismer: A Vision-Language Model with An Ensemble of Experts
 AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
 Self-evolving Agents with reflective and memory-augmented abilities
 CALYPSO: LLMs as Dungeon Masters' Assistants
 Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice
 Code Llama: Open Foundation Models for Code
 Ground Manipulator Primitive Tasks to Executable Actions using Large Language Models
 Faithful to Whom? Questioning Interpretability Measures in NLP
 Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis
 Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts
 How Good Are Large Language Models at Out-of-Distribution Detection?
 Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
 Can Large Language Models Find And Fix Vulnerable Software?
 Large Language Models for Software Engineering: A Systematic Literature Review
 Informed Named Entity Recognition Decoding for Generative Language Models
 Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
 Simple is Better and Large is Not Enough: Towards Ensembling of Foundational Language Models
 Better Zero-Shot Reasoning with Role-Play Prompting
 Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning
 Are ChatGPT and GPT-4 Good Poker Players? -- A Pre-Flop Analysis
 A Survey on Large Language Model based Autonomous Agents
 Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions
 Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models
 AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
 Evaluating ChatGPT and GPT-4 for Visual Programming
 Through the Lens of Core Competency: Survey on Evaluation of Large Language Models
 D4: Improving LLM Pretraining via Document De-Duplication and Diversification
 Cabrita: closing the gap for foreign languages
 GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems
 ProAgent: Building Proactive Cooperative AI with Large Language Models
 Instruction Position Matters in Sequence Generation with Large Language Models
 Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value Extraction
 SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
 LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
 Large Language Model as Autonomous Decision Maker
 Large Language Models as Superpositions of Cultural Perspectives
 Activation Addition: Steering Language Models Without Optimization
 Enhancing Recommender Systems with Large Language Model Reasoning Graphs
 GPTEval: A Survey on Assessments of ChatGPT and GPT-4
 An Empirical Study on Challenging Math Problem Solving with GPT-4
 Forward-Backward Reasoning in Large Language Models for Verification
 Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI
 Dynamic Planning with a LLM
 "Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion
 Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models
 Bridging the Gap: Deciphering Tabular Data Using Large Language Model
 The Pile: An 800GB Dataset of Diverse Text for Language Modeling
 Prompting Is Programming: A Query Language for Large Language Models
 EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models
 Knowledge Graph Prompting for Multi-Document Question Answering
 GPT detectors are biased against non-native English writers
 GradientCoin: A Peer-to-Peer Decentralized Large Language Models
 RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models
 IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning
 Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models
 Time Travel in LLMs: Tracing Data Contamination in Large Language Models
 Can Language Models Learn to Listen?
 Detecting The Corruption Of Online Questionnaires By Artificial Intelligence
 Towards an Understanding of Large Language Models in Software Engineering Tasks
 YaRN: Efficient Context Window Extension of Large Language Models
 An Examination of the Compositionality of Large Generative Vision-Language Models
 Company Similarity using Large Language Models
 LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs
 Instruction Tuning for Large Language Models: A Survey
 Language to Rewards for Robotic Skill Synthesis
 Is There Any Social Principle for LLM-Based Agents?
 A Study on Robustness and Reliability of Large Language Model Code Generation
 Leveraging Large Language Models for Pre-trained Recommender Systems
 Mind vs. Mouth: On Measuring Re-judge Inconsistency of Social Bias in Large Language Models
 LLaSM: Large Language and Speech Model
 SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation
 DiagGPT: An LLM-based Chatbot with Automatic Topic Management for Task-Oriented Dialogue
 FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt
 ChatEDA: A Large Language Model Powered Autonomous Agent for EDA
 AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
 Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
 Pretraining on the Test Set Is All You Need
 The AI Revolution in Education: Will AI Replace or Assist Teachers in Higher Education?
 Reinforced Self-Training (ReST) for Language Modeling
 Fast Inference from Transformers via Speculative Decoding
 LoRA: Low-Rank Adaptation of Large Language Models
 Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models
 AI Deception: A Survey of Examples, Risks, and Potential Solutions
 RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
 Towards Applying Powerful Large AI Models in Classroom Teaching: Opportunities, Challenges and Prospects
 ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
 Blockwise Parallel Decoding for Deep Autoregressive Models
 Assigning AI: Seven Approaches for Students, with Prompts
 Conformal Prediction with Large Language Models for Multi-Choice Question Answering
 Attention: Marginal Probability is All You Need?
 Exploring Large Language Models' Cognitive Moral Development through Defining Issues Test
 Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
 MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
 Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
 Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
 OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
 XGen-7B Technical Report
 LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
 Can Programming Languages Boost Each Other via Instruction Tuning?
 The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
 Efficient RLHF: Reducing the Memory Usage of PPO
 Universal Self-adaptive Prompting
 ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models
 Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
 One Wide Feedforward is All You Need
 Better Zero-Shot Reasoning with Self-Adaptive Prompting
 BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
 DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
 Graph of Thoughts: Solving Elaborate Problems with Large Language Models
 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
 AnomalyGPT: Detecting Industrial Anomalies using Large Vision-Language Models
 Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
 SoTaNa: The Open-Source Software Development Assistant
 GPT Can Solve Mathematical Problems Without a Calculator
 Physically Grounded Vision-Language Models for Robotic Manipulation
 FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
 FLM-101B: An Open LLM and How to Train It with $100K Budget
 LaMDA: Language Models for Dialog Applications
 LMDX: Language Model-based Document Information Extraction and Localization
 Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
 Do Multilingual Language Models Think Better in English?
 The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
 TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
 Textbooks Are All You Need II: phi-1.5 technical report
 Replacing softmax with ReLU in Vision Transformers
 Investigating Answerability of LLMs for Long-Form Question Answering
 Vector Search with OpenAI Embeddings: Lucene Is All You Need
 The Rise and Potential of Large Language Model Based Agents: A Survey
 Cure the headache of Transformers via Collinear Constrained Attention
 Uncovering mesa-optimization algorithms in Transformers
 Large Language Models for Compiler Optimization
 CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
 Chain-of-Verification Reduces Hallucination in Large Language Models
 AstroLLaMA: Towards Specialized Foundation Models in Astronomy
 [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
 Compositional Foundation Models for Hierarchical Planning
 AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents
 Sparse Autoencoders Find Highly Interpretable Features in Language Models
 DreamLLM: Synergistic Multimodal Comprehension and Creation
 Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)
 Improving Language Models with Advantage-based Offline Policy Gradients
 Improving Factuality and Reasoning in Language Models through Multiagent Debate
 From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
 BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
 Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
 Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?
 Multimodal Foundation Models: From Specialists to General-Purpose Assistants
 Boolformer: Symbolic Regression of Logic Functions with Transformers
 Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
 No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
 TP-Aware Dequantization
 LASER: LLM Agent with State-Space Exploration for Web Navigation
 An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
 Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
 MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
 Baichuan 2: Open Large-scale Language Models
 Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
 Efficient Benchmarking (of Language Models)
 Context is Environment
 Analyzing Transformer Dynamics as Movement through Embedding Space
 DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
 RMT: Retentive Networks Meet Vision Transformers
 Stack-and-Delay: a new codebook pattern for music generation
 Neurons in Large Language Models: Dead, N-gram, Positional
 Large Language Model for Science: A Study on P vs. NP
 LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
 Data Augmentation for Spoken Language Understanding via Pretrained Language Models
 Petals: Collaborative Inference and Fine-tuning of Large Models
 Scaling Laws for Sparsely-Connected Foundation Models
 Kosmos-2.5: A Multimodal Literate Model
 PDFTriage: Question Answering over Long, Structured Documents
 Statistical Rejection Sampling Improves Preference Optimization
 Stabilizing RLHF through Advantage Model and Selective Rehearsal
 MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
 Leveraging Contextual Information for Effective Entity Salience Detection
 NExT-GPT: Any-to-Any Multimodal LLM
 Are Emergent Abilities in Large Language Models just In-Context Learning?
 RACE: Large-scale ReAding Comprehension Dataset From Examinations
 Large-Scale Automatic Audiobook Creation
 Recovering from Privacy-Preserving Masking with Large Language Models
 Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts
 Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations
 Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology
 What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning
 RAIN: Your Language Models Can Align Themselves without Finetuning
 When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
 Hypothesis Search: Inductive Reasoning with Language Models
 Agents: An Open-source Framework for Autonomous Language Agents
 A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
 Gated recurrent neural networks discover attention
 Contrastive Decoding Improves Reasoning in Large Language Models
 Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts
 FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning
 LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
 Adapting Large Language Models via Reading Comprehension
 DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
 MindAgent: Emergent Gaming Interaction
 Graph Neural Prompting with Large Language Models
 Sparks of Artificial General Intelligence: Early experiments with GPT-4
 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
 Efficient Post-training Quantization with FP8 Formats
 Taken out of context: On measuring situational awareness in LLMs
 Jointly Training Large Autoregressive Multimodal Models
 The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
 Curriculum Learning with Adam: The Devil Is in the Wrong Details
 OWL: A Large Language Model for IT Operations
 Faith and Fate: Limits of Transformers on Compositionality
 CodePlan: Repository-level Coding using LLMs and Planning
 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
 Efficient Memory Management for Large Language Model Serving with PagedAttention
 Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
 Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
 SCREWS: A Modular Framework for Reasoning with Revisions
 Transformer models: an introduction and catalog
 Small-scale proxies for large-scale Transformer training instabilities
 Effective Long-Context Scaling of Foundation Models
 VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
 Qwen Technical Report
 Attention Approximates Sparse Distributed Memory
 Calibrating LLM-Based Evaluator
 Ambiguity-Aware In-Context Learning with Large Language Models
 GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
 Vision Transformers Need Registers
 Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic
 Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
 AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
 QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
 Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
 Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
 Language Modeling Is Compression
 MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models
 Aligning Large Multimodal Models with Factually Augmented RLHF
 Large Language Models as Optimizers
 SlimPajama-DC: Understanding Data Combinations for LLM Training
 Finite Scalar Quantization: VQ-VAE Made Simple
 Physics of Language Models: Part 3.2, Knowledge Manipulation
 Efficient Streaming Language Models with Attention Sinks
 The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
 Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
 LLM-grounded Video Diffusion Models
 Enable Language Models to Implicitly Learn Self-Improvement From Data
 Emergent Analogical Reasoning in Large Language Models
 RA-DIT: Retrieval-Augmented Dual Instruction Tuning
 Think Before You Speak: Explicitly Generating Implicit Commonsense Knowledge for Response Generation
 Large Language Models Cannot Self-Correct Reasoning Yet
 SmartPlay : A Benchmark for LLMs as Intelligent Agents
 Language Models Represent Space and Time
 Retrieval meets Long Context Large Language Models
 Borges and AI
 Can large language models provide useful feedback on research papers? A large-scale empirical analysis
 Ring Attention with Blockwise Transformers for Near-Infinite Context
 Can Language Models be Instructed to Protect Personal Information?
 QuIP: 2-Bit Quantization of Large Language Models With Guarantees
 Who's Harry Potter? Approximate Unlearning in LLMs
 Low-Resource Languages Jailbreak GPT-4
 DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
 Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
 EcoAssistant: Using LLM Assistant More Affordably and Accurately
 How FaR Are Large Language Models From Agents with Theory-of-Mind?
 MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
 Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
 FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
 HeaP: Hierarchical Policies for Web Actions using LLMs
 A Long Way to Go: Investigating Length Correlations in RLHF
 Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
 Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
 Think before you speak: Training Language Models With Pause Tokens
 Mistral 7B
 SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
 Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
 Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading
 Reliable, Reproducible, and Really Fast Leaderboards with Evalica
 RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
 Large Language Models can Learn Rules
 Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
 Large Language Models Are Zero-Shot Time Series Forecasters
 Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
 Learning Interactive Real-World Simulators
 FireAct: Toward Language Agent Fine-tuning
 InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
 Text Embeddings Reveal (Almost) As Much As Text
 EIPE-text: Evaluation-Guided Iterative Plan Extraction for Long-Form Narrative Text Generation
 A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
 Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
 Lemur: Harmonizing Natural Language and Code for Language Agents
 LangNav: Language as a Perceptual Representation for Navigation
 The LAMBADA dataset: Word prediction requiring a broad discourse context
 Octopus: Embodied Vision-Language Programmer from Environmental Feedback
 Toward Joint Language Modeling for Speech Units and Text
 MemGPT: Towards LLMs as Operating Systems
 A Zero-Shot Language Agent for Computer Control with Structured Reflection
 LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
 Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
 CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules
 The Consensus Game: Language Model Generation via Equilibrium Search
 Table-GPT: Table-tuned GPT for Diverse Table Tasks
 PaLI-3 Vision Language Models: Smaller, Faster, Stronger
 MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
 Arbitrary Length Generalization for Addition
 "I'm fully who I am": Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation
 Deep Learning Scaling is Predictable, Empirically
 MLQA: Evaluating Cross-lingual Extractive Question Answering
 OpenAssistant Conversations -- Democratizing Large Language Model Alignment
 Intersectional Bias in Hate Speech and Abusive Language Datasets
 BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
 Reducing malicious use of synthetic media research: Considerations and potential release practices for machine learning
 AI Ethics Issues in Real World: Evidence from AI Incident Database
 Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models
 BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT
 Measuring Mathematical Problem Solving With the MATH Dataset
 Can Machines Learn Morality? The Delphi Experiment
 BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
 UNKs Everywhere: Adapting Multilingual Language Models to New Scripts
 AndroidEnv: A Reinforcement Learning Platform for Android
 Demoting Racial Bias in Hate Speech Detection
 Social Bias Frames: Reasoning about Social and Power Implications of Language
 Characterising Bias in Compressed Models
 Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
 Towards Robust Toxic Content Classification
 The Challenge of Value Alignment: from Fairer Algorithms to AI Safety
 Towards Continual Knowledge Learning of Language Models
 The Pushshift Reddit Dataset
 Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
 Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation
 Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
 Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
 Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack
 Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
 What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus
 One Epoch Is All You Need
 Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading
 Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
 Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
 Plug and Play Language Models: A Simple Approach to Controlled Text Generation
 NewsQA: A Machine Comprehension Dataset
 AmbiPun: Generating Humorous Puns with Ambiguous Context
 Deal or No Deal? End-to-End Learning for Negotiation Dialogues
 Competition-Level Code Generation with AlphaCode
 STaR: Bootstrapping Reasoning With Reasoning
 Efficient Neural Architecture Search via Parameter Sharing
 Recursively Summarizing Books with Human Feedback
 Habitat: A Platform for Embodied AI Research
 Generate & Rank: A Multi-task Framework for Math Word Problems
 Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
 Mitigating Statistical Bias within Differentially Private Synthetic Data
 The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
 RecGPT: Generative Pre-training for Text-based Recommendation
 TruthfulQA: Measuring How Models Mimic Human Falsehoods
 An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models
 Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
 Controlling Style in Generated Dialogue
 QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
 Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
 Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation
 DeBERTa: Decoding-enhanced BERT with Disentangled Attention
 Societal Biases in Language Generation: Progress and Challenges
 Counterfactual Fairness in Text Classification through Robustness
 Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions
 Deep Double Descent: Where Bigger Models and More Data Hurt
 Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations
 InCoder: A Generative Model for Code Infilling and Synthesis
 Back to the Future: On Potential Histories in NLP
 Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
 Sharp Minima Can Generalize For Deep Nets
 Self-attention Does Not Need $O(n^2)$ Memory
 Measuring the Carbon Intensity of AI in Cloud Instances
 SocialIQA: Commonsense Reasoning about Social Interactions
 Generating Long Sequences with Sparse Transformers
 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
 QAmeleon: Multilingual QA with Only 5 Examples
 CTRL: A Conditional Transformer Language Model for Controllable Generation
 Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models
 Generating Fake Cyber Threat Intelligence Using Transformer-Based Models
 Impact of Pretraining Term Frequencies on Few-Shot Reasoning
 Is neural language acquisition similar to natural? A chronological probing study
 Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent
 Buffer Overflow in Mixture of Experts
 OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
 Bag of Tricks for Efficient Text Classification
 Automatic Detection of Machine Generated Text: A Critical Survey
 Adversarial Training for Large Neural Language Models
 Diffsound: Discrete Diffusion Model for Text-to-sound Generation
 TALM: Tool Augmented Language Models
 Training Language Models with Language Feedback
 Toxicity in Multilingual Machine Translation at Scale
 PEER: A Collaborative Language Model
 On the Multilingual Capabilities of Very Large-Scale English Language Models
 LLaMA: Open and Efficient Foundation Language Models
 SECure: A Social and Environmental Certificate for AI Systems
 Gaussian Error Linear Units (GELUs)
 RoFormer: Enhanced Transformer with Rotary Position Embedding
 Measuring Massive Multitask Language Understanding
 ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
 To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making
 Leveraging QA Datasets to Improve Generative Data Augmentation
 Decoupled Weight Decay Regularization
 A Distributional Approach to Controlled Text Generation
 Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
 The Turking Test: Can Language Models Understand Instructions?
 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
 DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
 Language Models (Mostly) Know What They Know
 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
 Towards Understanding and Mitigating Social Biases in Language Models
 Discovering and Categorising Language Biases in Reddit
 Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
 Training Verifiers to Solve Math Word Problems
 The Curse of Recursion: Training on Generated Data Makes Models Forget
 Compositional Semantic Parsing with Large Language Models
 Transforming Question Answering Datasets Into Natural Language Inference Datasets
 Bringing the People Back In: Contesting Benchmark Machine Learning Datasets
 The Values Encoded in Machine Learning Research
 InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
 Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems
 Ethical and social risks of harm from Language Models
 SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
 Understanding HTML with Large Language Models
 ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
 AudioLM: a Language Modeling Approach to Audio Generation
 Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
 Behavior Cloned Transformers are Neurosymbolic Reasoners
 Adversarial Attacks and Defenses in Images, Graphs and Text: A Review
 CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
 Thou shalt not hate: Countering Online Hate Speech
 SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
 Participation is not a Design Fix for Machine Learning
 Retrieval Augmentation Reduces Hallucination in Conversation
 Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize
 How Many Data Samples is an Additional Instruction Worth?
 Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims
 Crosslingual Generalization through Multitask Finetuning
 The Curious Case of Neural Text Degeneration
 UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
 VinaLLaMA: LLaMA-based Vietnamese Foundation Model
 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
 Evaluating the Social Impact of Generative AI Systems in Systems and Society
 SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
 Towards A Rigorous Science of Interpretable Machine Learning
 An Analysis of the Automatic Bug Fixing Performance of ChatGPT
 Investigating Failures of Automatic Translation in the Case of Unambiguous Gender
 Chat as Expected: Learning to Manipulate Black-box Neural Dialogue Models
 Defending Against Neural Fake News
 Analyzing Dynamic Adversarial Training Data in the Limit
 Criticality in Formal Languages and Statistical Physics
 Generating Wikipedia by Summarizing Long Sequences
 Gender Bias in Contextualized Word Embeddings
 MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
 Deep Generative Dual Memory Network for Continual Learning
 ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
 Persistent Anti-Muslim Bias in Large Language Models
 Mirages: On Anthropomorphism in Dialogue Systems
 Deep Learning for Symbolic Mathematics
 Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
 A Survey On Universal Adversarial Attack
 Atlas: Few-shot Learning with Retrieval Augmented Language Models
 StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
 Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
 A framework for the extraction of Deep Neural Networks by leveraging public data
 Recipes for building an open-domain chatbot
 Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
 Measuring the Effects of Data Parallelism on Neural Network Training
 ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports
 Kosmos-G: Generating Images in Context with Multimodal Large Language Models
 X-SQL: reinforce schema representation with context
 Constructing Datasets for Multi-hop Reading Comprehension Across Documents
 FastText.zip: Compressing text classification models
 The State and Fate of Linguistic Diversity and Inclusion in the NLP World
 A General Language Assistant as a Laboratory for Alignment
 Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention
 Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly
 Transformer tricks: Precomputing the first layer
 MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
 Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech
 Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
 Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving
 Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
 Deep Learning Based Text Classification: A Comprehensive Review
 Automated Hate Speech Detection and the Problem of Offensive Language
 Multi-Dimensional Gender Bias Classification
 Extracting Training Data from Large Language Models
 ProsocialDialog: A Prosocial Backbone for Conversational Agents
 Cross-Task Generalization via Natural Language Crowdsourcing Instructions
 SPLADE-v3: New baselines for SPLADE
 Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
 FlowQA: Grasping Flow in History for Conversational Machine Comprehension
 Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey
 Improving alignment of dialogue agents via targeted human judgements
 Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing
 DateLogicQA: Benchmarking Temporal Biases in Large Language Models
 Explanation in Artificial Intelligence: Insights from the Social Sciences
 RoBERTa: A Robustly Optimized BERT Pretraining Approach
 Revealing Persona Biases in Dialogue Systems
 GeDi: Generative Discriminator Guided Sequence Generation
 Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
 Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
 UL2: Unifying Language Learning Paradigms
 Self-Instruct: Aligning Language Models with Self-Generated Instructions
 Evaluating the Underlying Gender Bias in Contextualized Word Embeddings
 Does Gender Matter? Towards Fairness in Dialogue Systems
 Energy and Policy Considerations for Deep Learning in NLP
 Tools Fail: Detecting Silent Errors in Faulty Tools
 The False Promise of Imitating Proprietary LLMs
 Directional Bias Amplification
 Hierarchical Text-Conditional Image Generation with CLIP Latents
 How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
 ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
 Task-aware Retrieval with Instructions
 Do Prompt-Based Models Really Understand the Meaning of their Prompts?
 Reading Wikipedia to Answer Open-Domain Questions
 Supervising Model Attention with Human Explanations for Robust Natural Language Inference
 Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis
 Latent Retrieval for Weakly Supervised Open Domain Question Answering
 Teaching language models to support answers with verified quotes
 TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
 MasakhaNER: Named Entity Recognition for African Languages
 Predicting the Type and Target of Offensive Posts in Social Media
 Learning to Model Editing Processes
 MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
 Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering
 Zero-Shot Fine-Grained Style Transfer: Leveraging Distributed Continuous Style Representations to Transfer To Unseen Styles
 Quantifying the Carbon Emissions of Machine Learning
 Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
 Chasing Carbon: The Elusive Environmental Footprint of Computing
 Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion
 Distilling Reasoning Capabilities into Smaller Language Models
 Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning
 Scaling Language Models: Methods, Analysis & Insights from Training Gopher
 CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
 WebGPT: Browser-assisted question-answering with human feedback
 Making Large Language Models Better Reasoners with Step-Aware Verifier
 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
 SGPT: GPT Sentence Embeddings for Semantic Search
 Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer with Small Language Models
 Building a Conversational Agent Overnight with Dialogue Self-Play
 ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
 Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
 A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection
 Neural Machine Translation of Rare Words with Subword Units
 ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
 Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation
 Tokenisation is NP-Complete
 TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
 CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
 Know What You Don't Know: Unanswerable Questions for SQuAD
 Longformer: The Long-Document Transformer
 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
 A Constructive Prediction of the Generalization Error Across Scales
 Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases
 KERMIT: Generative Insertion-Based Modeling for Sequences
 mGPT: Few-Shot Learners Go Multilingual
 The Natural Language Decathlon: Multitask Learning as Question Answering
 A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents
 A Survey of Race, Racism, and Anti-Racism in NLP
 Unraveling the Hidden Environmental Impacts of AI Solutions for Environment
 SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
 Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering
 Hyperbolic Image-Text Representations
 Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
 RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
 Pretraining Language Models with Human Preferences
 Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English
 MTEB: Massive Text Embedding Benchmark
 Interscript: A dataset for interactive learning of scripts through error feedback
 Looped Transformers as Programmable Computers
 Inner Monologue: Embodied Reasoning through Planning with Language Models
 No Language Left Behind: Scaling Human-Centered Machine Translation
 Collaborative Storytelling with Large-scale Neural Language Models
 Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
 CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation
 Recipes for Safety in Open-domain Chatbots
 Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations
 Pre-Trained Language Models for Interactive Decision-Making
 Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 Formal Algorithms for Transformers
 An Emulator for Fine-Tuning Large Language Models using Small Language Models
 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
 Democratizing Reasoning Ability: Tailored Learning from Large Language Model
 HellaSwag: Can a Machine Really Finish Your Sentence?
 Teaching Language Models to Self-Improve through Interactive Demonstrations
 Ranking LLM-Generated Loop Invariants for Program Verification
 Approximating Two-Layer Feedforward Networks for Efficient Transformers
 Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
 When can transformers reason with abstract symbols?
 HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
 Language Models are Few-shot Multilingual Learners
 Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
 AutoMix: Automatically Mixing Language Models
 Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
 Pre-trained Summarization Distillation
 TEQ: Trainable Equivalent Transformation for Quantization of LLMs
 Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
 Improving Large Language Model Fine-tuning for Solving Math Problems
 Language Models are General-Purpose Interfaces
 Llemma: An Open Language Model For Mathematics
 Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
 Gender Bias in Machine Translation
 Towards a Human-like Open-Domain Chatbot
 NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
 A Network-based End-to-End Trainable Task-oriented Dialogue System
 Safe RLHF: Safe Reinforcement Learning from Human Feedback
 Cloze-driven Pretraining of Self-attention Networks
 Universal Language Model Fine-tuning for Text Classification
 OPT: Open Pre-trained Transformer Language Models
 Towards Zero-Label Language Learning
 GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
 MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
 Learning and Leveraging Verifiers to Improve Planning Capabilities of Pre-trained Language Models
 Fine-tuned Language Models are Continual Learners
 3D-GPT: Procedural 3D Modeling with Large Language Models
 PAL: Program-aided Language Models
 Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning
 Large Language Models for Software Engineering: Survey and Open Problems
 Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
 Self-critiquing models for assisting human evaluators
 Towards Understanding Sycophancy in Language Models
 SALMONN: Towards Generic Hearing Abilities for Large Language Models
 Finetuned Language Models Are Zero-Shot Learners
 Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
 ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
 Generating Sequences by Learning to Self-Correct
 The Depth-to-Width Interplay in Self-Attention
 Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning
 Internet-augmented language models through few-shot prompting for open-domain question answering
 GLM-130B: An Open Bilingual Pre-trained Model
 Three scenarios for continual learning
 Eureka: Human-Level Reward Design via Coding Large Language Models
 GPT-NeoX-20B: An Open-Source Autoregressive Language Model
 An Explanation of In-context Learning as Implicit Bayesian Inference
 AgentTuning: Enabling Generalized Agent Abilities for LLMs
 Snapshot Ensembles: Train 1, get M for free
 Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
 On the Planning Abilities of Large Language Models -- A Critical Investigation
 Efficient Estimation of Word Representations in Vector Space
 Visualizing the Loss Landscape of Neural Nets
 Contrastive Preference Learning: Learning from Human Feedback without RL
 High-Resolution Image Synthesis with Latent Diffusion Models
 I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents
 H2O Open Ecosystem for State-of-the-art Large Language Models
 Calibrate Before Use: Improving Few-Shot Performance of Language Models
 All-in-One Image-Grounded Conversational Agents
 Interactive Task Planning with Language Models
 Can AI-Generated Text be Reliably Detected?
 BitNet: Scaling 1-bit Transformers for Large Language Models
 Scaling Laws for Neural Language Models
 Self-Refine: Iterative Refinement with Self-Feedback
 Adversarial Environment Generation for Learning to Navigate the Web
 Cross-Lingual Language Model Meta-Pretraining
 Creative Robot Tool Use with Large Language Models
 Simple and Effective Multi-Paragraph Reading Comprehension
 Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
 VeRA: Vector-based Random Matrix Adaptation
 Open-Ended Learning Leads to Generally Capable Agents
 Exploring the Boundaries of GPT-4 in Radiology
 Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs
 High-Dimensional Continuous Control Using Generalized Advantage Estimation
 Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
 CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
 Eliciting Human Preferences with Language Models
 One-Shot Learning from a Demonstration with Hierarchical Latent Language
 OpenAgents: An Open Platform for Language Agents in the Wild
 Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
 Specific versus General Principles for Constitutional AI
 mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
 MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
 Task2Vec: Task Embedding for Meta-Learning
 Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
 Tuna: Instruction Tuning using Feedback from Large Language Models
 In-Context Pretraining: Language Modeling Beyond Document Boundaries
 Self-Consistency Improves Chain of Thought Reasoning in Language Models
 Transcending Scaling Laws with 0.1% Extra Compute
 InstructExcel: A Benchmark for Natural Language Instruction in Excel
 Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
 Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning
 A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets
 Understanding Retrieval Augmentation for Long-Form Question Answering
 A Neural Conversational Model
 Exploring the Limits of Language Modeling
 Scaling Instruction-Finetuned Language Models
 Learning Performance-Improving Code Edits
 Training Compute-Optimal Large Language Models
 Instruction Tuning with GPT-4
 Holistic Evaluation of Language Models
 Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
 Large Language Models as Analogical Reasoners
 Negative Training for Neural Dialogue Response Generation
 On the Opportunities and Risks of Foundation Models
 Dissecting In-Context Learning of Translations in GPTs
 Carbon Emissions and Large Neural Network Training
 Faithful Reasoning Using Large Language Models
 Detecting Pretraining Data from Large Language Models
 Motif: Intrinsic Motivation from Artificial Intelligence Feedback
 Unified Language Model Pre-training for Natural Language Understanding and Generation
 BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
 Predictability and Surprise in Large Generative Models
 Alignment of Language Agents
 Zephyr: Direct Distillation of LM Alignment
 Binding Language Models in Symbolic Languages
 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
 The Evolved Transformer
 Detecting Hate Speech with GPT-3
 Learning to summarize from human feedback
 Efficient Large Scale Language Modeling with Mixtures of Experts
 Jailbreaking Black Box Large Language Models in Twenty Queries
 How do Language Models Bind Entities in Context?
 Program Synthesis with Large Language Models
 Challenges in Detoxifying Language Models
 A Deep Reinforced Model for Abstractive Summarization
 Moral Foundations of Large Language Models
 Training Production Language Models without Memorizing User Data
 A Deep Reinforcement Learning Chatbot
 RT-1: Robotics Transformer for Real-World Control at Scale
 Entity Tracking in Language Models
 KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
 Controlled Decoding from Language Models
 QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
 FP8-LM: Training FP8 Large Language Models
 The Perils & Promises of Fact-checking with Large Language Models
 Imitation versus Innovation: What children can do that large language and language-and-vision models cannot (yet)?
 Unsolved Problems in ML Safety
 Woodpecker: Hallucination Correction for Multimodal Large Language Models
 A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications
 Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
 Data-Centric Financial Large Language Models
 CodeFusion: A Pre-trained Diffusion Model for Code Generation
 TRAMS: Training-free Memory Selection for Long-range Language Modeling
 Personas as a Way to Model Truthfulness in Language Models
 PockEngine: Sparse and Efficient Fine-tuning in a Pocket
 LLM-FP4: 4-Bit Floating-Point Quantized Transformers
 CLEX: Continuous Length Extrapolation for Large Language Models
 ALCUNA: Large Language Models Meet New Knowledge
 JudgeLM: Fine-tuned Large Language Models are Scalable Judges
 Large Language Models as Generalizable Policies for Embodied Tasks
 How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
 ControlLLM: Augment Language Models with Tools by Searching on Graphs
 Linear Representations of Sentiment in Large Language Models
 LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
 The Generative AI Paradox: "What It Can Create, It May Not Understand"
 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
 MM-VID: Advancing Video Understanding with GPT-4V(ision)
 ChatCoder: Chat-based Refine Requirement Improves LLMs' Code Generation
 Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V
 LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
 ChipNeMo: Domain-Adapted LLMs for Chip Design
 What's In My Big Data?
 Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve
 Idempotent Generative Network
 Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
 Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation
 Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models
 Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?
 TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
 NEFTune: Noisy Embeddings Improve Instruction Finetuning
 The Impact of Depth and Width on Transformer Language Model Generalization
 FlashDecoding++: Faster Large Language Model Inference on GPUs
 Skywork: A More Open Bilingual Foundation Model
 GRIM: GRaph-based Interactive narrative visualization for gaMes
 LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
 Does GPT-4 Pass the Turing Test?
 Text Rendering Strategies for Pixel Language Models
 Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
 Learning From Mistakes Makes LLM Better Reasoner
 AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning
 Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
 Ultra-Long Sequence Distributed Transformer
 Ziya2: Data-centric Learning is All LLMs Need
 GLaMM: Pixel Grounding Large Multimodal Model
 mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
 On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
 Unveiling Safety Vulnerabilities of Large Language Models
 Prompt Cache: Modular Attention Reuse for Low-Latency Inference
 Levels of AGI: Operationalizing Progress on the Path to AGI
 u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
 Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning
 Co-training and Co-distillation for Quality Improvement and Compression of Language Models
 CogVLM: Visual Expert for Pretrained Language Models
 Tailoring Self-Rationalizers with Multi-Reward Distillation
 NExT-Chat: An LMM for Chat, Detection and Segmentation
 The Efficiency Misnomer
 PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
 Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
 Training Dynamics of Contextual N-Grams in Language Models
 LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
 Large Language Models Understand and Can be Enhanced by Emotional Stimuli
 Gzip versus bag-of-words for text classification
 TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
 GPT4All: An Ecosystem of Open Source Compressed Language Models
 Evaluating Large Language Models: A Comprehensive Survey
 Leveraging Large Language Models for Automated Proof Synthesis in Rust
 GPTScore: Evaluate as You Desire
 CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
 S-LoRA: Serving Thousands of Concurrent LoRA Adapters
 Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
 Finding Neurons in a Haystack: Case Studies with Sparse Probing
 Simple and Controllable Music Generation
 Can LLMs Follow Simple Rules?
 Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
 Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
 MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning
 Memory Augmented Language Models through Mixture of Word Experts
 Language Models can be Logical Solvers
 JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
 ADaPT: As-Needed Decomposition and Planning with Language Models
 FinGPT: Large Generative Models for a Small Language
 Simplifying Transformer Blocks
 Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
 Prompt Engineering a Prompt Engineer
 A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
 Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
 Accelerating Large Language Model Decoding with Speculative Sampling
 Alternating Updates for Efficient Transformers
 White-Box Transformers via Sparse Rate Reduction
 ChatAnything: Facetime Chat with LLM-Enhanced Personas
 Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
 The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
 LayoutPrompter: Awaken the Design Ability of Large Language Models
 Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
 GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
 To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
 Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text
 SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
 Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
 Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
 Trusted Source Alignment in Large Language Models
 UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations
 MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
 Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
 Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
 Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
 Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
 The ART of LLM Refinement: Ask, Refine, and Trust
 Fine-tuning Language Models for Factuality
 A Survey on Language Models for Code
 DiLoCo: Distributed Low-Communication Training of Language Models
 ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks
 Fusion-Eval: Integrating Evaluators with LLMs
 PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers
 SiRA: Sparse Mixture of Low Rank Adaptation
 Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
 Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
 UT5: Pretraining Non autoregressive T5 with unrolled denoising
 Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
 Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
 Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
 Contrastive Chain-of-Thought Prompting
 Learning to Filter Context for Retrieval-Augmented Generation
 Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
 M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
 System 2 Attention (is something you might need too)
 GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
 Language Models are Multilingual Chain-of-Thought Reasoners
 ProAgent: From Robotic Process Automation to Agentic Process Automation
 Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
 Exponentially Faster Language Modelling
 Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
 ToolTalk: Evaluating Tool-Usage in a Conversational Setting
 Testing Language Model Agents Safely in the Wild
 AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort
 MultiLoRA: Democratizing LoRA for Better Multi-Task Learning
 Orca 2: Teaching Small Language Models How to Reason
 Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections
 On Leakage of Code Generation Evaluation Datasets
 GPQA: A Graduate-Level Google-Proof Q&A Benchmark
 Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
 GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
 SelfEval: Leveraging the discriminative nature of generative models for evaluation
 TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems
 UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework
 LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
 Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
 HiPPO: Recurrent Memory with Optimal Polynomial Projections
 Transformer Memory as a Differentiable Search Index
 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
 DeiT III: Revenge of the ViT
 Scaling Vision Transformers to 22 Billion Parameters
 On Calibration of Modern Neural Networks
 A* Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks
 MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
 Attention Is All You Need
 Acceleration via Fractal Learning Rate Schedules
 Transformers learn in-context by gradient descent
 HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
 Toy Models of Superposition
 SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
 Unified Scaling Laws for Routed Language Models
 CLIPPO: Image-and-Language Understanding from Pixels Only
 Task-Specific Skill Localization in Fine-tuned Language Models
 Discovering Latent Knowledge in Language Models Without Supervision
 OCR-free Document Understanding Transformer
 Language Models are Few-Shot Learners
 Progress measures for grokking via mechanistic interpretability
 Learning Transferable Visual Models From Natural Language Supervision
 Zero-Shot Text-to-Image Generation
 Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
 muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems
 Language Models as Agent Models
 Learning Models of Individual Behavior in Chess
 Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
 Ask Me Anything: A simple strategy for prompting language models
 Training language models to follow instructions with human feedback
 Sequence to Sequence Learning with Neural Networks
 SegGPT: Segmenting Everything In Context
 A data-driven approach for learning to control computers
 Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
 Unifying Vision, Text, and Layout for Universal Document Processing
 Memorizing Transformers
 GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
 Beyond Memorization: Violating Privacy Via Inference with Large Language Models
 A Succinct Summary of Reinforcement Learning
 Symbolic Discovery of Optimization Algorithms
 Confronting Reward Model Overoptimization with Constrained RLHF
 Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
 A Cookbook of Self-Supervised Learning
 Training Language Models with Language Feedback at Scale
 Answering Questions by Meta-Reasoning over Multiple Chains of Thought
 G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
 SemDeDup: Data-efficient learning at web-scale through semantic deduplication
 Adversarial Examples for Evaluating Reading Comprehension Systems
 Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction
 Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
 LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
 ImageBind: One Embedding Space To Bind Them All
 Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks
 Scaling Data-Constrained Language Models
 Efficient LLM Inference on CPUs
 Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
 Efficiently Scaling Transformer Inference
 One Model To Learn Them All
 Brain decoding: toward real-time reconstruction of visual perception
 GLU Variants Improve Transformer
 Vision Transformers with Mixed-Resolution Tokenization
 HyperNetworks
 InRank: Incremental Low-Rank Learning
 Text-to-Image Diffusion Models are Zero-Shot Classifiers
 CoBIT: A Contrastive Bi-directional Image-Text Generation Model
 MAGVLT: Masked Generative Vision-and-Language Transformer
 DINOv2: Learning Robust Visual Features without Supervision
 What learning algorithm is in-context learning? Investigations with linear models
 Any-to-Any Generation via Composable Diffusion
 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
 Shortformer: Better Language Modeling using Shorter Inputs
 Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?
 Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity
 Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
 PaLI: A Jointly-Scaled Multilingual Language-Image Model
 The alignment problem from a deep learning perspective
 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
 Jailbreaking is Best Solved by Definition
 Multimodal Analogical Reasoning over Knowledge Graphs
 Segment Everything Everywhere All at Once
 DocPrompting: Generating Code by Retrieving the Docs
 Emergent Tool Use From Multi-Agent Autocurricula
 Root Mean Square Layer Normalization
 TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
 Efficient Training of Language Models to Fill in the Middle
 AI for Mathematics: A Cognitive Science Perspective
 AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators
 Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
 The First Room-Temperature Ambient-Pressure Superconductor
 Segment Anything
 Less is More: Parameter-Free Text Classification with Gzip
 Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions
 A Generalist Agent
 Meet in the Middle: A New Pre-training Paradigm
 Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations
 Can Humans Do Less-Than-One-Shot Learning?
 Diffusion-LM Improves Controllable Text Generation
 SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking
 Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
 Text-to-3D using Gaussian Splatting
 Precise Zero-Shot Dense Retrieval without Relevance Labels
 Brainformers: Trading Simplicity for Efficiency
 DETRs Beat YOLOs on Real-time Object Detection
 OtterHD: A High-Resolution Multi-modality Model
 Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
 ConvNets Match Vision Transformers at Scale
 Domain Specific Question Answering Over Knowledge Graphs Using Logical Programming and Large Language Models
 Scaling Robot Learning with Semantically Imagined Experience
 Do LLMs exhibit human-like response biases? A case study in survey design
 READ: Recurrent Adaptation of Large Transformers
 Benchmarking Neural Network Training Algorithms
 Automatic Gradient Descent: Deep Learning without Hyperparameters
 Layer Normalization
 An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
 Implicit Representations of Meaning in Neural Language Models
 Calibrated Chaos: Variance Between Runs of Neural Network Training is Harmless and Inevitable
 SqueezeLLM: Dense-and-Sparse Quantization
 Optimisation & Generalisation in Networks of Neurons
 Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionals
 Transformers as Recognizers of Formal Languages: A Survey on Expressivity
 The effectiveness of MAE pre-pretraining for billion-scale pretraining
 Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
 Decoupled Context Processing for Context Augmented Language Modeling
 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
 The Transient Nature of Emergent In-Context Learning in Transformers
 Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
 Matryoshka Diffusion Models
 Show Your Work: Scratchpads for Intermediate Computation with Language Models
 Beyond neural scaling laws: beating power law scaling via data pruning
 Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
 Going Deeper with Convolutions
 TimeGPT-1
 Capabilities of GPT-4 on Medical Challenge Problems
 Training Large Language Models Efficiently with Sparsity and Dataflow
 Optimal Policies Tend to Seek Power
 A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
 Thinking Like Transformers
 Why think step by step? Reasoning emerges from the locality of experience
 Mixture-of-Experts with Expert Choice Routing
 GPT-4 Technical Report
 Scaling Expert Language Models with Unsupervised Domain Discovery
 End-to-End Spatio-Temporal Action Localisation with Video Transformers
 Mass-Editing Memory in a Transformer
 Erasing Concepts from Diffusion Models
 Physics of Language Models: Part 1, Context-Free Grammar
 Flamingo: a Visual Language Model for Few-Shot Learning
 Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
 Semantic Tokenizer for Enhanced Natural Language Processing
 On Limitations of the Transformer Architecture
 A Survey of Large Language Models
 Affordances from Human Videos as a Versatile Representation for Robotics
 DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
 Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
 Conditioning Predictive Models: Risks and Strategies
 Implicit Chain of Thought Reasoning via Knowledge Distillation
 Scaling Laws for Transfer
 Risks from Learned Optimization in Advanced Machine Learning Systems
 SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
 Bayesian Optimization of Catalysts With In-context Learning
 Teach LLMs to Phish: Stealing Private Information from Language Models
 LLMatic: Neural Architecture Search via Large Language Models and Quality Diversity Optimization
 Knowledge Graphs
 Language Modelling with Pixels
 FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
 Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
 Chinchilla Scaling: A replication attempt
 Retrofitting Word Vectors to Semantic Lexicons
 CoLT5: Faster Long-Range Transformers with Conditional Computation
 Deep contextualized word representations
 Boosted Prompt Ensembles for Large Language Models
 Recurrent Memory Transformer
 Multitask Prompted Training Enables Zero-Shot Task Generalization
 TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
 Monarch: Expressive Structured Matrices for Efficient and Accurate Training
 On the Turing Completeness of Modern Neural Network Architectures
 Generalized Out-of-Distribution Detection: A Survey
 AugGPT: Leveraging ChatGPT for Text Data Augmentation
 Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
 SLiC-HF: Sequence Likelihood Calibration with Human Feedback
 StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
 Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
 Human-Timescale Adaptation in an Open-Ended Task Space
 Sigmoid Loss for Language Image Pre-Training
 OpenScene: 3D Scene Understanding with Open Vocabularies
 Nougat: Neural Optical Understanding for Academic Documents
 SoundStorm: Efficient Parallel Audio Generation
 Text and Code Embeddings by Contrastive Pre-Training
 Fine-Tuning Language Models from Human Preferences
 ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
 Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models
 Effective Theory of Transformers at Initialization
 ST-MoE: Designing Stable and Transferable Sparse Expert Models
 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
 Natural Selection Favors AIs over Humans
 ART: Automatic multi-step reasoning and tool-use for large language models
 Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
 Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
 Visual Instruction Tuning
 Efficiently Modeling Long Sequences with Structured State Spaces
 Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges
 Mastering Diverse Domains through World Models
 Simplified State Space Layers for Sequence Modeling
 Offline RL for Natural Language Generation with Implicit Language Q Learning
 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
 Deduplicating Training Data Mitigates Privacy Risks in Language Models
 Self-supervised Learning: Generative or Contrastive
 Towards Automated Circuit Discovery for Mechanistic Interpretability
 Neural Story Planning
 DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training
 Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements
 Dota 2 with Large Scale Deep Reinforcement Learning
 Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
 AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
 The Matrix Calculus You Need For Deep Learning
 ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models
 DeepNet: Scaling Transformers to 1,000 Layers
 SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
 LLMs cannot find reasoning errors, but can correct them!
 Pretraining Without Attention
 Large language models are not zero-shot communicators
 Semi-supervised Sequence Learning
 Improving language models by retrieving from trillions of tokens
 Synthetic Data from Diffusion Models Improves ImageNet Classification
 Level Generation Through Large Language Models
 How Does Generative Retrieval Scale to Millions of Passages?
 State Spaces Aren't Enough: Machine Translation Needs Attention
 Data Distributional Properties Drive Emergent In-Context Learning in Transformers
 Evaluating Large Language Models Trained on Code
 Injecting structural hints: Using language models to study inductive biases in language learning
 The case for 4-bit precision: k-bit Inference Scaling Laws
 Divide-or-Conquer? Which Part Should You Distill Your LLM?
 Downstream Datasets Make Surprisingly Good Pretraining Corpora
 ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark
 Fast Transformer Decoding: One Write-Head is All You Need
 NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities
 Towards Deep Learning Models Resistant to Adversarial Attacks
 A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards
 Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok
 Large Language Models as General Pattern Machines
 Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
 Fast and forward stable randomized algorithms for linear least-squares problems
 Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
 Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
 Twist Decoding: Diverse Generators Guide Each Other
 Monolith: Real Time Recommendation System With Collisionless Embedding Table
 On-Device Training Under 256KB Memory
 Meta-Learning in Neural Networks: A Survey
 The Linear Representation Hypothesis and the Geometry of Large Language Models
 The Power of Scale for Parameter-Efficient Prompt Tuning
 LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction
 Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
 Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
 GLM: General Language Model Pretraining with Autoregressive Blank Infilling
 Human Preference Score: Better Aligning Text-to-Image Models with Human Preference
 Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
 Spreading vectors for similarity search
 REFINER: Reasoning Feedback on Intermediate Representations
 Learning to Learn Faster from Human Feedback with Language Model Predictive Control
 Low-code LLM: Visual Programming over LLMs
 Decoding speech perception from non-invasive brain recordings
 Towards Agile Text Classifiers for Everyone
 Cramming: Training a Language Model on a Single GPU in One Day
 Text-to-Table: A New Way of Information Extraction
 TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
 WizardLM: Empowering Large Language Models to Follow Complex Instructions
 Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
 ViperGPT: Visual Inference via Python Execution for Reasoning
 Spatial-Language Attention Policies for Efficient Robot Learning
 Improved Baselines with Visual Instruction Tuning
 Decision Transformer: Reinforcement Learning via Sequence Modeling
 What Algorithms can Transformers Learn? A Study in Length Generalization
 Tracking Everything Everywhere All at Once
 Bad Global Minima Exist and SGD Can Reach Them
 Directly Fine-Tuning Diffusion Models on Differentiable Rewards
 Fine-Tuning LLaMA for Multi-Stage Text Retrieval
 MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
 EVA-CLIP: Improved Training Techniques for CLIP at Scale
 Optimizing Memory Mapping Using Deep Reinforcement Learning
 A General Theoretical Paradigm to Understand Learning from Human Preferences
 Beyond Words: A Comprehensive Survey of Sentence Representations
 Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
 Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
 Adding Gradient Noise Improves Learning for Very Deep Networks
 Positional Description Matters for Transformers Arithmetic
 ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
 Calibrated Language Models Must Hallucinate
 Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
 Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement
 Online Decision Transformer
 Benchmarking Large Language Models for News Summarization
 Overthinking the Truth: Understanding how Language Models Process False Demonstrations
 Scalable Extraction of Training Data from (Production) Language Models
 White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
 Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
 Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
 ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization
 Visual In-Context Prompting
 PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
 GAIA: a benchmark for General AI Assistants
 More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory
 Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia
 Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text
 Chain-of-Thought Reasoning is a Policy Improvement Operator
 Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
 Thinking Fast and Slow in Large Language Models
 Towards Accurate Differential Diagnosis with Large Language Models
 Mamba: Linear-Time Sequence Modeling with Selective State Spaces
 Vanishing Gradients in Reinforcement Finetuning of Language Models
 The History and Risks of Reinforcement Learning and Human Feedback
 Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
 Video Language Planning
 Thread of Thought Unraveling Chaotic Contexts
 PaSS: Parallel Speculative Sampling
 SeaLLMs -- Large Language Models for Southeast Asia
 LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
 Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models
 An LLM Compiler for Parallel Function Calling
 CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
 WinoGrande: An Adversarial Winograd Schema Challenge at Scale
 Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
 Magicoder: Source Code Is All You Need
 SILC: Improving Vision Language Pretraining with Self-Distillation
 MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
 TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents
 An Early Evaluation of GPT-4V(ision)
 Farzi Data: Autoregressive Data Distillation
 Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
 One Embedder, Any Task: Instruction-Finetuned Text Embeddings
 Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
 Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
 Towards a Unified View of Parameter-Efficient Transfer Learning
 Beyond Surface: Probing LLaMA Across Scales and Layers
 TiC-CLIP: Continual Training of CLIP Models
 GPT4Point: A Unified Framework for Point-Language Understanding and Generation
 GOAT: GO to Any Thing
 Nash Learning from Human Feedback
 Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
 Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
 Axiomatic Preference Modeling for Longform Question Answering
 FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
 Efficient Monotonic Multihead Attention
 ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
 Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
 Are LLMs Useful in the Poorest Schools? theTeacherAI in Sierra Leone
 De-Diffusion Makes Text a Strong Cross-Modal Interface
 Dolphins: Multimodal Language Model for Driving
 MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture
 Efficient Transformer Knowledge Distillation: A Performance Review
 GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs
 Using Large Language Models to Accelerate Communication for Users with Severe Motor Impairments
 Instruction-tuning Aligns LLMs to the Human Brain
 Large Language Model Alignment: A Survey
 Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
 RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
 Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
 GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
 Instruction-Following Evaluation for Large Language Models
 ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
 Pre-Training to Learn in Context
 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
 Large Language Models for Mathematicians
 WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words
 Language Model Inversion
 Training Chain-of-Thought via Latent-Variable Inference
 The Quantization Model of Neural Scaling
 Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses
 TinyGSM: achieving >80% on GSM8k with small language models
 Context Tuning for Retrieval Augmented Generation
 Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
 TigerBot: An Open Multilingual Multitask LLM
 PromptBench: A Unified Library for Evaluation of Large Language Models
 Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions
 Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
 Challenges with unsupervised LLM knowledge discovery
 A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges
 The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
 Honeybee: Locality-enhanced Projector for Multimodal LLM
 VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
 ProTIP: Progressive Tool Retrieval Improves Planning
 Catwalk: A Unified Language Model Evaluation Framework for Many Datasets
 Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
 Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
 FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
 Unlocking Anticipatory Text Generation: A Constrained Approach for Faithful Decoding with Large Language Models
 SparQ Attention: Bandwidth-Efficient LLM Inference
 Silkie: Preference Distillation for Large Visual Language Models
 Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
 Algorithmic Collusion by Large Language Models
 Mathematical Language Models: A Survey
 Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
 FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
 Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
 Pixel Aligned Language Models
 PathFinder: Guided Search over Multi-Step Reasoning Paths
 Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
 Vision-Language Models as a Source of Rewards
 Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
 From Text to Motion: Grounding GPT-4 in a Humanoid Robot "Alter3"
 Language-Informed Visual Concept Learning
 Evaluation of Large Language Models for Decision Making in Autonomous Driving
 ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
 Extending Context Window of Large Language Models via Semantic Compression
 A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
 Formal Aspects of Language Modeling
 Large Language Models on Graphs: A Comprehensive Survey
 Merlin:Empowering Multimodal LLMs with Foresight Minds
 The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
 "I Want It That Way": Enabling Interactive Decision Support Using Large Language Models and Constraint Programming
 Generating Illustrated Instructions
 Alignment for Honesty
 Paloma: A Benchmark for Evaluating Language Model Fit
 Self-Evaluation Improves Selective Generation in Large Language Models
 Nomic Embed: Training a Reproducible Long Context Text Embedder
 Rejuvenating image-GPT as Strong Visual Representation Learners
 Object Recognition as Next Token Prediction
 Foundation Models in Robotics: Applications, Challenges, and the Future
 Distributed Inference and Fine-tuning of Large Language Models Over The Internet
 LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
 Data Management For Large Language Models: A Survey
 AtP*: An efficient and scalable method for localizing LLM behaviour to components
 Knowledge Distillation of Large Language Models
 Faithful Persona-based Conversational Dataset Generation with Large Language Models
 RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
 ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
 EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
 Localized Symbolic Knowledge Distillation for Visual Commonsense Models
 Weight subcloning: direct initialization of transformers using larger pretrained ones
 Segment and Caption Anything
 Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation
 Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
 Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
 OneLLM: One Framework to Align All Modalities with Language
 Steering Llama 2 via Contrastive Activation Addition
 VILA: On Pre-training for Visual Language Models
 TIP: Text-Driven Image Processing with Semantic and Restoration Instructions
 HyperAttention: Long-context Attention in Near-Linear Time
 LLM360: Towards Fully Transparent Open-Source LLMs
 Efficient Transformers with Dynamic Token Pooling
 GIVT: Generative Infinite-Vocabulary Transformers
 Modeling Context in Referring Expressions
 The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
 Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
 Text-Conditioned Resampler For Long Form Video Understanding
 Gemini: A Family of Highly Capable Multimodal Models
 LLMs are Not Just Next Token Predictors
 LLM in a flash: Efficient Large Language Model Inference with Limited Memory
 Cascade Speculative Drafting for Even Faster LLM Inference
 G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
 VideoPoet: A Large Language Model for Zero-Shot Video Generation
 HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models
 AppAgent: Multimodal Agents as Smartphone Users
 Time is Encoded in the Weights of Finetuned Language Models
 Generative Multimodal Models are In-Context Learners
 Cached Transformers: Improving Transformers with Differentiable Memory Cache
 Mini-GPTs: Efficient Large Language Models through Contextual Pruning
 PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
 An In-depth Look at Gemini's Language Abilities
 Retrieval-Augmented Generation for Large Language Models: A Survey
 Intriguing Properties of Quantization at Scale
 Parrot Captions Teach CLIP to Spot Text
 Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
 Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning
 YAYI 2: Multilingual Open-Source Large Language Models
 Reasons to Reject? Aligning Language Models with Judgments
 Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
 LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding
 Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion
 Exploiting Novel GPT-4 APIs
 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
 VCoder: Versatile Vision Encoders for Multimodal Large Language Models
 PreCog: Exploring the Relation between Memorization and Performance in Pre-trained Language Models
 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
 LLM4VG: Large Language Models Evaluation for Video Grounding
 Shai: A large language model for asset management
 WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation
 LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment
 Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4
 Supervised Knowledge Makes Large Language Models Better In-context Learners
 SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
 Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
 The LLM Surgeon
 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
 MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
 TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
 Task Contamination: Language Models May Not Be Few-Shot Anymore
 PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training
 Learning Vision from Models Rivals Learning Vision from Data
 TinyLlama: An Open-Source Small Language Model
 Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models
 PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation
 Making Large Language Models A Better Foundation For Dense Retrieval
 LARP: Language-Agent Role Play for Open-World Games
 A Survey of Reasoning with Foundation Models
 From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape
 Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
 Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
 Towards the Law of Capacity Gap in Distilling Language Models
 At Which Training Stage Does Code Data Help LLMs Reasoning?
 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
 Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery
 STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
 The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
 Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
 LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
 A Comprehensive Study of Knowledge Editing for Large Language Models
 VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
 Orion-14B: Open-source Multilingual Large Language Models
 LLaMA Beyond English: An Empirical Study on Language Capability Transfer
 DocLLM: A layout-aware generative language model for multimodal document understanding
 COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
 If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents
 Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
 Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models
 Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
 GeoGalactica: A Scientific Large Language Model in Geoscience
 Improving Text Embeddings with Large Language Models
 Boosting Large Language Model for Speech Synthesis: An Empirical Study
 TrustLLM: Trustworthiness in Large Language Models
 Unicron: Economizing Self-Healing LLM Training at Scale
 MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
 Proving Test Set Contamination in Black Box Language Models
 LLaMA Pro: Progressive LLaMA with Block Expansion
 LLM Augmented LLMs: Expanding Capabilities through Composition
 LLaVA-$φ$: Efficient Multi-Modal Assistant with Small Language Model
 ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers
 Understanding LLMs: A Comprehensive Overview from Training to Inference
 Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
 A Vision Check-up for Language Models
 MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
 Multilingual Instruction Tuning With Just a Pinch of Multilinguality
 WordArt Designer API: User-Driven Artistic Typography Synthesis with Large Language Models on ModelScope
 GPT-4V(ision) is a Generalist Web Agent, if Grounded
 V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
 Mind2Web: Towards a Generalist Agent for the Web
 DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
 DocGraphLM: Documental Graph Language Model for Information Extraction
 Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
 TOFU: A Task of Fictitious Unlearning for LLMs
 Transformers are Multi-State RNNs
 Secrets of RLHF in Large Language Models Part II: Reward Modeling
 DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
 Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
 A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
 Towards Conversational Diagnostic AI
 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
 Efficient LLM inference solution on Intel GPU
 I am a Strange Dataset: Metalinguistic Tests for Language Models
 Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
 Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models
 The Impact of Reasoning Step Length on Large Language Models
 Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
 Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
 Mixtral of Experts
 ChatQA: Building GPT-4 Level Conversational QA Models
 TeleChat Technical Report
 DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
 Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
 AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
 Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach
 MaLA-500: Massive Language Adaptation of Large Language Models
 The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
 Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?
 State of What Art? A Call for Multi-Prompt LLM Evaluation
 Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
 Compressing Context to Enhance Inference Efficiency of Large Language Models
 Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks
 VMamba: Visual State Space Model
 DiffusionGPT: LLM-Driven Text-to-Image Generation System
 Self-Rewarding Language Models
 Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
 Asynchronous Local-SGD Training for Language Modeling
 ReFT: Reasoning with Reinforced Fine-Tuning
 SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
 DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
 Tuning Language Models by Proxy
 Scalable Pre-training of Large Autoregressive Image Models
 Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
 Extending LLMs' Context Window with 100 Samples
 E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
 SPADE: Synthesizing Assertions for Large Language Model Pipelines
 Foundations of Vector Retrieval
 Rambler: Supporting Writing With Speech via LLM-Assisted Gist Manipulation
 Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
 Evaluating the Moral Beliefs Encoded in LLMs
 Boosting Theory-of-Mind Performance in Large Language Models via Prompting
 MambaByte: Token-free Selective State Space Model
 RakutenAI-7B: Extending Large Language Models for Japanese
 MM-LLMs: Recent Advances in MultiModal Large Language Models
 AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
 Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
 Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study
 Small Language Model Meets with Reinforced Vision Vocabulary
 WARM: On the Benefits of Weight Averaged Reward Models
 In-Context Learning for Extreme Multi-Label Classification
 SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
 Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text
 CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
 Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
 UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion
 What Are Tools Anyway? A Survey from the Language Model Perspective
 ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
 SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection
 Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment
 CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
 RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
 Mission: Impossible Language Models
 Benchmarking LLMs via Uncertainty Quantification
 BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
 Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
 DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
 H2O-Danube-1.8B Technical Report
 FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
 CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion
 Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
 WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
 Representation Engineering: A Top-Down Approach to AI Transparency
 LongAlign: A Recipe for Long Context Alignment of Large Language Models
 Scavenging Hyena: Distilling Transformers into Long Convolution Models
 Efficient Tool Use with Chain-of-Abstraction Reasoning
 YOLO-World: Real-Time Open-Vocabulary Object Detection
 Weaver: Foundation Models for Creative Writing
 Weak-to-Strong Jailbreaking on Large Language Models
 Transfer Learning for Text Diffusion Models
 StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
 T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
 InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
 Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
 Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
 MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
 RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture
 Watermarking Makes Language Models Radioactive
 From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
 SliceGPT: Compress Large Language Models by Deleting Rows and Columns
 Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
 Generative Expressive Robot Behaviors using Large Language Models
 Efficient Exploration for LLMs
 Can Large Language Models Understand Context?
 SymbolicAI: A framework for logic-based approaches combining generative models and solvers
 Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?
 OLMo: Accelerating the Science of Language Models
 Tree Prompting: Efficient Task Adaptation without Fine-Tuning
 CroissantLLM: A Truly Bilingual French-English Language Model
 Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model
 Transforming and Combining Rewards for Aligning Large Language Models
 EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models
 Scaling Laws for Downstream Task Performance of Large Language Models
 Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
 Seven Failure Points When Engineering a Retrieval Augmented Generation System
 EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
 Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
 CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
 Multi-line AI-assisted Code Authoring
 Self-Discover: Large Language Models Self-Compose Reasoning Structures
 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 Training-Free Consistent Text-to-Image Generation
 Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
 Shortened LLaMA: A Simple Depth Pruning for Large Language Models
 Rethinking Optimization and Architecture for Tiny Language Models
 LiPO: Listwise Preference Optimization through Learning-to-Rank
 BlackMamba: Mixture of Experts for State-Space Models
 Rethinking Interpretability in the Era of Large Language Models
 OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
 TravelPlanner: A Benchmark for Real-World Planning with Language Agents
 K-Level Reasoning with Large Language Models
 StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
 PokéLLMon: A Human-Parity Agent for Pokémon Battles with Large Language Models
 Specialized Language Models with Cheap Inference from Limited Domain Data
 Repeat After Me: Transformers are Better than State Space Models at Copying
 A Survey on Hallucination in Large Vision-Language Models
 Corrective Retrieval Augmented Generation
 A Comprehensive Survey of Compression Algorithms for Language Models
 Leveraging Large Language Models for NLG Evaluation: A Survey
 The Power of Noise: Redefining Retrieval for RAG Systems
 AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
 Red Teaming Visual Language Models
 Knowledge Fusion of Large Language Models
 A Survey of Resource-efficient LLM and Multimodal Foundation Models
 Lexinvariant Language Models
 Noise2Music: Text-conditioned Music Generation with Diffusion Models
 Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
 Mathematical Capabilities of ChatGPT
 AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
 Large Language Models for Mathematical Reasoning: Progresses and Challenges
 SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
 Driving Everywhere with Large Language Model Policy Adaptation
 WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
 SpiRit-LM: Interleaved Spoken and Written Language Model
 Multilingual E5 Text Embeddings: A Technical Report
 In-Context Principle Learning from Mistakes
 Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains
 Hydragen: High-Throughput LLM Inference with Shared Prefixes
 CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay
 Fast Timing-Conditioned Latent Audio Diffusion
 Direct Language Model Alignment from Online AI Feedback
 Grandmaster-Level Chess Without Search
 Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
 BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
 Tandem Transformers for Inference Efficient LLMs
 World Model on Million-Length Video And Language With RingAttention
 Lumos : Empowering Multimodal LLMs with Scene Text Recognition
 Suppressing Pink Elephants with Direct Principle Feedback
 Policy Improvement using Language Feedback Models
 PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
 LokiLM: Technical Report
 Scaling Laws for Fine-Grained Mixture of Experts
 Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
 Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
 AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts
 Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
 OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
 ODIN: Disentangled Reward Mitigates Hacking in RLHF
 GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
 A Tale of Tails: Model Collapse as a Change of Scaling Laws
 Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
 Generative Representational Instruction Tuning
 ChemLLM: A Chemical Large Language Model
 Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
 InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning
 DeAL: Decoding-time Alignment for Large Language Models
 Badllama 3: removing safety finetuning from Llama 3 in minutes
 ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
 SubGen: Token Generation in Sublinear Time and Memory
 Keyframer: Empowering Animation Design using Large Language Models
 Large Language Model for Table Processing: A Survey
 AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
 Approaching Human-Level Forecasting with Language Models
 A phase transition between positional and semantic learning in a solvable model of dot-product attention
 Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning
 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
 Large Language Model based Multi-Agents: A Survey of Progress and Challenges
 Premise Order Matters in Reasoning with Large Language Models
 Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment
 Chain-of-Thought Reasoning Without Prompting
 BitDelta: Your Fine-Tune May Only Be Worth One Bit
 OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
 Data Engineering for Scaling Language Models to 128K Context
 DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
 A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
 How to Train Data-Efficient LLMs
 L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects
 Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers
 GhostWriter: Augmenting Collaborative Human-AI Writing Experiences Through Personalization and Agency
 Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents
 Arrows of Time for Large Language Models
 Coercing LLMs to do and reveal (almost) anything
 LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
 Speculative Streaming: Fast LLM Inference without Auxiliary Models
 Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
 User-LLM: Efficient LLM Contextualization with User Embeddings
 BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
 TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
 How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
 Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
 Instruction-tuned Language Models are Better Knowledge Learners
 The FinBen: An Holistic Financial Benchmark for Large Language Models
 AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
 The boundary of neural network trainability is fractal
 Reformatted Alignment
 Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
 LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
 OneBit: Towards Extremely Low-bit Large Language Models
 CoLLaVO: Crayon Large Language and Vision mOdel
 FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
 GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
 RLVF: Learning from Verbal Feedback without Overgeneralization
 In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
 Linear Transformers with Learnable Kernel Functions are Better In-Context Models
 Efficient Guided Generation for Large Language Models
 SPAR: Personalized Content-Based Recommendation via Long Engagement Attention
 LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
 Large Language Models as Zero-shot Dialogue State Tracker through Function Calling
 DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
 LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing
 Generative Language Modeling for Automated Theorem Proving
 Automated Unit Test Improvement using Large Language Models at Meta
 LLM Agents can Autonomously Hack Websites
 Large Language Models: A Survey
 In-Context Retrieval-Augmented Language Models
 Consolidating Attention Features for Multi-view Image Editing
 LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models
 OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
 Scaling Up LLM Reviews for Google Ads Content Moderation
 Subobject-level Image Tokenization
 TinyLLaVA: A Framework of Small-scale Large Multimodal Models
 Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
 CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
 LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons
 EvoPrompting: Language Models for Code-Level Neural Architecture Search
 Goal Driven Discovery of Distributional Differences via Language Descriptions
 ChatMusician: Understanding and Generating Music Intrinsically with LLM
 GPTVQ: The Blessing of Dimensionality for LLM Quantization
 FuseChat: Knowledge Fusion of Chat Models
 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
 AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
 API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
 ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
 Large Language Models for Data Annotation: A Survey
 LoRA+: Efficient Low Rank Adaptation of Large Models
 When is Tree Search Useful for LLM Planning? It Depends on the Discriminator
 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
 Towards Optimal Learning of Language Models
 Evaluating Very Long-Term Conversational Memory of LLM Agents
 Training-Free Long-Context Scaling of Large Language Models
 Disentangled 3D Scene Generation with Layout Learning
 Do Large Language Models Latently Perform Multi-Hop Reasoning?
 Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
 Nemotron-4 15B Technical Report
 InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
 StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
 Towards Open-ended Visual Quality Comparison
 When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
 MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
 Orca-Math: Unlocking the potential of SLMs in Grade School Math
 Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
 MOSAIC: A Modular System for Assistive and Interactive Cooking
 Priority Sampling of Large Language Models for Compilers
 Simple linear attention language models balance the recall-throughput tradeoff
 API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access
 Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
 Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
 StarCoder 2 and The Stack v2: The Next Generation
 Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
 ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
 Simulacra as Conscious Exotica
 Both Matter: Enhancing the Emotional Intelligence of Large Language Models without Compromising the General Intelligence
 Enhancing Vision-Language Pre-training with Rich Supervisions
 MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
 EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
 Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
 Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
 PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval
 Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey
 Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?
 Tree-Planner: Efficient Close-loop Task Planning with Large Language Models
 Emergent and Predictable Memorization in Large Language Models
 Design2Code: How Far Are We From Automating Front-End Engineering?
 Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
 MathScale: Scaling Instruction Tuning for Mathematical Reasoning
 Empowering Large Language Model Agents through Action Learning
 Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
 RT-H: Action Hierarchies Using Language
 DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
 Resonance RoPE: Improving Context Length Generalization of Large Language Models
 Datasets for Large Language Models: A Comprehensive Survey
 INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models
 Do Efficient Transformers Really Save Computation?
 MathPrompter: Mathematical Reasoning using Large Language Models
 A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
 Can Large Language Models Reason and Plan?
 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
 LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
 Common 7B Language Models Already Possess Strong Math Capabilities
 Yi: Open Foundation Models by 01.AI
 Teaching Large Language Models to Reason with Reinforcement Learning
 SaulLM-7B: A pioneering Large Language Model for Law
 Online Adaptation of Language Models with a Memory of Amortized Contexts
 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
 Learning to Decode Collaboratively with Multiple Language Models
 ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
 The Unreasonable Effectiveness of Eccentric Automatic Prompts
 A Survey on Evaluation of Large Language Models
 The pitfalls of next-token prediction
 Stealing Part of a Production Language Model
 Algorithmic progress in language models
 Thinking Tokens for Language Modeling
 Is Cosine-Similarity of Embeddings Really About Similarity?
 ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
 Can't Remember Details in Long Documents? You Need Some R&R
 KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
 Retrieval-Augmented Generation for AI-Generated Content: A Survey
 LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
 3D-VLA: A 3D Vision-Language-Action Generative World Model
 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
 GPT on a Quantum Computer
 VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
 GiT: Towards Generalist Vision Transformer through Universal Language Interface
 BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
 Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
 Social Skill Training with Large Language Models
 StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control
 Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
 Veagle: Advancements in Multimodal Representation Learning
 Simple and Scalable Strategies to Continually Pre-train Large Language Models
 SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents
 Language models scale reliably with over-training and on downstream tasks
 Gemma: Open Models Based on Gemini Research and Technology
 LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
 On the Societal Impact of Open Foundation Models
 Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
 Chronos: Learning the Language of Time Series
 Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
 ORPO: Monolithic Preference Optimization without Reference Model
 MoAI: Mixture of All Intelligence for Large Language and Vision Models
 An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
 Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
 DeepSeek-VL: Towards Real-World Vision-Language Understanding
 How Far Are We from Intelligent Visual Deductive Reasoning?
 Small Models are Valuable Plug-ins for Large Language Models
 Backtracing: Retrieving the Cause of the Query
 MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
 VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
 Learning to Generate Better Than Your LLM
 Meta-in-context learning in large language models
 LERF: Language Embedded Radiance Fields
 Eliciting Latent Predictions from Transformers with the Tuned Lens
 FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
 Resurrecting Recurrent Neural Networks for Long Sequences
 An Overview on Language Models: Recent Developments and Outlook
 A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
 A Survey of Evaluation Metrics Used for NLG Systems
 SummEval: Re-evaluating Summarization Evaluation
 The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
 LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
 CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
 LLMR: Real-time Prompting of Interactive Worlds using Large Language Models
 Logits of API-Protected LLMs Leak Proprietary Information
 Knowledge Conflicts for LLMs: A Survey
 Revolutionizing Mental Health Care through LangChain: A Journey with a Large Language Model
 Will GPT-4 Run DOOM?
 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
 Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models
 Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
 Large language models surpass human experts in predicting neuroscience results
 Reliable, Adaptable, and Attributable Language Models with Retrieval
 You Need to Pay Better Attention
 RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
 Stable LM 2 1.6B Technical Report
 DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
 A Survey on Data Selection for Language Models
 PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
 Repetition Improves Language Model Embeddings
 How Transformers Learn Causal Structure with Gradient Descent
 Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
 Analysing The Impact of Sequence Composition on Language Model Pre-Training
 Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
 LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
 ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models
 Bayesian Reward Models for LLM Alignment
 KMMLU: Measuring Massive Multitask Language Understanding in Korean
 Dissecting Human and LLM Preferences
 Exploring Value Biases: How LLMs Deviate Towards the Ideal
 Do Llamas Work in English? On the Latent Language of Multilingual Transformers
 RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models
 Why are Sensitive Functions Hard for Transformers?
 Agents Need Not Know Their Purpose
 Copyright Traps for Large Language Models
 DoRA: Weight-Decomposed Low-Rank Adaptation
 SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks
 Rethinking Machine Unlearning for Large Language Models
 Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
 Improving Black-box Robustness with In-Context Rewriting
 Secret Collusion Among Generative AI Agents
 Natural Language Reinforcement Learning
 Universal Neural Functionals
 QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
 LESS: Selecting Influential Data for Targeted Instruction Tuning
 Building Your Own Product Copilot: Challenges, Opportunities, and Needs
 ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs
 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
 Continual Learning for Large Language Models: A Survey
 Towards Efficient and Exact Optimization of Language Model Alignment
 HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction
 OMPGPT: A Generative Pre-trained Transformer Model for OpenMP
 NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
 APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
 Spike No More: Stabilizing the Pre-training of Large Language Models
 Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
 Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention
 Zoology: Measuring and Improving Recall in Efficient Language Models
 GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
 Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
 LoBaSS: Gauging Learnability in Supervised Fine-tuning Data
 Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective
 Instruction Tuning with Human Curriculum
 MatFormer: Nested Transformer for Elastic Inference
 Ada-Instruct: Adapting Instruction Generators for Complex Reasoning
 xVal: A Continuous Number Encoding for Large Language Models
 Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models
 Human Feedback is not Gold Standard
 DeWave: Discrete EEG Waves Encoding for Brain Dynamics to Text Translation
 Headless Language Models: Learning without Predicting with Contrastive Weight Tying
 HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
 Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning
 Do language models plan ahead for future tokens?
 CAME: Confidence-guided Adaptive Memory Efficient Optimization
 Improving Language Plasticity via Pretraining with Active Forgetting
 AdANNS: A Framework for Adaptive Semantic Search
 Strategic Reasoning with Language Models
 MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
 Sparse is Enough in Scaling Transformers
 Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
 A Theory on Adam Instability in Large-Scale Machine Learning
 Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
 Are Language Models Worse than Humans at Following Prompts? It's Complicated
 PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition
 Transformer Language Models without Positional Encodings Still Learn Positional Information
 Sequence Parallelism: Long Sequence Training from System Perspective
 Bio-inspired Structure Identification in Language Embeddings
 Transformers without Tears: Improving the Normalization of Self-Attention
 Neural Text Generation with Unlikelihood Training
 MASS: Masked Sequence to Sequence Pre-training for Language Generation
 QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
 mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
 Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
 Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
 TnT-LLM: Text Mining at Scale with Large Language Models
 Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
 Larimar: Large Language Models with Episodic Memory Control
 LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
 VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
 MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
 PERL: Parameter Efficient Reinforcement Learning from Human Feedback
 Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding
 Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
 RAFT: Adapting Language Model to Domain Specific RAG
 Recurrent Drafter for Fast Speculative Decoding in Large Language Models
 Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
 Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models
 Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
 Language Agents as Optimizable Graphs
 Comparative Study of Large Language Model Architectures on Frontier
 Optimizing Distributed Training on Frontier for Large Language Models
 Striped Attention: Faster Ring Attention for Causal Transformers
 Block-Recurrent Transformers
 Addressing Some Limitations of Transformers with Feedback Memory
 Reverse Training to Nurse the Reversal Curse
 Evaluating Frontier Models for Dangerous Capabilities
 SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
 When Do We Not Need Larger Vision Models?
 LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
 Towards 3D Molecule-Text Interpretation in Language Models
 VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
 Mixture of Soft Prompts for Controllable Data Generation
 HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
 Evolutionary Optimization of Model Merging Recipes
 Semiparametric Token-Sequence Co-Supervision
 MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
 On Learning to Summarize with Large Language Models as References
 Scalable Prompt Generation for Semi-supervised Learning with Language Models
 From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
 MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
 MyVLM: Personalizing VLMs for User-Specific Queries
 Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
 Recourse for reclamation: Chatting with generative language models
 On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial
 DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
 The MiniPile Challenge for Data-Efficient Language Models
 OmniNet: Omnidirectional Representations from Transformers
 Arcee's MergeKit: A Toolkit for Merging Large Language Models
 FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications
 Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
 The Case for Co-Designing Model Architectures with Hardware
 The Unreasonable Ineffectiveness of the Deeper Layers
 Improving Text-to-Image Consistency via Automatic Prompt Optimization
 InternLM2 Technical Report
 AIOS: LLM Agent Operating System
 Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
 InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
 Can large language models explore in-context?
 SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
 FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
 BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
 AllHands: Ask Me Anything on Large-scale Verbatim Feedback via Large Language Models
 LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
 VidLA: Video-Language Alignment at Scale
 Compiler generated feedback for Large Language Models
 sDPO: Don't Use Your Data All at Once
 Polaris: A Safety-focused LLM Constellation Architecture for Healthcare
 RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners
 LLM4Decompile: Decompiling Binary Code with Large Language Models
 Getting the most out of your tokenizer for pre-training and domain adaptation
 How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
 Wider and Deeper LLM Networks are Fairer LLM Evaluators
 Editing Large Language Models: Problems, Methods, and Opportunities
 Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
 Long-form factuality in large language models
 Towards a World-English Language Model for On-Device Virtual Assistants
 LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
 MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
 STaR-GATE: Teaching Language Models to Ask Clarifying Questions
 Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
 LITA: Language Instructed Temporal-Localization Assistant
 TextCraftor: Your Text Encoder Can be Image Quality Controller
 Mechanistic Design and Scaling of Hybrid Architectures
 Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
 SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
 Blockwise Parallel Transformer for Large Context Models
 Large Language Models Can Be Strong Differentially Private Learners
 Head-wise Shareable Attention for Large Language Models
 Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
 ReALM: Reference Resolution As Language Modeling
 Gecko: Versatile Text Embeddings Distilled from Large Language Models
 Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
 Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer
 Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
 DiJiang: Efficient Large Language Models through Compact Kernelization
 Jamba: A Hybrid Transformer-Mamba Language Model
 Localizing Paragraph Memorization in Language Models
 The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
 Group Preference Optimization: Few-Shot Alignment of Large Language Models
 Communicative Agents for Software Development
 Preference Ranking Optimization for Human Alignment
 The CRINGE Loss: Learning what language not to model
 MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
 Attribute First, then Generate: Locally-attributable Grounded Text Generation
 Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
 FABLES: Evaluating faithfulness and content selection in book-length summarization
 Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
 WavLLM: Towards Robust and Adaptive Speech Large Language Model
 MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text
 ST-LLM: Large Language Models Are Effective Temporal Learners
 Advancing LLM Reasoning Generalists with Preference Trees
 Best Practices and Lessons Learned on Synthetic Data for Language Models
 Long-context LLMs Struggle with Long In-context Learning
 HyperCLOVA X Technical Report
 Poro 34B and the Blessing of Multilinguality
 Octopus v2: On-device language model for super agent
 Are large language models superhuman chemists?
 LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
 A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course
 ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
 Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
 Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
 Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers
 Auxiliary task demands mask the capabilities of smaller language models
 Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
 Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
 Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
 Data Interpreter: An LLM Agent For Data Science
 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
 Training LLMs over Neurally Compressed Text
 Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
 ReFT: Representation Finetuning for Language Models
 CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
 MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
 Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
 RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
 LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models
 Noise-Aware Training of Layout-Aware Language Models
 AI and the Problem of Knowledge Collapse
 Learning to Plan and Generate Text with Citations
 The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
 An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models
 ALOHa: A New Measure for Hallucination in Captioning Models
 Efficient Multi-Vector Dense Retrieval Using Bit Vectors
 Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
 Iterative Forward Tuning Boosts In-context Learning in Language Models
 Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
 No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
 CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
 Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
 Stream of Search (SoS): Learning to Search in Language
 Large Product Key Memory for Pretrained Language Models
 Large Memory Layers with Product Keys
 BRAVE: Broadening the visual encoding of vision-language models
 Adapting LLaMA Decoder to Vision Transformer
 RULER: What's the Real Context Size of Your Long-Context Language Models?
 Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
 Reconstructing Hand-Held Objects in 3D
 MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
 MuPT: A Generative Symbolic Music Pretrained Transformer
 OmniFusion Technical Report
 LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
 CodecLM: Aligning Language Models with Tailored Synthetic Data
 SambaLingo: Teaching Large Language Models New Languages
 MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
 Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
 MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
 Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
 Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
 Koala: Key frame-conditioned long video-LLM
 PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
 Understanding Emergent Abilities of Language Models from the Loss Perspective
 Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code
 Making Large Language Models Better Data Creators
 On Surgical Fine-tuning for Language Encoders
 AdaLomo: Low-memory Optimization with Adaptive Learning Rate
 FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets
 Embedding Democratic Values into Social Media AIs via Societal Objective Functions
 Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
 Less is More: Selective Layer Finetuning with SubTuning
 Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
 AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling
 Cut the CARP: Fishing for zero-shot story evaluation
 LLoCO: Learning Long Contexts Offline
 Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
 Rho-1: Not All Tokens Are What You Need
 RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
 Audio Dialogues: Dialogues dataset for audio and music understanding
 From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
 JetMoE: Reaching Llama2 Performance with 0.1M Dollars
 Tackling Polysemanticity with Neuron Embeddings
 WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
 Entity-Level Sentiment Analysis (ELSA): An exploratory task survey
 ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models
 Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs
 Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
 Mechanics of Next Token Prediction with Self-Attention
 Scaling Laws of RoPE-based Extrapolation
 Pre-training Small Base LMs with Fewer Tokens
 Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
 Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
 THOUGHTSCULPT: Reasoning with Intermediate Revision and Search
 Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
 Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
 Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought
 Toward a Theory of Tokenization in LLMs
 Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
 Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
 Learn Your Reference Model for Real Good Alignment
 Large Language Models are as persuasive as humans, but why? About the cognitive effort and moral-emotional language of LLM arguments
 TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
 TransformerFAM: Feedback attention is working memory
 On Speculative Decoding for Multimodal Large Language Models
 Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
 Generative Disco: Text-to-Video Generation for Music Visualization
 Self-playing Adversarial Language Game Enhances LLM Reasoning
 Compression Represents Intelligence Linearly
 The Illusion of State in State-Space Models
 ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past
 A Thorough Examination of Decoding Methods in the Era of LLMs
 A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
 What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
 Should You Mask 15% in Masked Language Modeling?
 Finetuning Pretrained Transformers into RNNs
 BLINK: Multimodal Large Language Models Can See but Not Perceive
 Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
 Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
 Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
 OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
 TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
 MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation
 When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes
 Fewer Truncations Improve Language Modeling
 Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection
 An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
 Many-Shot In-Context Learning
 Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
 Exploring the landscape of large language models: Foundations, techniques, and challenges
 Automated Social Science: Language Models as Scientist and Subjects
 Language Models Still Struggle to Zero-shot Reason about Time Series
 Stepwise Alignment for Constrained Language Model Policy Optimization
 MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
 Language Imbalance Can Boost Cross-lingual Generalisation
 Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge
 Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
 LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency
 TextSquare: Scaling up Text-Centric Visual Instruction Tuning
 Large Language Models are Few-Shot Health Learners
 How Far Can We Go with Practical Function-Level Program Repair?
 AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
 The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
 A Survey on Retrieval-Augmented Text Generation for Large Language Models
 A RAG Method for Source Code Inquiry Tailored to Long-Context LLMs
 How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior
 State Space Model for New-Generation Network Alternative to Transformers: A Survey
 LLM In-Context Recall is Prompt Dependent
 Reducing hallucination in structured outputs via Retrieval-Augmented Generation
 Towards Large Language Models as Copilots for Theorem Proving in Lean
 Characterizing LLM Abstention Behavior in Science QA with Context Perturbations
 From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
 Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
 Aligning language models with human preferences
 Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
 Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
 Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation
 RAR-b: Reasoning as Retrieval Benchmark
 Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
 Deep Reinforcement Learning with a Natural Language Action Space
 Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
 How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
 FlowMind: Automatic Workflow Generation with LLMs
 XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
 DataComp: In search of the next generation of multimodal datasets
 Stable and low-precision training for large-scale vision-language models
 Multi-Head Mixture-of-Experts
 Transformers Can Represent $n$-gram Language Models
 Pegasus-v1 Technical Report
 Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs
 OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
 SnapKV: LLM Knows What You are Looking for Before Generation
 SpaceByte: Towards Deleting Tokenization from Large Language Modeling
 A Survey on Self-Evolution of Large Language Models
 Retrieval Head Mechanistically Explains Long-Context Factuality
 Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
 SPLATE: Sparse Late Interaction Retrieval
 VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
 AgentKit: Flow Engineering with Graphs, not Coding
 Rethinking LLM Memorization through the Lens of Adversarial Compression
 What's the Magic Word? A Control Theory of LLM Prompting
 Adapting Language Models to Compress Contexts
 Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design
 LMentry: A Language Model Benchmark of Elementary Language Tasks
 ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning
 Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners
 Graph Machine Learning in the Era of Large Language Models (LLMs)
 NExT: Teaching Large Language Models to Reason about Code Execution
 "If the Machine Is As Good As Me, Then What Use Am I?" -- How the Use of ChatGPT Changes Young Professionals' Perception of Productivity and Accomplishment
 Can Language Models Solve Olympiad Programming?
 Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
 CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
 How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
 IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages
 Make Your LLM Fully Utilize the Context
 Weak-to-Strong Extrapolation Expedites Alignment
 SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
 Continual Learning of Large Language Models: A Comprehensive Survey
 Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
 Tele-FLM Technical Report
 TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
 List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
 Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
 MoDE: CLIP Data Experts via Clustering
 Universal Adversarial Triggers Are Not Universal
 The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
 Improving Dictionary Learning with Gated Sparse Autoencoders
 BASS: Batched Attention-optimized Speculative Sampling
 CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models
 CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
 Image Segmentation Using Text and Image Prompts
 Holistic Safety and Responsibility Evaluations of Advanced AI Models
 WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction
 NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models
 Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation
 Efficient Continual Pre-training for Building Domain Specific Large Language Models
 DeLighT: Deep and Light-weight Transformer
 Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
 GeckOpt: LLM System Efficiency via Intent-Based Tool Selection
 Better Synthetic Data by Retrieving and Transforming Existing Datasets
 Relational Graph Convolutional Networks for Sentiment Analysis
 Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
 Foundational Challenges in Assuring Alignment and Safety of Large Language Models
 Nyonic Technical Report
 LLM Evaluators Recognize and Favor Their Own Generations
 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
 A Survey of Generative Search and Recommendation in the Era of Large Language Models
 AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
 A Primer on the Inner Workings of Transformer-based Language Models
 U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
 zkLLM: Zero Knowledge Proofs for Large Language Models
 A Survey on the Memory Mechanism of Large Language Model based Agents
 Large Language Model Agent as a Mechanical Designer
 Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
 Near to Mid-term Risks and Opportunities of Open Source Generative AI
 Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
 Benchmarking Mobile Device Control Agents across Diverse Configurations
 Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark
 Assessing The Potential Of Mid-Sized Language Models For Clinical QA
 Conformal Prediction for Natural Language Processing: A Survey
 Dual Modalities of Text: Visual and Textual Generative Pre-training
 AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback
 Predicting Emergent Abilities with Infinite Resolution Evaluation
 Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
 Hallucination of Multimodal Large Language Models: A Survey
 Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
 Benchmarking Benchmark Leakage in Large Language Models
 Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations
 ChuXin: 1.6B Technical Report
 Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
 PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
 LEGENT: Open Platform for Embodied Agents
 From Persona to Personalization: A Survey on Role-Playing Language Agents
 CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments
 BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
 Autonomous LLM-driven research from data to human-verifiable research papers
 Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo
 Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
 Semantic Routing for Enhanced Performance of LLM-Assisted Intent-Based 5G Core Network Management and Orchestration
 RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
 Beyond Words: A Mathematical Framework for Interpreting Large Language Models
 BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers
 Ranked List Truncation for Large Language Model-based Re-Ranking
 Building a Large Japanese Web Corpus for Large Language Models
 STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases
 Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
 DOCCI: Descriptions of Connected and Contrasting Images
 Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
 ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
 Better & Faster Large Language Models via Multi-token Prediction
 When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
 Extending Llama-3's Context Ten-Fold Overnight
 Octopus v4: Graph of language models
 Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics
 ChatGPTest: opportunities and cautionary tales of utilizing AI for questionnaire pretesting
 How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library
 Faster Convergence for Transformer Fine-tuning with Line Search Methods
 Linear Transformers Are Secretly Fast Weight Programmers
 FLAME: Factuality-Aware Alignment for Large Language Models
 NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
 In-Context Learning Creates Task Vectors
 WildChat: 1M ChatGPT Interaction Logs in the Wild
 "In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval"
 LLM-AD: Large Language Model based Audio Description System
 PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval
 LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
 Self-Play Preference Optimization for Language Model Alignment
 Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3
 A Careful Examination of Large Language Model Performance on Grade School Arithmetic
 Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
 Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models
 Automatic Creative Selection with Cross-Modal Matching
 Harmonic LLMs are Trustworthy
 On Training a Neural Network to Explain Binaries
 In-Context Learning with Long-Context Models: An In-Depth Exploration
 Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
 Aligning LLM Agents by Learning Latent Preference from User Edits
 How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
 Neural Networks Learn Statistics of Increasing Complexity
 Emerging Properties in Self-Supervised Vision Transformers
 Advancing Multimodal Medical Capabilities of Gemini
 "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust
 D2PO: Discriminator-Guided DPO with Response Evaluation Models
 Controllable Text Generation in the Instruction-Tuning Era
 MANTIS: Interleaved Multi-Image Instruction Tuning
 A Philosophical Introduction to Language Models - Part II: The Way Forward
 RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing
 How do Large Language Models Handle Multilingualism?
 FinBERT: Financial Sentiment Analysis with Pre-trained Language Models
 Modeling Emotions and Ethics with Large Language Models
 Structured Chemistry Reasoning with Large Language Models
 Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
 To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
 Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
 MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
 Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving
 Characterising the Creative Process in Humans and Large Language Models
 ECC Analyzer: Extract Trading Signal from Earnings Conference Calls using Large Language Model for Stock Performance Prediction
 Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering
 AlphaMath Almost Zero: process Supervision without process
 MAmmoTH2: Scaling Instructions from the Web
 Is Flash Attention Stable?
 ImageInWords: Unlocking Hyper-Detailed Image Descriptions
 What matters when building vision-language models?
 The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates
 Understanding LLMs Requires More Than Statistical Generalization
 Efficient and Economic Large Language Model Inference with Attention Offloading
 A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
 Large Language Models are Inconsistent and Biased Evaluators
 101 Billion Arabic Words Dataset
 What is Sentiment Meant to Mean to Language Models?
 GPT-4 passes most of the 297 written Polish Board Certification Examinations
 Text Quality-Based Pruning for Efficient Training of Language Models
 Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
 On the Evaluation of Machine-Generated Reports
 Automatic Programming: Large Language Models and Beyond
 Long-Term Human Trajectory Prediction using 3D Dynamic Scene Graphs
 Multi-hop Question Answering over Knowledge Graphs using Large Language Models
 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
 Parallel Structures in Pre-training Data Yield In-Context Learning
 BooookScore: A systematic exploration of book-length summarization in the era of LLMs
 Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
 Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders
 Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, Challenges, and Future Directions
 Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
 Beyond Helpfulness and Harmlessness: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning
 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
 ReZero is All You Need: Fast Convergence at Large Depth
 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
 NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
 A Transformer with Stack Attention
 xLSTM: Extended Long Short-Term Memory
 Toward In-Context Teaching: Adapting Examples to Students' Misconceptions
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
 The Silicone Ceiling: Auditing GPT's Race and Gender Biases in Hiring
 Parameter-Efficient Fine-Tuning with Discrete Fourier Transform
 Granite Code Models: A Family of Open Foundation Models for Code Intelligence
 FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference
 Sketch Then Generate: Providing Incremental User Feedback and Guiding LLM Code Generation through Language-Oriented Code Sketches
 Assemblage: Automatic Binary Dataset Construction for Machine Learning
 Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application
 Modeling Caption Diversity in Contrastive Vision-Language Pretraining
 CLLMs: Consistency Large Language Models
 You Only Cache Once: Decoder-Decoder Architectures for Language Models
 VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
 From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
 Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
 Chain of Thoughtlessness: An Analysis of CoT in Planning
 LLMs Can Patch Up Missing Relevance Judgments in Evaluation
 Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures
 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
 How Susceptible are Large Language Models to Ideological Manipulation?
 CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
 Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
 Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
 Can We Use Large Language Models to Fill Relevance Judgment Holes?
 Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
 Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
 Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics
 The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models
 PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models
 Automating the Enterprise with Foundation Models
 Enhancing Q-Learning with Large Language Model Heuristics
 Can Nuanced Language Lead to More Actionable Insights? Exploring the Role of Generative AI in Analytical Narrative Structure
 Language Modeling Using Tensor Trains
 PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation
 Semantic Scaling: Bayesian Ideal Point Estimates with Large Language Models
 HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
 One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations
 Large Language Models (LLMs) as Agents for Augmented Democracy
 Scaling Laws for Forgetting When Fine-Tuning Large Language Models
 GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence
 Natural Language Processing RELIES on Linguistics
 Probing Multimodal LLMs as World Models for Driving
 AttacKG+:Boosting Attack Knowledge Graph Construction with Large Language Models
 Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation
 A Causal Explainable Guardrails for Large Language Models
 In-Context Symbolic Regression: Leveraging Language Models for Function Discovery
 Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models
 Value Augmented Sampling for Language Model Alignment and Personalization
 Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology
 A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models
 Transforming the Bootstrap: Using Transformers to Compute Scattering Amplitudes in Planar N = 4 Super Yang-Mills Theory
 Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
 Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages
 Sub-goal Distillation: A Method to Improve Small Language Agents
 MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
 Linearizing Large Language Models
 Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
 LMD3: Language Model Data Density Dependence
 State-Free Inference of State-Space Models: The Transfer Function Approach
 Generative AI as a metacognitive agent: A comparative mixed-method study with human participants on ICF-mimicking exam performance
 Masked Structural Growth for 2x Faster Language Model Pre-training
 Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
 A Generalist Learner for Multifaceted Medical Image Interpretation
 The Platonic Representation Hypothesis
 AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
 A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking
 Zero-Shot Tokenizer Transfer
 RLHF Workflow: From Reward Modeling to Online RLHF
 LogoMotion: Visually Grounded Code Generation for Content-Aware Animation
 SUTRA: Scalable Multilingual Language Model Architecture
 ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization
 Large Language Models as Planning Domain Generators
 Explaining Text Similarity in Transformer Models
 The Hidden Pitfalls of the Cosine Similarity Loss
 Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning
 Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
 Exposing Attention Glitches with Flip-Flop Language Modeling
 CodeT5+: Open Code Large Language Models for Code Understanding and Generation
 CinePile: A Long Video Question Answering Dataset and Benchmark
 Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
 Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
 Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models
 Understanding the performance gap between online and offline alignment algorithms
 SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
 SpeechVerse: A Large-scale Generalizable Audio Language Model
 Compositional Text-to-Image Generation with Dense Blob Representations
 Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness
 People cannot distinguish GPT-4 from a human in a Turing test
 LLM-Augmented Agent-Based Modelling for Social Simulations: Challenges and Opportunities
 What Can Natural Language Processing Do for Peer Review?
 Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
 Improving Transformers with Dynamically Composable Multi-Head Attention
 Word2World: Generating Stories and Worlds through Large Language Models
 Ask Again, Then Fail: Large Language Models' Vacillations in Judgement
 ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models
 Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
 Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
 Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Intent Resolution in LLMs
 Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models
 Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
 Measuring Implicit Bias in Explicitly Unbiased Large Language Models
 UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
 Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
 SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
 Chameleon: Mixed-Modal Early-Fusion Foundation Models
 Many-Shot In-Context Learning in Multimodal Foundation Models
 Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
 LoRA Learns Less and Forgets Less
 Using ChatGPT for Thematic Analysis
 Are Large Pre-Trained Language Models Leaking Your Personal Information?
 Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre
 HMT: Hierarchical Memory Transformer for Long Context Language Processing
 Air Gap: Protecting Privacy-Conscious Conversational Agents
 Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models
 LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages
 MarkLLM: An Open-Source Toolkit for LLM Watermarking
 "They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
 Towards Uncertainty-Aware Language Agent
 Observational Scaling Laws and the Predictability of Language Model Performance
 Layer-Condensed KV Cache for Efficient Inference of Large Language Models
 Inducing Group Fairness in LLM-Based Decisions
 CELA: Cost-Efficient Language Model Alignment for CTR Prediction
 RDRec: Rationale Distillation for LLM-based Recommendation
 A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers
 INDUS: Effective and Efficient Language Models for Scientific Applications
 Dynamic data sampler for cross-language transfer learning in large language models
 Grounded 3D-LLM with Referent Tokens
 PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
 Span-Aggregatable, Contextualized Word Embeddings for Effective Phrase Mining
 MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization
 WavCraft: Audio Editing and Generation with Large Language Models
 Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
 Transformers learn to implement preconditioned gradient descent for in-context learning
 BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
 MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
 Imp: Highly Capable Large Multimodal Models for Mobile Devices
 Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
 Towards Modular LLMs by Building and Reusing a Library of LoRAs
 OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
 Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks
 (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts
 Latent State Estimation Helps UI Agents to Reason
 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
 Large Language Models Meet NLP: A Survey
 PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
 Blind Baselines Beat Membership Inference Attacks for Foundation Models
 Your Transformer is Secretly Linear
 Can AI Relate: Testing Large Language Model Response for Mental Health Support
 Increasing the LLM Accuracy for Question Answering: Ontologies to the Rescue!
 Large Language Models are Biased Reinforcement Learners
 ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios
 SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks
 Keep It Private: Unsupervised Privatization of Online Text
 Generative AI and Large Language Models for Cyber Security: All Insights You Need
 Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents
 Are Large Language Models Moral Hypocrites? A Study Based on Moral Foundations
 Leveraging Reinforcement Learning and Large Language Models for Code Optimization
 Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
 Large Language Models Are Not Robust Multiple Choice Selectors
 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
 Not All Language Model Features Are Linear
 DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
 Dense Connector for MLLMs
 A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns
 Bitune: Bidirectional Instruction-Tuning
 Lessons from the Trenches on Reproducible Evaluation of Language Models
 Multi-turn Reinforcement Learning from Preference Human Feedback
 Base of RoPE Bounds Context Length
 Top-Down Partitioning for Efficient List-Wise Ranking
 Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast
 xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token
 Agent Planning with World Knowledge Model
 AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
 Distributed Speculative Inference of Large Language Models
 Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations
 RAGE Against the Machine: Retrieval-Augmented LLM Explanations
 Efficient Multimodal Large Language Models: A Survey
 Natural Language Can Help Bridge the Sim2Real Gap
 FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research
 A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
 On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models
 Infinite Limits of Multi-head Transformer Dynamics
 News Recommendation with Category Description by a Large Language Model
 Evaluation of the Programming Skills of Large Language Models
 AI-Assisted Assessment of Coding Practices in Modern Code Review
 LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery
 Super Tiny Language Models
 RE-Adapt: Reverse Engineered Adaptation of Large Language Models
 CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning
 "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data
 Instruction Tuning With Loss Over Instructions
 GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
 ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
 Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
 Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
 Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification
 V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
 SignLLM: Sign Languages Production Large Language Models
 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
 Are Long-LLMs A Necessity For Long-Context Tasks?
 iVideoGPT: Interactive VideoGPTs are Scalable World Models
 Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
 Extracting Prompts by Inverting LLM Outputs
 Data movement limits to frontier model training
 Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
 Aya 23: Open Weight Releases to Further Multilingual Progress
 AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}
 OLAPH: Improving Factuality in Biomedical Long-form Question Answering
 Tailoring Vaccine Messaging with Common-Ground Opinions
 Efficient Adversarial Training in LLMs with Continuous Attacks
 AGRaME: Any-Granularity Ranking with Multi-Vector Embeddings
 Neural Scaling Laws for Embodied AI
 Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust
 The AI Community Building the Future? A Quantitative Analysis of Development Activity on Hugging Face Hub
 Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
 G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation
 "The Death of Wikipedia?" -- Exploring the Impact of ChatGPT on Wikipedia Engagement
 Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning
 Eliciting Latent Knowledge from Quirky Language Models
 GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
 Matryoshka Multimodal Models
 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
 Transformers Can Do Arithmetic with the Right Embeddings
 $\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning
 An Introduction to Vision-Language Modeling
 Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models
 Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
 Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective
 Zamba: A Compact 7B SSM Hybrid Model
 A Survey on LLM Inference-Time Self-Improvement
 LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters
 MoEUT: Mixture-of-Experts Universal Transformers
 DAGER: Exact Gradient Inversion for Large Language Models
 HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models
 The Impact of Positional Encoding on Length Generalization in Transformers
 BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks
 Phase Transitions in the Output Distribution of Large Language Models
 Crafting Interpretable Embeddings by Asking LLMs Questions
 gzip Predicts Data-dependent Scaling Laws
 Spectral Editing of Activations for Large Language Model Alignment
 Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
 Learning to Reason via Program Generation, Emulation, and Search
 Hacc-Man: An Arcade Game for Jailbreaking LLMs
 CLARINET: Augmenting Language Models to Ask Clarification Questions for Retrieval
 FinTextQA: A Dataset for Long-form Financial Question Answering
 On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks
 Don't Forget to Connect! Improving RAG with Graph-based Reranking
 Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
 LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models
 Faithful Logical Reasoning via Symbolic Chain-of-Thought
 2BP: 2-Stage Backpropagation
 Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
 VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections
 Fine-tuning Large Language Models with Sequential Instructions
 Evaluating the Factual Consistency of Large Language Models Through News Summarization
 Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
 Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
 Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
 MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
 Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
 Robust Preference Optimization through Reward Model Distillation
 Jina CLIP: Your CLIP Model Is Also Your Text Retriever
 Matryoshka Query Transformer for Large Vision-Language Models
 Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets
 Offline Regularised Reinforcement Learning for Large Language Models Alignment
 LLMs achieve adult human performance on higher-order theory of mind tasks
 On the Role of Attention Masks and LayerNorm in Transformers
 OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning
 Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice
 On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization
 SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
 Xwin-LM: Strong and Scalable Alignment Practice for LLMs
 GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning
 Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
 Enhancing Large Vision Language Models with Self-Training on Image Comprehension
 Preference Learning Algorithms Do Not Learn Preference Rankings
 MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions
 Contextual Position Encoding: Learning to Count What's Important
 Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
 Linking In-context Learning in Transformers to Human Episodic Memory
 HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
 Bayesian Online Natural Gradient (BONG)
 Data Augmentation Vision Transformer for Fine-grained Image Classification
 MotionLLM: Understanding Human Behaviors from Human Motions and Videos
 Don't drop your samples! Coherence-aware training benefits Conditional diffusion
 Large Language Models Can Self-Improve At Web Agent Tasks
 Group Robust Preference Optimization in Reward-free RLHF
 Evaluating Large Language Model Biases in Persona-Steered Generation
 Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads
 Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
 Is In-Context Learning Sufficient for Instruction Following in LLMs?
 Aligning to Thousands of Preferences via System Message Generalization
 DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
 Generating Query Recommendations via LLMs
 Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
 Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
 Position: Foundation Agents as the Paradigm Shift for Decision Making
 PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression
 A Survey on Vision-Language-Action Models for Embodied AI
 Large Language Models Can Self-Correct with Minimal Effort
 Language Models with Conformal Factuality Guarantees
 Prompt Optimization with Human Feedback
 GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
 RealitySummary: On-Demand Mixed Reality Document Enhancement using Large Language Models
 Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars
 Certifiably Robust RAG against Retrieval Corruption
 Want To Reduce Labeling Cost? GPT-3 Can Help
 Embedding-Aligned Language Models
 Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code
 A Survey of Multimodal Large Language Model from A Data-centric Perspective
 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
 LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
 CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
 SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
 Large Language Models are Zero-Shot Next Location Predictors
 There and Back Again: The AI Alignment Paradox
 Expanded Gating Ranges Improve Activation Functions
 Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
 The Geometry of Categorical and Hierarchical Concepts in Large Language Models
 Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
 SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
 Grokfast: Accelerated Grokking by Amplifying Slow Gradients
 Stress-Testing Capability Elicitation With Password-Locked Models
 Knowledge Circuits in Pretrained Transformers
 Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
 Learning the Language of Protein Structure
 Zyda: A 1.3T Dataset for Open Language Modeling
 SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
 MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
 Towards Scalable Automated Alignment of LLMs: A Survey
 Pretrained Hybrids with MAD Skills
 Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
 BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
 Controlling Large Language Model Agents with Entropic Activation Steering
 A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians
 Transfer Q Star: Principled Decoding for LLM Alignment
 Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation
 From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
 ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory
 To Believe or Not to Believe Your LLM
 Scalable MatMul-free Language Modeling
 Meta-Designing Quantum Experiments with Language Models
 Extended Mind Transformers
 Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
 LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback
 How to Understand Whole Software Repository?
 When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
 Automated Focused Feedback Generation for Scientific Writing Assistance
 PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
 CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
 LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
 Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
 LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
 PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
 Item-Language Model for Conversational Recommendation
 Block Transformer: Global-to-Local Language Modeling for Fast Inference
 Parrot: Multilingual Visual Instruction Tuning
 Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data
 Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
 A Study of Optimizations for Fine-tuning Large Language Models
 Bridging Mini-Batch and Asymptotic Analysis in Contrastive Learning: From InfoNCE to Kernel-Based Losses
 The Impossibility of Fair LLMs
 Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
 AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
 Are We Done with MMLU?
 ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
 Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training
 QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
 Pre-trained Large Language Models Use Fourier Features to Compute Addition
 CLMASP: Coupling Large Language Models with Answer Set Programming for Robotic Task Planning
 PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
 Chain of Agents: Large Language Models Collaborating on Long-Context Tasks
 DiffUHaul: A Training-Free Method for Object Dragging in Images
 Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model
 ABodyBuilder3: Improved and scalable antibody structure predictions
 A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
 DsDm: Model-Aware Dataset Selection with Datamodels
 Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
 Improving Alignment and Robustness with Short Circuiting
 Semantically Diverse Language Generation for Uncertainty Estimation in Language Models
 Matching Anything by Segmenting Anything
 What Do Language Models Learn in Context? The Structured Task Hypothesis
 Scaling and evaluating sparse autoencoders
 Verbalized Machine Learning: Revisiting Machine Learning with Language Models
 Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
 Iteration Head: A Mechanistic Study of Chain-of-Thought
 Incremental Comprehension of Garden-Path Sentences by Large Language Models: Semantic Interpretation, Syntactic Re-Analysis, and Attention
 Does your data spark joy? Performance gains from domain upsampling at the end of training
 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
 CRAG -- Comprehensive RAG Benchmark
 Mixture-of-Agents Enhances Large Language Model Capabilities
 Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
 MAIRA-2: Grounded Radiology Report Generation
 Proofread: Fixes All Errors with One Tap
 NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
 Large Language Model Confidence Estimation via Black-Box Access
 Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
 Towards a Personal Health Large Language Model
 Tx-LLM: A Large Language Model for Therapeutics
 ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
 Unified Text-to-Image Generation and Retrieval
 VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
 BERTs are Generative In-Context Learners
 Is Free Self-Alignment Possible?
 TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools
 CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
 Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
 Creativity Has Left the Chat: The Price of Debiasing Language Models
 UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor
 Can Language Models Serve as Text-Based World Simulators?
 How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad
 Contrastive learning of T cell receptor representations
 Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
 Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
 MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
 MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering
 The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
 OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models
 Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
 A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
 On the Reliability of Watermarks for Large Language Models
 A Survey of Diffusion Models in Natural Language Processing
 Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
 Learning to Grow Pretrained Models for Efficient Transformer Training
 An Image is Worth 32 Tokens for Reconstruction and Generation
 Simple and Effective Masked Diffusion Language Models
 Instant 3D Human Avatar Generation using Image Diffusion Models
 TextGrad: Automatic "Differentiation" via Text
 Spectrum: Targeted Training on Signal to Noise Ratio
 VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
 Multimodal Belief Prediction
 McEval: Massively Multilingual Code Evaluation
 Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
 Merging Improves Self-Critique Against Jailbreak Attacks
 Confabulation: The Surprising Value of Large Language Model Hallucinations
 The Prompt Report: A Systematic Survey of Prompting Techniques
 Improve Mathematical Reasoning in Language Models by Automated Process Supervision
 MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
 Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
 Parallelizing Linear Transformers with the Delta Rule over Sequence Length
 LLM Dataset Inference: Did you train on my dataset?
 Towards Lifelong Learning of Large Language Models: A Survey
 PowerInfer-2: Fast Large Language Model Inference on a Smartphone
 LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
 Attention as a Hypernetwork
 ConStat: Performance-Based Contamination Detection in Large Language Models
 What If We Recaption Billions of Web Images with LLaMA-3?
 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
 Discovering Preference Optimization Algorithms with and for Large Language Models
 Large Language Models Must Be Taught to Know What They Don't Know
 An Empirical Study of Mamba-based Language Models
 Collective Constitutional AI: Aligning a Language Model with Public Input
 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
 Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
 Explore the Limits of Omni-modal Pretraining at Scale
 Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
 Large Language Model Unlearning via Embedding-Corrupted Prompts
 Grounding Multimodal Large Language Models in Actions
 BertaQA: How Much Do Language Models Know About Local Culture?
 VCR: Visual Caption Restoration
 Hibou: A Family of Foundational Vision Transformers for Pathology
 Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe
 Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
 Improving Retrieval for RAG based Question Answering Models on Financial Documents
 On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
 MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
 Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
 MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
 Optimised Grouped-Query Attention Mechanism for Transformers
 Transformers meet Neural Algorithmic Reasoners
 MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
 Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
 OpenVLA: An Open-Source Vision-Language-Action Model
 ReMI: A Dataset for Reasoning with Multiple Images
 Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
 EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
 Investigating the translation capabilities of Large Language Models trained on parallel data only
 Multi-Agent Software Development through Cross-Team Collaboration
 BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
 mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
 UnO: Unsupervised Occupancy Fields for Perception and Forecasting
 HelpSteer2: Open-source dataset for training top-performing reward models
 Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
 Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
 CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
 Real2Code: Reconstruct Articulated Objects via Code Generation
 DafnyBench: A Benchmark for Formal Software Verification
 Estimating the Hallucination Rate of Generative AI
 RWKV-CLIP: A Robust Vision-Language Representation Learner
 CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
 Early Weight Averaging meets High Learning Rates for LLM Pre-training
 Text Embeddings by Weakly-Supervised Contrastive Pre-training
 Promptagator: Few-shot Dense Retrieval From 8 Examples
 RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder
 InPars: Data Augmentation for Information Retrieval using Large Language Models
 Reconciling Kaplan and Chinchilla Scaling Laws
 Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
 Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
 SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals
 Cycles of Thought: Measuring LLM Confidence through Stable Explanations
 From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation
 Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models
 Are you still on track!? Catching LLM Task Drift with Activations
 Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
 UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
 Quantifying Variance in Evaluation Benchmarks
 Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
 BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
 Evaluation of Large Language Models: STEM education and Gender Stereotypes
 Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation
 Mixture-of-Subspaces in Low-Rank Adaptation
 ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
 CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions
 GEB-1.3B: Open Lightweight Large Language Model
 Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting
 A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
 Large language model validity via enhanced conformal prediction methods
 Decoding the Diversity: A Review of the Indic AI Research Landscape
 Advancing High Resolution Vision-Language Models in Biomedicine
 Bayesian Statistical Modeling with Predictors from LLMs
 Self-Supervised Speech Representations are More Phonetic than Semantic
 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
 Needle In A Multimodal Haystack
 mDPO: Conditional Preference Optimization for Multimodal Large Language Models
 Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
 DataComp-LM: In search of the next generation of training sets for language models
 Transcoders Beat Sparse Autoencoders for Interpretability
 Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem
 The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models
 Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
 Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
 MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
 Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
 Language Modeling with Editable External Knowledge
 WPO: Enhancing RLHF with Weighted Preference Optimization
 VideoLLM-online: Online Video Large Language Model for Streaming Video
 How Do Large Language Models Acquire Factual Knowledge During Pretraining?
 Task Me Anything
 Refusal in Language Models Is Mediated by a Single Direction
 DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models
 Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
 GUICourse: From General Vision Language Models to Versatile GUI Agents
 MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
 In-Context Editing: Learning Knowledge from Self-Induced Distributions
 WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
 THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation
 Breaking the Attention Bottleneck
 STAR: SocioTechnical Approach to Red Teaming Language Models
 GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
 HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies
 CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
 AudioPaLM: A Large Language Model That Can Speak and Listen
 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
 RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
 MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators
 ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation
 Full Parameter Fine-tuning for Large Language Models with Limited Resources
 Improving Multi-Agent Debate with Sparse Communication Topology
 Meta Reasoning for Large Language Models
 A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
 Unifying Multimodal Retrieval via Document Screenshot Embedding
 Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
 Deep Bayesian Active Learning for Preference Modeling in Large Language Models
 OLMES: A Standard for Language Model Evaluations
 Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement
 What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
 From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries
 OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
 Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
 Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages
 Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
 News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation
 Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech
 Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
 JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning
 VoCo-LLaMA: Towards Vision Compression with Large Language Models
 TroL: Traversal of Layers for Large Language and Vision Models
 BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
 Statistical Uncertainty in Word Embeddings: GloVe-V
 Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
 Large Scale Transfer Learning for Tabular Data via Language Modeling
 Transcoders Find Interpretable LLM Feature Circuits
 DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
 RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
 Tokenization Falling Short: The Curse of Tokenization
 Can LLM be a Personalized Judge?
 NAST: Noise Aware Speech Tokenization for Speech Language Models
 Bootstrapping Language Models with DPO Implicit Rewards
 The Impact of Initialization on LoRA Finetuning Dynamics
 StatBot.Swiss: Bilingual Open Data Exploration in Natural Language
 Adversarial Attacks on Multimodal Agents
 Estimating Knowledge in Large Language Models Without Generating a Single Token
 Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
 Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
 Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways
 From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
 A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges
 Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
 Long Code Arena: a Set of Benchmarks for Long-Context Code Models
 Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
 Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs
 MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
 Instruction Pre-Training: Language Models are Supervised Multitask Learners
 LLMatDesign: Autonomous Materials Discovery with Large Language Models
 Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
 AgentReview: Exploring Peer Review Dynamics with LLM Agents
 $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
 Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
 Are LLMs Naturally Good at Synthetic Tabular Data Generation?
 DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
 Measuring memorization in RLHF for code completion
 Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
 Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
 Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance
 garak: A Framework for Security Probing Large Language Models
 Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities
 GenQA: Generating Millions of Instructions from a Handful of Prompts
 Transferring Knowledge from Large Foundation Models to Small Downstream Models
 NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
 Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
 Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
 Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
 DeciMamba: Exploring the Length Extrapolation Potential of Mamba
 Evidence of a log scaling law for political persuasion with large language models
 LiveMind: Low-latency Large Language Models with Simultaneous Inference
 Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
 Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
 Improving Visual Commonsense in Language Models via Multiple Image Generation
 Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma?
 Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
 PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers
 Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
 HARE: HumAn pRiors, a key to small language model Efficiency
 Delving into ChatGPT usage in academic writing through excess vocabulary
 A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
 Interpretability of Language Models via Task Spaces
 Surface Form Competition: Why the Highest Probability Answer Isn't Always Right
 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
 CodeRAG-Bench: Can Retrieval Augment Code Generation?
 A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models
 Large Language Models are Null-Shot Learners
 SGLang: Efficient Execution of Structured Language Model Programs
 LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
 Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
 Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation
 Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models
 How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
 Learning to Retrieve Iteratively for In-Context Learning
 Jailbreaking as a Reward Misspecification Problem
 Information Guided Regularization for Fine-tuning Language Models
 Unlocking the Global Synergies in Low-Rank Adapters
 Towards Retrieval Augmented Generation over Large Video Libraries
 DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection
 Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
 RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
 Exploring Design Choices for Building Language-Specific LLMs
 ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
 RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
 African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification
 Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task
 Data Contamination Can Cross Language Barriers
 Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
 Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
 Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report
 Probing the Decision Boundaries of In-context Learning in Large Language Models
 CancerLLM: A Large Language Model in Cancer Domain
 CarLLaVA: Vision language models for camera-only closed-loop driving
 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
 DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
 Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
 VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
 BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
 OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants
 Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
 Long Context Transfer from Language to Vision
 Efficient Continual Pre-training by Mitigating the Stability Gap
 VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
 Sparse High Rank Adapters
 Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking
 What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages
 WARP: On the Benefits of Weight Averaged Rewarded Policies
 Scaling Laws for Linear Complexity Language Models
 LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
 Preference Tuning For Toxicity Mitigation Generalizes Across Languages
 FIRST: Faster Improved Listwise Reranking with Single Token Decoding
 InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context
 Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
 AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
 Confidence Regulation Neurons in Language Models
 Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization
 Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
 Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models
 video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
 How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
 Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations
 Hallucination is Inevitable: An Innate Limitation of Large Language Models
 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
 From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
 Steering Without Side Effects: Improving Post-Deployment Control of Language Models
 Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
 Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network
 PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
 MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
 PostMark: A Robust Blackbox Watermark for Large Language Models
 Can LLMs Learn Macroeconomic Narratives from Social Media?
 Embodied Instruction Following in Unknown Environments
 MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
 Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
 Data curation via joint example selection further accelerates multimodal learning
 From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
 Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
 LongIns: A Challenging Long-context Instruction-based Exam for LLMs
 Multi-property Steering of Large Language Models with Dynamic Activation Composition
 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
 Benchmarking Mental State Representations in Language Models
 Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
 Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy
 How to Compute the Probability of a Word
 Unlocking Continual Learning Abilities in Language Models
 Large Language Models Assume People are More Rational than We Really are
 Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
 Finding Transformer Circuits with Edge Pruning
 A mathematical perspective on Transformers
 Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
 On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
 LLMs' Classification Performance is Overclaimed
 Cross-Modality Safety Alignment
 Bridging Law and Data: Augmenting Reasoning via a Semi-Structured Dataset with IRAC methodology
 Preference Distillation for Personalized Generative Recommendation
 DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents
 Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
 Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
 MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
 Associative Recurrent Memory Transformer
 Symbolic Learning Enables Self-Evolving Agents
 CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
 WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
 WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
 From Rewriting to Remembering: Common Ground for Conversational QA Models
 Adversarial Search Engine Optimization for Large Language Models
 A Closer Look into Mixture-of-Experts in Large Language Models
 Multimodal foundation world models for generalist embodied agents
 Do they mean 'us'? Interpreting Referring Expressions in Intergroup Bias
 MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
 Efficacy of Language Model Self-Play in Non-Zero-Sum Games
 Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
 Large Language Models are Interpretable Learners
 Are Language Models Actually Useful for Time Series Forecasting?
 CAVE: Controllable Authorship Verification Explanations
 Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
 EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
 One Thousand and One Pairs: A "novel" challenge for long-context language models
 Breaking the Frame: Image Retrieval by Visual Overlap Prediction
 CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans
 Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
 GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models
 EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
 Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation
 A Benchmark for Learning to Translate a New Language from One Grammar Book
 OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
 Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
 Aligning Teacher with Student Preferences for Tailored Training Data Generation
 Simulating Classroom Education with LLM-Empowered Agents
 SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
 Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models
 Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
 LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
 MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
 Can LLMs Learn by Teaching? A Preliminary Study
 The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models
 Is Programming by Example solved by LLMs?
 Suri: Multi-constraint Instruction Following for Long-form Text Generation
 Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
 LiveBench: A Challenging, Contamination-Free LLM Benchmark
 From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
 VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
 Revealing Fine-Grained Values and Opinions in Large Language Models
 T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
 Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
 MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
 Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
 ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
 ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
 News Deja Vu: Connecting Past and Present with Semantic Search
 Contrastive Entity Coreference and Disambiguation for Historical Texts
 SAIL: Self-Improving Efficient Online Alignment of Large Language Models
 AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
 Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
 BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
 Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models
 Macroeconomic Forecasting with Large Language Models
 Self-Retrieval: Building an Information Retrieval System with One Large Language Model
 Cognitive Architectures for Language Agents
 Adaptable Logical Control for Large Language Models
 The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More
 DistiLRR: Transferring Code Repair for Low-Resource Programming Languages
 A Critical Study of What Code-LLMs (Do Not) Learn
 "Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline
 Efficient Evolutionary Search Over Chemical Space with Large Language Models
 Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
 Understanding and Mitigating Language Confusion in LLMs
 LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
 Scaling Synthetic Data Creation with 1,000,000,000 Personas
 ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
 The Remarkable Robustness of LLMs: Stages of Inference?
 HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
 Following Length Constraints in Instructions
 AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
 Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
 Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
 EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
 Direct Preference Knowledge Distillation for Large Language Models
 Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
 Monitoring Latent World States in Language Models with Propositional Probes
 RouteLLM: Learning to Route LLMs with Preference Data
 LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
 DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
 RaTEScore: A Metric for Radiology Report Generation
 PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks
 Flora: Low-Rank Adapters Are Secretly Gradient Compressors
 Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
 Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
 Scaling Laws for Fact Memorization of Large Language Models
 Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
 RegMix: Data Mixture as Regression for Language Model Pre-training
 LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
 DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
 ColPali: Efficient Document Retrieval with Vision Language Models
 Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
 Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
 We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
 Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
 MIRAI: Evaluating LLM Agents for Event Forecasting
 Searching for Best Practices in Retrieval-Augmented Generation
 $\text{Memory}^3$: Language Modeling with Explicit Memory
 Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
 BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
 M2QA: Multi-domain Multilingual Question Answering
 Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
 Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
 Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
 MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
 Brevity is the soul of wit: Pruning long files for code generation
 The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention
 From RAG to RICHES: Retrieval Interlaced with Sequence Generation
 LiteSearch: Efficacious Tree Search for LLM
 Detection and Measurement of Syntactic Templates in Generated Text
 Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
 OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
 Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
 UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
 T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
 Compressing Search with Language Models
 Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization
 ProgressGym: Alignment with a Millennium of Moral Progress
 The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
 Changing Answer Order Can Decrease MMLU Accuracy
 MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
 Understanding Alignment in Multimodal LLMs: A Comprehensive Study
 ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions
 Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions
 To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models
 μ-Bench: A Vision-Language Benchmark for Microscopy Understanding
 A Review of Large Language Models and Autonomous Agents in Chemistry
 Agentless: Demystifying LLM-based Software Engineering Agents
 Eliminating Position Bias of Language Models: A Mechanistic Approach
 Resolving Discrepancies in Compute-Optimal Scaling of Language Models
 OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
 FLoRA: Low-Rank Core Space for N-dimension
 InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
 TokenPacker: Efficient Visual Projector for Multimodal LLM
 Investigating Decoder-only Large Language Models for Speech-to-text Translation
 Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models
 Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
 Evaluating Human Alignment and Model Faithfulness of LLM Rationale
 Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
 On the Limitations of Fine-tuned Judge Models for LLM Evaluation
 Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
 LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
 On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
 Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
 Tweetorial Hooks: Generative AI Tools to Motivate Science on Social Media
 A Solvable Model of Neural Scaling Laws
 Hopfield Networks is All You Need
 Improving Transformer Models by Reordering their Sublayers
 A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
 Prompt Stability Scoring for Text Annotation with Large Language Models
 Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
 AI-native Memory: A Pathway from LLMs Towards AGI
 Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
 From Efficient Multimodal Models to World Models: A Survey
 Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
 LLMs can learn self-restraint through iterative self-reflection
 ReGround: Improving Textual and Spatial Grounding at No Cost
 EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria
 Large language models can accurately predict searcher preferences
 Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
 Large Language Models Enable Few-Shot Clustering
 LM vs LM: Detecting Factual Errors via Cross Examination
 Perspectives on Large Language Models for Relevance Judgment
 Human-like Summarization Evaluation with ChatGPT
 ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
 Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
 How Does Quantization Affect Multilingual LLMs?
 Are Large Language Models Consistent over Value-laden Questions?
 RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
 Tree Search for Language Model Agents
 Towards Compositionality in Concept Learning
 Unified Auto-Encoding with Masked Diffusion
 GraphEdit: Large Language Models for Graph Structure Learning
 Meta Large Language Model Compiler: Foundation Models of Compiler Optimization
 LLM-Select: Feature Selection with Large Language Models
 Improving Reward Models with Synthetic Critiques
 JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
 An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models
 Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
 Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
 On scalable oversight with weak LLMs judging strong LLMs
 Fast Forwarding Low-Rank Training
 Learning to (Learn at Test Time): RNNs with Expressive Hidden States
 XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
 AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
 Mixture of A Million Experts
 DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
 FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
 LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
 Anthropocentric bias and the possibility of artificial cognition
 AgentInstruct: Toward Generative Teaching with Agentic Flows
 HEMM: Holistic Evaluation of Multimodal Foundation Models
 Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
 52B to 1T: Lessons Learned via Tele-FLM Series
 Reasoning in Large Language Models: A Geometric Perspective
 RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
 Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling
 Synthetic Multimodal Question Generation
 Unveiling Encoder-Free Vision-Language Models
 $\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
 Distilling System 2 into System 1
 LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
 RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
 Granular Privacy Control for Geolocation with Vision Language Models
 VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models
 Zero-shot Persuasive Chatbots with LLM-Generated Strategies and Information Retrieval
 Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
 FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
 Multi-Object Hallucination in Vision-Language Models
 ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
 Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models
 From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty
 PAS: Data-Efficient Plug-and-Play Prompt Augmentation System
 An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models
 InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct
 LLMBox: A Comprehensive Library for Large Language Models
 Training Task Experts through Retrieval Based Distillation
 Language Models Encode Collaborative Signals in Recommendation
 ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
 When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions
 LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
 Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction
 MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation
 Machine Unlearning Fails to Remove Data Poisoning Attacks
 BeHonest: Benchmarking Honesty in Large Language Models
 Emu: Generative Pretraining in Multimodality
 Enabling Large Language Models to Generate Text with Citations
 Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
 Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
 Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
 Vision language models are blind
 Composable Interventions for Language Models
 A Single Transformer for Scalable Vision-Language Modeling
 MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension
 Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
 Decoding-Time Language Model Alignment with Multiple Objectives
 WebCanvas: Benchmarking Web Agents in Online Environments
 SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
 Visual representations in the human brain are aligned with large language models
 LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
 RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension
 Inference Performance Optimization for Large Language Models on CPUs
 LETS-C: Leveraging Language Embedding for Time Series Classification
 Just read twice: closing the recall gap for recurrent language models
 How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
 TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts
 CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging
 Knowledge Composition using Task Vectors with Learned Anisotropic Scaling
 MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
 Forcing Diffuse Distributions out of Language Models
 Evaluating LLMs at Detecting Errors in LLM Responses
 LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models
 R-Tuning: Instructing Large Language Models to Say `I Don't Know'
 Label Supervised LLaMA Finetuning
 Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
 PentestGPT: An LLM-empowered Automatic Penetration Testing Tool
 Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses
 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
 Review-LLM: Harnessing Large Language Models for Personalized Review Generation
 Do Vision and Language Models Share Concepts? A Vector Space Alignment Study
 MAVIS: Mathematical Visual Instruction Tuning
 Automata-based constraints for language model decoding
 GTA: A Benchmark for General Tool Agents
 SEED-Story: Multimodal Long Story Generation with Large Language Model
 The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
 Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On
 PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents
 DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
 Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
 DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
 MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
 Genomic Language Models: Opportunities and Challenges
 Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
 Self-Recognition in Language Models
 Deconstructing What Makes a Good Optimizer for Language Models
 Teaching Transformers Causal Reasoning through Axiomatic Training
 Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
 CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
 ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
 Why are Visually-Grounded Language Models Bad at Image Classification?
 LoQT: Low Rank Adapters for Quantized Training
 Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
 Lynx: An Open Source Hallucination Evaluation Model
 Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
 LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
 Human-like Episodic Memory for Infinite Context LLMs
 MUSCLE: A Model Update Strategy for Compatible LLM Evolution
 H2O-Danube3 Technical Report
 Context Embeddings for Efficient Answer Generation in RAG
 SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
 RoboMorph: Evolving Robot Morphology using Large Language Models
 SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
 Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
 New Desiderata for Direct Preference Optimization
 Characterizing Prompt Compression Methods for Long Context Inference
 Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
 Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
 MUSE: Machine Unlearning Six-Way Evaluation for Language Models
 Accuracy is Not All You Need
 AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
 Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs
 Universal Neurons in GPT2 Language Models
 Agent Instructs Large Language Models to be General Zero-Shot Reasoners
 Qwen2 Technical Report
 The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
 LAB-Bench: Measuring Capabilities of Language Models for Biology Research
 Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
 MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models
 Representing Rule-based Chatbots with Transformers
 Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
 Benchmarking Language Model Creativity: A Case Study on Code Generation
 Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules
 Spontaneous Reward Hacking in Iterative Self-Refinement
 From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients
 Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
 LLM Circuit Analyses Are Consistent Across Training and Scale
 Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
 NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
 Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development
 Fast Matrix Multiplications for Lookup Table-Quantized LLMs
 Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together
 Bridging the Gap Between Information Seeking and Product Search Systems: Q&A Recommendation for E-commerce
 When is the consistent prediction likely to be a correct prediction?
 Transformer tricks: Removing weights for skipless transformers
 Transformers represent belief state geometry in their residual stream
 Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
 How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition
 #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
 A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment
 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
 OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
 Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
 A Survey on LoRA of Large Language Models
 No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations
 LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
 Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
 Patch-Level Training for Large Language Models
 E5-V: Universal Embeddings with Multimodal Large Language Models
 Case2Code: Learning Inductive Reasoning with Synthetic Data
 AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
 Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
 Splatfacto-W: A Nerfstudio Implementation of Gaussian Splatting for Unconstrained Photo Collections
 The Art of Saying No: Contextual Noncompliance in Language Models
 EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
 NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
 GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression
 Practical Unlearning for Large Language Models
 Does Refusal Training in LLMs Generalize to the Past Tense?
 Automatic Prompt Optimization with "Gradient Descent" and Beam Search
 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
 CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization
 Understanding Reference Policies in Direct Preference Optimization
 Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
 PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks
 Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
 Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
 Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation
 Weak-to-Strong Reasoning
 Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation
 Benchmarking Vision Language Models for Cultural Understanding
 DebUnc: Mitigating Hallucinations in Large Language Model Agent Communication with Uncertainty Estimations
 Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
 A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
 DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
 Scaling Granite Code Models to 128K Context
 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
 VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
 Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers
 Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning
 Lean-STaR: Learning to Interleave Thinking and Proving
 GAVEL: Generating Games Via Evolution and Language Models
 Transformer Layers as Painters
 AUITestAgent: Automatic Requirements Oriented GUI Function Testing
 Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
 Training on the Test Task Confounds Evaluation and Emergence
 The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing
 PaliGemma: A versatile 3B VLM for transfer
 A Survey on Mixture of Experts
 Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
 Consent in Crisis: The Rapid Decline of the AI Data Commons
 ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
 Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
 The Vision of Autonomic Computing: Can LLMs Make It a Reality?
 EVLM: An Efficient Vision-Language Model for Visual Understanding
 LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
 Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
 Internal Consistency and Self-Feedback in Large Language Models: A Survey
 Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
 SciCode: A Research Coding Benchmark Curated by Scientists
 VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
 Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
 VideoGameBunny: Towards vision assistants for video games
 GET-Zero: Graph Embodiment Transformer for Zero-shot Embodiment Generalization
 NNsight and NDIF: Democratizing Access to Foundation Model Internals
 Fractal Patterns May Illuminate the Success of Next-Token Prediction
 WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
 NV-Retriever: Improving text embedding models with effective hard-negative mining
 Efficient Retrieval with Learned Similarities
 Knowledge Mechanisms in Large Language Models: A Survey and Perspective
 Gated Linear Attention Transformers with Hardware-Efficient Training
 SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
 Discrete Flow Matching
 MIBench: Evaluating Multimodal Large Language Models over Multiple Images
 BOND: Aligning LLMs with Best-of-N Distillation
 Foundational Models Defining a New Era in Vision: A Survey and Outlook
 Shared Imagination: LLMs Hallucinate Alike
 Aligning Large Language Models with Human: A Survey
 Compact Language Models via Pruning and Knowledge Distillation
 Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
 CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
 Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings
 IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
 To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability
 Demystifying Chains, Trees, and Graphs of Thoughts
 INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
 The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
 PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
 Testing Occupational Gender Bias in Language Models: Towards Robust Measurement and Zero-Shot Debiasing
 PERSONA: A Reproducible Testbed for Pluralistic Alignment
 Scalify: scale propagation for efficient low-precision LLM training
 Reinforced Prompt Personalization for Recommendation with Large Language Models
 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
 DDK: Distilling Domain Knowledge for Efficient Large Language Models
 Course-Correction: Safety Alignment Using Synthetic Preferences
 Longhorn: State Space Models are Amortized Online Learners
 u-$μ$P: The Unit-Scaled Maximal Update Parametrization
 Recursive Introspection: Teaching Language Model Agents How to Self-Improve
 Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
 Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
 Fluent Student-Teacher Redteaming
 Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?
 Efficient Inference of Vision Instruction-Following Models with Elastic Cache
 Very Large-Scale Multi-Agent Simulation in AgentScope
 AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
 $VILA^2$: VILA Augmented VILA
 AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
 Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach
 Visual Haystacks: Answering Harder Questions About Sets of Images
 Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
 Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
 Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
 ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
 Prover-Verifier Games improve legibility of LLM outputs
 Exploring Advanced Large Language Models with LLMsuite
 Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
 The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
 Vectoring Languages
 Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond
 LoRA-Pro: Are Low-Rank Adapters Properly Optimized?
 RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
 RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
 The Art of Refusal: A Survey of Abstention in Large Language Models
 SALMON: Self-Alignment with Instructable Reward Models
 Small Molecule Optimization with Large Language Models
 Generation Constraint Scaling Can Mitigate Hallucination
 A Survey on Employing Large Language Models for Text-to-SQL Tasks
 Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
 Prompt Injection Attacks on Large Language Models in Oncology
 MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
 Theia: Distilling Diverse Vision Foundation Models for Robot Learning
 Diffusion Feedback Helps CLIP See Better
 Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
 VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
 SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
 Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
 SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
 Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
 Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification
 MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
 PersonaGym: Evaluating Persona Agents and LLMs
 MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
 When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
 Transformers need glasses! Information over-squashing in language tasks
 ThinK: Thinner Key Cache by Query-Driven Pruning
 Meltemi: The first open Large Language Model for Greek
 Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework
 Machine Unlearning in Generative AI: A Survey
 A Large Encoder-Decoder Family of Foundation Models For Chemical Language
 Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
 AI-Assisted Generation of Difficult Math Questions
 Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
 Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
 Demystifying Verbatim Memorization in Large Language Models
 Can LLMs be Fooled? Investigating Vulnerabilities in LLMs
 Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
 The Llama 3 Herd of Models
 ShieldGemma: Generative AI Content Moderation Based on Gemma
 MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
 Adaptive Retrieval-Augmented Generation for Conversational Systems
 Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
 PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
 Latxa: An Open Language Model and Evaluation Suite for Basque
 Improving Retrieval Augmented Language Model with Self-Reasoning
 Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
 Data Contamination Report from the 2024 CONDA Shared Task
 Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
 Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack
 Are LLMs classical or nonmonotonic reasoners? Lessons from generics
 MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
 AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
 Tamper-Resistant Safeguards for Open-Weight LLMs
 Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model
 Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning
 OmniParser for Pure Vision Based GUI Agent
 Finch: Prompt-guided Key-Value Cache Compression
 Gemma 2: Improving Open Language Models at a Practical Size
 Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
 An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
 Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
 Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions
 On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
 A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
 $\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
 Apple Intelligence Foundation Language Models
 Multi-group Uncertainty Quantification for Long-form Text Generation
 MaskInversion: Localized Embeddings via Optimization of Explainability Maps
 Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
 Transformers are Universal In-context Learners
 RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
 In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation
 Leveraging LLM Reasoning Enhances Personalized Recommender Systems
 Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost
 MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
 A Survey of Mamba
 Jailbreaking Text-to-Image Models with LLM-Based Agents
 Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins
 MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
 Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks
 Generative Retrieval with Preference Optimization for E-commerce Search
 The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation
 Improving Retrieval in Sponsored Search by Leveraging Query Context Signals
 GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering
 Crafting the Path: Robust Query Rewriting for Information Retrieval
 Harnessing Large Language Models for Multimodal Product Bundling
 RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
 All Roads Lead to Rome: Unveiling the Trajectory of Recommender Systems Across the LLM Era
 Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems
 Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
 Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models
 Vortex under Ripplet: An Empirical Study of RAG-enabled Applications
 MemoCRS: Memory-enhanced Sequential Conversational Recommender Systems with Large Language Models
 Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
 Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I
 AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment
 Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
 Retrieval-augmented generation in multilingual settings
 Optimization of Retrieval-Augmented Generation Context with Outlier Detection
 "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
 Retrieval-style In-Context Learning for Few-shot Hierarchical Text Classification
 LumberChunker: Long-Form Narrative Document Segmentation
 Entropy-Based Decoding for Retrieval-Augmented Large Language Models
 Improving Zero-shot LLM Re-Ranker with Risk Minimization
 A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
 D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
 Retrieval Augmented Zero-Shot Text Classification
 APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking
 StackRAG Agent: Improving Developer Answers with Retrieval-Augmented Generation
 PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
 RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
 Unified Active Retrieval for Retrieval Augmented Generation
 LLM-enhanced Reranking in Recommender Systems
 Intermediate Distillation: Data-Efficient Distillation from Black-Box LLMs for Information Retrieval
 CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG
 The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs
 Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens
 A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks
 Supportiveness-based Knowledge Rewriting for Retrieval-augmented Language Modeling
 Blowfish: Topological and statistical signatures for quantifying ambiguity in semantic search
 Async Learned User Embeddings for Ads Delivery Optimization
 Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
 RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation
 MrRank: Improving Question Answering Retrieval System through Multi-Result Ranking Model
 Evaluating the External and Parametric Knowledge Fusion of Large Language Models
 DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation
 Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers
 RAG Does Not Work for Enterprises
 One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models
 Voice Jailbreak Attacks Against GPT-4o
 CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
 DeeperImpact: Optimizing Sparse Learned Index Structures
 Empowering Large Language Models to Set up a Knowledge Retrieval Indexer via Self-Learning
 Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
 Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
 RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference
 RaFe: Ranking Feedback Improves Query Rewriting for RAG
 Question-Based Retrieval using Atomic Units for Enterprise RAG
 SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
 Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy
 Redefining Information Retrieval of Structured Database via Large Language Models
 Contextualization with SPLADE for High Recall Retrieval
 Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning
 Comparative Analysis of Retrieval Systems in the Real World
 Semi-Parametric Retrieval via Binary Token Index
 Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations
 GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model
 Retrieval-Oriented Knowledge for Click-Through Rate Prediction
 Leveraging Large Language Models for Multimodal Search
 Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
 From Matching to Generation: A Survey on Generative Information Retrieval
 Retrieval Augmented Generation for Domain-specific Question Answering
 Planning Ahead in Generative Retrieval: Guiding Autoregressive Generation through Simultaneous Decoding
 Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering
 CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models
 Dubo-SQL: Diverse Retrieval-Augmented Generation and Fine Tuning for Text-to-SQL
 Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
 Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing
 Recall-Augmented Ranking: Enhancing Click-Through Rate Prediction Accuracy with Cross-Stage Data
 The Elephant in the Room: Rethinking the Usage of Pre-trained Language Model in Sequential Recommendation
 Efficient Prompting Methods for Large Language Models: A Survey
 Enhancing Question Answering for Enterprise Knowledge Bases using Large Language Models
 PMG : Personalized Multimodal Generation with Large Language Models
 RecGPT: Generative Personalized Prompts for Sequential Recommendation via ChatGPT Training Paradigm
 Taxonomy and Analysis of Sensitive User Queries in Generative AI Search
 Generative Information Retrieval Evaluation
 End-to-end training of Multimodal Model and ranking Model
 Event-enhanced Retrieval in Real-time Search
 Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation
 Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models
 CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems
 Digital Forgetting in Large Language Models: A Survey of Unlearning Methods
 Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts
 Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models
 Where to Move Next: Zero-shot Generalization of LLMs for Next POI Recommendation
 Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems
 Shallow Cross-Encoders for Low-Latency Retrieval
 Retrieval-Enhanced Knowledge Editing for Multi-Hop Question Answering in Language Models
 Generate then Retrieve: Conversational Response Retrieval Using LLMs as Answer and Query Generators
 Are Large Language Models Good at Utility Judgments?
 SelfIE: Self-Interpretation of Large Language Model Embeddings
 Make Large Language Model a Better Ranker
 Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check
 CoLLEGe: Concept Embedding Generation for Large Language Models
 Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation
 JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning
 Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive Learning
 Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases
 Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems
 RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
 ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval
 RecAI: Leveraging Large Language Models for Next-Generation Recommender Systems
 PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design
 Chaining text-to-image and large language model: A novel approach for generating personalized e-commerce banners
 LocalRQA: From Generating Data to Locally Training, Testing, and Deploying Retrieval-Augmented QA Systems
 An Interpretable Ensemble of Graph and Language Models for Improving Search Relevance in E-Commerce
 LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction
 Embedding-based search in JetBrains IDEs
 RAM-EHR: Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records
 Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges
 ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework
 Meta-Task Prompting Elicits Embeddings from Large Language Models
 The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA
 Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation
 Corpus-Steered Query Expansion with Large Language Models
 REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering
 The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)
 Large Language Model Augmented Exercise Retrieval for Personalized Language Learning
 ESE: Espresso Sentence Embeddings
 ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling
 Self-DC: When to retrieve and When to generate? Self Divide-and-Conquer for Compositional Unknown Questions
 Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models
 Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity
 Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge
 ARKS: Active Retrieval in Knowledge Soup for Code Generation
 Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models
 BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence
 Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs
 TriSampler: A Better Negative Sampling Principle for Dense Retrieval
 EcoRank: Budget-Constrained Text Re-ranking Using Large Language Models
 Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models
 Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
 PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers
 Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
 G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
 T-RAG: Lessons from the LLM Trenches
 Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models
 REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
 Chained Tuning Leads to Biased Forgetting
 Non-autoregressive Generative Models for Reranking Recommendation
 History, Development, and Principles of Large Language Models-An Introductory Survey
 Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Human Feedback
 Leveraging LLMs for Unsupervised Dense Retriever Ranking
 RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
 Retrieve to Explain: Evidence-driven Predictions with Language Models
 C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models
 Locally-Adaptive Quantization for Streaming Vector Search
 HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA
 When Large Language Models Meet Vector Databases: A Survey
 Data-efficient Fine-tuning for LLM-based Recommendation
 CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
 Re3val: Reinforced and Reranked Generative Retrieval
 Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
 Generative Dense Retrieval: Memory Can Be a Burden
 The Chronicles of RAG: The Retriever, the Chunk and the Generator
 Curator: Efficient Indexing for Multi-Tenant Vector Databases
 Bridging the Preference Gap between Retrievers and LLMs
 InRanker: Distilled Rankers for Zero-shot Information Retrieval
 Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis
 ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback
 Unsupervised hard Negative Augmentation for contrastive learning
 Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder Models
 RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation
 Large Language Models are Not Stable Recommender Systems
 ESPN: Memory-Efficient Multi-Vector Information Retrieval
 Unlocking the Potential of Large Language Models for Explainable Recommendations
 Preliminary Study on Incremental Learning for Large Language Model-based Recommender Systems
 Agent4Ranking: Semantic Robust Ranking via Personalized Query Rewriting Using Multi-agent LLM
 Dense X Retrieval: What Retrieval Granularity Should We Use?
 Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss
 End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene
 IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions
 ControlRec: Bridging the Semantic Gap between Language Model and Personalized Recommendation
 RecExplainer: Aligning Large Language Models for Explaining Recommendation Models
 Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base
 Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders
 On Retrieval Augmentation and the Limitations of Language Model Training
 ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
 Text Retrieval with Multi-Stage Re-Ranking Models
 LLatrieval: LLM-Verified Retrieval for Verifiable Generation
 CoverBench: A Challenging Benchmark for Complex Claim Verification
 Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion
 Exploring Fine-tuning ChatGPT for News Recommendation
 Self-Taught Evaluators
 RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation
 The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
 Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
 The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
 Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
 MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
 Language Model Can Listen While Speaking
 Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
 Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping
 MiniCPM-V: A GPT-4V Level MLLM on Your Phone
 What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
 The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
 Can LLMs predict the convergence of Stochastic Gradient Descent?
 The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
 LLaVA-OneVision: Easy Visual Task Transfer
 The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design
 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
 Fact Finder -- Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs
 A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search
 Leveraging Inter-Chunk Interactions for Enhanced Retrieval in Large Language Model-Based Question Answering
 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
 Generative Retrieval with Few-shot Indexing
 Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval
 SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
 Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer
 StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
 Synthesizing Text-to-SQL Data from Weak and Strong LLMs
 Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
 CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
 NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time
 Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data
 EXAONE 3.0 7.8B Instruction Tuned Language Model
 Sketch-Guided Constrained Decoding for Boosting Blackbox Large Language Models without Logit Access
 WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
 Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
 Learning Task Decomposition to Assist Humans in Competitive Programming
 Better Alignment with Instruction Back-and-Forth Translation
 Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
 Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers
 LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
 Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation
 ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning
 GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
 Diffusion Guided Language Modeling
 Conversational Prompt Engineering
 Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
 Pairing Clustered Inverted Indexes with kNN Graphs for Fast Approximate Retrieval over Learned Sparse Representations
 Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning
 EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
 Pairwise Judgment Formulation for Semantic Embedding Model in Web Search
 DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
 PhiloBERTA: A Transformer-Based Cross-Lingual Analysis of Greek and Latin Lexicons
 Interpreting Attention Layer Outputs with Sparse Autoencoders
 Fine-tuning language models to find agreement among humans with diverse preferences
 VITA: Towards Open-Source Interactive Omni Multimodal LLM
 A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?
 MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
 Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks
 Early Exit Strategies for Approximate k-NN Search in Dense Retrieval
 HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction
 Relevance Filtering for Embedding-based Retrieval
 OpenResearcher: Unleashing AI for Accelerated Scientific Research
 Enhancing Relevance of Embedding-based Retrieval at Walmart
 mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
 Natural Language Outlines for Code: Literate Programming in the LLM Era
 ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
 Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective
 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data
 The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
 VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
 Review-driven Personalized Preference Reasoning with Large Language Models for Recommendation
 PhysBERT: A Text Embedding Model for Physics Scientific Literature
 Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
 Med42-v2: A Suite of Clinical LLMs
 Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers
 PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting
 LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
 The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
 Layerwise Recurrent Router for Mixture-of-Experts
 Prompt Tuning as User Inherent Profile Inference Machine
 Large Language Model Agent in Financial Trading: A Survey
 Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
 Design Proteins Using Large Language Models: Enhancements and Comparative Analyses
 Hermes 3 Technical Report
 FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
 WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs
 Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM
 InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning
 Aquila2 Technical Report
 Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
 Hierarchical Structured Neural Network for Retrieval
 BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search
 Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
 MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance
 Can Large Language Models Understand Symbolic Graphics Programs?
 BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
 DaRec: A Disentangled Alignment Framework for Large Language Model and Recommender System
 Mamba Retriever: Utilizing Mamba for Effective and Efficient Dense Retrieval
 Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
 Post-Training Sparse Attention with Double Sparsity
 Large language models can be zero-shot anomaly detectors for time series?
 The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
 I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
 FuseChat: Knowledge Fusion of Chat Models
 Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models
 Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
 NL2OR: Solve Complex Operations Research Problems Using Natural Language Inputs
 Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models
 Min P Sampling: Balancing Creativity and Coherence at High Temperature
 LLM Stability: A detailed analysis with some surprises
 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
 A Survey on Benchmarks of Multimodal Large Language Models
 Where is the signal in tokenization space?
 JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
 W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering
 Cropper: Vision-Language Model for Image Cropping through In-Context Learning
 Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering
 Can Large Language Models Reason? A Characterization via 3-SAT
 Large language models can consistently generate high-quality content for election disinformation operations
 LongVILA: Scaling Long-Context Visual Language Models for Long Videos
 Meta Knowledge for Retrieval Augmented Large Language Models
 Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges
 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models
 Graph Retrieval-Augmented Generation: A Survey
 Patched MOA: optimizing inference for diverse software development tasks
 Patched RTC: evaluating LLMs for diverse software development tasks
 InstructCoder: Instruction Tuning Large Language Models for Code Editing
 To Code, or Not To Code? Exploring Impact of Code in Pre-training
 Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
 HMoE: Heterogeneous Mixture of Experts for Language Modeling
 Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval
 Analysis of Plan-based Retrieval for Grounded Text Generation
 NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency
 Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
 Goldfish: Monolingual Language Models for 350 Languages
 BLADE: Benchmarking Language Model Agents for Data-Driven Science
 TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
 Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
 Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
 See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
 LLM Pruning and Distillation in Practice: The Minitron Approach
 Critique-out-Loud Reward Models
 FocusLLM: Scaling LLM's Context by Parallel Decoding
 First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models
 StructuredRAG: JSON Response Formatting with Large Language Models
 MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
 RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
 UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
 Mistral-SPLADE: LLMs for for better Learned Sparse Retrieval
 CTP-LLM: Clinical Trial Phase Transition Prediction Using Large Language Models
 Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer
 Great Memory, Shallow Reasoning: Limits of $k$NN-LMs
 Unboxing Occupational Bias: Grounded Debiasing LLMs with U.S. Labor Data
 Flexora: Flexible Low Rank Adaptation for Large Language Models
 Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information
 Controllable Text Generation for Large Language Models: A Survey
 Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
 Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
 SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
 Drama Engine: A Framework for Narrative Agents
 Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search
 Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment
 Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
 SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
 ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM
 Automating Thought of Search: A Journey Towards Soundness and Completeness
 Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
 Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs
 Matmul or No Matmal in the Era of 1-bit LLMs
 Cross-Modal Safety Alignment: Is textual unlearning all you need?
 Unlocking the Potential of Large Language Models for Clinical Text Anonymization: A Comparative Study
 Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
 QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM
 Exploring Backdoor Attacks against Large Language Model-based Decision Making
 Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
 Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
 Visual Perception by Large Language Model's Weights
 S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
 Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
 PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization
 Nadine: An LLM-driven Intelligent Social Robot with Affective Capabilities and Human-like Memory
 Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs
 InstructionCP: A fast approach to transfer Large Language Models into target language
 KNOW: A Real-World Ontology for Knowledge Capture with Large Language Models
 InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning
 Two Optimizers Are Better Than One: LLM Catalyst Empowers Gradient-Based Optimization for Prompt Tuning
 One-Shot Safety Alignment for Large Language Models via Optimal Dualization
 Are Large Language Models Chameleons?
 Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
 Towards Next-Generation Urban Decision Support Systems through AI-Powered Construction of Scientific Ontology using Large Language Models -- A Case in Optimizing Intermodal Freight Transportation
 VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
 Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery
 Can Graph Learning Improve Task Planning?
 MEMoE: Enhancing Model Editing with Mixture of Experts Adaptors
 Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners
 Language Generation with Strictly Proper Scoring Rules
 Compressing Large Language Models using Low Rank and Low Precision Decomposition
 Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
 Mechanistic Interpretability of Binary and Ternary Transformers
 Enhanced Robot Arm at the Edge with NLP and Vision Systems
 Generative Query Reformulation Using Ensemble Prompting, Document Fusion, and Relevance Feedback
 HEART-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with LLMs
 Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
 THREAD: Thinking Deeper with Recursive Spawning
 Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching
 Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention
 LLM-Assisted Static Analysis for Detecting Security Vulnerabilities
 CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
 Autoformalizing Euclidean Geometry
 LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
 Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
 MotionLLM: Multimodal Motion-Language Learning with Large Language Models
 Exploring the LLM Journey from Cognition to Expression with Linear Representations
 A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor
 TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing
 Laurel: Generating Dafny Assertions Using Large Language Models
 LLMs for User Interest Exploration in Large-scale Recommendation Systems
 Devil's Advocate: Anticipatory Reflection for LLM Agents
 SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs
 Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models
 Mechanism Design for LLM Fine-tuning with Multiple Reward Models
 FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference
 A statistical framework for weak-to-strong generalization
 No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
 GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases
 Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
 C3LLM: Conditional Multimodal Content Generation Using Large Language Models
 Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting
 Finetuning Large Language Model for Personalized Ranking
 Towards Completeness-Oriented Tool Retrieval for Large Language Models
 Keypoint-based Progressive Chain-of-Thought Distillation for LLMs
 SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
 Semantic Importance-Aware Communications with Semantic Correction Using Large Language Models
 Claim Verification in the Age of Large Language Models: A Survey
 Streaming Long Video Understanding with Large Language Models
 Your Large Language Models Are Leaving Fingerprints
 WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
 What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
 Why Not Transform Chat Large Language Models to Non-English?
 TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment
 LOGIN: A Large Language Model Consulted Graph Neural Network Training Framework
 Sunnie: An Anthropomorphic LLM-Based Conversational Agent for Mental Well-Being Activity Recommendation
 CG-FedLLM: How to Compress Gradients in Federated Fune-tuning for Large Language Models
 DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for type classification
 How to set AdamW's weight decay as you scale model and dataset size
 Safety Alignment for Vision Language Models
 ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
 Large Language Models are Effective Priors for Causal Graph Discovery
 HighwayLLM: Decision-Making and Navigation in Highway Driving with RL-Informed Language Model
 WaterPool: A Watermark Mitigating Trade-offs among Imperceptibility, Efficacy and Robustness
 LIRE: listwise reward enhancement for preference alignment
 Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction
 TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models
 RoundTable: Leveraging Dynamic Schema and Contextual Autocomplete for Enhanced Query Precision in Tabular Question Answering
 VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
 Lusifer: LLM-based User SImulated Feedback Environment for online Recommender systems
 AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs
 Large Language Models (LLMs) Assisted Wireless Network Deployment in Urban Settings
 Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance
 Towards Evaluating and Building Versatile Large Language Models for Medicine
 LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction
 RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment
 Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers
 MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
 MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
 Domain-specific long text classification from sparse relevant information
 DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
 Instruct-DeBERTa: A Hybrid Approach for Aspect-based Sentiment Analysis on Textual Reviews
 Insights from Benchmarking Frontier Language Models on Web App Code Generation
 Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning
 Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
 Semantic Alignment for Multimodal Large Language Models
 Memory-Efficient LLM Training with Online Subspace Descent
 A Survey of Hallucination in Large Foundation Models
 MEDCO: Medical Education Copilots Based on A Multi-Agent Framework
 Customizing Language Models with Instance-wise LoRA for Sequential Recommendation
 Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions
 SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
 The Mamba in the Llama: Distilling and Accelerating Hybrid Models
 Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
 MobileQuant: Mobile-friendly Quantization for On-device Language Models
 LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
 LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
 Efficient Detection of Toxic Prompts in Large Language Models
 Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
 Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time
 A Web-Based Solution for Federated Learning with LLM-Based Automation
 NanoFlow: Towards Optimal Large Language Model Serving Throughput
 A Survey of Large Language Models for European Languages
 HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
 Challenges and Responses in the Practice of Large Language Models
 PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars
 Inverse Scaling: When Bigger Isn't Better
 Generative Verifiers: Reward Modeling as Next-Token Prediction
 Project SHADOW: Symbolic Higher-order Associative Deductive reasoning On Wikidata using LM probing
 BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
 DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding
 MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce
 Writing in the Margins: Better Inference Pattern for Long Context Retrieval
 Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
 PAT: Pruning-Aware Tuning for Large Language Models
 Text2SQL is Not Enough: Unifying AI and Databases with TAG
 Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
 Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
 Agentic Retrieval-Augmented Generation for Time Series Analysis
 LLM-3D Print: Large Language Models To Monitor and Control 3D Printing
 A Law of Next-Token Prediction in Large Language Models
 Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
 WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration
 Efficient LLM Scheduling by Learning to Rank
 Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
 Decentralized LLM Inference over Edge Networks with Energy Harvesting
 LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments
 LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
 Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
 Geometry of Lightning Self-Attention: Identifiability and Dimension
 LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
 Conan-embedding: General Text Embedding with More and Better Negative Samples
 Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
 ReMamba: Equip Mamba with Effective Long-Sequence Modeling
 Awes, Laws, and Flaws From Today's LLM Research
 Persuasion Games using Large Language Models
 Can Unconfident LLM Annotations Be Used for Confident Conclusions?
 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
 Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
 Transformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation
 A Survey on Evaluating Large Language Models in Code Generation Tasks
 Law of Vision Representation in MLLMs
 SynDL: A Large-Scale Synthetic Test Collection
 Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
 StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements
 Understanding the User: An Intent-Based Ranking Dataset
 Iterative Graph Alignment
 Icing on the Cake: Automatic Code Summarization at Ericsson
 Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
 LLMs generate structurally realistic social networks but overestimate political homophily
 Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
 Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models
 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
 Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models
 MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
 LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
 GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs
 InkubaLM: A small language model for low-resource African languages
 SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
 CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization
 Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
 SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
 AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems
 CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation
 MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
 Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
 MemLong: Memory-Augmented Retrieval for Long Text Modeling
 BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
 Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning
 SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
 CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
 Selective Preference Optimization via Token-Level Reward Function Estimation
 Impact of ChatGPT on the writing style of condensed matter physicists
 WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
 Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
 ImageBind-LLM: Multi-modality Instruction Tuning
 Transformers as Support Vector Machines
 LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
 RACONTEUR: A Knowledgeable, Insightful, and Portable LLM-Powered Shell Command Explainer
 OLMoE: Open Mixture-of-Experts Language Models
 BEAVER: An Enterprise Benchmark for Text-to-SQL
 Foundations of Large Language Model Compression -- Part 1: Weight Quantization
 Contemporary Model Compression on Large Language Models Inference
 rerankers: A Lightweight Python Library to Unify Ranking Methods
 FuzzCoder: Byte-level Fuzzing Test via Large Language Model
 LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models
 Focus Agent: LLM-Powered Virtual Focus Group
 A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks
 AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction
 In Defense of RAG in the Era of Long-Context Language Models
 Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information
 LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models
 ProGRes: Prompted Generative Rescoring on ASR n-Best
 Augmented Reality without Borders: Achieving Precise Localization Without Maps
 Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
 CogVLM2: Visual Language Models for Image and Video Understanding
 Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need
 In-Context Imitation Learning via Next-Token Prediction
 A Practitioner's Guide to Continual Multimodal Pretraining
 LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
 LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
 Configurable Foundation Models: Building LLMs from a Modular Perspective
 Towards a Unified View of Preference Learning for Large Language Models: A Survey
 A Comparative Study of Pre-training and Self-training
 Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
 RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models
 Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
 NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
 WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
 Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
 Unforgettable Generalization in Language Models
 CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
 GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI
 Imitating Language via Scalable Inverse Reinforcement Learning
 Statically Contextualizing Large Language Models with Typed Holes
 ContextCite: Attributing Model Generation to Context
 TinyAgent: Function Calling at the Edge
 The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts
 PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action
 Ruri: Japanese General Text Embeddings
 On-Device Language Models: A Comprehensive Review
 Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
 Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments
 Building Math Agents with Multi-Turn Iterative Preference Learning
 Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
 Attention Heads of Large Language Models: A Survey
 Planning In Natural Language Improves LLM Search For Code Generation
 On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
 From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
 Extracting Paragraphs from LLM Token Activations
 xLAM: A Family of Large Action Models to Empower AI Agent Systems
 Large Language Model-Based Agents for Software Engineering: A Survey
 SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration
 Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
 CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities
 Evolution of Social Norms in LLM Agents using Natural Language
 A Static Evaluation of Code Completion by Large Language Models
 Universal Transformers
 Hardware Acceleration of LLMs: A comprehensive survey and comparison
 Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Translation
 The Compressor-Retriever Architecture for Language Model OS
 A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine
 A Survey for Large Language Models in Biomedicine
 Watermarking Techniques for Large Language Models: A Survey
 Genetic Approach to Mitigate Hallucination in Generative IR
 Theory, Analysis, and Best Practices for Sigmoid Self-Attention
 VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
 RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs
 Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs
 Advancing Automated Knowledge Transfer in Evolutionary Multitasking via Large Language Models
 An overview of domain-specific foundation model: key technologies, applications and challenges
 Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts
 GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
 RETAIN: Interactive Tool for Regression Testing Guided LLM Migration
 How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
 MoRe Fine-Tuning with 10x Fewer Parameters
 Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity
 Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
 AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model
 Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
 MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
 TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
 Benchmarking Chinese Knowledge Rectification in Large Language Models
 A System and Benchmark for LLM-based Q\&A on Heterogeneous Data
 MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery
 CauseJudger: Identifying the Cause with LLMs for Abductive Logical Reasoning
 Tele-LLMs: A Series of Specialized Large Language Models for Telecommunications
 OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
 Achieving Peak Performance for Large Language Models: A Systematic Review
 Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
 Improving Pretraining Data Using Perplexity Correlations
 LLMs Will Always Hallucinate, and We Need to Live With This
 Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
 How Does Code Pretraining Affect Language Model Task Performance?
 Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
 A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
 Radiology-Llama2: Best-in-Class Large Language Model for Radiology
 Synthetic continued pretraining
 Agent Workflow Memory
 Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
 STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM
 What is the Role of Small Models in the LLM Era: A Survey
 LLaMA-Omni: Seamless Speech Interaction with Large Language Models
 GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
 Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?
 Length Desensitization in Directed Preference Optimization
 STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
 Can Large Language Models Unlock Novel Scientific Research Ideas?
 SongCreator: Lyrics-based Universal Song Generation
 Self-Harmonized Chain of Thought
 SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
 AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
 MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
 PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
 Generative Hierarchical Materials Search
 Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
 What Makes a Maze Look Like a Maze?
 DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
 Retrieval Augmented Thought Process for Private Data Handling in Healthcare
 Dense Reward for Free in Reinforcement Learning from Human Feedback
 Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG
 Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models
 Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection
 Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT
 Representation Tuning
 E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
 DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models
 Alleviating Hallucinations in Large Language Models with Scepticism Modeling
 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
 Harmonic Reasoning in Large Language Models
 STLM Engineering Report: Dropout
 Towards Automated Machine Learning Research
 Optimization Hyper-parameter Laws for Large Language Models
 Residual Stream Analysis with Multi-Layer SAEs
 LAST: Language Model Aware Speech Tokenization
 A Fused Large Language Model for Predicting Startup Success
 Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
 Accelerating Large Language Model Training with Hybrid GPU-based Compression
 Training on the Benchmark Is Not All You Need
 From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
 LanguaShrink: Reducing Token Overhead with Psycholinguistics
 EPO: Hierarchical LLM Agents with Environment Preference Optimization
 Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games
 Harmonized Speculative Sampling
 Why transformers are obviously good models of language
 SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models
 How transformers learn structured data: insights from hierarchical filtering
 Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
 SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection
 Search-Based LLMs for Code Optimization
 Memorization In In-Context Learning
 Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?
 AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference
 Demystifying the Communication Characteristics for Distributed Transformer Models
 In-Context Learning with Representations: Contextual Generalization of Trained Transformers
 Performance Law of Large Language Models
 Importance Weighting Can Help Large Language Models Self-Improve
 Acquiring Bidirectionality via Large and Small Language Models
 Extracting Sentence Embeddings from Pretrained Transformer Models
 Instruct Large Language Models to Generate Scientific Literature Survey Step by Step
 LLMs can Schedule
 A Unified Framework for Model Editing
 AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
 Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data
 Animate, or Inanimate, That is the Question for Large Language Models
 Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks
 How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression
 Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
 From Words to Worth: Newborn Article Impact Prediction with LLM
 Is Child-Directed Speech Effective Training Data for Language Models?
 Automated Theorem Provers Help Improve Large Language Model Reasoning
 SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
 Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages
 Cross-layer Attention Sharing for Large Language Models
 STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
 Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer
 Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models
 On the Resilience of Multi-Agent Systems with Malicious Agents
 Disentangling Dense Embeddings with Sparse Autoencoders
 SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context
 PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning
 Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens
 Entropy, Thermodynamics and the Geometrization of the Language Model
 MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning
 CultureVo: The Serious Game of Utilizing Gen AI for Enhancing Cultural Intelligence
 ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
 LLMs' Understanding of Natural Language Revealed
 Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models
 Do Language Models Have a Critical Period for Language Acquisition?
 Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications
 Towards Effective and Efficient Continual Pre-training of Large Language Models
 Climbing the Complexity Ladder with Expressive Attention
 Towards More Accurate Prediction of Human Empathy and Emotion in Text and Multi-turn Conversations by Combining Advanced NLP, Transformers-based Networks, and Linguistic Methodologies
 I Could've Asked That: Reformulating Unanswerable Questions
 Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
 On the Design and Analysis of LLM-Based Algorithms
 Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data
 A mathematical framework of intelligence and consciousness based on Riemannian Geometry
 Enhancing Training Efficiency Using Packing with Flash Attention
 Banishing LLM Hallucinations Requires Rethinking Generalization
 OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
 Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata
 A Notion of Complexity for Theory of Mind via Discrete World Models
 Tree Cross Attention
 Sentence Bottleneck Autoencoders from Transformer Language Models
 Neural Machine Translation without Embeddings
 Agents in Software Engineering: Survey, Landscape, and Vision
 Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions
 Programming Refusal with Conditional Activation Steering
 AIPO: Improving Training Objective for Iterative Preference Optimization
 Your Weak LLM is Secretly a Strong Teacher for Alignment
 Mutual Theory of Mind in Human-AI Collaboration: An Empirical Study with LLM-driven AI Agents in a Real-time Shared Workspace Task
 Fusing Dynamics Equation: A Social Opinions Prediction Algorithm with LLM-based Agents
 CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks
 LLM Critics Help Catch LLM Bugs
 Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
 Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models
 Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
 Reasoning with Language Model is Planning with World Model
 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
 Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models
 Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles
 ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
 BERT Rediscovers the Classical NLP Pipeline
 AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
 Assessing Adversarial Robustness of Large Language Models: An Empirical Study
 Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
 LLM as BT-Planner: Leveraging LLMs for Behavior Tree Generation in Robot Task Planning
 Instigating Cooperation among LLM Agents Using Adaptive Information Modulation
 Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation
 beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems
 ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
 Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
 From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs
 jina-embeddings-v3: Multilingual Embeddings With Task LoRA
 Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
 On the Diagram of Thought
 CROSS-JEM: Accurate and Efficient Cross-encoders for Short-text Ranking Tasks
 Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator
 HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications
 Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding
 Explaining Datasets in Words: Statistical Models with Natural Language Parameters
 AudioBERT: Audio Knowledge Augmented Language Model
 Policy Filtration in RLHF to Fine-Tune LLM for Code Generation
 Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models
 Qwen2.5-Coder Technical Report
 To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
 A Controlled Study on Long Context Extension and Generalization in LLMs
 GRIN: GRadient-INformed MoE
 LLMs + Persona-Plug = Personalized LLMs
 Human-like Affective Cognition in Foundation Models
 Designing Interfaces for Multimodal Vector Search Applications
 Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation
 A Framework for Ranking Content Providers Using Prompt Engineering and Self-Attention Network
 Scaling FP8 training to trillion-token LLMs
 NVLM: Open Frontier-Class Multimodal LLMs
 LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents
 Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
 Towards Time Series Reasoning with LLMs
 Learning Spatially-Aware Language and Audio Embedding
 THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
 LOLA -- An Open-Source Massively Multilingual Large Language Model
 Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
 SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer
 Semformer: Transformer Language Models with Semantic Planning
 Embedding Geometries of Contrastive Language-Image Pre-Training
 Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
 A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
 Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
 Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
 On the limits of agency in agent-based models
 Schrodinger's Memory: Large Language Models
 Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
 RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
 What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing
 LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
 Stable Language Model Pre-training by Reducing Embedding Variability
 Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
 The Expressive Power of Transformers with Chain of Thought
 Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
 MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
 Revealing the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
 Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
 Training Language Models to Self-Correct via Reinforcement Learning
 Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
 Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models
 Language Models Learn to Mislead Humans via RLHF
 Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL
 MEXMA: Token-level objectives improve sentence representations
 Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
 Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
 InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
 MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning
 BERT-VBD: Vietnamese Multi-Document Summarization Framework
 Measuring Human and AI Values based on Generative Psychometrics with Large Language Models
 RoMath: A Mathematical Reasoning Benchmark in Romanian
 Compressing LLMs: The Truth is Rarely Pure and Never Simple
 CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
 MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
 Knowledge-Based Domain-Oriented Data Augmentation for Enhancing Unsupervised Sentence Embedding
 HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling
 AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances
 Retrieval-Augmented Test Generation: How Far Are We?
 Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
 RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues
 Should RAG Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological Insights
 Linear Recency Bias During Training Improves Transformers' Fit to Reading Times
 Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models
 Making Large Language Models into World Models with Precondition and Effect Knowledge
 Linguini: A benchmark for language-agnostic linguistic reasoning
 Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
 Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking
 SLIMER-IT: Zero-Shot NER on Italian Language
 Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization
 Adaptive Large Language Models By Layerwise Attention Shortcuts
 Rediscovering the Latent Dimensions of Personality with Large Language Models as Trait Descriptors
 MindScape Study: Integrating LLM and Behavioral Sensing for Personalized AI-Driven Journaling Experiences
 Language Models "Grok" to Copy
 Autoregressive + Chain of Thought $\simeq$ Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer
 Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
 What You Say = What You Want? Teaching Humans to Articulate Requirements for LLMs
 When Context Leads but Parametric Memory Follows in Large Language Models
 SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
 Mixture of Diverse Size Experts
 Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent
 Semi-Supervised Reward Modeling via Iterative Self-Training
 Spectral Filters, Dark Signals, and Attention Sinks
 Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers
 Enhancing Fault Localization Through Ordered Code Analysis with LLM Agents and Self-Reflection
 ChainBuddy: An AI Agent System for Generating LLM Pipelines
 ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources
 Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models
 RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
 RRM: Robust Reward Model Training Mitigates Reward Hacking
 AutoVerus: Automated Proof Generation for Rust Code
 LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models
 Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
 LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts
 TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning
 Jailbreaking Large Language Models with Symbolic Mathematics
 Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments
 An adapted large language model facilitates multiple medical tasks in diabetes care
 KTO: Model Alignment as Prospect Theoretic Optimization
 Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
 Towards Understanding Grokking: An Effective Theory of Representation Learning
 What Makes Good In-Context Examples for GPT-$3$?
 A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
 Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
 Learning from Contrastive Prompts: Automated Optimization and Adaptation
 Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
 Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs
 Phantom of Latent for Large Language and Vision Models
 Target-Aware Language Modeling via Granular Data Sampling
 Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
 A Case Study of Web App Coding with OpenAI Reasoning Models
 DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency
 Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems
 Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking
 LLM-Assisted Visual Analytics: Opportunities and Challenges
 Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
 Instruction Following without Instruction Tuning
 OmniBench: Towards The Future of Universal Omni-Language Models
 Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
 A Survey on the Honesty of Large Language Models
 HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
 Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering
 MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents
 Making Text Embedders Few-Shot Learners
 BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems
 Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation
 Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA
 EuroLLM: Multilingual Language Models for Europe
 Small Language Models: Survey, Measurements, and Insights
 Reward-Robust RLHF in LLMs
 Planning in the Dark: LLM-Symbolic Planning Pipeline without Experts
 Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
 Block-Attention for Low-Latency RAG
 Federated Large Language Models: Current Progress and Future Directions
 Visual Prompting in Multimodal Large Language Models: A Survey
 Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
 Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
 VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
 Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization
 DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling
 Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing
 Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval
 LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ
 Context-Enhanced LLM-Based Framework for Automatic Test Refactoring
 MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
 RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
 A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
 Disentangling Questions from Query Generation for Task-Adaptive Retrieval
 Boosting Healthcare LLMs Through Retrieved Context
 FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression
 Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
 HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
 INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
 NoTeeline: Supporting Real-Time Notetaking from Keypoints with Large Language Models
 A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions
 Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models
 EgoLM: Multi-Modal Language Model of Egocentric Motions
 EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
 BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search
 Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
 MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
 Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
 Looped Transformers for Length Generalization
 Automatic Instruction Evolving for Large Language Models
 Towards More Relevant Product Search Ranking Via Large Language Models: An Empirical Study
 Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
 Infer Human's Intentions Before Following Natural Language Instructions
 The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
 VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search
 ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
 Here's Charlie! Realising the Semantic Web vision of Agents in the age of LLMs
 Multi-language Unit Test Generation using LLMs
 CLUE: Concept-Level Uncertainty Estimation for Large Language Models
 Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models
 Alignment-Aware Model Extraction Attacks on Large Language Models
 Creating a Gen-AI based Track and Trace Assistant MVP (SuperTracy) for PostNL
 Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
 Hypothesizing Missing Causal Variables with LLMs
 Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
 Membership Inference Attacks Against In-Context Learning
 Deploying a Retrieval based Response Model for Task Oriented Dialogues
 Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
 Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction
 FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
 Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
 Large Language Models Can Understanding Depth from Monocular Images
 Addition is All You Need for Energy-efficient Language Models
 TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
 UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
 SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
 Can Models Learn Skill Composition from Examples?
 Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
 Hyper-Connections
 Visual Question Decomposition on Multimodal Large Language Models
 DiaSynth -- Synthetic Dialogue Generation Framework
 On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
 LML: Language Model Learning a Dataset for Data-Augmented Prediction
 Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
 From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
 Emu3: Next-Token Prediction is All You Need
 Learning the Latent Rules of a Game from Data: A Chess Story
 Cottention: Linear Transformers With Cosine Attention
 Do We Need Domain-Specific Embedding Models? An Empirical Investigation
 Data Analysis in the Era of Generative AI
 Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization
 VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback
 SciDFM: A Large Language Model with Mixture-of-Experts for Science
 Generative Retrieval Meets Multi-Graded Relevance
 CurricuLLM: Automatic Task Curricula Design for Learning Complex Robot Skills using Large Language Models
 An Adversarial Perspective on Machine Unlearning for AI Safety
 Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult
 HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows
 MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making
 A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
 Natural Language Processing Methods for the Study of Protein-Ligand Interactions
 Solving math word problems with process- and outcome-based feedback
 Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG
 Law of the Weakest Link: Cross Capabilities of Large Language Models
 One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
 Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
 LoRA Dropout as a Sparsity Regularizer for Overfitting Control
 Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs
 Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation
 RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
 ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
 Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling
 HelpSteer2-Preference: Complementing Ratings with Preferences
 From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
 Quantifying Generalization Complexity for Large Language Models
 Not All LLM Reasoners Are Created Equal
 LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
 Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis
 FactAlign: Long-form Factuality Alignment of Large Language Models
 InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
 BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation
 Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
 Contrastive Localized Language-Image Pre-Training
 Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
 Large Language Models as Markov Chains
 Distilling an End-to-End Voice Assistant Without Instruction Training Data
 MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation
 General Preference Modeling with Preference Representations for Aligning Language Models
 L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
 Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
 OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
 FlashMask: Efficient and Rich Mask Extension of FlashAttention
 Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting
 Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks
 KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
 Understanding Higher-Order Correlations Among Semantic Components in Embeddings
 Calibrating Language Models with Adaptive Temperature Scaling
 On the Inductive Bias of Stacking Towards Improving Reasoning
 Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
 Intelligence at the Edge of Chaos
 Contextual Document Embeddings
 RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
 Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning
 SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics
 Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
 VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment
 AutoTrain: No-code training for state-of-the-art models
 Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
 Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
 The Perfect Blend: Redefining RLHF with Mixture of Judges
 How Much Can RAG Help the Reasoning of LLM?
 ENTP: Encoder-only Next Token Prediction
 Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
 On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
 A General Framework for Producing Interpretable Semantic Text Embeddings
 Showing LLM-Generated Code Selectively Based on Confidence of LLMs
 Autoregressive Large Language Models are Computationally Universal
 Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
 Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
 Erasing Conceptual Knowledge from Language Models
 Selective Attention Improves Transformer
 GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
 ARB-LLM: Alternating Refined Binarizations for Large Language Models
 Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning
 In-context Learning in Presence of Spurious Correlations
 AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
 ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model
 CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
 TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
 Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation
 HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly
 OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
 Efficient $1$-bit tensor approximations
 When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
 Differential Transformer
 LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
 DEPT: Decoupled Embeddings for Pre-training Language Models
 Fast State Restoration in LLM Serving with HCache
 TLDR: Token-Level Detective Reward Model for Large Vision Language Models
 Reward-RAG: Enhancing RAG with Reward Driven Supervision
 Named Clinical Entity Recognition Benchmark
 MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs
 LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
 LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
 GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
 ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
 Why Do We Need Weight Decay in Modern Deep Learning?
 SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
 Algorithmic Capabilities of Random Transformers
 Inference Scaling for Long-Context Retrieval Augmented Generation
 Preference Optimization as Probabilistic Inference
 Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
 RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
 $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization
 LongGenBench: Long-context Generation Benchmark
 Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
 nGPT: Normalized Transformer with Representation Learning on the Hypersphere
 VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
 ToolGen: Unified Tool Retrieval and Calling via Generation
 MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
 A generative framework to bridge data-driven models and scientific theories in language neuroscience
 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks
 Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
 Learning How Hard to Think: Input-Adaptive Allocation of LM Computation
 Steering Large Language Models between Code Execution and Textual Reasoning
 TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
 Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach
 DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search
 Archon: An Architecture Search Framework for Inference-Time Techniques
 Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
 Data Selection via Optimal Control for Language Models
 Upcycling Large Language Models into Mixture of Experts
 Temporal Reasoning Transfer from Text to Video
 TRACE: Temporal Grounding Video LLM via Causal Event Modeling
 MM-Ego: Towards Building Egocentric Multimodal LLMs
 One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
 Can Transformers Reason Logically? A Study in SAT Solving
 Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
 Personalized Visual Instruction Tuning
 MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
 Pixtral 12B
 Self-Boosting Large Language Models with Synthetic Preference Data
 Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
 Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
 Multimodal Situational Safety
 AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
 Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning
 CursorCore: Assist Programming through Aligning Anything
 TinyEmo: Scaling down Emotional Reasoning via Metric Projection
 MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders
 ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
 Falcon Mamba: The First Competitive Attention-free 7B Language Model
 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
 Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
 Does Spatial Cognition Emerge in Frontier Models?
 Round and Round We Go! What makes Rotary Positional Encodings useful?
 Large Language Model Enhanced Text-to-SQL Generation: A Survey
 Tracking Universal Features Through Fine-Tuning and Model Merging
 Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
 Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space
 SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks
 Response Tuning: Aligning Large Language Models without Instruction
 Collective Critics for Creative Story Generation
 LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
 MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
 Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
 Emergent properties with repeated examples
 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
 PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
 MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
 MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
 Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning
 SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
 Intriguing Properties of Large Language and Vision Models
 Benchmarking Agentic Workflow Generation
 GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
 Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
 Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation
 WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents
 Vector-ICL: In-context Learning with Continuous Vector Representations
 Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
 LLM Cascade with Multi-Objective Optimal Consideration
 No Free Lunch: Retrieval-Augmented Generation Undermines Fairness in LLMs, Even for Vigilant Users
 The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks
 LLMs Are In-Context Reinforcement Learners
 Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
 Accelerated Preference Optimization for Large Language Model Alignment
 How to Train Long-Context Language Models (Effectively)
 GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning
 SimpleStrat: Diversifying Language Model Generation with Stratification
 Mentor-KD: Making Small Language Models Better Multi-step Reasoners
 SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights
 Science is Exploration: Computational Frontiers for Conceptual Metaphor Theory
 Baichuan-Omni Technical Report
 KV Prediction for Improved Time to First Token
 Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation
 Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
 Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
 GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
 Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
 Benign Overfitting in Single-Head Attention
 DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
 I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy
 PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness
 The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
 MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
 RL, but don't do anything I wouldn't do
 From Tokens to Words: On the Inner Lexicon of LLMs
 Neuron-Level Sequential Editing for Large Language Models
 Mixture of Attentions For Speculative Decoding
 Integrating Natural Language Prompting Tasks in Introductory Programming Courses
 Benign or Not-Benign Overfitting in Token Selection of Attention Mechanism
 Causal Inference with Large Language Model: A Survey
 MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
 MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
 SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI
 DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
 PeerArg: Argumentative Peer Review with LLMs
 LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
 Thinking LLMs: General Instruction Following with Thought Generation
 Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
 VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
 ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
 Toward General Instruction-Following Alignment for Retrieval-Augmented Generation
 Rethinking Data Selection at Scale: Random Selection is Almost All You Need
 The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling
 Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
 Tree of Problems: Improving structured problem solving with compositionality
 TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
 Think While You Generate: Discrete Diffusion with Planned Denoising
 Strong Model Collapse
 Fundamental Limitations on Subquadratic Alternatives to Transformers
 On The Computational Complexity of Self-Attention
 Primer: Searching for Efficient Transformers for Language Modeling
 NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
 Agent-as-a-Judge: Evaluate Agents with Agents
 Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free
 Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt
 LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models
 What Matters in Transformers? Not All Attention is Needed
 Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
 A Hitchhiker's Guide to Scaling Law Estimation
 How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs
 Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations
 Agentic Information Retrieval
 In-Context Learning Enables Robot Action Prediction in LLMs
 Exploring Model Kinship for Merging Large Language Models
 Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
 BenTo: Benchmark Task Reduction with In-Context Transferability
 Revealing the Barriers of Language Agents in Planning
 ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
 Prompt Compression for Large Language Models: A Survey
 Model Balancing Helps Low-data Training and Fine-tuning
 The Moral Case for Using Language Model Agents for Recommendation
 OMCAT: Omni Context Aware Transformer
 FLARE: Faithful Logic-Aided Reasoning and Exploration
 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
 Persistent Topological Features in Large Language Models
 Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions
 Large Language Model Evaluation via Matrix Nuclear-Norm
 ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains
 Taming Overconfidence in LLMs: Reward Calibration in RLHF
 Parameter-Efficient Fine-Tuning of State Space Models
 How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms
 Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
 DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
 LightRAG: Simple and Fast Retrieval-Augmented Generation
 Large Language Model-Based Evolutionary Optimizer: Reasoning with elitism
 $γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
 Can MLLMs Understand the Deep Implication Behind Chinese Images?
 Retrospective Learning from Interactions
 A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models
 AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
 Harnessing Webpage UIs for Text-Rich Visual Understanding
 Looking Inward: Language Models Can Learn About Themselves by Introspection
 PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
 Improving Multi-modal Large Language Model through Boosting Vision Capabilities
 Persistent Pre-Training Poisoning of LLMs
 A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
 LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning
 Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
 Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
 SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation
 Roadmap towards Superhuman Speech Understanding using Large Language Models
 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
 A Little Human Data Goes A Long Way
 AERO: Softmax-Only LLMs for Efficient Private Inference
 Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging
 Improving Instruction-Following in Language Models through Activation Steering
 JudgeBench: A Benchmark for Evaluating LLM-based Judges
 From Commands to Prompts: LLM-based Semantic File System for AIOS
 MoH: Multi-Head Attention as Mixture-of-Head Attention
 When Attention Sink Emerges in Language Models: An Empirical View
 Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
 FlatQuant: Flatness Matters for LLM Quantization
 MedMobile: A mobile-sized language model with expert-level clinical capabilities
 Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models
 Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
 SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval
 SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
 Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
 TopoLM: brain-like spatio-functional organization in a topographic language model
 Global Lyapunov functions: a long-standing open problem in mathematics, with symbolic transformers
 Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts
 GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings
 Teaching Models to Balance Resisting and Accepting Persuasion
 Do LLMs "know" internally when they follow instructions?
 CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers and Fully-Connected Neural Networks for Causally Constrained Predictions
 Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
 UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
 Goal Inference from Open-Ended Dialog
 A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
 SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
 Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
 Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media
 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
 Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
 SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise
 Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
 CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
 Pre-training Distillation for Large Language Models: A Design Space Exploration
 Improve Vision Language Model Chain-of-thought Reasoning
 RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
 Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
 Baichuan Alignment Technical Report
 SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
 Decomposing The Dark Matter of Sparse Autoencoders
 Sparse Universal Transformer
 Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens
 Diverging Preferences: When do Annotators Disagree and do Models Know?
 Do LLMs estimate uncertainty well in instruction-following?
 Large Language Models Are Overparameterized Text Encoders
 Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
 Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
 CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
 Generative Reward Models
 Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
 Content Enhanced BERT-based Text-to-SQL Generation
 DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
 SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments
 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
 Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning
 RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
 Cascade Reward Sampling for Efficient Decoding-Time Alignment
 Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging
 Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
 Data Agnostic RoBERTa-based Natural Language to SQL Query Generation
 Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation
 Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement
 Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
 In-context learning and Occam's razor
 Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
 Zero-shot Model-based Reinforcement Learning using Large Language Models
 SMART: Self-learning Meta-strategy Agent for Reasoning Tasks
 Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs
 Transformers are Efficient Compilers, Provably
 LongReward: Improving Long-context Large Language Models with AI Feedback
 Automatically Interpreting Millions of Features in Large Language Models
 You can remove GPT2's LayerNorm by fine-tuning
 An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
 MiniPLM: Knowledge Distillation for Pre-Training Language Models
 Value Residual Learning For Alleviating Attention Concentration In Transformers
 LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging
 Aligning Large Language Models via Self-Steering Optimization
 Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes
 Beyond Retrieval: Generating Narratives in Conversational Recommender Systems
 Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?
 STAR: A Simple Training-free Approach for Recommendations using Large Language Models
 SouLLMate: An Application Enhancing Diverse Mental Health Support with Adaptive LLMs, Prompt Engineering, and RAG Techniques
 EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search
 Improving Pinterest Search Relevance Using Large Language Models
 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
 Pyramid Vector Quantization for LLMs
 TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
 LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
 Stick-breaking Attention
 SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
 Scaling Diffusion Language Models via Adaptation from Autoregressive Models
 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
 Frontiers in Intelligent Colonoscopy
 Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy
 LLM-based Optimization of Compound AI Systems: A Survey
 Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
 M-RewardBench: Evaluating Reward Models in Multilingual Settings
 MedINST: Meta Dataset of Biomedical Instructions
 ALTA: Compiler-Based Analysis of Transformers
 SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback
 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
 Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
 Should We Really Edit Language Models? On the Evaluation of Edited Language Models
 Why Does the Effective Context Length of LLMs Fall Short?
 RRADistill: Distilling LLMs' Passage Ranking Ability for Document Re-Ranking of Long-Tail Queries in a Search Engine
 Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch
 LOGO -- Long cOntext aliGnment via efficient preference Optimization
 CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models
 Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
 Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits
 ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
 Language Models are Symbolic Learners in Arithmetic
 Balancing Label Quantity and Quality for Scalable Elicitation
 The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
 AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline
 SpinQuant: LLM quantization with learned rotations
 WAFFLE: Multi-Modal Model for Automated Front-End Development
 AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning
 FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
 Taipan: Efficient and Expressive State Space Language Models with Selective Attention
 Can Knowledge Editing Really Correct Hallucinations?
 When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
 Distill Visual Chart Reasoning Ability from LLMs to MLLMs
 A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
 ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
 Provably Robust Watermarks for Open-Source Language Models
 DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations
 Rethinking Softmax: Self-Attention with Polynomial Activations
 SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoning
 Understanding Players as if They Are Talking to the Game in a Customized Language: A Pilot Study
 The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI
 Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
 ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
 Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction
 Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements
 Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
 DreamLIP: Language-Image Pre-training with Long Captions
 Inductive Biases and Variable Creation in Self-Attention Mechanisms
 An LLM Agent for Automatic Geospatial Data Analysis
 EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning
 Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
 Long Term Memory: The Foundation of AI Self-Evolution
 From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
 SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
 LeanAgent: Lifelong Learning for Formal Theorem Proving
 Little Giants: Synthesizing High-Quality Embedding Data at Scale
 Beyond position: how rotary embeddings shape representations and memory in autoregressive transfomers
 A Survey of Conversational Search
 Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction
 How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts
 Teach Multimodal LLMs to Comprehend Electrocardiographic Images
 Knowledge Graph Enhanced Language Agents for Recommendation
 AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios
 COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
 Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning
 VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs
 MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
 Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
 Counting Ability of Large Language Models and Impact of Tokenization
 Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
 Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
 Reflection-Bench: probing AI intelligence with reflection
 PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles
 Analysing the Residual Stream of Language Models Under Knowledge Conflicts
 CoqPilot, a plugin for LLM-based generation of proofs
 Measuring memorization through probabilistic discoverable extraction
 Computational Bottlenecks of Training Small-scale Large Language Models
 Mixture of Parrots: Experts improve memorization more than reasoning
 M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation
 Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
 A Survey of Small Language Models
 HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
 Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
 LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
 KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation
 MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
 Plan$\times$RAG: Planning-guided Retrieval Augmented Generation
 Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment
 Language Models And A Second Opinion Use Case: The Pocket Professional
 Fast Best-of-N Decoding via Speculative Rejection
 UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
 RARe: Retrieval Augmented Retrieval with In-Context Examples
 Towards Next-Generation LLM-based Recommender Systems: A Survey and Beyond
 Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation
 Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction
 Large Language Models Reflect the Ideology of their Creators
 A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration
 A Survey on Data Synthesis and Augmentation for Large Language Models
 ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
 SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
 EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
 Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
 Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
 Understanding Synthetic Context Extension via Retrieval Heads
 Matryoshka: Learning to Drive Black-Box LLMs with LLMs
 The Geometry of Concepts: Sparse Autoencoder Feature Structure
 Attacking Vision-Language Computer Agents via Pop-ups
 OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
 Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
 CLEAR: Character Unlearning in Textual and Visual Modalities
 Aligning Audio-Visual Joint Representations with an Agentic Workflow
 Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
 TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
 CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation
 Distinguishing Ignorance from Error in LLM Hallucinations
 Learning and Unlearning of Fabricated Knowledge in Language Models
 Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
 On the Role of Depth and Looping for In-Context Learning with Task Diversity
 Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'
 Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges
 Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback
 Accelerating Direct Preference Optimization with Prefix Sharing
 AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
 QTIP: Quantization with Trellises and Incoherence Processing
 EMMA: End-to-End Multimodal Model for Autonomous Driving
 SciPIP: An LLM-based Scientific Paper Idea Proposer
 Zipfian Whitening
 On Memorization of Large Language Models in Logical Reasoning
 Stealing User Prompts from Mixture of Experts
 Toxicity of the Commons: Curating Open-Source Pre-Training Data
 RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering
 UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function
 SelfCodeAlign: Self-Alignment for Code Generation
 Constraint Back-translation Improves Complex Instruction Following of Large Language Models
 Nearest Neighbor Normalization Improves Multimodal Retrieval
 Language Models can Self-Lengthen to Generate Long Texts
 Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models
 Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts
 Weight decay induces low-rank attention layers
 What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
 Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
 Toward Understanding In-context vs. In-weight Learning
 Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks
 BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
 GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
 AAAR-1.0: Assessing AI's Potential to Assist Research
 Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
 Failure Modes of LLMs for Causal Reasoning on Narratives
 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
 NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
 Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
 $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
 Physics in Next-token Prediction
 "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
 WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
 Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
 PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
 GPT or BERT: why not both?
 Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
 SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
 Thinking Forward and Backward: Effective Backward Planning with Large Language Models
 Context Parallelism for Scalable Million-Token Inference
 RAGViz: Diagnose and Visualize Retrieval-Augmented Generation
 DynaSaur: Large Language Agents Beyond Predefined Actions
 Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks
 LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding
 LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
 Survey of Cultural Awareness in Language Models: Text and Beyond
 LLM-KT: A Versatile Framework for Knowledge Transfer from Large Language Models to Collaborative Filtering
 Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
 E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation
 Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
 GRS-QA -- Graph Reasoning-Structured Question Answering Dataset
 BitNet a4.8: 4-bit Activations for 1-bit LLMs
 Beyond Utility: Evaluating LLM as Recommender
 Rationale-Guided Retrieval Augmented Generation for Medical Question Answering
 Personalization of Large Language Models: A Survey
 AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
 How Does Critical Batch Size Scale in Pre-training?
 Scaling Optimal LR Across Token Horizons
 Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
 Not All Memories are Created Equal: Learning to Forget by Expiring
 Inference Optimal VLMs Need Only One Visual Token but Larger Models
 HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
 Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
 Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge
 MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
 DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
 Sample-Efficient Alignment for LLMs
 LLaMo: Large Language Model-based Molecular Graph Assistant
 Controlling Language and Diffusion Models by Transporting Activations
 A Scalable Communication Protocol for Networks of Large Language Models
 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
 Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval
 Wave Network: An Ultra-Small Language Model
 Model Equality Testing: Which Model Is This API Serving?
 A linguistic analysis of undesirable outcomes in the era of generative AI
 Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
 Long Context RAG Performance of Large Language Models
 LASER: Attention with Exponential Transformation
 Photon: Federated LLM Pre-Training
 How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
 M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
 Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
 MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba
 MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
 SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models
 Can LLMs make trade-offs involving stipulated pain and pleasure states?
 Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
 Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically
 Teaching Models to Improve on Tape
 Evolving Alignment via Asymmetric Self-Play
 Scaling LLM Inference with Optimized Sample Compute Allocation
 Self-Consistency Preference Optimization
 Tiny Transformers Excel at Sentence Compression
 Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
 From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
 A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
 What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
 LoRA vs Full Fine-tuning: An Illusion of Equivalence
 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
 RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval
 Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model
 LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
 Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
 VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
 Towards Reliable Alignment: Uncertainty-aware RLHF
 Abrupt Learning in Transformers: A Case Study on Matrix Completion
 MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
 O1 Replication Journey: A Strategic Progress Report -- Part 1
 KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
 Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
 Methods of improving LLM training stability
 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
 CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
 Generalized Probabilistic Attention Mechanism in Transformers
 Economic Anthropology in the Era of Generative Artificial Intelligence
 Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation
 MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
 A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
 MoDification: Mixture of Depths Made Easy
 Speciesism in Natural Language Processing Research
 Reducing the Transformer Architecture to a Minimum
 MoR: Mixture of Ranks for Low-Rank Adaptation Tuning
 Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence
 Hypothesis Testing the Circuit Hypothesis in LLMs
 FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
 Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding
 Conformity in Large Language Models
 Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing
 A Case for AI Consciousness: Language Agents and Global Workspace Theory
 Local and Global Decoding in Text Generation
 SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
 Geometric Signatures of Compositionality Across a Language Model's Lifetime
 Is Parameter Collision Hindering Continual Learning in LLMs?
 Reverse Modeling in Large Language Models
 On the Proper Treatment of Tokenization in Psycholinguistics
 Post-edits Are Preferences Too
 Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
 Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
 EmbedLLM: Learning Compact Representations of Large Language Models
 Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
 Mitigating Memorization In Language Models
 House of Cards: Massive Weights in LLMs
 U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models
 Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
 Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training
 RisingBALLER: A player is a token, a match is a sentence, A path towards a foundational model for football players data analytics
 MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards
 Self-Updatable Large Language Models with Parameter Integration
 Are LLMs Aware that Some Questions are not Open-ended?
 Vision Language Models See What You Want but not What You See
 A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Models
 Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?
 M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
 Analyzing The Language of Visual Tokens
 The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
 Model merging with SVD to tie the Knots
 Best Practices for Distilling Large Language Models into BERT for Web Search Ranking
 Interpretable Language Modeling via Induction-head Ngram Models
 Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method
 GUI Agents with Foundation Models: A Comprehensive Survey
 Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
 DELIFT: Data Efficient Language model Instruction Fine Tuning
 Aioli: A Unified Optimization Framework for Language Model Data Mixing
 LBPE: Long-token-first Tokenization to Improve Large Language Models
 Balancing Pipeline Parallelism with Vocabulary Parallelism
 Fox-1 Technical Report
 STAND-Guard: A Small Task-Adaptive Content Moderation Model
 Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs
 CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
 Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
 FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?
 Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
 Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale
 Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning
 LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions
 Scattered Forest Search: Smarter Code Space Exploration with LLMs
 Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
 RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models
 An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking
 ZipNN: Lossless Compression for AI Models
 LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
 Efficient Constant-Space Multi-Vector Retrieval
 FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
 Number Cookbook: Number Understanding of Language Models and How to Improve It
 Mixtures of In-Context Learners
 Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
 Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
 Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models
 AFlow: Automating Agentic Workflow Generation
 Recycled Attention: Efficient inference for long-context language models
 Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
 Counterfactual Generation from Language Models
 Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
 Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
 IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
 CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
 Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge
 Game-theoretic LLM: Agent Workflow for Negotiation Games
 Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction
 NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts
 GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
 More Expressive Attention with Negative Weights
 Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation
 LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models
 End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering
 Learning Code Preference via Synthetic Evolution
 Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation
 Scaling Laws for Precision
 Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
 RedCode: Risky Code Execution and Generation Benchmark for Code Agents
 Likelihood as a Performance Gauge for Retrieval-Augmented Generation
 Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
 Entropy Controllable Direct Preference Optimization
 SecEncoder: Logs are All You Need in Security
 Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
 Toward Optimal Search and Retrieval for RAG
 The Super Weight in Large Language Models
 Multi-Modal Forecaster: Jointly Predicting Time Series and Textual Data
 What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance
 Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
 Towards Low-bit Communication for Tensor Parallel LLM Inference
 What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
 The Crucial Role of Samplers in Online Direct Preference Optimization
 SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
 Stronger Models are NOT Stronger Teachers for Instruction Tuning
 Hardware and Software Platform Inference
 Direct Preference Optimization Using Sparse Feature-Level Constraints
 An Empirical Study on LLM-based Agents for Automated Bug Fixing
 CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
 Can sparse autoencoders be used to decompose and interpret steering vectors?
 Dynamic Subset Tuning: Expanding the Operational Range of Parameter-Efficient Training for Large Language Models
 Large Language Models Can Self-Improve in Long-context Reasoning
 Language Models as Causal Effect Generators
 Natural Language Reinforcement Learning
 Model Stealing for Any Low-Rank Language Model
 Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs
 XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL
 Controllable Context Sensitivity and the Knob Behind It
 LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
 Pie: Pooling CPU Memory for LLM Inference
 Cut Your Losses in Large-Vocabulary Language Models
 LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs
 Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
 A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
 ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
 Squeezed Attention: Accelerating Long Context Length LLM Inference
 Hermes: A Large Language Model Framework on the Journey to Autonomous Networks
 Number it: Temporal Grounding Videos like Flipping Manga
 The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
 Benchmarking Distributional Alignment of Large Language Models
 Towards Edge General Intelligence via Large Language Models: Opportunities and Challenges
 LLaVA-o1: Let Vision Language Models Reason Step-by-Step
 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
 Xmodel-1.5: An 1B-scale Multilingual LLM
 MARS: Unleashing the Power of Variance Reduction for Training Large Models
 Generative Agent Simulations of 1,000 People
 Hidden Persuaders: LLMs' Political Leaning and Their Influence on Voters
 Drowning in Documents: Consequences of Scaling Reranker Inference
 Top-$nσ$: Not All Logits Are You Need
 Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
 Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
 Closing the Curious Case of Neural Text Degeneration
 Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models
 Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
 LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
 Analyzing Pokémon and Mario Streamers' Twitch Chat with LLM-based User Embeddings
 BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
 Evaluating the role of `Constitutions' for learning from AI feedback
 SlimLM: An Efficient Small Language Model for On-Device Document Assistance
 Adaptive Decoding via Latent Preference Optimization
 Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
 MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
 Bi-Mamba: Towards Accurate 1-Bit State Space Models
 BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
 FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models
 Steering Language Model Refusal with Sparse Autoencoders
 VersaTune: Fine-Tuning Multi-Ability LLMs Efficiently
 MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
 Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text
 SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
 Does Prompt Formatting Have Any Impact on LLM Performance?
 KuaiFormer: Transformer-Based Retrieval at Kuaishou
 Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis
 Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
 Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search
 UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
 RedPajama: an Open Dataset for Training Large Language Models
 Building Trust: Foundations of Security, Safety and Transparency in AI
 A Taxonomy of AgentOps for Enabling Observability of Foundation Model based Agents
 When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
 On the Way to LLM Personalization: Learning to Remember User Conversations
 VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
 MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers
 Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
 SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
 Refusal in LLMs is an Affine Function
 Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
 FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models
 ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
 OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
 Hymba: A Hybrid-head Architecture for Small Language Models
 Are Large Language Models Memorizing Bug Benchmarks?
 Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
 Ultra-Sparse Memory Network
 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
 Sparse Upcycling: Inference Inefficient Finetuning
 ChatGPT in Research and Education: Exploring Benefits and Threats
 Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation
 Scaling Laws for Reward Model Overoptimization
 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
 Patience Is The Key to Large Language Model Reasoning
 Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
 Auto-Regressive Next-Token Predictors are Universal Learners
 Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
 A Reproducibility and Generalizability Study of Large Language Models for Query Generation
 Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
 Disentangling Memory and Reasoning Ability in Large Language Models
 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
 XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
 DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
 Evaluating the Robustness of Analogical Reasoning in Large Language Models
 TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
 One to rule them all: natural language to bind communication, perception and action
 Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
 Understanding LLM Embeddings for Regression
 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
 A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
 Conversational Medical AI: Ready for Practice
 From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
 MH-MoE:Multi-Head Mixture-of-Experts
 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
 From CISC to RISC: language-model guided assembly transpilation
 Predicting Emergent Capabilities by Finetuning
 LLMs Do Not Think Step-by-step In Implicit Reasoning
 Knowledge Transfer Across Modalities with Natural Language Supervision
 The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz
 Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
 Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
 SketchAgent: Language-Driven Sequential Sketch Generation
 I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
 Pushing the Limits of Large Language Model Quantization via the Linearity Theorem
 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
 BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
 2D Matryoshka Training for Information Retrieval
 Star Attention: Efficient LLM Inference over Long Sequences
 Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
 Self-Generated Critiques Boost Reward Modeling for Language Models
 All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
 Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines
 Low-Rank Correction for Quantized LLMs
 SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
 A Survey on LLM-as-a-Judge
 From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars
 From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set
 The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations
 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
 MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
 LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent Sentence Spaces
 Inference Scaling $\scriptsize\mathtt{F}$Laws: The Limits of LLM Resampling with Imperfect Verifiers
 Boundless Socratic Learning with Language Games
 Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
 ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
 Large Language Model-Brained GUI Agents: A Survey
 Training and Evaluating Language Models with Template-based Data Generation
 Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
 MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation
 HyperSeg: Towards Universal Visual Segmentation with Large Language Model
 LoLCATs: On Low-Rank Linearizing of Large Language Models
 AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
 o1-Coder: an o1 Replication for Coding
 VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
 Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations
 HARec: Hyperbolic Graph-LLM Alignment for Exploration and Exploitation in Recommender Systems
 LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System
 FastRAG: Retrieval Augmented Generation for Semi-structured Data
 Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
 AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
 AIGS: Generating Science from AI-Powered Automated Falsification
 FinRobot: AI Agent for Equity Research and Valuation with Large Language Models
 Two are better than one: Context window extension with multi-grained self-injection
 Towards Knowledge Checking in Retrieval-augmented Generation: A Representation Perspective
 Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat
 Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
 On Domain-Specific Post-Training for Multimodal Large Language Models
 ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring
 Reverse Thinking Makes LLMs Stronger Reasoners
 Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation
 Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
 LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
 Does Representation Matter? Exploring Intermediate Layers in Large Language Models
 TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension
 Zero-Indexing Internet Search Augmented Generation for Large Language Models
 Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems
 Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models
 Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
 Mars-PO: Multi-Agent Reasoning System Preference Optimization
 ICLERB: In-Context Learning Embedding and Reranker Benchmark
 MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications
 Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS
 Leveraging Retrieval-Augmented Generation for University Knowledge Retrieval
 Yi-Lightning Technical Report
 Sneaking Syntax into Transformer Language Models with Tree Regularization
 T-REG: Preference Optimization with Token-Level Reward Regularization
 Gradient Localization Improves Lifelong Pretraining of Language Models
 AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
 OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
 Free Process Rewards without Process Labels
 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
 The Evolution of RWKV: Advancements in Efficient Language Modeling
 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
 [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
 LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
 MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity
 COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
 LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
 CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search
 Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input
 Language Models Encode Numbers Using Digit Representations in Base 10
 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
 Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
 DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification
 Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection
 Baba Is AI: Break the Rules to Beat the Benchmark
 Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
 VLSBench: Unveiling Visual Leakage in Multimodal Safety
 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
 STAR: Synthesis of Tailored Architectures
 Best-of-N Jailbreaking
 PaliGemma 2: A Family of Versatile VLMs for Transfer
 Beyond Questions: Leveraging ColBERT for Keyphrase Search
 RedStone: Curating General, Code, Math, and QA Data for Large Language Models
 A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs
 FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
 AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
 A dynamic parallel method for performance optimization on hybrid CPUs
 Weighted-Reward Preference Optimization for Implicit Model Fusion
 Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
 Does Few-Shot Learning Help LLM Performance in Code Synthesis?
 RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models
 Time-Reversal Provides Unsupervised Feedback to LLMs
 Self-Improvement in Language Models: The Sharpening Mechanism
 Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"
 MALT: Improving Reasoning with Multi-Agent LLM Training
 Generating a Low-code Complete Workflow via Task Decomposition and RAG
 Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis
 Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
 U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
 QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
 Explainable CTR Prediction via LLM Reasoning
 Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
 VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
 VisionZip: Longer is Better but Not Necessary in Vision Language Models
 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
 Retrieval-Augmented Machine Translation with Unstructured Knowledge
 Densing Law of LLMs
 Monet: Mixture of Monosemantic Experts for Transformers
 Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
 A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios
 Domain-specific Question Answering with Hybrid Search
 Evaluating Language Models as Synthetic Data Generators
 Theoretical limitations of multi-layer Transformer
 VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
 Personalized Multimodal Large Language Models: A Survey
 Optimal Memorization Capacity of Transformers
 NVILA: Efficient Frontier Visual Language Models
 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
 Discriminative Fine-tuning of LVLMs
 Challenges in Trustworthy Human Evaluation of Chatbots
 ALMA: Alignment with Minimal Annotation
 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
 KV Shifting Attention Enhances Language Modeling
 Establishing Task Scaling Laws via Compute-Efficient Model Ladders
 Evolutionary Pre-Prompt Optimization for Mathematical Reasoning
 Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges
 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
 APOLLO: SGD-like Memory, AdamW-level Performance
 CompCap: Improving Multimodal Large Language Models with Composite Captions
 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
 In Tree Structure Should Sentence Be Generated
 BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits
 ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges
 LinVT: Empower Your Image-level Large Language Model to Understand Videos
 Transformers Can Navigate Mazes With Multi-Step Prediction
 Gated Delta Networks: Improving Mamba2 with Delta Rule
 Frontier Models are Capable of In-context Scheming
 Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling
 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
 Transformers Struggle to Learn to Search
 The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
 Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
 Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG
 Flex Attention: A Programming Model for Generating Optimized Attention Kernels
 Training Large Language Models to Reason in a Continuous Latent Space
 ProcessBench: Identifying Process Errors in Mathematical Reasoning
 Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
 If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
 Robust Multi-bit Text Watermark with LLM-based Paraphrasers
 An Evolved Universal Transformer Memory
 Mixture of Hidden-Dimensions Transformer
 Granite Guardian
 Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs
 Automatic Database Configuration Debugging using Retrieval-Augmented Language Models
 Large Language Models are Biased Because They Are Large Language Models
 Maya: An Instruction Finetuned Multilingual Multimodal Model
 Fully Open Source Moxin-7B Technical Report
 ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
 Evaluating and Aligning CodeLLMs on Human Preference
 Pre-trained Language Models Return Distinguishable Probability Distributions to Unfaithfully Hallucinated Texts
 Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
 Tool Learning with Foundation Models
 POINTS1.5: Building a Vision-Language Model towards Real World Applications
 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
 Is Personality Prediction Possible Based on Reddit Comments?
 Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation
 Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation
 HARP: Hesitation-Aware Reframing in Transformer Inference Pass
 Chimera: Improving Generalist Model with Domain-Specific Experts
 StreamChat: Chatting with Streaming Video
 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models
 The BrowserGym Ecosystem for Web Agent Research
 Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning
 jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
 From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
 Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
 Phi-4 Technical Report
 Large Concept Models: Language Modeling in a Sentence Representation Space
 Test-Time Alignment via Hypothesis Reweighting
 LatentQA: Teaching LLMs to Decode Activations Into Natural Language
 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
 JuStRank: Benchmarking LLM Judges for System Ranking
 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
 VisionArena: 230K Real World User-VLM Conversations with Preference Labels
 Semantic Retrieval at Walmart
 Fine-Tuning Language Models with Advantage-Induced Policy Alignment
 LIBER: Lifelong User Behavior Modeling Based on Large Language Models
 FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
 Understanding World or Predicting Future? A Comprehensive Survey of World Models
 Improving training time and GPU utilization in geo-distributed language model training
 A Survey of Financial AI: Architectures, Advances and Open Challenges
 Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces
 Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning
 Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
 Adapting Language Models via Token Translation
 Length-Induced Embedding Collapse in Transformer-based Models
 L3Ms -- Lagrange Large Language Models
 Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
 BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
 Subspace Optimization for Large Language Models with Convergence Guarantees
 Liger Kernel: Efficient Triton Kernels for LLM Training
 Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)
 Defining Knowledge: Bridging Epistemology and Large Language Models
 LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
 'Simulacrum of Stories': Examining Large Language Models as Qualitative Research Participants
 Self-attention as an attractor network: transient memories without backpropagation
 AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment
 A Large Language Model and Denoising Diffusion Framework for Targeted Design of Microstructures with Commands in Natural Language
 Guided Profile Generation Improves Personalization with LLMs
 Is Contrasting All You Need? Contrastive Learning for the Detection and Attribution of AI-generated Text
 Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm
 Personality Alignment of Large Language Models
 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
 The Hitchhiker's Guide to Human Alignment with *PO
 Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
 Establishing Knowledge Preference in Language Models
 Do LLMs have Consistent Values?
 The Foundations of Tokenization: Statistical and Computational Concerns
 Curriculum Learning for Small Code Language Models
 Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference
 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
 A Survey of Controllable Learning: Methods and Applications in Information Retrieval
 LLM Internal States Reveal Hallucination Risk Faced With a Query
 Efficient Sparse Attention needs Adaptive Token Release
 LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation
 AutoPal: Autonomous Adaptation to Users for Personal AI Companionship
 It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization
 LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods
 Mental Modeling of Reinforcement Learning Agents by Language Models
 SimSMoE: Solving Representational Collapse via Similarity Measure
 Hybrid Alignment Training for Large Language Models
 Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective
 LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
 Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
 Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities
 Abstraction-of-Thought Makes Language Models Better Reasoners
 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
 Promises, Outlooks and Challenges of Diffusion Language Modeling
 CodeGemma: Open Code Models Based on Gemma
 What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling
 A Survey on Human Preference Learning for Large Language Models
 Exploring the Zero-Shot Capabilities of LLMs Handling Multiple Problems at once
 SoK: Decentralized AI (DeAI)
 Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning
 The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C
 Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
 The Zamba2 Suite: Technical Report
 Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective
 Planning-Driven Programming: A Large Language Model Programming Workflow
 Logic Augmented Generation
 Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits
 On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse
 Theoretical Analysis of Byte-Pair Encoding
 Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
 LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models
 Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices
 Warmstarting for Scaling Language Models
 TreeCoders: Trees of Transformers
 Token2Wave
 Enhancing Transformer Training Efficiency with Dynamic Dropout
 DroidSpeak: Enhancing Cross-LLM Communication
 Ask, and it shall be given: Turing completeness of prompting
 Can Language Models Learn to Skip Steps?
 Unlocking the Theory Behind Scaling 1-Bit Neural Networks
 Thought Space Explorer: Navigating and Expanding Thought Space for Large Language Model Reasoning
 All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
 Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations?
 Moral Agency in Silico: Exploring Free Will in Large Language Models
 Personas with Attitudes: Controlling LLMs for Diverse Data Annotation
 Towards Infinite-Long Prefix in Transformer
 Glider: Global and Local Instruction-Driven Expert Router
 Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection
 An Implementation of Werewolf Agent That does not Truly Trust LLMs
 ReAttention: Training-Free Infinite Context with Finite Attention Scope
 Collective Innovation in Groups of Large Language Models
 Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
 See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
 LEGO: Language Model Building Blocks
 LMLPA: Language Model Linguistic Personality Assessment
 Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
 Unifying Economic and Language Models for Enhanced Sentiment Analysis of the Oil Market
 Light-Weight Fault Tolerant Attention for Large Language Model Training
 Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
 Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only
 Personality Differences Drive Conversational Dynamics: A High-Dimensional NLP Approach
 A Unified Approach to Routing and Cascading for LLMs
 Evaluating Language Model Character Traits
 LoRTA: Low Rank Tensor Adaptation of Large Language Models
 Using Prompts to Guide Large Language Models in Imitating a Real Person's Language Style
 Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas
 Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
 Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language
 On The Adaptation of Unlimiformer for Decoder-Only Transformers
 Lines of Thought in Large Language Models
 PersonalLLM: Tailoring LLMs to Individual Preferences
 Characterizing stable regions in the residual stream of LLMs
 Counterfactual Token Generation in Large Language Models
 Forking Paths in Neural Text Generation
 LLM Echo Chamber: personalized and automated disinformation
 HLB: Benchmarking LLMs' Humanlikeness in Language Use
 Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns
 Backtracking Improves Generation Safety
 Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape
 Investigating Layer Importance in Large Language Models
 Uncovering Latent Chain of Thought Vectors in Language Models
 Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings
 Spin glass model of in-context learning
 Scaling Embedding Layers in Language Models
 Advancing Prompt Learning through an External Layer
 Effects of Scale on Language Model Robustness
 Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks
 Financial Statement Analysis with Large Language Models
 AI TrackMate: Finally, Someone Who Will Give Your Music More Than Just "Sounds Great!"
 Graph-Structured Speculative Decoding
 On the Benefits of Rank in Attention Layers
 Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
 Supporting the Digital Autonomy of Elders Through LLM Assistance
 Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
 Dissecting Multiplication in Transformers: Insights into LLMs
 Open Artificial Knowledge
 Combining Constraint Programming Reasoning with Large Language Model Predictions
 Transformer-based Single-Cell Language Model: A Survey
 Compressed models are NOT miniature versions of large models
 Beyond KV Caching: Shared Attention for Efficient LLMs
 Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences
 Struct-X: Enhancing Large Language Models Reasoning with Structured Data
 The Better Angels of Machine Personality: How Personality Relates to LLM Safety
 Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
 Large Language Models as Misleading Assistants in Conversation
 Apollo: An Exploration of Video Understanding in Large Multimodal Models
 SCBench: A KV Cache-Centric Analysis of Long-Context Methods
 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
 Large Action Models: From Inception to Implementation
 On Implications of Scaling Laws on Feature Superposition
 Hey, That's My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique
 Weighted Grouped Query Attention in Transformers
 MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs
 BiasScanner: Automatic Detection and Classification of News Bias to Strengthen Democracy
 MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
 Self-Evolving GPT: A Lifelong Autonomous Experiential Learner
 Real-Time Anomaly Detection and Reactive Planning with Large Language Models
 On the Universal Truthfulness Hyperplane Inside LLMs
 A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training
 Bucket Pre-training is All You Need
 Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning
 SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training
 Virtual Personas for Language Models via an Anthology of Backstories
 Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons
 Optimal Decision Making Through Scenario Simulations Using Large Language Models
 Using Grammar Masking to Ensure Syntactic Validity in LLM-based Modeling Tasks
 Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
 Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions in Social Conversations
 Improving Self Consistency in LLMs through Probabilistic Tokenization
 Over the Edge of Chaos? Excess Complexity as a Roadblock to Artificial General Intelligence
 The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model
 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models
 Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model
 Large Language Models as Evaluators for Scientific Synthesis
 Efficient Training of Language Models with Compact and Consistent Next Token Distributions
 Learning to Reduce: Towards Improving Performance of Large Language Models on Structured Data
 Large Language Model Enhanced Knowledge Representation Learning: A Survey
 Generative Monoculture in Large Language Models
 Black Big Boxes: Do Language Models Hide a Theory of Adjective Order?
 DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models
 Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models
 Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
 LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models
 Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
 Debate-to-Write: A Persona-Driven Multi-Agent Framework for Diverse Argument Generation
 Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
 The Rise of Artificial Intelligence in Educational Measurement: Opportunities and Ethical Challenges
 MammothModa: Multi-Modal Large Language Model
 Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
 Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance
 Understanding and Mitigating Tokenization Bias in Language Models
 Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
 Large Vocabulary Size Improves Large Language Models
 The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
 Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
 What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
 Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models?
 FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
 Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models
 Distributed Rule Vectors is A Key Mechanism in Large Language Models' In-Context Learning
 Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration
 Large Language Models have Intrinsic Self-Correction Ability
 Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
 Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
 Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
 Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas
 Ranking LLMs by compression
 SPL: A Socratic Playground for Learning Powered by Large Language Model
 Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning
 Elliptical Attention
 In-Context Former: Lightning-fast Compressing Context for Large Language Model
 AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
 Locating and Extracting Relational Concepts in Large Language Models
 Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference
 Synergizing Foundation Models and Federated Learning: A Survey
 Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
 What Makes Two Language Models Think Alike?
 P-Tailor: Customizing Personality Traits for Language Models via Mixture of Specialized LoRA Experts
 Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
 Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions
 LLMs Are Prone to Fallacies in Causal Inference
 Compact Proofs of Model Performance via Mechanistic Interpretability
 Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models
 MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic
 A Survey on Large Language Model-based Agents for Statistics and Data Science
 Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
 A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
 Multilingual Large Language Models and Curse of Multilinguality
 Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models
 Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts
 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
 Federated Learning driven Large Language Models for Swarm Intelligence: A Survey
 Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?
 Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark
 Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
 Compacter: Efficient Low-Rank Hypercomplex Adapter Layers
 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs
 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
 Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
 AdvPrefix: An Objective for Nuanced LLM Jailbreaks
 VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation
 CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
 Llama 3 Meets MoE: Efficient Upcycling
 Memory Layers at Scale
 GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers
 RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
 Memory Transformer
 Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture
 SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
 Whisper-GPT: A Hybrid Representation Audio Large Language Model
 Smaller Language Models Are Better Instruction Evolvers
 Investigating Mixture of Experts in Dense Retrieval
 Let your LLM generate a few tokens and you will reduce the need for retrieval
 RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models
 No More Adam: Learning Rate Scaling at Initialization is All You Need
 Entropy-Regularized Process Reward Model
 SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
 The Open Source Advantage in Large Language Models (LLMs)
 GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
 Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory
 Superhuman performance of a large language model on the reasoning tasks of a physician
 Reinforcement Learning Enhanced LLMs: A Survey
 Byte Latent Transformer: Patches Scale Better Than Tokens
 TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax
 Continual Pre-Training of Large Language Models: How to (re)warm your model?
 Are Your LLMs Capable of Stable Reasoning?
 An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions
 Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
 AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
 OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
 EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation
 When to Speak, When to Abstain: Contrastive Decoding with Abstention
 RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems
 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
 Mastering Board Games by External and Internal Planning with Language Models
 Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework
 VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
 Cultural Evolution of Cooperation among LLM Agents
 Legommenders: A Comprehensive Content-Based Recommendation Library with LLM Support
 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
 GUI Agents: A Survey
 RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
 LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
 MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
 Alignment faking in large language models
 CAD-Recode: Reverse Engineering CAD Code from Point Clouds
 Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation
 LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
 AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
 Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
 EscapeBench: Pushing Language Models to Think Outside the Box
 FastVLM: Efficient Vision Encoding for Vision Language Models
 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
 Progressive Multimodal Reasoning via Active Retrieval
 MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
 Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
 Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models
 Qwen2.5 Technical Report
 AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
 ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
 How to Synthesize Text Data without Model Collapse?
 TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation
 A Closer Look at the Limitations of Instruction Tuning
 Rethinking Uncertainty Estimation in Natural Language Generation
 Knowledge Injection via Prompt Distillation
 HashAttention: Semantic Sparsity for Faster Inference
 Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
 SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction
 BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
 Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs
 Large Language Model Enhanced Recommender Systems: Taxonomy, Trend, Application and Future
 Explainable Procedural Mistake Detection
 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
 HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
 Offline Reinforcement Learning for LLM Multi-Step Reasoning
 Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
 XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation
 Fietje: An open, efficient LLM for Dutch
 SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation
 MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
 PruneVid: Visual Token Pruning for Efficient Video Large Language Models
 LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
 HREF: Human Response-Guided Evaluation of Instruction Following in Language Models
 Multi-LLM Text Summarization
 Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
 Diving into Self-Evolving Training for Multimodal Reasoning
 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
 Revisiting In-Context Learning with Long Context Language Models
 NILE: Internal Consistency Alignment in Large Language Models
 Formal Mathematical Reasoning: A New Frontier in AI
 A Systematic Examination of Preference Learning through the Lens of Instruction-Following
 WebLLM: A High-Performance In-Browser LLM Inference Engine
 Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
 MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
 Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
 Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
 Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
 Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
 Deliberation in Latent Space via Differentiable Cache Augmentation
 DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
 Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)
 The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States
 GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
 OpenAI o1 System Card
 Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval
 LearnLM: Improving Gemini for Learning
 Outcome-Refining Process Supervision for Code Generation
 PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
 OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
 Agent-SafetyBench: Evaluating the Safety of LLM Agents
 ResearchTown: Simulator of Human Research Community
 Rate of Model Collapse in Recursive Training
 Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective
 YuLan-Mini: An Open Data-efficient Language Model
 A Survey of Query Optimization in Large Language Models
 SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
 Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation
 Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
 In Case You Missed It: ARC 'Challenge' Is Not That Challenging
 Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning
 Improving Factuality with Explicit Working Memory
 Efficient Long Context Language Model Retrieval with Compression
 Token-Budget-Aware LLM Reasoning
 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
 Fooling LLM graders into giving better grades through neural activity guided adversarial prompting
 Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
 A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
 Gradient Weight-normalized Low-rank Projection for Efficient LLM Training
 Instruction Fine-Tuning: Does Prompt Loss Matter?
 Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
 RecLM: Recommendation Instruction Tuning
 Multi-matrix Factorization Attention
 Jasper and Stella: distillation of SOTA embedding models
 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
 Alpha Excel Benchmark
 HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
 Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
 Facilitating large language model Russian adaptation with Learned Embedding Propagation
 Training Software Engineering Agents and Verifiers with SWE-Gym
 Efficiently Serving LLM Reasoning Programs with Certaindex
 GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search
 On the Compositional Generalization of Multimodal LLMs for Medical Imaging
 OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
 InfAlign: Inference-aware language model alignment
 Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
 Dynamic Skill Adaptation for Large Language Models
 Long-Range Tasks Using Short-Context LLMs: Incremental Reasoning With Structured Memories
 CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
 Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
 In-context Continual Learning Assisted by an External Continual Learner
 LLM2: Let Large Language Models Harness System 2 Reasoning
 Precise Length Control in Large Language Models
 Can LLMs Convert Graphs to Text-Attributed Graphs?
 Using Generative AI and Multi-Agents to Provide Automatic Feedback
 Agents Are Not Enough
 Smoothie: Label Free Language Model Routing
 Xmodel-2 Technical Report
 LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
 HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
 OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
 ProgCo: Program Helps Self-Correction of Large Language Models
 CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
 A3: Android Agent Arena for Mobile GUI Agents
 Dynamic Scaling of Unit Tests for Code Reward Modeling
 KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
 MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
 MLLM-as-a-Judge for Image Safety without Human Labeling
 MapQaTor: A System for Efficient Annotation of Map Query Datasets
 Are Vision-Language Models Truly Understanding Multi-vision Sensor?
 BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
 TrustRAG: Enhancing Robustness and Trustworthiness in RAG
 Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
 RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
 FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
 Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
 Titans: Learning to Memorize at Test Time
 Unifying Specialized Visual Encoders for Video Language Models
 IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently
 Low-Rank Adaptation for Foundation Models: A Comprehensive Review
 Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
 Can LLMs Design Good Questions Based on Context?
 2 OLMo 2 Furious
 RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
 LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
 VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
 Enhancing Human-Like Responses in Large Language Models
 Metadata Conditioning Accelerates Language Model Pre-training
 Cold-Start Recommendation towards the Era of Large Language Models (LLMs): A Comprehensive Survey and Roadmap
 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
 QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture
 SDPO: Segment-Level Direct Preference Optimization for Social Agents
 Predicting the Performance of Black-box LLMs through Self-Queries
 Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information
 BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
 Reinforcing Thinking through Reasoning-Enhanced Reward Models
 ICLR: In-Context Learning of Representations
 BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
 Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
 HALO: Hadamard-Assisted Lossless Optimization for Efficient Low-Precision LLM Training and Fine-Tuning
 ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
 Test-time Computing: from System-1 Thinking to System-2 Thinking
 Scaling Laws for Floating Point Quantization Training
 Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey
 Personalized Graph-Based Retrieval for Large Language Models
 Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
 Understanding How CodeLLMs (Mis)Predict Types with Activation Steering
 Instruction-Following Pruning for Large Language Models
 GeAR: Generation Augmented Retrieval
 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
 MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
 PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
 DeepSeek-V3 Technical Report
 Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
 Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
 Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
 OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
 Entropy-Guided Attention for Private LLMs
 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
 InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
 Agent Laboratory: Using LLM Agents as Research Assistants
 URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
 LLM4SR: A Survey on Large Language Models for Scientific Research
 Repository Structure-Aware Training Makes SLMs Better Issue Resolver
 DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
 EpiCoder: Encompassing Diversity and Complexity in Code Generation
 Multi-task retriever fine-tuning for domain-specific and efficient RAG
 Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation
 Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts
 Search-o1: Agentic Search-Enhanced Large Reasoning Models
 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
 Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
 SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
 Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
 Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
 Concept Boundary Vectors
 A Survey of RWKV
 Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
 Large-scale Group Brainstorming using Conversational Swarm Intelligence (CSI) versus Traditional Chat
 Large Language Model is Secretly a Protein Sequence Optimizer
 Experience of Training a 1.7B-Parameter LLaMa Model From Scratch
 Frontier AI systems have surpassed the self-replicating red line
 Large Language Models show both individual and collective creativity comparable to humans
 Does Self-Attention Need Separate Weights in Transformers?
 VideoRAG: Retrieval-Augmented Generation over Video Corpus
 A Survey on Large Language Models with some Insights on their Capabilities and Limitations
 SUGAR: Leveraging Contextual Confidence for Smarter Retrieval
 On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
 LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
 Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
 Small Language Models (SLMs) Can Still Pack a Punch: A survey
 Enabling Scalable Oversight via Self-Evolving Critic
 Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
 Infecting Generative AI With Viruses
 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
 The Future of AI: Exploring the Potential of Large Concept Models
 Demystifying Domain-adaptive Post-training for Financial LLMs
 Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
 WebWalker: Benchmarking LLMs in Web Traversal
 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
 Enhancing Retrieval-Augmented Generation: A Study of Best Practices
 Foundations of Large Language Models
 The Lessons of Developing Process Reward Models in Mathematical Reasoning
 ListConRanker: A Contrastive Text Reranker with Listwise Encoding
 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
 MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
 Towards Best Practices for Open Datasets for LLM Training
 ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
 O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
 Tensor Product Attention Is All You Need
 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
 $\text{Transformer}^2$: Self-adaptive LLMs
 Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
 Amortizing intractable inference in large language models
 PokerBench: Training Large Language Models to become Professional Poker Players
 MiniMax-01: Scaling Foundation Models with Lightning Attention
 HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
 OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
 A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following
 Potential and Perils of Large Language Models as Judges of Unstructured Textual Data
 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
 Adaptive Semantic Prompt Caching with VectorQ
 How GPT learns layer by layer
 Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models
 Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
 Enhancing Automated Interpretability with Output-Centric Feature Descriptions
 From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models
 A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
 Fast Inference of Mixture-of-Experts Language Models with Offloading
 In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR
 Entailed Between the Lines: Incorporating Implication into NLI
 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
 OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
 Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
 Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators
 Guiding Retrieval using LLM-based Listwise Rankers
 Aligning Instruction Tuning with Pre-training
 Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
 RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
 The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
 Vision-Language Models Do Not Understand Negation
 Task Vectors in In-Context Learning: Emergence, Formation, and Benefit
 CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval
 ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
 PaSa: An LLM Agent for Comprehensive Academic Paper Search
 Evolving Deeper LLM Thinking
 Bridging Language Barriers in Healthcare: A Study on Arabic LLMs
 SEAL: Entangled White-box Watermarks on Low-Rank Adaptation
 Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
 LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models
 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
 Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
 VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
 Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
 Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
 Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
 Reasoning Language Models: A Blueprint
 Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
 Computational Protein Science in the Era of Large Language Models (LLMs)
 FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs
 SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs
 PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
 Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
 4bit-Quantization in Vector-Embedding for RAG
 AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
 MSTS: A Multimodal Safety Test Suite for Vision-Language Models
 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
 Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model
 Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation
 Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
 The Geometry of Tokens in Internal Representations of Large Language Models
 Autonomy-of-Experts Models
 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
 FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
 Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
 Kimi k1.5: Scaling Reinforcement Learning with LLMs
 O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
 Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
 Tell me about yourself: LLMs are aware of their learned behaviors
 IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
 Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
 Distillation Quantification for Large Language Models
 FOCUS: First Order Concentrated Updating Scheme
 Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
 FLAME: A small language model for spreadsheet formulas
 Temporal Preference Optimization for Long-Form Video Understanding
 Parameter-Efficient Fine-Tuning for Foundation Models
 Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
 Can Large Language Models Understand Preferences in Personalized Recommendation?
 RAG-Reward: Optimizing RAG with Reward Modeling and RLHF
 Analyzing Continuous Semantic Shifts with Diachronic Word Similarity Matrices
 Debate Helps Weak-to-Strong Generalization
 Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
 Hallucinations Can Improve Large Language Models in Drug Discovery
 EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
 Do LLMs Provide Consistent Answers to Health-Related Questions across Languages?
 Control LLM: Controlled Evolution for Intelligence Retention in LLM
 Bilinear MLPs enable weight-based mechanistic interpretability
 A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
 Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
 Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API
 RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
 Chain-of-Retrieval Augmented Generation
 OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
 A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
 Chat3GPP: An Open-Source Retrieval-Augmented Generation Framework for 3GPP Documents
 Redundancy Principles for MLLMs Benchmarks
 RL + Transformer = A General-Purpose Problem Solver
 GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
 Question Answering on Patient Medical Records with Private Fine-Tuned LLMs
 Return of the Encoder: Maximizing Parameter Efficiency for SLMs
 Provence: efficient and robust context pruning for retrieval-augmented generation
 ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
 OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas
 Qwen2.5-1M Technical Report
 GaussMark: A Practical Approach for Structural Watermarking of Language Models
 Baichuan-Omni-1.5 Technical Report
 ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval
 Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
 CodeMonkeys: Scaling Test-Time Compute for Software Engineering
 Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
 SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
 Optimizing Large Language Model Training Using FP4 Quantization
 Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
 Open Problems in Mechanistic Interpretability
 Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
 You Do Not Fully Utilize Transformer's Representation Capacity
 Training Dynamics of In-Context Learning in Linear Attention
 IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
 StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
 Self-reflecting Large Language Models: A Hegelian Dialectical Approach
 Histoires Morales: A French Dataset for Assessing Moral Alignment
 TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
 Can Transformers Learn Full Bayesian Inference in Context?
 Sparse Autoencoders Trained on the Same Data Learn Different Features
 FBQuant: FeedBack Quantization for Large Language Models
 RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
 DeepFlow: Serverless Large Language Model Serving at Scale
 Can LLM Generate Regression Tests for Software Commits?
 Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models
 WARP: An Efficient Engine for Multi-Vector Retrieval
 Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
 Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
 Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
 Atla Selene Mini: A General Purpose Evaluation Model
 Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts
 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
 GuardReasoner: Towards Reasoning-based LLM Safeguards
 LLMs can see and hear without any training
 Large Language Models Think Too Fast To Explore Effectively
 AdditiveLLM: Large Language Models Predict Defects in Additive Manufacturing
 Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models
 People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
 Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
 WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
 o3-mini vs DeepSeek-R1: Which One is Safer?
 MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
 Beyond Turn-taking: Introducing Text-based Overlap into Human-LLM Interactions
 s1: Simple test-time scaling
 Trading Inference-Time Compute for Adversarial Robustness
 R.I.P.: Better Models by Survival of the Fittest Prompts
 Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
 Improving Your Model Ranking on Chatbot Arena by Vote Rigging
 Sparse Autoencoders Can Interpret Randomly Initialized Transformers
 LLM-AutoDiff: Auto-Differentiate Any LLM Workflow
 Propositional Interpretability in Artificial Intelligence
 Function Vectors in Large Language Models
 Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game
 Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
 Reward-Guided Speculative Decoding for Efficient LLM Reasoning
 mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval
 Efficient Reasoning with Hidden Thinking
 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
 Unraveling the Capabilities of Language Models in News Summarization
 Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
 Diverse Preference Optimization
 Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
 LLMs Can Plan Only If We Tell Them
 Scalable-Softmax Is Superior for Attention
 An introduction to graphical tensor notation for mechanistic interpretability
 PixelWorld: Towards Perceiving Everything as Pixels
 The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
 ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
 Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities
 Towards Safe and Honest AI Agents with Neural Self-Other Overlap
 Lifelong Sequential Knowledge Editing without Model Degradation
 Preference Leakage: A Contamination Problem in LLM-as-a-judge
 Process Reinforcement through Implicit Rewards
 GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
 The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
 FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
 HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
 RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models
 AIN: The Arabic INclusive Large Multimodal Model
 Querying Databases with Function Calling
 PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
 The Differences Between Direct Alignment Algorithms are a Blur
 Almost Surely Safe Alignment of Large Language Models at Inference-Time
 Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
 Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
 SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
 Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading
 AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
 MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
 Learning to Generate Unit Tests for Automated Debugging
 Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences
 Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
 LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information
 RandLoRA: Full-rank parameter-efficient fine-tuning of large models
 Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds
 Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
 QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
 Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
 Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
 ACECODER: Acing Coder RL via Automated Test-Case Synthesis
 Harmonic Loss Trains Interpretable AI Models
 Converting MLPs into Polynomials in Closed Form
 Language Models Use Trigonometry to Do Addition
 MMTEB: Massive Multilingual Text Embedding Benchmark
 Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
 TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
 BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving
 SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
 LIMO: Less is More for Reasoning
 Demystifying Long Chain-of-Thought Reasoning in LLMs
 Intent Representation Learning with Large Language Model for Recommendation
 Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
 Jailbreaking with Universal Multi-Prompts
 Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
 Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
 Wavelet-based Positional Representation for Long Context
 A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
 ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
 Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
 TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
 Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense
 Federated Sketching LoRA: On-Device Collaborative Fine-Tuning of Large Language Models
 Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models
 Deriving Activation Functions Using Integration
 FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions
 Large Language Model Guided Self-Debugging Code Generation
 OmniRL: In-Context Reinforcement Learning by Large-Scale Meta-Training in Randomized Worlds
 On Teacher Hacking in Language Model Distillation
 Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation
 Activation-Informed Merging of Large Language Models
 HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems
 Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
 Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
 Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization
 PILAF: Optimal Human Preference Sampling for Reward Modeling
 MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion
 MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation
 UltraIF: Advancing Instruction Following from the Wild
 BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation
 Syntriever: How to Train Your Retriever with Synthetic Data from LLMs
 Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
 Great Models Think Alike and this Undermines AI Oversight
 ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution
 PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback
 Enhancing Code Generation for Low-Resource Languages: No Silver Bullet
 Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction
 LLM Alignment as Retriever Optimization: An Information Retrieval Perspective
 Partially Rewriting a Transformer in Natural Language
 MedRAX: Medical Reasoning Agent for Chest X-ray
 Do Large Language Model Benchmarks Test Reliability?
 Loss Functions and Operators Generated by f-Divergences
 xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
 QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
 DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
 Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
 QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
 Generating Symbolic World Models via Test-time Scaling of Large Language Models
 ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning
 Linear Correlation in LM's Compositional Generalization and Hallucination
 When One LLM Drools, Multi-LLM Collaboration Rules
 CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
 MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
 Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models
 MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
 CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
 Multi-agent Architecture Search via Agentic Supernet
 Advancing Reasoning in Large Language Models: Promising Methods and Approaches
 CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning
 ALU: Agentic LLM Unlearning
 Training Language Models to Reason Efficiently
 Sparse Autoencoders for Hypothesis Generation
 It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers
 SPARC: Subspace-Aware Prompt Adaptation for Robust Continual Learning in LLMs
 EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
 Matryoshka Quantization
 Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
 ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
 Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
 Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
 LM2: Large Memory Models
 Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
 APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
 Hypencoder: Hypernetworks for Information Retrieval
 The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
 Leveraging the true depth of LLMs
 Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
 Develop AI Agents for System Engineering in Factorio
 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
 Augmenting Self-attention with Persistent Memory
 SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
 MetaChain: A Fully-Automated and Zero-Code Framework for LLM Agents
 Auditing Prompt Caching in Language Model APIs
 FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading
 Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
 Scaling Pre-training to One Hundred Billion Data for Vision Language Models
 O1 Embedder: Let Retrievers Think Before Action
 Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
 LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
 CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
 TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Model Bring? - A Case Study on Korea Financial Texts
 Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey
 Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
 Knowledge Graph-Guided Retrieval Augmented Generation
 Competitive Programming with Large Reasoning Models
 Solving the Content Gap in Roblox Game Recommendations: LLM-Based Profile Generation and Reranking
 DeepCrossAttention: Supercharging Transformer Residual Connections
 Towards Internet-Scale Training For Agents
 Optimizing Temperature for Language Models with Multi-Sample Inference
 On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
 Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
 Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
 Confidence Improves Self-Consistency in LLMs
 Teaching Language Models to Critique via Reinforcement Learning
 Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
 FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking
 The Curse of Depth in Large Language Models
 CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
 Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
 DarwinLM: Evolutionary Structured Pruning of Large Language Models
 Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
 PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference
 Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
 DPO-Shift: Shifting the Distribution of Direct Preference Optimization
 Bag of Tricks for Inference-time Computation of LLM Reasoning
 Gemstones: A Model Suite for Multi-Faceted Scaling Laws
 Expect the Unexpected: FailSafe Long Context QA for Finance
 Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models
 Distillation Scaling Laws
 Transfer Learning of Tabular Data by Finetuning Large Language Models
 CoS: Chain-of-Shot Prompting for Long Video Understanding
 LLM Pretraining with Continuous Concepts
 Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
 TransMLA: Multi-head Latent Attention Is All You Need
 Automated Capability Discovery via Model Self-Exploration
 Harnessing Language's Fractal Geometry with Recursive Inference Scaling
 When More is Less: Understanding Chain-of-Thought Length in LLMs
 Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models
 NoLiMa: Long-Context Evaluation Beyond Literal Matching
 mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
 The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities
 MetaSC: Test-Time Safety Specification Optimization for Language Models
 LLM Modules: Knowledge Transfer from a Large to a Small Model using Enhanced Cross-Attention
 BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
 Language Models Can Teach Themselves to Program Better
 Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
 Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing
 Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
 PDE-Controller: LLMs for Autoformalization and Reasoning of PDEs
 Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
 Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
 The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models
 SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
 CoT-Valve: Length-Compressible Chain-of-Thought Tuning
 EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
 KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAG
 Logical Reasoning in Large Language Models: A Survey
 An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
 Typhoon T1: An Open Thai Reasoning Model
 Diversity Enhances an LLM's Performance in RAG and Long-context Task
 The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
 InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
 MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
 Exploring the Potential of Encoder-free Architectures in 3D LMMs
 SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models
 CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
 EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
 Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
 Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges
 Escaping Collapse: The Strength of Weak Data for Large Language Model Training
 CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality
 Human-LLM Coevolution: Evidence from Academic Writing
 GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
 Mastering the Craft of Data Synthesis for CodeLLMs
 Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face
 How Green are Neural Language Models? Analyzing Energy Consumption in Text Summarization Fine-tuning
 Enabling Autoregressive Models to Fill In Masked Tokens
 GENERator: A Long-Context Generative Genomic Foundation Model
 Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH
 ToolFactory: Automating Tool Generation by Leveraging LLM to Understand REST API Documentations
 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
 Large Language Diffusion Models
 Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages
 V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models
 SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
 LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing
 KGGen: Extracting Knowledge Graphs from Plain Text with Language Models
 Diverse Inference and Verification for Advanced Reasoning
 FoNE: Precise Single-Token Number Embeddings via Fourier Features
 ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
 Jailbreaking to Jailbreak
 Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models
 We Can't Understand AI Using our Existing Vocabulary
 HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
 Scaling Test-Time Compute Without Verification or RL is Suboptimal
 PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
 Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity
 LIMR: Less is More for RL Scaling
 Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
 Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance
 Large Language Models and Mathematical Reasoning Failures
 SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL
 Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
 ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability
 System Message Generation for User Preferences using Open-Source Models
 Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
 How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
 ReLearn: Unlearning via Learning for Large Language Models
 The Mirage of Model Editing: Revisiting Evaluation in the Wild
 Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
 SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
 Dyve: Thinking Fast and Slow for Dynamic Process Verification
 Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems
 CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
 KernelBench: Can LLMs Write Efficient GPU Kernels?
 SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
 One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
 Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning
 Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking
 CRANE: Reasoning with constrained LLM generation
 Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
 Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model
 The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
 Self-Supervised Prompt Optimization
 Self-Data Distillation for Recovering Quality in Pruned Large Language Models
 HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation
 Rethinking Diverse Human Preference Learning through Principal Component Analysis
 Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
 HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
 Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
 Autellix: An Efficient Serving Engine for LLM Agents as General Programs
 Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
 AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
 Qwen2.5-VL Technical Report
 LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
 Craw4LLM: Efficient Web Crawling for LLM Pretraining
 Thinking Preference Optimization
 Magma: A Foundation Model for Multimodal AI Agents
 NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
 Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
 Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
 Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks
 LLM-Powered Proactive Data Systems
 Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
 Soundwave: Less is More for Speech-Text Alignment in LLMs
 PAFT: Prompt-Agnostic Fine-Tuning
 Concise Reasoning via Reinforcement Learning
 Baichuan-M1: Pushing the Medical Capability of Large Language Models
 Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research
 The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
 Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking
 SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
 Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
 MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
 Small Models Struggle to Learn from Strong Reasoners
 Scaling Autonomous Agents via Automatic Reward Modeling And Planning
 Atom of Thoughts for Markov LLM Test-Time Scaling
 OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
 FinMTEB: Finance Massive Text Embedding Benchmark
 Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
 Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey
 HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
 RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision
 Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning
 GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
 TrustRAG: An Information Assistant with Retrieval Augmented Generation
 REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models
 Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
 Learning to Reason at the Frontier of Learnability
 Presumed Cultural Identity: How Names Shape LLM Responses
 InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
 StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning
 Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
 TESS 2: A Large-Scale Generalist Diffusion Language Model
 Judging the Judges: A Collection of LLM-Generated Relevance Judgements
 From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
 Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval
 REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation
 AIDE: AI-Driven Exploration in the Space of Code
 MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
 Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
 LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
 Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
 On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems
 SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
 AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO
 Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
 MLGym: A New Framework and Benchmark for Advancing AI Research Agents
 S*: Test Time Scaling for Code Generation
 On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
 PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
 Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
 Interpretable Text Embeddings and Text Similarity Explanation: A Primer
 ETS: Efficient Tree Search for Inference-Time Scaling
 From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
 How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
 Unstructured Evidence Attribution for Long Context Query Focused Summarization
 Reasoning LLMs are Wandering Solution Explorers
 Collaborative Retrieval for Large Language Model-based Conversational Recommender Systems
 S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
 How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
 LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
 InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
 CLIPPER: Compression enables long-context synthetic data generation
 LLM-based User Profile Management for Recommender System
 Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
 Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
 The Imitation Game According To Turing
 You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
 Trojan Detection Through Pattern Recognition for Large Language Models
 How to Get Your LLM to Generate Challenging Problems for Evaluation
 Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
 Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
 RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
 Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder
 H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
 The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training
 Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
 VERT: Verified Equivalent Rust Transpilation with Large Language Models as Few-Shot Learners
 Large language models and (non-)linguistic recursion
 MathConstruct: Challenging LLM Reasoning with Constructive Proofs
 Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications
 Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
 Automated Hypothesis Validation with Agentic Sequential Falsifications
 The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
 LightThinker: Thinking Step-by-Step Compression
 PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning
 UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
 More for Keys, Less for Values: Adaptive KV Cache Quantization
 SIFT: Grounding LLM Reasoning in Contexts via Stickers
 Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence
 PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths
 SurveyX: Academic Survey Automation via Large Language Models
 StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
 Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model
 LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
 KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
 MoBA: Mixture of Block Attention for Long-Context LLMs
 Evaluating Multimodal Generative AI with Korean Educational Standards
 Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
 Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries
 Large Language Models for Cryptocurrency Transaction Analysis: A Bitcoin Case Study
 mStyleDistance: Multilingual Style Embeddings and their Evaluation
 Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
 MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
 Benchmarking LLMs for Political Science: A United Nations Perspective
 Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
 Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
 CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
 Thus Spake Long-Context Large Language Model
 BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
 Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
 Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment
 Beyond Release: Access Considerations for Generative AI Systems
 Audio-FLAN: A Preliminary Release
 Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge
 Linear Attention for Efficient Bidirectional Sequence Modeling
 Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
 Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
 Empowering LLMs with Logical Reasoning: A Comprehensive Survey
 Activation Steering in Neural Theorem Provers
 Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease
 PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
 PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
 Reasoning with Latent Thoughts: On the Power of Looped Transformers
 COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
 Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
 LettuceDetect: A Hallucination Detection Framework for RAG Applications
 Mapping 1,000+ Language Models via the Log-Likelihood Vector
 M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
 Grounded Persuasive Language Generation for Automated Marketing
 Compression Scaling Laws:Unifying Sparsity and Quantization
 Self-Taught Agentic Long Context Understanding
 InductionBench: LLMs Fail in the Simplest Complexity Class
 Towards an AI co-scientist
 Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models
 DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
 SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
 WebGames: Challenging General-Purpose Web-Browsing AI Agents
 Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
 AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
 MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use
 Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
 TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
 Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
 Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
 Kanana: Compute-efficient Bilingual Language Models
 Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
 WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
 ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
 Language Models' Factuality Depends on the Language of Inquiry
 Scaling LLM Pre-training with Vocabulary Curriculum
 PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
 The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
 ReaderLM-v2: Small Language Model for HTML to Markdown and JSON
 Introducing Visual Perception Token into Multimodal Large Language Model
 MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
 Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
 Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
 AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
 Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
 LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models
 Prompt-to-Leaderboard
 Optimizing Model Selection for Compound AI Systems
 Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
 MixMin: Finding Data Mixtures via Convex Minimization
 Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
 Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
 CritiQ: Mining Data Quality Criteria from Human Preferences
 Rank1: Test-Time Compute for Reranking in Information Retrieval
 BIG-Bench Extra Hard
 All That Glitters is Not Novel: Plagiarism in AI Generated Research
 FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
 Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
 Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
 KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation
 LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers
 (Mis)Fitting: A Survey of Scaling Laws
 Towards Optimal Multi-draft Speculative Decoding
 Training a Generally Curious Agent
 Reward Shaping to Mitigate Reward Hacking in RLHF
 R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
 SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning
 LongRoPE2: Near-Lossless LLM Context Window Scaling
 MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
 Self-rewarding correction for mathematical reasoning
 FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
 Granite Embedding Models
 R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning
 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
 Less or More: Towards Glanceable Explanations for LLM Recommendations Using Ultra-Small Devices
 Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
 CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
 NeoBERT: A Next-Generation BERT
 An Extensive Evaluation of PDDL Capabilities in off-the-shelf LLMs
 On Relation-Specific Neurons in Large Language Models
 Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
 Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
 Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review
 Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
 Applications of Large Models in Medicine
 Agent-centric Information Access
 A Systematic Survey of Automatic Prompt Optimization Techniques
 ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
 HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
 DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
 Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition
 SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
 Chain of Draft: Thinking Faster by Writing Less
 TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
 Multi-Turn Code Generation Through Single-Step Rewards
 Preference Learning Unlocks LLMs' Psycho-Counseling Skills
 Large-Scale Data Selection for Instruction Tuning
 Visual-RFT: Visual Reinforcement Fine-Tuning
 Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
 Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
 Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
 Predictive Data Selection: The Data That Predicts Is the Data That Teaches
 DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
 EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
 U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack
 CodeArena: A Collective Evaluation Platform for LLM Code Generation
 CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments
 Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia
 DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
 Liger: Linearizing Large Language Models to Gated Recurrent Structures
 AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding
 From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
 SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
 When an LLM is apprehensive about its answers -- and when its uncertainty is justified
 PodAgent: A Comprehensive Framework for Podcast Generation
 General Reasoning Requires Learning to Reason from the Get-go
 Efficient Test-Time Scaling via Self-Calibration
 Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis
 Wikipedia in the Era of LLMs: Evolution and Risks
 Language Models can Self-Improve at State-Value Estimation for Better Search
 Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
 MPO: Boosting LLM Agents with Meta Plan Optimization
 Teaching Metric Distance to Autoregressive Multimodal Foundational Models
 ATLaS: Agent Tuning via Learning Critical Steps
 MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
 PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
 Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale
 Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
 FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
 State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence
 DINT Transformer
 Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width
 Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?
 FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
 Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models
 Attention is All You Need Until You Need Retention
 Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers
 FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
 Shrink the longest: improving latent space isotropy with symplicial geometry
 Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models
 Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention
 Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph
 Superposition in Transformers: A Novel Way of Building Mixture of Experts
 Proactive Conversational Agents with Inner Thoughts
 Chunk-Distilled Language Modeling
 Transformer with Fourier Integral Attentions
 AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
 IterPref: Focal Preference Learning for Code Generation via Iterative Debugging
 Iterative Value Function Optimization for Guided Decoding
 LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
 Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
 A Token-level Text Image Foundation Model for Document Understanding
 (How) Do Language Models Track State?
 Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks
 Forgetting Transformer: Softmax Attention with a Forget Gate
 Societal Alignment Frameworks Can Improve LLM Alignment
 PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
 One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings
 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
 MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs
 IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
 Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
 Interact, Instruct to Improve: A LLM-Driven Parallel Actor-Reasoner Framework for Enhancing Autonomous Vehicle Interactions
 ABC: Achieving Better Control of Multimodal Embeddings using VLMs
 Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
 CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom
 Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection
 SwiLTra-Bench: The Swiss Legal Translation Benchmark
 HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
 Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective
 Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
 Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
 START: Self-taught Reasoner with Tools
 HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
 More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
 Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks
 FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
 Token-Efficient Long Video Understanding for Multimodal LLMs
 PokéChamp: an Expert-level Minimax Language Agent
 LLM as a Broken Telephone: Iterative Generation Distorts Information
 Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
 L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
 Multi Agent based Medical Assistant for Edge Devices
 Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
 LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
 Identifying Sensitive Weights via Post-quantization Integral
 On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
 TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge
 Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
 Enough Coin Flips Can Make LLMs Act Bayesian
 Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints
 All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
 From Language to Cognition: How LLMs Outgrow the Human Language Network
 R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning
 Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
 R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
 TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
 Shifting Long-Context LLMs Research from Input to Output
 Learning from Failures in Multi-Attempt Reinforcement Learning
 EuroBERT: Scaling Multilingual Encoders for European Languages
 LoRACode: LoRA Adapters for Code Embeddings
 SAGE: A Framework of Precise Retrieval for RAG
 Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
 An Empirical Study on Eliciting and Improving R1-like Reasoning Models
 S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
 LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
 RuCCoD: Towards Automated ICD Coding in Russian
 Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
 Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation
 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
 Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing
 G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks
 DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
 Power-Softmax: Towards Secure LLM Inference over Encrypted Data
 SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models
 Detection Avoidance Techniques for Large Language Models
 MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
 Automated Movie Generation via Multi-Agent CoT Planning
 Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
 FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
 Agent models: Internalizing Chain-of-Action Generation into Reasoning models
 LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
 Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
 Should VLMs be Pre-trained with Image Data?
 GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval
 MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
 RePO: ReLU-based Preference Optimization
 ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
 Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
 WritingBench: A Comprehensive Benchmark for Generative Writing
 SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
 Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
 Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
 This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
 Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
 Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
 Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
 Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
 Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
 Enhancing Reasoning with Collaboration and Memory
 What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces
 Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
 YuE: Scaling Open Foundation Models for Long-Form Music Generation
 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
 Gemini Embedding: Generalizable Embeddings from Gemini
 Mixture of Experts Made Intrinsically Interpretable
 Implicit Reasoning in Transformers is Reasoning through Shortcuts
 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
 Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
 BiasEdit: Debiasing Stereotyped Language Models via Model Editing
 AI-native Memory 2.0: Second Me
 Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
 VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
 Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
 Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru
 Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
 Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
 LocAgent: Graph-Guided LLM Agents for Code Localization
 PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation
 Confident Adaptive Language Modeling
 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
 Cost-Optimal Grouped-Query Attention for Long-Context LLMs
 Quantizing Large Language Models for Code Generation: A Differentiated Replication
 WildIFEval: Instruction Following in the Wild
 LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
 Self-Taught Self-Correction for Small Language Models
 The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models
 How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
 Protein Large Language Models: A Comprehensive Survey
 Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
 Forecasting Rare Language Model Behaviors
 SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
 Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL
 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
 UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
 Transformers without Normalization
 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
 IteRABRe: Iterative Recovery-Aided Block Reduction
 Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
 Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
 VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
 AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
 TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
 MinorBench: A hand-built benchmark for content-based risks for children
 A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
 KV-Distill: Nearly Lossless Learnable Context Compression for LLMs
 Medical Hallucinations in Foundation Models and Their Impact on Healthcare
 KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
 Language Models Fail to Introspect About Their Knowledge of Language
 Constructions are Revealed in Word Distributions
 API Agents vs. GUI Agents: Divergence and Convergence
 A Survey on Knowledge-Oriented Retrieval-Augmented Generation
 Small Vision-Language Models: A Survey on Compact Architectures and Techniques
 Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
 Generative Modelling for Mathematical Discovery
 Ordered Semantically Diverse Sampling for Textual Data
 Semantic Wave Functions: Exploring Meaning in Large Language Models Through Quantum Formalism
 Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model
 LimTopic: LLM-based Topic Modeling and Text Summarization for Analyzing Scientific Articles limitations
 From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
 ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
 ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
 WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models
 Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs
 Why do language models perform worse for morphologically complex languages?
 When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
 MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
 reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
 RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
 UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
 Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words
 Basic Category Usage in Vision Language Models
 Investigating Human-Aligned Large Language Model Uncertainty
 A Review of DeepSeek Models' Key Innovative Techniques
 Agents Play Thousands of 3D Video Games
 Free-form language-based robotic reasoning and grasping
 Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
 SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
 APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
 xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
 SuperBPE: Space Travel for Language Models
 $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
 EXAONE Deep: Reasoning Enhanced Language Models
 Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
 Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
 Frac-Connections: Fractional Extension of Hyper-Connections
 PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models
 MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
 Auditing language models for hidden objectives
 Aligning Multimodal LLM with Human Preference: A Survey
 Measuring AI Ability to Complete Long Tasks
 Temporal Consistency for LLM Reasoning Process Error Identification
 Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
 Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
 DAPO: An Open-Source LLM Reinforcement Learning System at Scale
 RWKV-7 "Goose" with Expressive Dynamic State Evolution
 CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
 LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
 DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
 CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
 PENCIL: Long Thoughts with Short Memory
 Pensez: Less Data, Better Reasoning -- Rethinking French LLM
 MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
 Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs
 Learning to Inference Adaptively for Multimodal Large Language Models
 Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
 EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
 CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving
 Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions
 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
 Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems
 ELTEX: A Framework for Domain-Driven Synthetic Data Generation
 Enhancing Code LLM Training with Programmer Attention
 Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
 ViSpeak: Visual Instruction Feedback in Streaming Videos
 STEVE: AStep Verification Pipeline for Computer-use Agent Training
 LEGION: Learning to Ground and Explain for Synthetic Image Detection
 GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction
 MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer
 VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
 SkyLadder: Better and Faster Pretraining via Context Window Scheduling
 LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
 Inside-Out: Hidden Factual Knowledge in LLMs
 The KoLMogorov Test: Compression by Code Generation
 LLM-Mediated Guidance of MARL Systems
 XAttention: Block Sparse Attention with Antidiagonal Scoring
 Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
 Survey on Evaluation of LLM-based Agents
 CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners
 LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
 Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
 Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
 Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
 MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion
 CLS-RL: Image Classification with Rule-Based Reinforcement Learning
 SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
 Mixture of Lookup Experts
 FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models
 Vision-Speech Models: Teaching Speech Models to Converse about Images
 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
 BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
 Why Personalizing Deep Learning-Based Code Completion Tools Matters
 Where do Large Vision-Language Models Look at when Answering Questions?
 Why Do Multi-Agent LLM Systems Fail?
 UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
 RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
 Causal Discovery from Data Assisted by Large Language Models
 Modifying Large Language Model Post-Training for Diverse Creative Writing
 Computation Mechanism Behind LLM Position Generalization
 A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
 Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
 Capturing Individual Human Preferences with Reward Features
 PVChat: Personalized Video Chat with One-Shot Learning
 MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
 MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
 MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
 AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning
 GAEA: A Geolocation Aware Conversational Model
 What Makes a Reward Model a Good Teacher? An Optimization Perspective
 Can Large Vision Language Models Read Maps Like a Human?
 From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
 Implicit Bias-Like Patterns in Reasoning Models
 Thinking Machines: A Survey of LLM based Reasoning Strategies
 StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
 A Comprehensive Survey on Long Context Language Modeling
 Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation
 ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization
 Exploring Training and Inference Scaling Laws in Generative Retrieval
 Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
 Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
 FFN Fusion: Rethinking Sequential Computation in Large Language Models
 Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
 SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
 Reasoning to Learn from Latent Thoughts
 Defeating Prompt Injections by Design
 Verbal Process Supervision Elicits Better Coding Agents
 Context-Efficient Retrieval with Factual Decomposition
 AgentRxiv: Towards Collaborative Autonomous Research
 Mind with Eyes: from Language Reasoning to Multimodal Reasoning
 Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
 Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
 SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA
 Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models
 Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM
 Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
 Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On
 Variance Control via Weight Rescaling in LLM Pre-training
 Judge Anything: MLLM as a Judge Across Any Modality
 LEMMA: Learning from Errors for MatheMatical Advancement in LLMs
 V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms
 Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering
 Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
 CoLLM: A Large Language Model for Composed Image Retrieval
 Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark
 MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
 ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
 Dewey Long Context Embedding Model: A Technical Report
 Fully Autonomous AI Agents Should Not be Developed
 Open Deep Search: Democratizing Search with Open-source Reasoning Agents
 Efficient Model Development through Fine-tuning Transfer
 LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
 LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
 LookAhead Tuning: Safer Language Models via Partial Answer Previews
 Scaling Vision Pre-Training to 4K Resolution
 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
 Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
 Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
 Learning to chain-of-thought with Jensen's evidence lower bound
 Scaling Laws of Synthetic Data for Language Models
 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
 Can Vision-Language Models Answer Face to Face Questions in the Real-World?
 ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
 Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling
 xKV: Cross-Layer SVD for KV-Cache Compression
 LLaVAction: evaluating and training multi-modal large language models for action recognition
 When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
 FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
 Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
 Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging
 Video-R1: Reinforcing Video Reasoning in MLLMs
 ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
 Large Language Model Agent: A Survey on Methodology, Applications and Challenges
 Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
 ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
 Identifying Emerging Concepts in Large Corpora
 ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
 FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
 LLPut: Investigating Large Language Models for Bug Report-Based Input Generation
 ViLBench: A Suite for Vision-Language Process Reward Modeling
 Qwen2.5-Omni Technical Report
 Gemma 3 Technical Report
 Overtrained Language Models Are Harder to Fine-Tune
 Don't lie to your friends: Learning what you know from collaborative self-play
 Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
 Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
 Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
 RONA: Pragmatically Diverse Image Captioning with Coherence Relations
 MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents
 New Trends for Modern Machine Translation with Large Reasoning Models
 Compute Optimal Scaling of Skills: Knowledge vs Reasoning
 MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
 Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
 Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs
 LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
 Training Plug-n-Play Knowledge Modules with Deep Context Distillation
 Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
 HuixiangDou2: A Robustly Optimized GraphRAG Approach
 A Survey on Post-training of Large Language Models
 A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
 Knowledge Updating? No More Model Editing! Just Selective Contextual Reasoning
 Continual Pre-training of MoEs: How robust is your router?
 SafeArena: Evaluating the Safety of Autonomous Web Agents
 Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
 Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
 Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs
 L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
 PDX: A Data Layout for Vector Similarity Search
 Process-based Self-Rewarding Language Models
 SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches
 LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
 FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4
 Improving LLM-as-a-Judge Inference with the Judgment Distribution
 Tabby: Tabular Data Synthesis with Language Models
 How to Steer LLM Latents for Hallucination Detection?
 Better Embeddings with Coupled Adam
 RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
 CoSMoEs: Compact Sparse Mixture of Experts
 Steering Large Language Model Activations in Sparse Spaces
 VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
 LLM Post-Training: A Deep Dive into Reasoning Large Language Models
 Token-level Ensembling of Models with Different Vocabularies
 WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval
 FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
 $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
 Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?
 Long-Context Inference with Retrieval-Augmented Speculative Decoding
 Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale
 Implicit Search via Discrete Diffusion: A Study on Chess
 Speculative Decoding and Beyond: An In-Depth Survey of Techniques
 GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
 No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data
 Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
 Learning Code-Edit Embedding to Model Student Debugging Behavior
 UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering
 Talking like Piping and Instrumentation Diagrams (P&IDs)
 Letters from Future Self: Augmenting the Letter-Exchange Exercise with LLM-based Agents to Enhance Young Adults' Career Exploration
 Seeing the Forest for the Trees: A Large Scale, Continuously Updating Meta-Analysis of Frontier LLMs
 Automatic Prompt Optimization via Heuristic Search: A Survey
 TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning
 Unveiling and Causalizing CoT: A Causal Pespective
 Bayesian Optimization for Controlled Image Editing via LLMs
 Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
 MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
 An Overview of Large Language Models for Statisticians
 From System 1 to System 2: A Survey of Reasoning Large Language Models
 LLM-QE: Improving Query Expansion by Aligning Large Language Models with Ranking Preferences
 Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
 Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
 Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores
 Interrogating LLM design under a fair learning doctrine
 The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
 Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models
 LLMs in Mobile Apps: Practices, Challenges, and Opportunities
 An Agent Framework for Real-Time Financial Information Searching with Large Language Models
 Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
 Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning
 DReSD: Dense Retrieval for Speculative Decoding
 SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
 LEDD: Large Language Model-Empowered Data Discovery in Data Lakes
 Which Attention Heads Matter for In-Context Learning?
 MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads
 How Do LLMs Perform Two-Hop Reasoning in Context?
 Lost in Sequence: Do Large Language Models Understand Sequential Recommendation?
 Evaluating Step-by-step Reasoning Traces: A Survey
 Idiosyncrasies in Large Language Models
 Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control
 TokenSkip: Controllable Chain-of-Thought Compression in LLMs
 CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models
 LLM Agents Making Agent Tools
 SMART: Self-Aware Agent for Tool Overuse Mitigation
 Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
 Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
 RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation
 Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective
 Solvable Dynamics of Self-Supervised Word Embeddings and the Emergence of Analogical Reasoning
 Spectral Journey: How Transformers Predict the Shortest Path
 IHEval: Evaluating Language Models on Following the Instruction Hierarchy
 Human Decision-making is Susceptible to AI-driven Manipulation
 A Comprehensive Review of Protein Language Models
 Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation
 Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition
 SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
 CARROT: A Cost Aware Rate Optimal Router
 Peri-LN: Revisiting Layer Normalization in the Transformer Architecture
 Layer by Layer: Uncovering Hidden Representations in Language Models
 VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
 Explaining Context Length Scaling and Bounds for Language Models
 Eliciting Language Model Behaviors with Investigator Agents
 Internal Activation as the Polar Star for Steering Unsafe LLM Behavior
 MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction
 Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective
 Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
 Rope to Nope and Back Again: A New Hybrid Attention Strategy
 MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
 Improving LLM Leaderboards with Psychometrical Methodology
 Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
 ADAM-1: AI and Bioinformatics for Alzheimer's Detection and Microbiome-Clinical Data Integrations
 Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps
 I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution
 Large Language Model Interface for Home Energy Management Systems
 Enhancing Talent Employment Insights Through Feature Extraction with LLM Finetuning
 Hierarchical Repository-Level Code Summarization for Business Applications Using Local LLMs
 Flow: Modularized Agentic Workflow Automation
 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
 CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation
 Visual Language Models as Operator Agents in the Space Domain
 Large Language Models for Interpretable Mental Health Diagnosis
 SafePowerGraph-LLM: Novel Power Grid Graph Embedding and Optimization with Large Language Models
 Evaluating Agent-based Program Repair at Google
 RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment
 AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
 LLM-Net: Democratizing LLMs-as-a-Service through Blockchain-based Expert Networks
 Touched by ChatGPT: Using an LLM to Drive Affective Tactile Interaction
 How is Google using AI for internal code migrations?
 Eliza: A Web3 friendly AI Agent Operating System
 Using Pre-trained LLMs for Multivariate Time Series Forecasting
 IntelEX: A LLM-driven Attack-level Threat Intelligence Extraction Framework
 ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability
 Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness
 MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model
 LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management
 Efficient Reasoning Models: A Survey
 Embracing Large Language Models in Traffic Flow Forecasting
 TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
 Semantic Steganography: A Framework for Robust and High-Capacity Information Hiding using Large Language Models
 A Contextualized BERT model for Knowledge Graph Completion
 WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models
 GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning
 ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
 Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
 EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance
 SwarmGPT-Primitive: A Language-Driven Choreographer for Drone Swarms Using Safe Motion Primitive Composition
 CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models
 EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
 Asynchronous LLM Function Calling
 AutoReason: Automatic Few-Shot Reasoning Decomposition
 Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization
 Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families
 Antidistillation Sampling
 LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation
 Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects
 iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
 SiReRAG: Indexing Similar and Related Information for Multihop Reasoning
 Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling
 Does RLHF Scale? Exploring the Impacts From Data, Model, and Method
 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
 HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design
 Rethinking Time Series Forecasting with LLMs via Nearest Neighbor Contrastive Learning
 Cross-Self KV Cache Pruning for Efficient Vision-Language Inference
 MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification
 DataLab: A Unified Platform for LLM-Powered Business Intelligence
 Towards Adaptive Mechanism Activation in Language Agent
 InstCache: A Predictive Cache for LLM Serving
 DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation
 ARChef: An iOS-Based Augmented Reality Cooking Assistant Powered by Multimodal Gemini LLM
 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects
 AIDetx: a compression-based method for identification of machine-learning generated text
 Build An Influential Bot In Social Media Simulations With Large Language Models
 Context-Aware Membership Inference Attacks against Pre-trained Large Language Models
 Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning
 Marconi: Prefix Caching for the Era of Hybrid LLMs
 CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding
 FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
 MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
 FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI
 Automated Test Transfer Across Android Apps Using Large Language Models
 Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
 Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
 MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
 Reassessing Layer Pruning in LLMs: New Insights and Methods
 Measuring Bullshit in the Language Games played by ChatGPT
 AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
 Multiverse of Greatness: Generating Story Branches with LLMs
 DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
 How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception
 One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity
 Accelerated AI Inference via Dynamic Execution Methods
 Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench
 Centaur: a foundation model of human cognition
 Markov Chain of Thought for Efficient Mathematical Reasoning
 Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling
 Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
 Compositional Entailment Learning for Hyperbolic Vision-Language Models
 Uplifting Lower-Income Data: Strategies for Socioeconomic Perspective Shifts in Large Multi-modal Models
 ImProver: Agent-Based Automated Proof Optimization
 ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement
 SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models
 From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
 AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
 What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
 Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification
 Advertiser Content Understanding via LLMs for Google Ads Safety
 Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
 Logically Consistent Language Models via Neuro-Symbolic Integration
 Evaluating Defences against Unsafe Feedback in RLHF
 SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications
 Human Interest or Conflict? Leveraging LLMs for Automated Framing Analysis in TV Shows
 Profiling Patient Transcript Using Large Language Model Reasoning Augmentation for Alzheimer's Disease Detection
 On the consistent reasoning paradox of intelligence and optimal trust in AI: The power of 'I don't know'
 A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
 Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
 GenoTEX: A Benchmark for Automated Gene Expression Data Analysis in Alignment with Bioinformaticians
 Why Would You Suggest That? Human Trust in Language Model Responses
 SLMRec: Distilling Large Language Models into Small for Sequential Recommendation
 LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought
 Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge
 How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
 Prompt Optimization via Adversarial In-Context Learning
 PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
 Generative Sequential Recommendation with GPTRec
 Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
 Hierarchical LLMs In-the-loop Optimization for Real-time Multi-Robot Target Tracking under Unknown Hazards
 ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation
 MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts
 LLM-Powered Text Simulation Attack Against ID-Free Recommender Systems
 Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
 LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
 Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style
 Self-Attention Limits Working Memory Capacity of Transformer-Based Models
 E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models
 Confidence Estimation for LLM-Based Dialogue State Tracking
 Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
 FP-VEC: Fingerprinting Large Language Models via Efficient Vector Addition
 LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
 Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
 Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM as Copilot
 Faster Speech-LLaMA Inference with Multi-token Prediction
 Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat
 Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches
 Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
 Geometric-Averaged Preference Optimization for Soft Preference Labels
 Optimal Workload Placement on Multi-Instance GPUs
 Towards Agentic AI on Particle Accelerators
 Enhancing Long Video Understanding via Hierarchical Event-Based Memory
 Algorithmic Language Models with Neurally Compiled Libraries
 Harder Tasks Need More Experts: Dynamic Routing in MoE Models
 Shared Global and Local Geometry of Language Model Embeddings
 Multi-head Reward Aggregation Guided by Entropy
 How do language models learn facts? Dynamics, curricula and hallucinations
 A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
 debug-gym: A Text-Based Environment for Interactive Debugging
 SWI: Speaking with Intent in Large Language Models
 From Deep Learning to LLMs: A survey of AI in Quantitative Investment
 RALLRec+: Retrieval Augmented Large Language Model Recommendation with Reasoning
 Speculative Decoding for Verilog: Speed and Quality, All in One
 Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
 Writing as a testbed for open ended agents
 A Survey of Large Language Model Agents for Question Answering
 Self-Organizing Graph Reasoning Evolves into a Critical State for Continuous Discovery Through Structural-Semantic Dynamics
 YoChameleon: Personalized Vision and Language Generation
 Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
 Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration
 Improving Low-Resource Retrieval Effectiveness using Zero-Shot Linguistic Similarity Transfer
 A Refined Analysis of Massive Activations in LLMs
 Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
 MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow
 On Large Multimodal Models as Open-World Image Classifiers
 ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback
 AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation
 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
 OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
 Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models
 Challenges and Paths Towards AI for Software Engineering
 Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
 Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
 Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey
 Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs
 Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
 Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment
 Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models
 Effectively Controlling Reasoning Models through Thinking Intervention
 Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
 What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
 TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
 Expanding RL with Verifiable Rewards Across Diverse Domains
 KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
 Efficient Inference for Large Reasoning Models: A Survey
 Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code
 LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
 Entropy-Based Adaptive Weighting for Self-Training
 Decoupling Angles and Strength in Low-rank Adaptation
 UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
 Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning
 Better wit than wealth: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement
 Adaptive Layer-skipping in Pre-trained LLMs
 RARE: Retrieval-Augmented Reasoning Modeling
 TRA: Better Length Generalisation with Threshold Relative Attention
 ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
 PAVE: Patching and Adapting Video Large Language Models
 When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
 Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
 Multi-Token Attention
 Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
 m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
 Z1: Efficient Test-time Scaling with Code
 Command A: An Enterprise-Ready Large Language Model
 Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
 TimeLMs: Diachronic Language Models from Twitter
 Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead
 Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
 Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
 Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
 AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
 CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis
 JudgeLRM: Large Reasoning Models as a Judge
 Towards Trustworthy GUI Agents: A Survey
 OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
 Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
 LARGE: Legal Retrieval Augmented Generation Evaluation Tool
 Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL
 Token embeddings violate the manifold hypothesis
 Hawkeye:Efficient Reasoning with Model Collaboration
 Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
 Improved Visual-Spatial Reasoning via R1-Zero-Like Training
 PaperBench: Evaluating AI's Ability to Replicate AI Research
 YourBench: Easy Custom Evaluation Sets for Everyone
 Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
 AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
 ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
 Inference-Time Scaling for Generalist Reward Modeling
 Efficient Model Selection for Time Series Forecasting via LLMs
 Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
 ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
 Bhakti: A Lightweight Vector Database Management System for Endowing Large Language Models with Semantic Search Capabilities and Memory
 MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage
 ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
 Medical large language models are easily distracted
 VerifiAgent: a Unified Verification Agent in Language Model Reasoning
 DASH: Detection and Assessment of Systematic Hallucinations of VLMs
 Understanding R1-Zero-Like Training: A Critical Perspective
 Affordable AI Assistants with Knowledge Graph of Thoughts
 ZClip: Adaptive Spike Mitigation for LLM Pre-Training
 Scaling Analysis of Interleaved Speech-Text Language Models
 MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
 GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
 ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
 Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
 OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
 Comment Staytime Prediction with LLM-enhanced Comment Understanding
 Scaling Laws in Scientific Discovery with AI and Robot Scientists
 Analyzing the Generalization and Reliability of Steering Vectors
 Why do LLMs attend to the first token?
 Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models
 UNDO: Understanding Distillation as Optimization
 How Do Large Language Monkeys Get Their Power (Laws)?
 Multi-Agent Multimodal Models for Multicultural Text to Image Generation
 Large Language Models Pass the Turing Test
 Mixture of Routers
 MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
 SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
 Agentic Knowledgeable Self-awareness
 VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
 Soft Policy Optimization: Online Off-Policy RL for Sequence Models
 Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
 APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
 BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models
 MegaMath: Pushing the Limits of Open Math Corpora
 Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
 Slow-Fast Architecture for Video Multi-Modal Large Language Models
 TransMamba: Flexibly Switching between Transformer and Mamba
 Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
 ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
 Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions
 URECA: Unique Region Caption Anything
 Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
 Align to Structure: Aligning Large Language Models with Structural Information
 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
 Understanding Aha Moments: from External Observations to Internal Mechanisms
 Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings
 LiveVQA: Live Visual Knowledge Seeking
 M-Prometheus: A Suite of Open Multilingual LLM Judges
 A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam
 Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
 Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
 Clinical ModernBERT: An efficient and long context encoder for biomedical text
 JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
 RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm
 Robustly identifying concepts introduced during chat fine-tuning using crosscoders
 SmolVLM: Redefining small and efficient multimodal models
 DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation
 VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
 T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
 GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
 Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
 DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models
 Mixture-of-Personas Language Models for Population Simulation
 Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B
 Rethinking Reflection in Pre-Training
 Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
 Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
 Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
 Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
 Kimi-VL Technical Report
 Saliency-driven Dynamic Token Pruning for Large Language Models
 Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning
 Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
 Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking
 EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline
 Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
 An Empirical Study of GPT-4o Image Generation Capabilities
 Generative Evaluation of Complex Reasoning in Large Language Models
 CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
 Do Transformer Modifications Transfer Across Implementations and Applications?
 Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation
 Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
 V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
 COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
 Leanabell-Prover: Posttraining Scaling in Formal Reasoning
 HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
 Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
 From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
 Lattice: Learning to Efficiently Compress the Memory
 Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
 OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
 SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
 OmniCaptioner: One Captioner to Rule Them All
 A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
 Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
 Exact Unlearning of Finetuning Data via Model Merging at Scale
 Self-Steering Language Models
 VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
 RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts
 An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
 Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding
 Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
 Pretraining Language Models for Diachronic Linguistic Change Discovery
 A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models
 Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
 CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers
 C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
 MM-IFEngine: Towards Multimodal Instruction Following
 MemInsight: Autonomous Memory Augmentation for LLM Agents
 VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
 MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
 A System for Comprehensive Assessment of RAG Frameworks
 DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
 OSCAR: Online Soft Compression And Reranking
 Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
 Towards Visual Text Grounding of Multimodal Large Language Model
 To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning
 Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions
 Current and Future Use of Large Language Models for Knowledge Work
 Synthetic Data Generation Using Large Language Models: Advances in Text and Code
 Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
 SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
 VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
 Real-World Gaps in AI Governance Research
 Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
 Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization
 A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
 How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
 LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking
 Towards Distribution Matching between Collaborative and Language Spaces for Generative Recommendation
 On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation
 Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
 GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization
 Increasing happiness through conversations with artificial intelligence
 End-To-End Memory Networks
 Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction
 Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
 ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
 SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
 Out of Style: RAG's Fragility to Linguistic Variation
 SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
 CoRAG: Collaborative Retrieval-Augmented Generation
 Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs
 Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
 Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
 SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling
 On The Landscape of Spoken Language Models: A Comprehensive Survey
 SEAL: Steerable Reasoning Calibration of Large Language Models for Free
 Perception-R1: Pioneering Perception Policy with Reinforcement Learning
 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
 LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
 S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
 Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
 Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
 KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference
 FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
 Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems
 DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
 EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
 TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
 Do Reasoning Models Show Better Verbalized Calibration?
 AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
 VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
 PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
 Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
 Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning
 Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations
 Language Models can Evaluate Themselves via Probability Discrepancy
 M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
 LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
 A Survey of Personalization: From RAG to Agent
 Reasoning Models Can Be Effective Without Thinking
 RAKG:Document-level Retrieval Augmented Knowledge Graph Construction
 VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
 Iterative Self-Training for Code Generation via Reinforced Re-Ranking
 (How) Do reasoning models reason?
 How new data permeates LLM knowledge and how to dilute it
 VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
 UltraRAG: A Modular and Automated Toolkit for Adaptive Retrieval-Augmented Generation
 DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
 Teaching Large Language Models to Reason through Learning and Forgetting
 MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
 MIEB: Massive Image Embedding Benchmark
 Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models
 Long Context In-Context Compression by Getting to the Gist of Gisting
 From Tokens to Lattices: Emergent Lattice Structures in Language Models
 MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
 Language Model Alignment in Multilingual Trolley Problems
 Robust and Fine-Grained Detection of AI Generated Texts
 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
 DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
 TextArena
 Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts
 Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
 DataDecide: How to Predict Best Pretraining Data with Small Experiments
 A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
 Looking beyond the next token
 LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
 ReZero: Enhancing LLM search ability by trying one-more-time
 Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
 How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
 Better Estimation of the KL Divergence Between Language Models
 Efficient Process Reward Model Training via Active Learning
 HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations
 xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
 The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
 Multimodal Long Video Modeling Based on Temporal Dynamic Context
 VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge
 Heimdall: test-time scaling on the generative verification
 RealHarm: A Collection of Real-World Language Model Application Failures
 AI-University: An LLM-based platform for instructional alignment to scientific classrooms
 Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
 Word Embeddings Track Social Group Changes Across 70 Years in China
 Adaptive Computation Pruning for the Forgetting Transformer
 PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
 Liquid: Language Models are Scalable and Unified Multi-modal Generators
 BitNet b1.58 2B4T Technical Report
 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
 A Library of LLM Intrinsics for Retrieval-Augmented Generation
 NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
 ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
 AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
 SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
 Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution
 SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
 Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
 FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
 A Survey of Multimodal Retrieval-Augmented Generation
 MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
 CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
 Exploring Expert Failures Improves LLM Agent Tuning
 VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
 Retrieval-Augmented Generation with Conflicting Evidence
 Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
 HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation
 FocusedAD: Character-centric Movie Audio Description
 Improving Instruct Models for Free: A Study on Partial Adaptation
 CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
 Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian
 A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
 Sleep-time Compute: Beyond Inference Scaling at Test-time
 MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
 PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
 Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?
 Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
 Reinforcement Learning from Human Feedback
 Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis
 LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
 MIB: A Mechanistic Interpretability Benchmark
 LitLLMs, LLMs for Literature Review: Are we there yet?
 VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
 Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
 Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
 MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
 Generative AI Act II: Test Time Scaling Drives Cognition Engineering
 Could Thinking Multilingually Empower LLM Reasoning?
 Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
 Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
 RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
 Learning to Reason under Off-Policy Guidance
 Local Prompt Optimization
 UFO2: The Desktop AgentOS
 NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning
 Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
 Learning to Attribute with Attention
 Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
 Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
 Cost-of-Pass: An Economic Framework for Evaluating Language Models
 Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models
 Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
 THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
 Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
 Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
 FlowReasoner: Reinforcing Query-Level Meta-Agents
 EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
 LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
 InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
 ToolRL: Reward is All Tool Learning Needs
 LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark
 X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
 An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
 OTC: Optimal Tool Calls via Reinforcement Learning
 SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
 Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
 Mixer Metaphors: audio interfaces for non-musical applications
 Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
 Compute-Optimal LLMs Provably Generalize Better With Scale
 Vector Embedding, Retrieval-Augmented Generation, CPU-NPU Collaboration, Heterogeneous Computing
 Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions
 Kuwain 1.5B: An Arabic SLM via Language Injection
 PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
 Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey
 Less is More: Adaptive Coverage for Synthetic Training Data
 CacheFormer: High Attention-Based Segment Caching
 LoRe: Personalizing LLMs via Low-Rank Reward Modeling
 TALES: Text Adventure Learning Environment Suite
 TTRL: Test-Time Reinforcement Learning
 MR. Video: "MapReduce" is the Principle for Long Video Understanding
 LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
 Efficient Pretraining Length Scaling
 WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
 The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
 Learning Adaptive Parallel Reasoning with Language Models
 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
 BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation
 Progent: Programmable Privilege Control for LLM Agents
 IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
 AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
 Process Reward Models That Think
 Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
 Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
 A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
 Trillion 7B Technical Report
 VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
 BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
 GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation
 miniCTX: Neural Theorem Proving with (Long-)Contexts
 Tina: Tiny Reasoning Models via LoRA
 CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
 Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA
 PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
 Causal-Copilot: An Autonomous Causal Analysis Agent
 Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading
 ComPO: Community Preferences for Language Model Personalization
 Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
 Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
 Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
 QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
 TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
 IberBench: LLM Evaluation on Iberian Languages
 Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions
 The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
 Energy Considerations of Large Language Model Inference and Efficiency Optimizations
 Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation
 Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning
 Scaling Laws For Scalable Oversight
 Kimi-Audio Technical Report
 HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
 Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family
 Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models
 PropRAG: Guiding Retrieval with Beam Search over Proposition Paths
 SMARTFinRAG: Interactive Modularized Financial RAG Benchmark
 VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
 Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
 Exploring How LLMs Capture and Represent Domain-Specific Knowledge
 Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
 Clinical knowledge in LLMs does not translate to human interactions
 DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models
 LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
 Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation
 AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings
 MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
 ICL CIPHERS: Quantifying "Learning'' in In-Context Learning via Substitution Ciphers
 SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
 CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
 Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
 ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development
 Accelerating Mixture-of-Experts Training with Adaptive Expert Replication
 Attention Mechanism, Max-Affine Partition, and Universal Approximation
 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
 UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
 ReasonIR: Training Retrievers for Reasoning Tasks
 Reinforcement Learning for Reasoning in Large Language Models with One Training Example
 GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
 A Review of 3D Object Detection with Vision-Language Models
 Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
 X-Fusion: Introducing New Modality to Frozen Large Language Models
 The Leaderboard Illusion
 TreeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering
 Certified Mitigation of Worst-Case LLM Copyright Infringement
 RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
 LawFlow : Collecting and Simulating Lawyers' Thought Processes
 Learning Explainable Dense Reward Shapes via Bayesian Optimization
 Phi-4-reasoning Technical Report
 Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
 COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
 DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition
 WebThinker: Empowering Large Reasoning Models with Deep Research Capability
 Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
 Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
 Taming the Titans: A Survey of Efficient LLM Inference Serving
 Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
 Sadeed: Advancing Arabic Diacritization Through Small Language Model
 UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
 Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
 NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
 AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
 Memorization and Knowledge Injection in Gated LLMs
 Model Connectomes: A Generational Approach to Data-Efficient Language Models
 Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost
 DeepCritic: Deliberate Critique with Large Language Models
 Investigating Task Arithmetic for Zero-Shot Information Retrieval
 Efficient Recommendation with Millions of Items by Dynamic Pruning of Sub-Item Embeddings
 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models
 Distributed Retrieval-Augmented Generation
 EnronQA: Towards Personalized RAG over Private Documents
 TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
 LLMs for Engineering: Teaching Models to Design High Powered Rockets
 A Survey of Scaling in Large Language Model Reasoning
 Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks
 A Practical Examination of AI-Generated Text Detectors for Large Language Models
 On the generalization of language models from in-context learning and finetuning: a controlled study
 Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
 Scaling On-Device GPU Inference for Large Generative Models
 Base Models Beat Aligned Models at Randomness and Creativity
 The Mind in the Machine: A Survey of Incorporating Psychological Theories in LLMs
 Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
 Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning
 Investigating task-specific prompts and sparse autoencoders for activation monitoring
 Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
 Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
 Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving
 NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
 Llama-Nemotron: Efficient Reasoning Models
 CORG: Generating Answers from Complex, Interrelated Contexts
 X-Cross: Dynamic Integration of Language Models for Cross-Domain Sequential Recommendation
 Muon is Scalable for LLM Training
 Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
 Evaluating Frontier Models for Stealth and Situational Awareness
 Improving Large Language Model Planning with Action Sequence Similarity
 ICQuant: Index Coding enables Low-bit LLM Quantization
 A Survey on Large Language Model based Human-Agent Systems
 R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
 FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
 LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
 Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents
 Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities
 ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations
 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
 SymbioticRAG: Enhancing Document Intelligence Through Human-LLM Symbiotic Collaboration
 Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
 RM-R1: Reward Modeling as Reasoning
 A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
 TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
 Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
 Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
 Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
 Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
 Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing
 Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
 Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
 Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
 LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load
 Absolute Zero: Reinforced Self-play Reasoning with Zero Data
 Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering
 LlamaFirewall: An open source guardrail system for building secure AI agents
 Avoid Recommending Out-of-Domain Items: Constrained Generative Recommendation with LLMs
 Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
 Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models
 RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
 RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
 Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
 An Empirical Study of Qwen3 Quantization
 STORY2GAME: Generating (Almost) Everything in an Interactive Fiction Game
 Geospatial Mechanistic Interpretability of Large Language Models
 Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
 Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant
 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
 Don't be lazy: CompleteP enables compute-efficient deep transformers
 Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
 SWE-smith: Scaling Data for Software Engineering Agents
 Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
 Teaching Models to Understand (but not Generate) High-risk Data
 DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
 Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
 ZeroSearch: Incentivize the Search Capability of LLMs without Searching
 Benchmarking LLMs' Swarm intelligence
 A Reasoning-Focused Legal Retrieval Benchmark
 OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
 Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
 Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
 R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
 OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
 LLM-Independent Adaptive RAG: Let the Question Speak for Itself
 AutoLibra: Agent Metric Induction from Open-Ended Feedback
 OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
 Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
 X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
 Splitwiser: Efficient LM inference with constrained resources
 COSMOS: Predictable and Cost-Effective Adaptation of LLMs
 StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
 Scalable Chain of Thoughts via Elastic Reasoning
 Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
 On Path to Multimodal Generalist: General-Level and General-Bench
 Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
 BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
 Crosslingual Reasoning through Test-Time Scaling
 ICon: In-Context Contribution for Automatic Data Selection
 Chain-of-Thought Tokens are Computer Program Variables
 Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
 Understanding In-context Learning of Addition via Activation Subspaces
 WaterDrum: Watermarking for Data-centric Unlearning Metric
 Rethinking Invariance in In-context Learning
 When Bad Data Leads to Good Models
 SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
 Towards Quantifying the Hessian Structure of Neural Networks
 What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction
 ExpeL: LLM Agents Are Experiential Learners
 $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge
 LLMs Get Lost In Multi-Turn Conversation
 Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
 What Is Next for LLMs? Next-Generation AI Computing Hardware Using Photonic Chips
 Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
 Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions
 Understanding Stragglers in Large Model Training Using What-if Analysis
 Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
 Bielik v3 Small: Technical Report
 Bielik 11B v2 Technical Report
 Qwen3 Technical Report
 Fleet of Agents: Coordinated Problem Solving with Large Language Models
 Learning Dynamics in Continual Pre-Training for Large Language Models
 Learning from Peers in Reasoning Models
 MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
 Neural Thermodynamic Laws for Large Language Model Training
 Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
 AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
 INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning
 Measuring General Intelligence with Generated Games
 Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition
 Seed1.5-VL Technical Report
 MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG
 REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
 Document Attribution: Examining Citation Relationships using Large Language Models
 FloE: On-the-Fly MoE Inference on Memory-constrained GPU
 WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
 Understanding Chain-of-Thought in LLMs through Information Theory
 Overflow Prevention Enhances Long-Context Recurrent LLMs
 UMoE: Unifying Attention and FFN with Shared Experts
 DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
 The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts
 Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
 MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
 Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
 Imagining and building wise machines: The centrality of AI metacognition
 LLAMAPIE: Proactive In-Ear Conversation Assistants
 DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs
 Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
 Aya Vision: Advancing the Frontier of Multilingual Multimodality
 WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation
 Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
 AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale
 A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
 Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency
 Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines
 The Geometry of Self-Verification in a Task-Specific Reasoning Model
 Memorization-Compression Cycles Improve Generalization
 TRAIL: Trace Reasoning and Agentic Issue Localization
 Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
 Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies
 InfoPO: On Mutual Information Maximization for Large Language Model Alignment
 Lost in Transmission: When and Why LLMs Fail to Reason Globally
 Putting It All into Context: Simplifying Agents with LCLMs
 Scalable LLM Math Reasoning Acceleration with Low-rank Distillation
 Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
 MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
 WorldPM: Scaling Human Preference Modeling
 Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
 J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
 The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
 Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
 Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
 A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias
 Behind Maya: Building a Multilingual Vision Language Model
 Self Rewarding Self Improving
 VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
 Visually Interpretable Subtask Reasoning for Visual Question Answering
 SweRank: Software Issue Localization with Code Ranking
 MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
 Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
 End-to-End Vision Tokenizer Tuning
 Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
 Parallel Scaling Law for Language Models
 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenge
 Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
 System Prompt Optimization with Meta-Learning
 Analog Foundation Models
 OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
 Unilogit: Robust Machine Unlearning for LLMs Using Uniform-Target Self-Distillation
 Superposition Yields Robust Neural Scaling
 Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers
 Revealing economic facts: LLMs know more than they say
 From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
 Must Read: A Systematic Survey of Computational Persuasion
 PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
 Chronocept: Instilling a Sense of Time in Machines
 MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
 Visual Planning: Let's Think Only with Images
 MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation
 Mergenetic: a Simple Evolutionary Model Merging Library
 Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
 mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs
 Scaling Reasoning can Improve Factuality in Large Language Models
 Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity
 Humans expect rationality and cooperation from LLM opponents in strategic games
 GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
 MatTools: Benchmarking Large Language Models for Materials Science Tools
 Multi-Token Prediction Needs Registers
 Improving Assembly Code Performance with Large Language Models via Reinforcement Learning
 ShiQ: Bringing back Bellman to LLMs
 BLEUBERI: BLEU is a surprisingly effective reward for instruction following
 Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models
 Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images
 PoE-World: Compositional World Modeling with Products of Programmatic Experts
 Unifying Segment Anything in Microscopy with Multimodal Large Language Model
 AdaptThink: Reasoning Models Can Learn When to Think
 ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
 MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
 Thinkless: LLM Learns When to Think
 FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
 CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
 Neuro-Symbolic Query Compiler
 AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
 Chain-of-Model Learning for Language Model
 Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
 Fractured Chain-of-Thought Reasoning
 Emerging Properties in Unified Multimodal Pretraining
 LLM Context Conditioning and PWP Prompting for Multimodal Validation of Chemical Formulas
 Model Merging in Pre-training of Large Language Models
 When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
 AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning
 Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
 ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
 ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning
 Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
 Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation
 TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
 SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
 G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
 HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
 SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization
 R3: Robust Rubric-Agnostic Reward Models
 A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
 MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
 NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search
 Latent Flow Transformer
 Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits
 Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
 EfficientLLM: Efficiency in Large Language Models
 Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
 CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
 Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
 Truth Neurons
 Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
 Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation
 Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
 Reward Reasoning Model
 General-Reasoner: Advancing LLM Reasoning Across All Domains
 The Hallucination Tax of Reinforcement Finetuning
 VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
 Think Only When You Need with Large Hybrid-Reasoning Models
 Lessons from Defending Gemini Against Indirect Prompt Injections
 Reasoning Models Better Express Their Confidence
 Not All Correct Answers Are Equal: Why Your Distillation Source Matters
 Towards eliciting latent knowledge from LLMs with mechanistic interpretability
 Visual Agentic Reinforcement Fine-Tuning
 Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings
 CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models
 Exploring Federated Pruning for Large Language Models
 Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
 Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair
 Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
 Phare: A Safety Probe for Large Language Models
 MigrationBench: Repository-Level Code Migration Benchmark from Java 8
 The Aloe Family Recipe for Open and Specialized Healthcare LLMs
 Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
 Learning to Reason via Mixture-of-Thought for Logical Reasoning
 Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
 Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
 Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
 Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs
 How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
 When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
 Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
 Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
 lmgame-Bench: How Good are LLMs at Playing Games?
 PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration
 RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
 Language Specific Knowledge: Do Models Know Better in X than in English?
 Text Generation Beyond Discrete Token Sampling
 WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
 Visual Instruction Bottleneck Tuning
 Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training
 Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
 KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models
 Void in Language Models
 Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
 Scaling Law for Quantization-Aware Training
 UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
 Prior Prompt Engineering for Reinforcement Fine-Tuning
 MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
 Do Language Models Use Their Depth Efficiently?
 CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs
 Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
 A*-Decoding: Token-Efficient Inference Scaling
 BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
 J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
 Harnessing the Universal Geometry of Embeddings
 Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
 Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
 HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
 The Distracting Effect: Understanding Irrelevant Passages in RAG
 VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
 BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
 The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
 Meta-Design Matters: A Self-Design Multi-Agent System
 Self-Evolving Curriculum for LLM Reasoning
 SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
 SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
 Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
 NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
 LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
 Backdoor Cleaning without External Guidance in MLLM Fine-tuning
 Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
 AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
 QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
 Training-Free Reasoning and Reflection in MLLMs
 Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
 OViP: Online Vision-Language Preference Learning
 GRIT: Teaching MLLMs to Think with Images
 Reinforcement Learning from User Feedback
 In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
 Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
 TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
 Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
 AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
 PaTH Attention: Position Encoding via Accumulating Householder Transformations
 VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
 SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
 When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
 Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
 SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
 How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
 Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning
 Let LLMs Break Free from Overthinking via Self-Braking Tuning
 MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
 Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework
 Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
 Steering Large Language Models for Machine Translation Personalization
 Training Step-Level Reasoning Verifiers with Formal Verification Tools
 Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
 WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
 SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
 CRISP: Clustering Multi-Vector Representations for Denoising and Pruning
 Understanding Prompt Tuning and In-Context Learning via Meta-Learning
 In-Context Watermarks for Large Language Models
 Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
 Small-to-Large Generalization: Data Influences Models Consistently Across Scale
 Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge
 Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment
 ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay
 VeriThinker: Learning to Verify Makes Reasoning Model Efficient
 Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
 Distilling LLM Agent into Small Models with Retrieval and Code Tools
 Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection
 On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
 Speechless: Speech Instruction Training Without Speech for Low Resource Languages
 Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
 Large Language Models Implicitly Learn to See and Hear Just By Reading
 Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
 Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
 AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
 Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
 Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
 QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
 Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models
 QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
 Synthetic Data RL: Task Definition Is All You Need
 Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA
 PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
 TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
 MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
 FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
 Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
 NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
 Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
 Quartet: Native FP4 Training Can Be Optimal for Large Language Models
 s3: You Don't Need That Much Data to Train a Search Agent via RL
 Time-R1: Towards Comprehensive Temporal Reasoning in LLMs
 Value-Guided Search for Efficient Chain-of-Thought Reasoning
 FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
 ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection
 FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation
 Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
 How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
 Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL
 DataRater: Meta-Learned Dataset Curation
 Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
 Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration
 Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
 ARM: Adaptive Reasoning Model
 Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
 Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions
 Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
 Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
 DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue
 Learning to Reason without External Rewards
 Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
 WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
 Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
 MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
 Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
 Flex-Judge: Think Once, Judge Anywhere
 B-score: Detecting biases in large language models using response history
 InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
 Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
 NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
 VIBE: Vector Index Benchmark for Embeddings
 Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
 CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
 The Coverage Principle: A Framework for Understanding Compositional Generalization
 Lifelong Safety Alignment for Language Models
 Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
 Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
 The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models
 Interleaved Reasoning for Large Language Models via Reinforcement Learning
 BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
 Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
 Shifting AI Efficiency From Model-Centric to Data-Centric Compression
 The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
 Removal of Hallucination on Hallucination: Debate-Augmented RAG
 Dynamic Risk Assessments for Offensive Cybersecurity Agents
 Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks
 Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance
 EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
 KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing
 StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
 MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
 MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
 Accelerating Nash Learning from Human Feedback via Mirror Prox
 PATS: Process-Level Adaptive Thinking Mode Switching
 LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
 An Embarrassingly Simple Defense Against LLM Abliteration Attacks
 AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting
 Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
 TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
 Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
 ModernGBERT: German-only 1B Encoder Model Trained from Scratch
 Large Language Models Are More Persuasive Than Incentivized Human Persuaders
 MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
 Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models
 FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
 REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
 The Limits of Preference Data for Post-Training
 DocMMIR: A Framework for Document Multi-modal Information Retrieval
 POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval
 FP4 All the Way: Fully Quantized Training of LLMs
 Efficient Data Selection at Scale via Influence Distillation
 AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
 The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
 Hybrid Latent Reasoning via Reinforcement Learning
 UFT: Unifying Supervised and Reinforcement Fine-Tuning
 Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
 Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
 STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
 Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
 Sentience Quest: Towards Embodied, Emotionally Adaptive, Self-Evolving, Ethically Aligned Artificial General Intelligence
 Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
 Exploring the Latent Capacity of LLMs for One-Step Text Generation
 Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
 Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
 MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
 rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
 MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
 How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
 UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
 Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
 HoliTom: Holistic Token Merging for Fast Video Large Language Models
 MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
 Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
 Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
 VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration
 GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
 VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
 DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
 SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
 VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
 Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
 R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
 MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
 Improving Chemical Understanding of LLMs via SMILES Parsing
 Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning
 Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
 SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
 Can Large Reasoning Models Self-Train?
 MLLMs are Deeply Affected by Modality Bias
 R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
 Do RAG Systems Suffer From Positional Bias?
 Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration
 Improved Representation Steering for Language Models
 MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
 BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
 Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
 An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning
 ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
 Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL
 Explaining Sources of Uncertainty in Automated Fact-Checking
 Reinforcing General Reasoning without Verifiers
 Towards Better Instruction Following Retrieval Models
 Thinker: Learning to Think Fast and Slow
 Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
 Capability-Based Scaling Laws for LLM Red-Teaming
 Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
 CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
 Do Large Language Models (Really) Need Statistical Foundations?
 PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval
 SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
 Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
 MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
 FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
 WebDancer: Towards Autonomous Information Seeking Agency
 Spatial Knowledge Graph-Guided Multimodal Synthesis
 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
 Thinking with Generated Images
 Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
 Text2Grad: Reinforcement Learning from Natural Language Feedback
 Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
 Skywork Open Reasoner 1 Technical Report
 Reverse Preference Optimization for Complex Instruction Following
 Knowledge Base Construction for Knowledge-Augmented Text-to-SQL
 R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
 SageAttention2++: A More Efficient Implementation of SageAttention2
 DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
 LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
 Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
 Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
 Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning
 Let's Predict Sentence by Sentence
 VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
 Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities
 MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
 Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
 Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
 ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
 Sherlock: Self-Correcting Reasoning in Vision-Language Models
 Fostering Video Reasoning via Next-Event Prediction
 Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
 AITEE -- Agentic Tutor for Electrical Engineering
 SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
 First Finish Search: Efficient Test-Time Scaling in Large Language Models
 CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature
 Zero-Shot Vision Encoder Grafting via LLM Surrogates
 Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
 Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach
 Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
 Maximizing Confidence Alone Improves Reasoning
 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
 Precise In-Parameter Concept Erasure in Large Language Models
 EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles
 Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks
 Can Large Language Models Infer Causal Relationships from Real-World Text?
 PixelThink: Towards Efficient Chain-of-Pixel Reasoning
 VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
 To Trust Or Not To Trust Your Vision-Language Model's Prediction
 SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents
 Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
 When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
 PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective
 One-shot Entropy Minimization
 CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
 IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests
 DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
 Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
 ATLAS: Learning to Optimally Memorize the Context at Test Time
 MuLoCo: Muon is a practical inner optimizer for DiLoCo
 Table-R1: Inference-Time Scaling for Table Reasoning
 Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
 On-Policy RL with Optimal Reward Baseline
 SWE-bench Goes Live!
 VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
 ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
 Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
 In Search of Adam's Secret Sauce
 The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
 ChartLens: Fine-grained Visual Attribution in Charts
 A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models
 Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
 PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions
 Differential Information: An Information-Theoretic Perspective on Preference Optimization
 Are Reasoning Models More Prone to Hallucination?
 KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
 StressTest: Can YOUR Speech LM Handle the Stress?
 Multi-Domain Explainability of Preferences
 Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
 Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
 Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
 Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement
 Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities
 Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
 MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
 Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
 System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts
 Learning Compositional Functions with Transformers from Easy-to-Hard Data
 SlimLLM: Accurate Structured Pruning for Large Language Models
 GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
 Model-Preserving Adaptive Rounding
 Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models