alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings

Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts05/08 · Prof. Dhruv Kumar and Dhruv Trehan · alphaXiv

Let ViT Speak: Generative Language-Image Pre-training

01 May 2026

Beijing Jiaotong University

Yan Fang

Mengcheng Lan

Zilong Huang

The GenLIP framework introduces a minimalist generative pre-training approach for Vision Transformers, enabling them to directly predict language tokens from visual inputs. It achieves competitive or superior performance on 14 diverse multimodal benchmarks with 8 billion pretraining samples, outperforming baselines that use up to 40 billion samples.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

MolmoAct2: Action Reasoning Models for Real-world Deployment

04 May 2026

University of Washington

National University of Singapore

Haoquan Fang

Jiafei Duan

Donovan Clay

MolmoAct2, developed by the Allen Institute for AI and the University of Washington, is a fully open-source action reasoning model designed for real-world robot deployment, enhancing generalizability and efficiency. It incorporates an embodied reasoning VLM backbone and a continuous action expert, achieving up to 87.1% success on real-world DROID tasks with unseen objects and a 2.42x speedup in control rate compared to unoptimized inference.

#computer-science #robotics

Paper thumbnail

Thinking with Visual Primitives

30 Apr 2026

Tsinghua University Peking University logo

Peking University

Ruijie Lu

Yiyang Ma

Xiaokang Chen

DeepSeek-AI researchers introduced "Thinking with Visual Primitives," a framework that integrates points and bounding boxes as fundamental units of thought into Multimodal Large Language Models (MLLMs) to address the "Reference Gap" in complex visual reasoning. This approach improves performance on tasks like counting, spatial deduction, and topological navigation while significantly enhancing visual token efficiency.

Paper thumbnail

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

04 May 2026

Yangming Shi

Shixiang Zhu

Tao Shen

ByteDance's Mamoda2.5 unifies multimodal understanding, image generation, and video generation/editing within a single Autoregressive–Diffusion framework, leveraging a fine-grained Mixture-of-Experts Diffusion Transformer (DiT-MoE) for computational efficiency. The model demonstrates competitive performance on benchmarks like VBench 2.0 (61.64) and OpenVE-Bench (3.86), while its distilled version achieves up to 95.9 times faster video editing inference.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Model Spec Midtraining: Improving How Alignment Training Generalizes

03 May 2026

Chloe Li

Sara Price

Samuel Marks

Model Spec Midtraining (MSM) enhances the generalization of large language models by integrating an intermediate training phase that instills a deep understanding of a Model Spec's principles and values. This approach reduces agentic misalignment in out-of-distribution scenarios and improves the compute efficiency of alignment fine-tuning by up to 60x in low-sample regimes.

#agents #computer-science #artificial-intelligence

Paper thumbnail

On-Policy Distillation

05 May 2026

Thinking Machines

Kevin Lu

This work introduces on-policy distillation, a post-training method for large language models that combines the on-policy relevance of reinforcement learning with dense, token-level feedback from a teacher model. The approach achieved 70% on the AIME'24 mathematical reasoning benchmark with Qwen3-8B and demonstrated a 30x cost reduction compared to off-policy distillation, while also recovering instruction-following abilities in personalized models.

Paper thumbnail

RLDX-1 Technical Report

05 May 2026

Dongyoung Kim

Huiwon Jang

Myungkyu Koo

Researchers from RLWRLD and KAIST developed RLDX-1, a general-purpose robotic policy that integrates motion awareness, long-term memory, and physical sensing into a Vision-Language-Action (VLA) model for dexterous manipulation. The system consistently outperformed state-of-the-art VLAs in both simulation and real-world tasks, exhibiting superior performance in dynamic and contact-rich environments.

#agents #computer-science #artificial-intelligence

Paper thumbnail

A Theory of Generalization in Deep Learning

02 May 2026

Stanford University

Elon Litman

Gabe Guo

Researchers at Stanford University developed a generalization theory for deep neural networks that operates in the full feature-learning regime, showing how the empirical Neural Tangent Kernel (eNTK) partitions output space into a signal channel and a test-invisible reservoir. This framework leads to an optimization method that accelerates generalization by suppressing noise, achieving, for instance, a 5-fold speedup in grokking and improving reward accuracy in LLM fine-tuning under noisy preferences.

#computer-science #machine-learning #fine-tuning

Paper thumbnail

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

04 May 2026

Jianing Wang

Linsen Guo

Zhengyu Chen

HEAVYSKILL formalizes a two-phase 'heavy thinking' process as an intrinsic LLM skill, combining parallel reasoning with sequential deliberation. This framework consistently improves performance on complex reasoning tasks, outperforming traditional strategies like Best-of-N and demonstrating that deliberation can synthesize correct solutions not present in individual initial attempts.

#agentic-frameworks #agents #chain-of-thought

Paper thumbnail

^2

PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

04 May 2026

Haixin Wang

Hejie Cui

Chenwei Zhang

T2PO (Token- and Turn-level Policy Optimization) introduces an uncertainty-guided exploration control framework for multi-turn agentic reinforcement learning, achieving improved training stability and task performance by adaptively mitigating inefficient token-level thinking and turn-level repetition. The method demonstrated higher success rates and reduced token consumption and interaction turns across diverse interactive environments like WebShop and ALFWorld.

#agents #computer-science #conversational-ai

Paper thumbnail

Representation Fréchet Loss for Visual Generation

30 Apr 2026

Jiawei Yang

Zhengyang Geng

Xuan Ju

This research introduces FD-loss, a method that directly optimizes Fréchet Distance as a training objective for generative models by decoupling population statistics from batch-level gradient computation. This approach enhances the visual quality of one-step generators and repurposes multi-step models for efficient single-step generation, while also proposing FDr_k, a new multi-representation metric for comprehensive evaluation.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

World Model for Robot Learning: A Comprehensive Survey

30 Apr 2026

ETH Zurich Harvard University logo

Harvard University

Bohan Hou

Gen Li

Jindou Jia

This survey systematically reviews world models in robot learning, offering a robotics-centric definition and categorizing existing approaches by architectural coupling, functional roles, and application domains. It synthesizes current research, identifies key challenges, and outlines future directions for integrating predictive modeling into embodied AI.

#agent-based-systems #autonomous-vehicles #computer-science

Paper thumbnail

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

05 May 2026

Yuwen Du

Rui Ye

Shuo Tang

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

#agents #computer-science #artificial-intelligence

Paper thumbnail

From Qubit to Qubit: A Graduate Course in Quantum Mechanics

02 May 2026

Jeremy Levy

Jeremy Levy's textbook "From Qubit to Qubit" introduces a graduate quantum mechanics curriculum by first developing foundational concepts using finite-dimensional spin-1/2 systems (qubits) before progressing to continuous quantum mechanics. This approach aims to provide a more intuitive and unified understanding, integrating quantum information, condensed matter, and atomic physics.

#physics #quantum-physics

Paper thumbnail

Posterior Augmented Flow Matching

01 May 2026

University of Washington Hugging Face logo

George Stoica

Sayak Paul

Matthew Wallingford

Posterior Augmented Flow Matching (PAFM) reforms the Flow Matching objective to address sparse supervision in continuous-time generative models, incorporating a multi-target supervision signal during training. This method achieved up to 3.4 FID50K improvement on ImageNet-1K and 0.92 FID5K improvement on CC12M, while introducing negligible computational overhead.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

01 May 2026

Siyuan Huang

Xiaoye Qu

Yafu Li

Researchers from Shanghai AI Laboratory and Shanghai Jiao Tong University developed Persistent Visual Memory (PVM), a module designed to counteract visual signal dilution in autoregressive Large Vision-Language Models (LVLMs) during deep generation tasks. Integrating PVM led to an absolute improvement of 4.8% in average accuracy on eight multimodal benchmarks, reaching 71.5% with a Qwen3-VL-8B-Instruct backbone, and showed a 27.3% relative performance boost for long output sequences.

#attention-mechanisms #computer-science #artificial-intelligence

Paper thumbnail

Recursive Multi-Agent Systems

28 Apr 2026

University of Illinois at Urbana-Champaign Stanford University logo

Stanford University

Xiyuan Yang

Jiaru Zou

Rui Pan

RecursiveMAS introduces a framework that integrates recursive computation into multi-agent systems, enabling agents to refine collaborative reasoning through iterative latent-space interactions rather than explicit text. This approach leads to average accuracy improvements of up to 20.2% and inference speedups of up to 2.4x compared to text-based recursive multi-agent systems, significantly reducing token usage.

#agentic-frameworks #agents #chain-of-thought

Paper thumbnail

Linearizing Vision Transformer with Test-Time Training

04 May 2026

Yining Li

Dongchen Han

Zeyu Liu

Tsinghua University researchers developed T5 (Transformer To Test-Time Training), a method for linearizing pre-trained Softmax Vision Transformers into linear-complexity architectures with minimal fine-tuning. This approach, which includes architectural and representational alignments, enables near-full performance recovery in classification and generation tasks while significantly accelerating inference, such as a 1.47x speedup for Stable Diffusion at 2048x2048 resolution.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

03 May 2026

Sungyoung Lee

Dohyeong Kim

Eshan Balachandar

The University of Texas at Austin and independent researchers developed Flow-Anchored Noise-conditioned Q-Learning (FAN), an offline reinforcement learning algorithm that leverages expressive flow policies and distributional critics while significantly improving computational efficiency. This method achieved competitive or superior task performance across D4RL and OGBench benchmarks, exhibiting 5-14 times faster training runtime and competitive inference speed compared to previous distributional approaches.

#computer-science #machine-learning #robotics

Paper thumbnail

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

04 May 2026

Sunghwan Kim

Junhee Cho

Beong-woo Kwak

This empirical study identifies intrinsic task horizon length as a fundamental bottleneck in training large language model agents, demonstrating that longer horizons cause severe training instability and performance collapse. The research shows that applying horizon reduction techniques, such as macro actions and subgoal decomposition, effectively stabilizes training and improves performance, enabling generalization to longer, previously unseen tasks.

#agents #computer-science #artificial-intelligence

Paper thumbnail

There are no more papers matching your filters at the moment.