What are the most popular benchmarks for math reasoning?
The GenLIP framework introduces a minimalist generative pre-training approach for Vision Transformers, enabling them to directly predict language tokens from visual inputs. It achieves competitive or superior performance on 14 diverse multimodal benchmarks with 8 billion pretraining samples, outperforming baselines that use up to 40 billion samples.
View blogMolmoAct2, developed by the Allen Institute for AI and the University of Washington, is a fully open-source action reasoning model designed for real-world robot deployment, enhancing generalizability and efficiency. It incorporates an embodied reasoning VLM backbone and a continuous action expert, achieving up to 87.1% success on real-world DROID tasks with unseen objects and a 2.42x speedup in control rate compared to unoptimized inference.
View blogDeepSeek-AI researchers introduced "Thinking with Visual Primitives," a framework that integrates points and bounding boxes as fundamental units of thought into Multimodal Large Language Models (MLLMs) to address the "Reference Gap" in complex visual reasoning. This approach improves performance on tasks like counting, spatial deduction, and topological navigation while significantly enhancing visual token efficiency.
View blogByteDance's Mamoda2.5 unifies multimodal understanding, image generation, and video generation/editing within a single Autoregressive–Diffusion framework, leveraging a fine-grained Mixture-of-Experts Diffusion Transformer (DiT-MoE) for computational efficiency. The model demonstrates competitive performance on benchmarks like VBench 2.0 (61.64) and OpenVE-Bench (3.86), while its distilled version achieves up to 95.9 times faster video editing inference.
View blogModel Spec Midtraining (MSM) enhances the generalization of large language models by integrating an intermediate training phase that instills a deep understanding of a Model Spec's principles and values. This approach reduces agentic misalignment in out-of-distribution scenarios and improves the compute efficiency of alignment fine-tuning by up to 60x in low-sample regimes.
View blogThis work introduces on-policy distillation, a post-training method for large language models that combines the on-policy relevance of reinforcement learning with dense, token-level feedback from a teacher model. The approach achieved 70% on the AIME'24 mathematical reasoning benchmark with Qwen3-8B and demonstrated a 30x cost reduction compared to off-policy distillation, while also recovering instruction-following abilities in personalized models.
View blogResearchers from RLWRLD and KAIST developed RLDX-1, a general-purpose robotic policy that integrates motion awareness, long-term memory, and physical sensing into a Vision-Language-Action (VLA) model for dexterous manipulation. The system consistently outperformed state-of-the-art VLAs in both simulation and real-world tasks, exhibiting superior performance in dynamic and contact-rich environments.
View blogResearchers at Stanford University developed a generalization theory for deep neural networks that operates in the full feature-learning regime, showing how the empirical Neural Tangent Kernel (eNTK) partitions output space into a signal channel and a test-invisible reservoir. This framework leads to an optimization method that accelerates generalization by suppressing noise, achieving, for instance, a 5-fold speedup in grokking and improving reward accuracy in LLM fine-tuning under noisy preferences.
View blogHEAVYSKILL formalizes a two-phase 'heavy thinking' process as an intrinsic LLM skill, combining parallel reasoning with sequential deliberation. This framework consistently improves performance on complex reasoning tasks, outperforming traditional strategies like Best-of-N and demonstrating that deliberation can synthesize correct solutions not present in individual initial attempts.
View blogT2PO (Token- and Turn-level Policy Optimization) introduces an uncertainty-guided exploration control framework for multi-turn agentic reinforcement learning, achieving improved training stability and task performance by adaptively mitigating inefficient token-level thinking and turn-level repetition. The method demonstrated higher success rates and reduced token consumption and interaction turns across diverse interactive environments like WebShop and ALFWorld.
View blogThis research introduces FD-loss, a method that directly optimizes Fréchet Distance as a training objective for generative models by decoupling population statistics from batch-level gradient computation. This approach enhances the visual quality of one-step generators and repurposes multi-step models for efficient single-step generation, while also proposing FDr_k, a new multi-representation metric for comprehensive evaluation.
View blogThis survey systematically reviews world models in robot learning, offering a robotics-centric definition and categorizing existing approaches by architectural coupling, functional roles, and application domains. It synthesizes current research, identifies key challenges, and outlines future directions for integrating predictive modeling into embodied AI.
View blogJeremy Levy's textbook "From Qubit to Qubit" introduces a graduate quantum mechanics curriculum by first developing foundational concepts using finite-dimensional spin-1/2 systems (qubits) before progressing to continuous quantum mechanics. This approach aims to provide a more intuitive and unified understanding, integrating quantum information, condensed matter, and atomic physics.
View blogPosterior Augmented Flow Matching (PAFM) reforms the Flow Matching objective to address sparse supervision in continuous-time generative models, incorporating a multi-target supervision signal during training. This method achieved up to 3.4 FID50K improvement on ImageNet-1K and 0.92 FID5K improvement on CC12M, while introducing negligible computational overhead.
View blogResearchers from Shanghai AI Laboratory and Shanghai Jiao Tong University developed Persistent Visual Memory (PVM), a module designed to counteract visual signal dilution in autoregressive Large Vision-Language Models (LVLMs) during deep generation tasks. Integrating PVM led to an absolute improvement of 4.8% in average accuracy on eight multimodal benchmarks, reaching 71.5% with a Qwen3-VL-8B-Instruct backbone, and showed a 27.3% relative performance boost for long output sequences.
View blogRecursiveMAS introduces a framework that integrates recursive computation into multi-agent systems, enabling agents to refine collaborative reasoning through iterative latent-space interactions rather than explicit text. This approach leads to average accuracy improvements of up to 20.2% and inference speedups of up to 2.4x compared to text-based recursive multi-agent systems, significantly reducing token usage.
View blogTsinghua University researchers developed T5 (Transformer To Test-Time Training), a method for linearizing pre-trained Softmax Vision Transformers into linear-complexity architectures with minimal fine-tuning. This approach, which includes architectural and representational alignments, enables near-full performance recovery in classification and generation tasks while significantly accelerating inference, such as a 1.47x speedup for Stable Diffusion at 2048x2048 resolution.
View blogThe University of Texas at Austin and independent researchers developed Flow-Anchored Noise-conditioned Q-Learning (FAN), an offline reinforcement learning algorithm that leverages expressive flow policies and distributional critics while significantly improving computational efficiency. This method achieved competitive or superior task performance across D4RL and OGBench benchmarks, exhibiting 5-14 times faster training runtime and competitive inference speed compared to previous distributional approaches.
View blogThis empirical study identifies intrinsic task horizon length as a fundamental bottleneck in training large language model agents, demonstrating that longer horizons cause severe training instability and performance collapse. The research shows that applying horizon reduction techniques, such as macro actions and subgoal decomposition, effectively stabilizes training and improves performance, enabling generalization to longer, previously unseen tasks.
View blog