Table of Content
- AGENTIC REINFORCED POLICY OPTIMIZATION
- KIMI-K2
- Flow Matching Policy Gradients
- Geometric-Mean Policy Optimization
- ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
- DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception
- Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
- Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
- Self-Guided Masked Autoencoder
- TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
AGENTIC REINFORCED POLICY OPTIMIZATION
The paper “Agentic Reinforced Policy Optimization (ARPO)” introduces a novel reinforcement learning (RL) algorithm designed to enhance the performance and efficiency of Large Language Models (LLMs) in multi-turn, tool-augmented reasoning tasks. Existing RL methods for LLMs often fall short in balancing intrinsic long-horizon reasoning with proficiency in multi-turn tool interactions, primarily due to their trajectory-level sampling approaches.
ARPO addresses this gap by recognizing that LLMs exhibit high token entropy (i.e., uncertainty) immediately after interacting with external tools. Leveraging this insight, ARPO incorporates an entropy-based adaptive rollout mechanism that dynamically balances global trajectory sampling with step-level sampling. This promotes exploration at points of high uncertainty following tool usage. Furthermore, an advantage attribution estimation mechanism is integrated, allowing LLMs to internalize advantage differences in stepwise tool-use interactions.
Experimental results across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s significant superiority over traditional trajectory-level RL algorithms. Crucially, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, presenting a scalable and efficient solution for aligning LLM-based agents with dynamic real-time environments.
KIMI-K2
The source introduces Kimi K2, a large language model designed for agentic intelligence, emphasizing its ability to autonomously learn and interact. The paper details the MuonClip optimizer which ensures stable pre-training on a massive dataset of 15.5 trillion tokens, highlighting its efficiency and stability. Post-training involves large-scale agentic data synthesis for tool use and a reinforcement learning framework utilizing both verifiable rewards and self-critique. The document also presents extensive evaluation results showcasing Kimi K2’s state-of-the-art performance in areas like coding, mathematics, reasoning, and tool use, often surpassing other open-source and proprietary models. Finally, it outlines the model’s architecture, training infrastructure, and safety evaluations, noting its limitations and future research directions.
Flow Matching Policy Gradients
Flow Policy Optimization (FPO) is a novel, on-policy reinforcement learning (RL) algorithm designed to train flow-based generative models, including diffusion models, within the policy gradient framework. FPO addresses key limitations of prior approaches by reformulating policy optimization as maximizing an advantage-weighted ratio derived from the conditional flow matching (CFM) loss. This method sidesteps the need for computationally expensive exact likelihood calculations, a common hurdle for flow-based models in RL. FPO is sampler-agnostic, meaning it is compatible with various diffusion or flow integration methods at both training and inference times, unlike previous diffusion-based RL techniques that bind training to specific sampling procedures. Empirical validation across diverse continuous control tasks, including GridWorld, MuJoCo Playground, and high-dimensional humanoid control, demonstrates that FPO can effectively train diffusion-style policies from scratch. Notably, FPO-trained policies can capture multimodal action distributions and achieve superior performance compared to traditional Gaussian policies, especially in under-conditioned scenarios.
On-policy methods learn about the same policy that is used to generate the data. Think of it as learning by doing and only from your own experiences generated by your current way of doing things. Off-policy methods, conversely, learn about a different policy than the one generating the data. This allows them to learn from past experiences (even from an older or different behavior policy) and potentially from observing others.
Geometric-Mean Policy Optimization
This document introduces Geometric-Mean Policy Optimization (GMPO), a new approach designed to enhance the stability and performance of large language models (LLMs) during reinforcement learning, particularly for reasoning tasks. It addresses issues found in previous methods like Group Relative Policy Optimization (GRPO), which often suffer from unstable policy updates due to sensitivity to outlier rewards. GMPO achieves this by optimizing the geometric mean of token-level rewards, which inherently handles outliers more effectively, leading to more stable training and improved exploration capabilities. The paper provides theoretical justifications and experimental results, showcasing GMPO’s superior performance on various mathematical and multimodal reasoning benchmarks compared to existing methods.
Geometric-Mean Policy Optimization (GMPO) enhances large language model fine-tuning by replacing the arithmetic mean with a geometric mean in the policy optimization objective, which stabilizes training and improves exploration. GMPO leads to up to 4.1% higher Pass@1 accuracy on mathematical benchmarks and 1.4% on multimodal tasks compared to its predecessor, GRPO.
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
The document introduces ARC-Hunyuan-Video-7B, a novel multimodal model designed for structured comprehension of real-world short videos, particularly those from platforms like WeChat Channel and TikTok. Unlike previous models, it processes visual, audio, and textual signals end-to-end to address challenges posed by the fast-paced, information-dense nature of user-generated content. The model excels at multi-granularity timestamped video captioning, summarization, open-ended question answering, temporal video grounding, and video reasoning. Its development involved a comprehensive training regimen, including pre-training, instruction fine-tuning, cold start, and reinforcement learning, leveraging a high-quality, automatically annotated dataset. The paper presents qualitative and quantitative evaluations demonstrating the model’s superior performance in understanding the chronological flow, thematic nuances, and creative intent of videos, with real-world deployment showing improved user engagement.
DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception
The research introduces DriveAgent-R1, an advanced autonomous driving agent designed to address the limitations of current Vision-Language Models (VLMs) in complex driving scenarios. It features a Hybrid-Thinking framework that adaptively switches between efficient text-based reasoning and in-depth, tool-based reasoning for enhanced decision-making. The agent also incorporates an Active Perception mechanism with a Vision Toolkit to proactively gather crucial visual information and resolve uncertainties, mirroring human driver behavior. A novel three-stage progressive reinforcement learning strategy trains the agent to master these capabilities, enabling it to achieve state-of-the-art performance by grounding its decisions in actively perceived visual evidence. This approach aims to create safer and more intelligent autonomous systems by balancing efficiency with reliability.
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
This academic paper details a large-scale public red-teaming competition designed to evaluate the security vulnerabilities of AI agents powered by Large Language Models (LLMs). The study involved 22 frontier AI agents across 44 realistic deployment scenarios, where participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully causing policy violations such as unauthorized data access or illicit financial actions. The researchers developed the Agent Red Teaming (ART) benchmark from these attacks, demonstrating that nearly all agents exhibit policy violations for most behaviors within 10–100 queries due to high attack transferability and universality across models and tasks. A crucial finding is the lack of correlation between an agent’s robustness and its model size, capability, or inference-time compute, emphasizing the urgent need for new defense mechanisms. The paper concludes by releasing the ART benchmark to support more rigorous security assessments and drive progress toward safer AI agent deployment.
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
This paper introduces Rep-MTL, a novel approach to multi-task learning (MTL) that aims to enhance performance by addressing negative transfer and promoting inter-task complementarity directly within the shared representation space. Unlike conventional multi-task optimization (MTO) techniques that primarily focus on optimizer-centric loss scaling and gradient manipulation, Rep-MTL utilizes representation-level task saliency. This method, through its Task-specific Saliency Regulation (TSR) and Cross-task Saliency Alignment (CSA) modules, preserves individual task learning patterns and facilitates beneficial information sharing without altering optimizers or network architectures. Empirical results across various benchmarks demonstrate Rep-MTL’s consistent performance gains and efficiency, even when paired with basic weighting policies.
Self-Guided Masked Autoencoder
This collection of sources centers on an in-depth analysis and proposed improvement for Masked Autoencoders (MAE), a self-supervised learning approach used in computer vision. The authors uncover that MAE inherently learns pattern-based patch clustering from early stages of pre-training. Building on this understanding, they introduce a “self-guided masked autoencoder” that generates informed masks internally by leveraging its progress in patch clustering, unlike the original MAE’s random masking. This novel approach significantly boosts MAE’s learning process without requiring external models or additional information, a benefit verified through comprehensive experiments on various downstream tasks like image classification, object detection, and semantic segmentation.
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
TransPrune introduces a method for Large Vision-Language Models that prunes less important visual tokens based on their representation transitions within transformer layers. This approach reduces inference TFLOPs of LLaVA-v1.5-7B by nearly 60% without performance degradation, and for LLaVA-Next-7B by 60% with minimal accuracy loss, addressing limitations of prior attention-based pruning.