blog

Welcome to my blog!

Daily Paper | Aug 3, 2025

ab's Avatar 2025-08-03 Daily Paper

  1. 1. Table of Content
  2. 2. SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
  3. 3. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
  4. 4. Phi-Ground Tech Report: Advancing Perception in GUI Grounding
  5. 5. FairReason: Balancing Reasoning and Social Bias in MLLMs
  6. 6. DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model
  7. 7. TARS : MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs
  8. 8. AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning
  9. 9. Policy Learning from Large Vision-Language Model Feedback Without Reward Modeling
  10. 10. First Return, Entropy-Eliciting Explore

Table of Content

SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

Github Link: https://github.com/maitrix-org/llm-reasoners/tree/main/examples/ReasonerAgent-Web
Paper Link: https://arxiv.org/abs/2507.23773

AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SIMURA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, SIMURA overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that SIMURA improves the success of flight search from 0% to 32.2%. World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make REASONERAGENT-WEB, a web-browsing agent built on SIMURA with pretrained LLMs, available as a research demo for public testing.

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Github Link: https://github.com/raymin0223/mixture_of_recursions
Paper Link: https://www.alphaxiv.org/abs/2507.10524

The Mixture-of-Recursions (MoR) framework unifies parameter sharing and adaptive computation within Recursive Transformer architectures, allowing models to dynamically apply shared layers to individual tokens. MoR achieves competitive performance with vanilla Transformers while using 50% fewer parameters, 25% fewer training FLOPs, and boosting inference throughput by up to 2.06x.

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Project Link: https://zhangmiaosen2000.github.io/Phi-Ground/
Paper Link: https://www.alphaxiv.org/abs/2507.23779

The Phi-Ground model family from Microsoft introduces an efficient and high-performing approach to Graphical User Interface (GUI) grounding, achieving state-of-the-art accuracy on five challenging benchmarks for models under 10B parameters by systematically optimizing data, training, and architectural considerations. This work explores practical aspects such as the impact of image tokens and the unexpected effectiveness of Direct Preference Optimization (DPO) for perceptual tasks.

FairReason: Balancing Reasoning and Social Bias in MLLMs

Github Link: https://github.com/Yutongzhang20080108/FairReason-Balancing-Reasoning-and-Social-Bias-in-MLLMs
Paper Link: https://www.alphaxiv.org/abs/2507.23067

Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models’ outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation—and whether the two objectives inherently trade off—therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies—supervised fine-tuning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)—under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versusbias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model’s original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.

DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

affiliation: KAIST

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2’s coarse features. Furthermore, we complement DINOv2’s robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-toframe VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

TARS : MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

Github Link: https://kejiazhang-robust.github.io/tars_web/
Paper Link: https://www.alphaxiv.org/abs/2507.21584

Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

Github Link: https://github.com/weiyifan1023/AutoTIR.
Paper Link: https://www.alphaxiv.org/abs/2507.21836

Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.

Policy Learning from Large Vision-Language Model Feedback Without Reward Modeling

affiliation: KAIST

Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using precollected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large visionlanguage models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.

First Return, Entropy-Eliciting Explore

Project Link: https://huggingface.co/FR3E-Bytedance
Paper Link: https://www.alphaxiv.org/abs/2507.07017

ByteDance researchers developed First Return, Entropy-Eliciting Explore (FR3E), a value-model-free reinforcement learning framework that enhances LLM reasoning by providing semantically grounded intermediate feedback. It identifies high-uncertainty decision points in reasoning paths using token-level entropy and conducts targeted partial rollouts, leading to more stable training and improved performance on mathematical reasoning benchmarks, with an average accuracy increase of over 3% on Qwen2.5 models.

本文最后更新于 天前,文中所描述的信息可能已发生改变