blog

Welcome to my blog!

Daily Paper | Aug 1, 2025

ab's Avatar 2025-08-01 Daily Paper

  1. 1. Table of Content
  2. 2. RecGPT Technical Report
  3. 3. RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
  4. 4. Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
  5. 5. Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
  6. 6. CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
  7. 7. UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
  8. 8. C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
  9. 9. ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
  10. 10. I Am Big, You Are Little; I Am Right, You Are Wrong
  11. 11. AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Table of Content

RecGPT Technical Report

Author: Alibaba Taobao Team

RecGPT, developed by Taobao, integrates large language models into its recommender system to enable intent-centered personalization. This framework, fully deployed on the Taobao App, increased click-through rate by 6.33%, dwell time by 4.82%, and user-clicked item category diversity by 6.96%, while also mitigating the Matthew effect for merchants.

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Github Link: https://github.com/Tencent/DigitalHuman/tree/main/RLVMR
Paper Link: https://www.alphaxiv.org/abs/2507.22844

RLVMR, developed by Tencent, trains Large Language Model agents to perform complex, long-horizon tasks by providing dense, verifiable meta-reasoning rewards during reinforcement learning. This approach leads to enhanced task success and generalization while significantly reducing inefficient exploration, such as repetitive and invalid actions, on benchmarks like ALFWorld and ScienceWorld.


Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Author: ByteDance Seed AI4Math

Github Link: https://github.com/ByteDance-Seed/Seed-Prover
Paper Link: https://www.alphaxiv.org/abs/2507.23726

ByteDance Seed AI4Math’s Seed-Prover and Seed-Geometry are AI systems that successfully proved 5 out of 6 problems in the IMO 2025 competition, establishing new state-of-the-art results across several formal mathematical benchmarks including MiniF2F and PutnamBench. The systems achieve this through lemma-style proving, multi-tiered inference strategies that integrate iterative refinement and broad conjecture generation, and a fast, specialized geometry engine.

Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

This academic paper explores how the position of demonstrations (demos) within a Large Language Model’s (LLM) prompt affects its performance, a phenomenon termed DPP bias. Researchers evaluated ten open-source LLMs across various NLP tasks, discovering that placing demos at the beginning of the prompt generally leads to higher accuracy and greater prediction stability. Conversely, demos positioned at the end of the user message can drastically alter predictions without improving correctness. The study highlights that the optimal demo placement is not universal, varying significantly with both the LLM’s size and the specific task, emphasizing the critical need for model-aware and task-sensitive prompt design.

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

CoT-Self-Instruct, developed by FAIR at Meta, introduces a method for generating high-quality synthetic data for Large Language Models by combining Chain-of-Thought reasoning for instruction creation with robust, automated filtering mechanisms. This approach enables models trained on the synthetic data to achieve superior performance on both reasoning and general instruction-following benchmarks, often surpassing existing synthetic methods and human-annotated datasets.

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

n this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance. In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dualcondition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM’s strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Project Link: https://step-out.github.io/C3-web/
Github Link: https://github.com/step-out/C3
Paper Link: https://www.alphaxiv.org/abs/2507.22968

This document introduces C3, a new bilingual benchmark designed to assess Spoken Dialogue Models (SDMs) in complex conversational scenarios. It highlights five key challenges in human speech: phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction. The paper presents a dataset of 1,079 instances in both English and Chinese, evaluated using an LLM-based method that strongly correlates with human judgment. Experimental results reveal that ambiguity, especially semantic ambiguity in Chinese, poses significant difficulties for SDMs, and that omission is the most challenging aspect of context-dependency.

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Github Link: https://github.com/leigest519/ScreenCoder
Paper Link: https://www.alphaxiv.org/abs/2507.22827

The source introduces ScreenCoder, a novel framework designed to automate the conversion of user interface (UI) designs into front-end code, specifically HTML/CSS. It highlights the limitations of existing methods that primarily rely on text-to-code generation and struggle with visual design nuances. ScreenCoder addresses this by employing a modular multi-agent system comprising three stages: grounding (detecting and labeling UI components), planning (structuring a hierarchical layout), and generation (synthesizing code from the structured layout). Furthermore, the paper describes how this framework functions as a scalable data engine, generating UI-image/code pairs to enhance vision-language models (VLMs) through supervised fine-tuning and reinforcement learning, ultimately achieving state-of-the-art performance in UI-to-code synthesis.

I Am Big, You Are Little; I Am Right, You Are Wrong

Paper Link: https://www.alphaxiv.org/abs/2507.23509
Github Link: https://github.com/ReX-XAI/ReX

This paper presents a study on minimal sufficient pixel sets (MPSs), which are the smallest sets of pixels needed for an image classification model to make its original prediction. The authors investigate various neural network architectures, including Inception, ResNet, ConvNext, ViT, and EVA, to understand how different models process visual information. They specifically examine whether MPS size and location vary across models and architectures, and if misclassifications correlate with larger MPSs. The research utilizes ReX, a causal explainable AI (XAI) tool, to generate these pixel sets, demonstrating that models like ConvNext and EVA often rely on fewer, more spatially distinct pixels.

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Paper Link: https://papers-pdfs.assets.alphaxiv.org/2507.22291v1.pdf

The paper introduces AlphaEarth Foundations (AEF), a novel embedding field model developed by Google DeepMind and Google, designed for accurate and efficient global mapping from sparse Earth observation data. AEF creates a highly general, geospatial representation by integrating spatial, temporal, and measurement contexts from various sources like Sentinel and Landsat imagery, LiDAR, climate data, and even text. This innovation addresses the challenge of creating high-quality global maps despite the scarcity of detailed, labeled data, consistently outperforming existing featurization approaches across diverse mapping tasks such as thematic mapping, biophysical variable estimation, and change detection. The authors plan to release a dataset of global, annual, analysis-ready embedding field layers from 2017 to 2024, enabling practitioners to leverage this technology without complex deep learning workflows.

本文最后更新于 天前,文中所描述的信息可能已发生改变