blog

Welcome to my blog!

Daily Paper | July 29, 2025

ab's Avatar 2025-07-29 Daily Paper

  1. 1. Table of Content
  2. 2. AlphaGo Moment for Model Architecture Discovery
  3. 3. Group Sequence Policy Optimization
  4. 4. GEPA: REFLECTIVE PROMPT EVOLUTION CAN OUTPERFORM REINFORCEMENT LEARNING
  5. 5. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
  6. 6. Back to the Features: DINO as a Foundation for Video World Models
  7. 7. Gemini 2.5 Pro Capable of Winning Gold at IMO 2025
  8. 8. Learning without training : The implicit dynamics of in-context learning
  9. 9. Checklists Are Better Than Reward Models For Aligning Language Models
  10. 10. GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

Table of Content

AlphaGo Moment for Model Architecture Discovery


The provided text describes ASI-ARCH, an AI system designed to autonomously discover and innovate neural network architectures, moving beyond traditional human-limited search methods. The system operates through three core modules: a Researcher that proposes novel designs, an Engineer that implements and evaluates them, and an Analyst that extracts insights from experiments. ASI-ARCH has successfully identified 106 state-of-the-art linear attention architectures, demonstrating a computational scaling law for scientific discovery in AI research. This breakthrough suggests that AI can significantly accelerate the pace of architectural innovation by continually learning and evolving its own designs, similar to how AlphaGo revealed new strategic principles in games.

Group Sequence Policy Optimization

GSPO addresses critical stability and efficiency issues inherent in previous state-of-the-art RL algorithms like Group Relative Policy Optimization (GRPO), particularly when training gigantic and Mixture-of-Experts (MoE) LLMs.

The core innovation of GSPO lies in its sequence-level approach to importance ratio definition, clipping, rewarding, and optimization, contrasting with GRPO’s token-level methods. This fundamental change is theorized to align more closely with the basic principles of importance sampling, where rewards are granted to entire sequences. GSPO has demonstrated superior training stability, efficiency, and performance, notably resolving the instability challenges in MoE RL training without complex workarounds. These advancements have been instrumental in the “remarkable improvements in the latest Qwen3 models.” GSPO also offers potential for simplifying RL infrastructure by enabling direct use of likelihoods from inference engines.

GEPA: REFLECTIVE PROMPT EVOLUTION CAN OUTPERFORM REINFORCEMENT LEARNING

This paper introduces GEPA (Genetic-Pareto), a novel prompt optimizer designed for compound AI systems. GEPA distinguishes itself by employing a multi-objective evolutionary search that incorporates natural language feedback from new system rollouts to iteratively refine prompts. Unlike greedy update methods, GEPA maintains a Pareto front of top-performing prompts, fostering diversity and robust generalization to avoid local optima. The authors demonstrate GEPA’s superior sample efficiency and performance against state-of-the-art optimizers like MIPROv2 and GRPO across various tasks, highlighting its ability to generate shorter, more effective prompts and adapt AI systems rapidly even with limited data or budget. The methodology involves reflective prompt mutation and Pareto-based candidate selection, as illustrated by the system’s iterative search process and sample prompt examples for tasks like HotpotQA and PUPA.

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

Step-3 is a 321-billion-parameter Vision-Language Model (VLM) developed by StepFun Inc., specifically designed to minimize decoding costs for long-context reasoning tasks. It introduces a novel “model-system co-design” approach that focuses on hardware efficiency. Key innovations include the Multi-Matrix Factorization Attention (MFA) mechanism, which reduces KV cache size and computation while maintaining expressiveness, and Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers.

Step-3 activates 38 billion parameters per token, more than comparable models like DeepSeek-V3 and Qwen3 MoE 235B, yet achieves significantly lower theoretical decoding costs. Its implementation on Hopper GPUs demonstrates a decoding throughput of up to 4,039 tokens per second per GPU, outperforming DeepSeek-V3 and setting a new Pareto frontier for LLM decoding efficiency. The paper argues that total or activated parameter count is a poor indicator of decoding costs, highlighting the critical role of hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD in achieving cost-effectiveness.

Back to the Features: DINO as a Foundation for Video World Models

The paper introduces DINO-world, a novel generalist video world model designed to predict future frames in the latent space of DINOv2. This model addresses key challenges in training effective world models, such as the need for large-scale, annotated video data and the computational cost of pixel-based generative models. By leveraging a pre-trained image encoder (DINOv2) and training a future predictor on a massive, uncurated video dataset, DINO-world achieves superior performance in diverse video prediction benchmarks, including segmentation and depth forecasting, and demonstrates a strong understanding of intuitive physics. A significant advantage is its ability to be fine-tuned on observation-action trajectories for planning, making it suitable for controlling agents in simulated environments. DINO-world offers a more resource-efficient architecture compared to state-of-the-art pixel-based models, making it a promising step towards more generalist and adaptable AI agents.

Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

A recent research paper, “Gemini 2.5 Pro Capable of Winning Gold at IMO 2025,” presents a significant breakthrough in the field of Large Language Models (LLMs) and their ability to solve highly complex mathematical problems, specifically those encountered in the International Mathematical Olympiad (IMO). The authors, Yichen Huang and Lin F. Yang, demonstrate that by employing a sophisticated “self-verification pipeline” with meticulous prompt engineering, Google’s Gemini 2.5 Pro model successfully solved 5 out of 6 problems from the newly released IMO 2025 competition. This performance level is unprecedented for LLMs on Olympiad-level mathematics and suggests a shift from mere pattern recognition or data retrieval to more genuine complex reasoning and proof construction capabilities. The methodology emphasizes rigorous, multi-step logical deduction and an iterative refinement process, indicating that optimal strategies are crucial for harnessing the full potential of powerful LLMs for complex reasoning tasks.

Learning without training : The implicit dynamics of in-context learning

This paper explores In-Context Learning (ICL) in Large Language Models (LLMs), a phenomenon where LLMs acquire new patterns from examples presented in the prompt without explicit weight updates. The authors propose that a transformer block, specifically the stacking of a self-attention layer with an MLP (Multi-Layer Perceptron), implicitly modifies the MLP layer’s weights based on the input context. They introduce the concept of a “contextual block” and demonstrate through theory and experimentation how context translates into a low-rank weight update of the MLP, effectively acting as an implicit fine-tuning mechanism. This implicit update is shown to resemble gradient descent, suggesting a form of implicit learning dynamics at inference time. The research offers a theoretical framework for understanding ICL beyond restrictive assumptions of prior work, although it acknowledges limitations to single transformer blocks and the initial output token.

Checklists Are Better Than Reward Models For Aligning Language Models

The provided text introduces Reinforcement Learning from Checklist Feedback (RLCF), a novel method for aligning large language models (LLMs) to better follow user instructions. Unlike traditional reinforcement learning methods that use fixed criteria, RLCF employs dynamic, instruction-specific checklists to evaluate responses, generating more flexible and precise reward signals. This approach utilizes both AI judges and specialized verifier programs to score how well an LLM’s output satisfies each checklist item. The authors demonstrate that RLCF consistently improves performance across various benchmarks, highlighting its effectiveness in eliciting desirable behaviors in open-ended instruction-following tasks, even for models that haven’t been specifically instruction-tuned. The research also details the creation of “WildChecklists,” a large dataset of instructions and corresponding synthetically generated checklists, and explores the computational efficiency of the method.

GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

The “GPT-IMAGE-EDIT-1.5M” paper introduces a significant new public dataset designed to advance open-source research in instruction-guided image editing. This dataset comprises over 1.5 million high-quality triplets of {instruction, source image, edited image}. It was systematically constructed by leveraging the advanced capabilities of GPT-4o to unify and refine existing popular image-editing datasets (OmniEdit, HQ-Edit, and UltraEdit). The core methodology involves regenerating output images for enhanced visual quality and instruction alignment, and selectively rewriting prompts to improve semantic clarity.

Models fine-tuned on GPT-IMAGE-EDIT-1.5M, specifically FluxKontext, have achieved state-of-the-art performance among open-source methods across various benchmarks (e.g., 7.24@GEdit-EN, 3.80@ImgEdit-Full, 8.78@Complex-Edit). This significantly narrows the performance gap with leading proprietary models like GPT-4o. The release of this dataset aims to “catalyze further open research in instruction-guided image editing.”

本文最后更新于 天前,文中所描述的信息可能已发生改变