离职阿里后，林俊旸首次复盘千问得失-钛媒体官方网站

3月26日，被誉为“阿里最年轻P10”的千问（Qwen）大模型灵魂人物林俊旸，在月初离职风波舆论渐息之际（详见《阿里千问“换帅”风暴：林俊旸转身，大模型赛道告别英雄时代》），在X平台发布长文《从“推理式思考”到“智能体式思考”》，系统阐述了他对AI技术范式演进剖析。通过这篇文章，林俊旸不仅总结了过去，更指向了AI未来竞争的焦点——一个超越单一模型比拼、关乎系统、环境与协同的智能体新时代。

林俊旸将2024-2025年定义为“推理思考”阶段，以OpenAI o1和DeepSeek-R1为代表，其核心成就是证明了“思考”可以作为一种可训练、可交付的一流能力。这一阶段的本质，是通过强化学习（RL）在数学、代码等可验证领域获得确定性反馈，从而让模型“为正确而优化，而非为合理”。然而，这背后是巨大的基础设施挑战——推理RL已从轻量级微调附件，演变为需要大规模部署、高吞吐验证的系统工程问题。

不过，真正的难题远不止于此。文章第二部分探讨了“思考模式”与“指令模式”融合的实践困境。这一分析也映照了商业现实：阿里在Qwen3尝试融合后，后续的2507版本中Instruct与Thinking版本独立呈现，因为大量客户在批量操作中仍需要高性价比、高可控的指令行为。

文章提出“智能体式思考”（Agentic Thinking）是下一代AI的核心范式。这标志着训练核心从模型本身转向 “模型-环境”系统。智能体思维的核心是“为行动而思考”，它必须处理纯推理模型无需面对的难题：决定何时行动、调用何种工具、处理环境的不确定反馈、在失败后修订计划、在多轮交互中保持连贯。

林俊旸认为，在推理时代，优势源于更好的RL算法和反馈信号；而在智能体时代，竞争优势将建立在更优质的环境设计、更紧密的训练-服务一体化架构、以及更强大的智能体协同工程之上。环境本身成为一等品，其稳定性、真实性、反馈丰富度和抗过拟合能力至关重要。同时，多智能体组织架构——由规划者、领域专家和执行子代理构成的系统——将成为核心智能的来源。

这篇文章探讨的内容并非前沿式创新，但发布后仍引发了大量关注。对更多人而言，好奇点在于：曾经的Qwen大模型核心负责人如何理解当下AI技术发展。或许，这也暗示了他看好的下一个创业或研究方向。

全文由千问Qwen翻译：

From "Reasoning" Thinking to "Agentic" Thinking

从“推理式思考”到“智能体式思考”

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

过去两年重塑了我们评估模型的方式以及对模型的期望。OpenAI的o1证明，“思考”可以成为一种一流的技能——一种需要专门训练并面向用户开放的能力。DeepSeek-R1则表明，推理风格的后训练方法不仅能在原始实验室之外重现，还能实现规模化应用。OpenAI将o1描述为一种通过强化学习训练而成的模型，它能够在回答问题前“先进行思考”。DeepSeek则将R1定位为一款与o1相媲美的开放式推理模型。

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那个阶段很重要。但2025年上半年主要聚焦于推理思维：如何让模型在推理时花费更多时间。计算，如何用更强烈的奖励来训练它们，如何暴露或控制那种额外的推理努力。现在的问题是：接下来该怎么做？我认为答案是代理思维：即思考——为了在与环境互动时采取行动，并根据来自外界的反馈不断更新计划。

1. What the Rise of o1 and R1 Actually Taught Us

o1和R1的崛起实际上教会了我们什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型告诉我们，若想在语言模型中规模化应用强化学习，我们就需要具备确定性、稳定性和可扩展性的反馈信号。数学、代码、逻辑及其他可验证的领域因此成为核心，因为在这些场景中，奖励信号远比一般的偏好监督更为有力。它们使强化学习能够专注于追求正确性，而非仅仅追求合理性。与此同时，基础设施也变得至关重要。

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

一旦模型经过训练能够推理更长的轨迹，强化学习便不再只是监督微调的一个轻量级附加组件。它……变成一个系统性问题。你需要大规模部署、高吞吐量验证、稳定的策略更新以及高效的采样。推理模型的出现，其背后既涉及基础设施建设，也关乎建模本身。OpenAI 将 o1 描述为一种通过强化学习训练的推理模型，而 DeepSeek R1 后来进一步印证了这一方向，展示了——多少针对基于推理的强化学习，需要专门的算法和基础设施工作。第一次重大转变：从扩大预训练规模转向扩大后训练规模以实现推理能力。

2. The Real Problem Was Never Just "Merge Thinking and Instruct"

真正的问题从来不仅仅是“融合思考与指令”

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

2025年初，我们Qwen团队的许多成员心中都描绘了一幅雄心勃勃的蓝图。理想的系统是将实现思维与指令模式统一，并支持可调节的推理力度，其理念类似于低/中/高三种推理设置。更棒的是，该系统能够根据提示和上下文自动推断出恰当的推理量：模型既能即时作答，也能选择深入思考，甚至在面对真正棘手的问题时，投入更多计算资源进行细致求解。

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

从概念上讲，这是正确的方向。Qwen3是最清晰的公开尝试之一。它引入了“混合思考模式”，在一个模型家族中同时支持思考和非思考行为，强调可控的思考预算，并描述了一个明确包含“思考模式融合”的四阶段后训练流程，该流程位于长思维链冷启动和推理强化学习之后。

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但融合比良好执行更容易描述。困难的部分是数据。当人们谈论融合思考与指令时，他们通常首先想到的是模型侧的兼容性：一个检查点能否同时支持两种模式，一个聊天模板能否在它们之间切换，一个服务栈能否暴露正确的切换开关。更深层的问题是，这两种模式的数据分布和行为目标存在本质差异。

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

我们在尝试平衡模型合并与提升训练后数据的质量和多样性时，并未完全做到尽善尽美。在这一修订过程中，我们还密切关注了用户如何实际参与具备思考与指导两种模式。在企业级任务中，例如重写、标注、模板化支持、结构化提取以及运营质量保证等重复性高、工作量大的场景，表现强劲的指导模型通常因其直接性、简洁性、格式合规性以及低延迟而受到青睐。而表现强劲的思考模型则因在解决难题时消耗更多标记、保持连贯的中间结构、探索多种备选路径，并保留足够的内部计算以切实提升最终结果的正确性而备受推崇。

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

这两种行为模式相互抵消。如果对合并后的数据不加以精心筛选，最终结果往往两头不讨好：所谓的“思考”型行为变得杂乱无章、臃肿不堪，或缺乏足够的决断力；而“指令”型行为则变得不够干脆利落、可靠性降低，且成本高于商业用户的需求。实际上想要。

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

分离在实践中仍颇具吸引力。2025年晚些时候，在Qwen3最初的混合框架之后，2507版本推出了独立的Instruct和Thinking更新版本，其中包括分别针对30B和235B参数量的变体。在商业部署中，大量客户仍然希望在批量操作中实现高吞吐、低成本且高度可操控的指令行为。对于这些场景，合并显然并不具备优势。将各条线分开，能让团队更清晰地专注于解决每种模式的数据和训练问题。

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

其他实验室则选择了截然不同的路径。Anthropic公开倡导一种集成式模型理念：Claude 3.7 Sonnet被定位为一种混合推理模型，用户可选择普通回复或深度思考模式，API用户还可设定思考预算。Anthropic明确表示，他们认为推理应当是一种集成化的能力，而非独立的模型。GLM-4.5同样公开将自身定位为一种混合推理模型，兼具思考与非思考两种模式，实现了推理、编码及智能体能力的统一；DeepSeek随后也朝着类似方向迈进，其V3.1版本推出了“思考与非思考”混合推理功能。

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

关键问题在于，这种融合是否是自然有机的。如果思维与指令仅仅被安置于同一个检查点内，却仍表现为两种生硬拼接的个性，那么产品的用户体验将依然显得不自然。真正成功的融合，需要实现推理努力的平滑连续变化。模型应当能够表达不同层次的推理强度，并且最好能自适应地在这些层次之间做出选择。GPT式的努力控制正朝着这一方向迈进：它采用的是对计算资源的策略性调控，而非简单的二元开关。

3. Why Anthropic's Direction Was a Useful Corrective

为什么Anthropic的方针是一种有益的纠正措施

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic围绕Claude 3.7和Claude 4的公开表述是克制的。他们着重强调了整合推理、用户可控的思维预算、真实世界任务、代码质量，以及后期在长时间思考过程中使用工具的能力。Claude 3.7被定位为一种具备可控预算的混合推理模型；Claude 4则在此基础上进一步拓展，允许推理与工具使用相互交织。与此同时，Anthropic还特别强调了编码、长期运行任务以及智能体工作流作为其主要目标。

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

生成更长的推理轨迹并不会自动使模型变得更智能。在许多情况下，过多的显式推理信号反而会导致分配效率低下。如果模型试图以同样冗长的方式对所有内容进行推理，它很可能无法合理 prioritization，无法有效压缩，也无法采取行动。人类的轨迹表明，一种更严谨的视角更为恰当：思考应以目标工作量为导向。如果目标是编写代码，那么思考就应有助于代码库导航、规划、分解、错误恢复以及工具编排。如果目标是代理工作流，那么思考的重点应放在提升长期执行质量上，而非追求令人惊艳的中间成果。

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

这种对目标导向型实用性的强调，指向了一个更为宏大的趋势：我们正从模型训练时代迈向智能体训练时代。我们在Qwen3博客中明确指出：“我们正在从一个以模型训练为核心的时代，转型为以智能体训练为核心的时代”，并把未来的强化学习进展与环境反馈相结合，以支持长时程的推理能力。所谓智能体，是一种能够制定计划、决定行动时机、运用工具、感知环境反馈、调整策略，并在长周期内持续运行的系统。它之所以与众不同，就在于其与外界之间形成了闭环互动。

4. What "Agentic Thinking" Really Means

“智能体式思考”的真正含义

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

“智能体式思考”是一种不同的优化目标。推理思维通常以最终答案之前的内部推敲质量来衡量：模型能否解出定理、写出证明、生成正确的代码，或通过基准测试。而“智能体式思考”则关注的是，模型在与环境交互的过程中能否持续取得进展。

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

Deciding when to stop thinking and take an action
Choosing which tool to invoke and in what order
Incorporating noisy or partial observations from the environment
Revising plans after failures
Maintaining coherence across many turns and many tool calls

Agentic thinking is a model that reasons through action.

核心问题从“模型能否思考足够长的时间？”转变为“模型能否以维持有效行动的方式进行思考？”。智能体式思考必须处理几件纯推理模型大多可以避免的事情：

决定何时停止思考并采取行动
选择调用哪个工具以及调用顺序
融入来自环境的噪声或部分观测数据
在失败后修订计划
在多次轮次和多次工具调用中保持连贯性

智能体式思考是一个通过行动进行推理的模型。

5. Why Agentic RL Infrastructure Is Harder

为什么智能体强化学习基础设施更难

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

一旦目标从解决基准问题转向解决交互式任务，强化学习的架构便会发生变化。用于经典推理强化学习的基础设施已不足以应对这一需求。在推理强化学习中，你通常可以将采样轨迹视为大体自成一体的路径，并配备相对清晰的评估器。而在代理强化学习中，策略被嵌入一个更大的框架之中：工具服务器、浏览器、终端、搜索引擎、模拟器、执行沙箱、API层、内存系统以及编排框架。此时，环境不再只是静态的验证者；它已成为训练系统的一部分。

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

这会创建一个新的系统要求：训练与推理必须实现更彻底的解耦。若缺乏这种解耦，模型上线的吞吐量将大幅下降。试想一下，一个编码智能体需要针对实时测试框架执行生成的代码：推理端会因等待执行反馈而停滞不前，训练端则因缺乏已完成的轨迹而陷入饥饿状态，整个流水线的运行效率远低于基于经典推理的强化学习所预期的GPU利用率。如果再叠加工具延迟、部分可观测性以及有状态环境等因素，这些低效问题便会进一步加剧。其结果是，实验进度缓慢且充满痛苦，甚至在你尚未达到目标能力水平之前，就已经陷入困境。

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

环境本身也正成为一类一流的研究工具。在SFT时代，我们痴迷于数据的多样性；而在智能体时代，我们则应痴迷于环境的质量：包括稳定性、真实性、覆盖范围、难度、状态多样性、反馈丰富度、抗过拟合能力以及 rollout 生成的可扩展性。环境构建已开始成为一个真正的创业领域，而不再仅仅是副业项目。如果智能体正在接受训练，以适应类似生产环境的运行场景，那么环境便成了核心能力栈的重要组成部分。

6. The Next Frontier Is More Usable Thought

下一个前沿是更易用的思维

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我的预期是，智能体式思考将成为思考的主导形式。我认为它可能最终取代大部分旧的静态独白式推理思考：那种因缺乏交互而通过输出越来越多文本来补偿的、过长的、孤立的内部轨迹。即使在非常困难的数学或编码任务上，一个真正先进的系统也应该有权进行搜索、模拟、执行、检查、验证和修订。目标是稳健且高效地解决问题。

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

训练这类系统时，最棘手的挑战便是奖励作弊。一旦模型获得了有意义的工具访问权限，奖励作弊便会变得愈加危险。具备搜索功能的模型可能会在强化学习过程中直接查找到答案；编码智能体则可能利用仓库中的未来信息、滥用日志，或发现一些能轻易绕过任务要求的捷径。如果环境中存在隐蔽漏洞，智能体看似表现得超凡脱俗，实则是在被训练去作弊。正因如此，智能体时代比推理时代更加微妙和复杂。更强大的工具让模型变得更加有用，但同时也扩大了虚假优化的攻击面。我们应预期，下一阶段的重大研究瓶颈将来自环境设计、评估器的鲁棒性、反作弊机制，以及策略与世界之间更具原则性的接口。尽管如此，方向已然明确：借助工具的思维模式远比孤立的思考更有价值，也更有可能切实提升生产力。

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能体式思考也将意味着对工程的驾驭。核心智能将越来越多地源自于多个代理的组织方式：一位负责规划与调度工作的统筹者，一群充当领域专家的专业代理，以及一群执行更具体任务、同时协助控制上下文、避免干扰并保持不同层次推理之间隔离性的子代理。未来，我们将从训练模型转向训练代理，再进一步从训练代理转向训练系统。

Conclusion

结语

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一阶段确立了一项重要发现：在语言模型之上应用强化学习，当反馈信号可靠且基础设施能够支撑时，可产生质量上更强大的认知能力。

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

深层次的转变是从推理式思维转向行动式思维：从更长时间的思考，转变为为了采取行动而进行的有序思考。培训的核心对象也随之发生了变化——如今，关注的焦点已不再是单纯的模型本身，而是“模型+环境”这一系统，更具体地说，是智能体及其周围的生态系统。这使得哪些研究成果最为关键也发生了改变：固然，模型架构和训练数据依然至关重要；但与此同时，环境设计、部署基础设施、评估器的稳健性，以及多个智能体之间协同互动所依赖的各类接口，也都变得同样重要。这也重新定义了“良好思考”的含义：在现实世界的约束条件下，最能持续推动行动的有效轨迹，而非单纯追求最长或最显眼的轨迹。

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

它也改变了竞争优势的来源。在推理时代，优势来自更优秀的强化学习算法、更强的反馈信号以及更高的可扩展性。训练流水线。在智能体时代，优势将来自更优质的环境、更紧密的训练与服务一体化、更强大的模型约束工程，以及实现模型决策与其所产生后果之间闭环的能力。