For years, the CEOs of leading high-tech companies have promoted the vision of AI agents that can use software applications to complete people’s tasks. But whether it’s Openai’s ChatGpt agent or Perplexity’s comet, spin today’s consumer AI agents. That way you can quickly realize how limited the technology is still. Making AI agents more robust could potentially adopt a new set of techniques the industry is still discovering.
One of these techniques is to carefully simulate a workspace where agents can be trained in multi-step tasks known as multi-step tasks (RL) environments. Just as how labeled datasets move in the final wave of AI, the RL environment is beginning to appear as an important factor in agent development.
AI researchers, founders and investors tell TechCrunch that the leading AI labs are demanding more RL environments and there is a shortage of startups that want to provide them.
“All big AI labs are building RL environments in-house,” Jennifer Li, general partner at Andreessen Horowitz, said in an interview with TechCrunch. “But as you can imagine, creating these datasets is so complicated that AI Labs is also looking at third-party vendors who can create high-quality environments and assessments. Everyone is looking at this space.”
The push of the RL environment has minted a new class of newly funded startups, including mechanization and key intelligence, aimed at leading the space. Meanwhile, large data label companies like Mercor and Surge say they are investing more in RL environments, addressing the industry’s shift from static datasets to interactive simulations. Major labs are also considering investing heavily. According to information, human leaders are debating more than $1 billion in spending on the RL environment over next year.
The hope for investors and founders is that one of these startups emerges as a “Scale AI for the Environment” and refers to $29 billion in data labeled Powerhouse, powered by the age of chatbots.
The question is whether the RL environment will truly boost the frontier of AI progression.
TechCrunch Events
San Francisco
|
October 27th-29th, 2025
What is an RL environment?
Because RL environments are core, they are the basis for training to simulate what AI agents do in real software applications. One founder explained in a recent interview that they will build them in “creating very boring video games, etc.”
For example, the environment can simulate a Chrome browser and task AI agents to buy socks on Amazon. The agent is graded for its performance and sends a reward signal when it is successful (in this case, it buys valuable socks).
Such tasks sound relatively simple, but there are many places where AI agents can stumble. You may be navigating through drop-down menus on a web page or purchasing too many socks. Also, since developers cannot accurately predict what wrong an agent is doing, the environment itself must be robust enough to capture unexpected behavior and still provide useful feedback. This makes the built environment much more complicated than a static dataset.
Some environments are very elaborate, allowing AI agents to use tools, access the Internet, and use a variety of software applications to complete specific tasks. Others are narrower and aimed to help agents learn specific tasks in enterprise software applications.
The RL environment is currently the hottest thing in Silicon Valley, but there are many precedents for using this technique. One of Openai’s first projects in 2016 was to build “RL Gyms.” This was very similar to the modern concept of the environment. In the same year, Google Deepmind’s Alphago AI system defeated the world champion in the board game. We also used RL technology within a simulated environment.
What’s unique about today’s environment is that researchers are trying to build computer-based AI agents with large-scale trans models. Unlike Alphago, a specialized AI system that runs in a closed environment, today’s AI agents are trained to have more general functions. AI researchers today have a stronger starting point, but there are also complex goals that don’t go well with many more.
A busy field
AI Scale AI data labeling companies like AI, Surge, Mercor are trying to meet at the moment and build an RL environment. These companies have more resources than many startups in this space, as well as their deeper relationships with AI Labs.
Surge CEO Edwin Chen told TechCrunch that demand for RL environments within AI Labs has been “a significant increase” recently. Surge, which reportedly worked with AI labs such as Openai, Google, Anthropic and Meta last year to generate revenues in $1.2 billion, recently said it has spun a new internal organisation specially charged to build an RL environment.
Just behind Surge is Mercor, a $10 billion worth of startup that also works in Openai, Meta, and humanity. Mercor is pitching investors to a business building RL environment for domain-specific tasks such as coding, healthcare and law, according to marketing materials seen by TechCrunch.
“Leah, few people understand how big the opportunities around the RL environment are,” Melkor CEO Brendan Hoody told TechCrunch in an interview.
The scale AI used to control the labeling space for data has lost ground since Meta invested $14 billion and hired CEOs. Since then, Google and Openai have removed Scale AI as data providers, and startups have even faced a race for data labeling work within Meta. But even so, Scale is trying to meet at the moment and create an environment.
“This lies in the nature of the business (Scale AI),” said Chetan Rane, head of product for agents and RL environments. “Scale proves its ability to adapt quickly. We did this early on in our first business unit, the self-driving cars. When ChatGPT came out, AI adapted to it.
Some new players have focused solely on the environment from the start. Among them is a startup that was founded about six months ago with the bold goal of “automating all jobs.” However, co-founder Matthew Barnett tells TechCrunch that his company starts with an AI coding agent’s RL environment.
Mechanization aims to provide AI labs with a small number of robust RL environments, says Barnett, rather than a large data company that creates a wide range of simple RL environments. At this point, the startup is building an RL environment by offering a $500,000 salary to software engineers. This is much higher than hourly contractors can work with AI or surges.
Mechanize is already working with humanity in an RL environment, two sources familiar with the issue told TechCrunch. Mechanization and humanity declined to comment on the partnership.
Other startups bet that the RL environment will have an impact outside of AI Labs. Prime Intellect – A startup supported by AI researchers Andrej Karpathy, Founders Fund and Menlo Ventures targets small developers in RL environments.
Last month, Prime Intellect launched the RL Environments Hub. This is intended to “hugging the face of an RL environment.” The idea is to allow open source developers to access the same resources that large AI Labs have, allowing those developers to access computational resources in the process.
According to Will Brown of Prime Intellect Researcher, a generally capable agent can be more computational than previous AI training techniques in an RL environment. Along with startups building RL environments, GPU providers can enhance their processes have another opportunity.
“The RL environment would be too big for one company to control,” Brown said in an interview. “Part of what we do is try to build a great open source infrastructure around it. The services we sell are calculations, so it’s a convenient on-ramp to use GPUs, but this is what we’re thinking about in the long run.”
Does it scale?
An unresolved question regarding the RL environment is whether the technique is scaled like previous AI training methods.
Reinforcement learning has driven some of the biggest leaps in AI over the past year, including models such as Openai’s O1 and Anthropic’s Claude Opus 4. These are particularly important breakthroughs as methods previously used to improve AI models show reduced returns.
The environment is part of AI Labs’ larger bets on RL, and we believe that many will continue to drive progress as data and computational resources are added to the process. Some Openai researchers behind O1 previously told TechCrunch that the company originally invested in the AI Reasoning model (created through investment in RL and calculations during testing) and thought it would be a good extension.
The best way to scale RL remains unknown, but the environment appears to be a promising candidate. Instead of simply rewarding the chatbot for text responses, agents use tools and computers at their disposal to run in simulations. It’s much more resource intensive, but potentially rewarding.
Some are skeptical that all these RL environments will pan out. Ross Taylor, former AI research lead at Meta, who co-founded general reasoning, tells TechCrunch that RL environments tend to reward hacking. This is the process in which AI models cheats to earn rewards without actually performing tasks.
“I think people underestimate how difficult it is to expand the environment,” Taylor said. “Even the best (RL environments) that are generally available will not normally work without serious changes.”
Sherwin Wu, Head of Engineering for API Business at Openai, said in a recent podcast that it was “short” at RL environment startups. Wu said it is a very competitive space, but AI research has evolved so quickly that it is difficult to serve AI labs well.
Karpathy, a leading intelligence investor who calls the RL environment a potential breakthrough, has paid more broad attention to the RL space. In X’s post, he raised concerns about whether he could squeeze more AI progress from RL.
“I’m bullish about the interaction between environment and agent, but specifically, I’m bearish towards reinforcement learning,” says Karpathy.
Update: Previous versions of this article were called mechanization work. Updated to reflect the company’s official name.