Close Menu
  • Home
  • AI
  • Entertainment
  • Finance
  • Sports
  • Tech
  • USA
  • World
  • Latest News

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

What's Hot

Iranian leader Ayatollah Khamenei has died, according to President Trump and Israeli officials. Here’s what we know:

February 28, 2026

Billion-dollar infrastructure deal fuels AI boom

February 28, 2026

Bridgerton showrunner Phoebe Dynevor talks about recasting Regé-Jean Page

February 28, 2026
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram Vimeo
BWE News – USA, World, Tech, AI, Finance, Sports & Entertainment Updates
  • Home
  • AI
  • Entertainment
  • Finance
  • Sports
  • Tech
  • USA
  • World
  • Latest News
BWE News – USA, World, Tech, AI, Finance, Sports & Entertainment Updates
Home » Are AI agents ready for the workplace? New benchmarks raise questions
AI

Are AI agents ready for the workplace? New benchmarks raise questions

adminBy adminJanuary 23, 2026No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp VKontakte Email
Share
Facebook Twitter LinkedIn Pinterest Email


It’s been nearly two years since Microsoft CEO Satya Nadella predicted that knowledge work (the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT, etc.) would be replaced by AI.

However, despite the great advances made with basic models, changes in knowledge work have been slow to emerge. Models have mastered thorough research and agency planning, but for some reason, most white-collar jobs remain relatively untouched.

This is one of the biggest mysteries in AI, and thanks to new research from training data giant Mercor, we finally have some answers.

New research examines how leading AI models drawn from consulting, investment banking, and law hold up to performing real-world white-collar work. The result is a new benchmark called APEX-Agents, which so far has given every AI lab a failing grade. When faced with questions from real experts, even the best models struggled to get more than a quarter of the questions correct. Most of the time, the model returned a wrong answer or no answer at all.

Mercor CEO Brendan Foody, who helped write the paper, said the model’s biggest stumbling block was tracking information across multiple domains, which is essential for most human knowledge tasks.

“One of the big changes in this benchmark is that we modeled the entire environment after real-world professional services,” Foody told TechCrunch. “The way we work is not one person providing all the context in one place. We actually work across Slack and Google Drive and all these other tools.” For many agent AI models, this kind of multi-domain reasoning remains hit-or-miss.

screenshot

All scenarios were drawn from real experts from Mercor’s expert marketplace who posed queries and set criteria for successful responses. If you look through the questions published on Hugging Face, you’ll see how complex the task can be.

tech crunch event

san francisco
|
October 13-15, 2026

One of the questions in the “Legal” section is:

During the first 48 minutes of the EU production shutdown, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to a U.S. analytics vendor. Based on Northstar’s own policies, could the export of one or two logs be reasonably treated as consistent with Section 49?

The correct answer is yes, but getting there requires a detailed assessment of a company’s own policies and relevant EU privacy laws.

This can be confusing to even the most informed people, but the researchers were trying to model work done by experts in the field. If LLMs can reliably answer these questions, they could effectively replace many of the lawyers currently working. “I think this is probably the most important topic in economics,” Foody told TechCrunch. “The benchmarks are very reflective of the actual work of these people.”

OpenAI also attempted to measure professional skills using the GDPval benchmark, but the APEX-Agents test differs in important ways. While GDPval tests general knowledge across a wide range of professions, the APEX-Agents benchmark measures a system’s ability to perform continuous tasks in a limited number of high-value professions. The consequences are more difficult for models, but also more closely related to whether these jobs can be automated.

Although none of the models proved ready to take over the position of investment banker, a few clearly came closer to the goal. Gemini 3 Flash performed best in the group with 24% one-shot accuracy, closely followed by GPT-5.2 at 23%. Below that, the Opus 4.5, Gemini 3 Pro, and GPT-5 all scored around 18%.

Although early results are lacking, the AI ​​field has a history of breaking through difficult benchmarks. Now that the APEX-Agents test has been published, this is an open challenge for AI Labs that believes it can do better, and Foody fully expects to do so in the coming months.

“It’s improving really quickly,” he told TechCrunch. “Right now, we’d say the interns were getting it right one in four of the time, whereas last year they were getting it right 5 to 10 percent of the time. Year-on-year improvements like this can have an impact very quickly.”



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Email
Previous ArticleBlake Lively, Taylor Swift’s friendship timeline, Justin Baldoni drama
Next Article I’m locking in profits with this AI stock that started parabolically through 2026
admin
  • Website

Related Posts

Billion-dollar infrastructure deal fuels AI boom

February 28, 2026

Anthropic’s Claude rises to No. 2 on App Store following Pentagon dispute

February 28, 2026

OpenAI’s Sam Altman announces ‘technical safeguards’ agreement with Department of Defense

February 28, 2026

Musk criticized OpenAI in his deposition, saying, “No one committed suicide because of Grok.”

February 28, 2026
Leave A Reply Cancel Reply

Our Picks

Newly freed hostages face long road to recovery after two years in captivity

October 15, 2025

Former Kenyan Prime Minister Raila Odinga dies at 80

October 15, 2025

New NATO member offers to buy more US weapons to Ukraine as Western aid dwindles

October 15, 2025

Russia expands drone targeting on Ukraine’s rail network

October 15, 2025
Don't Miss
Entertainment

Bridgerton showrunner Phoebe Dynevor talks about recasting Regé-Jean Page

By adminFebruary 28, 20260

This story contains spoilers for Part 2 of Bridgerton Season 4. Bridgerton’s creative team remains…

Graham Norton talks about Taylor Swift and Travis Kelsey’s wedding

February 28, 2026

Mary Cosby pays tribute to son Robert Cosby Jr. after his death

February 28, 2026

Nate Bergatze moves to Nashville for daughter Harper

February 28, 2026
About Us
About Us

Welcome to BWE News – your trusted source for timely, reliable, and insightful news from around the globe.

At BWE News, we believe in keeping our readers informed with facts that matter. Our mission is to deliver clear, unbiased, and up-to-date news so you can stay ahead in an ever-changing world.

Our Picks

Iranian leader Ayatollah Khamenei has died, according to President Trump and Israeli officials. Here’s what we know:

February 28, 2026

The almost forgotten history of a 1,700-year-old gigantic structure

February 28, 2026

The world’s best passenger airplanes — according to CNN’s top aviation expert

February 28, 2026

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact US
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 bwenews. Designed by bwenews.

Type above and press Enter to search. Press Esc to cancel.