
As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog.
The report highlights a compounding complexity challenge as AI systems scale. Nearly seven in ten companies (69%) now use three or more models alongside increasingly complex agent workflows. Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits — leading to slowdowns, errors, and broken experiences in AI-powered applications.
Additional key findings:
- Multi-model is now the norm: OpenAI remains the most widely used provider at 63% share, alongside rising adoption of Google Gemini and Anthropic Claude which grew by 20 and 23 percentage points, respectively.
- Agent framework adoption doubled year-over-year, accelerating development but also introducing more moving parts into production systems.
- The amount of data sent to AI models per request is also rising: the average number of tokens more than doubled for median use teams (50th percentile of usage volume) and quadrupled for heavy users (90th percentile).
"AI is starting to look a lot like the early days of cloud," said Yanbing Li, Chief Product Officer at Datadog. "The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won't just build better models — they'll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago."
Speed Requires Control
Competitive pressure is accelerating AI deployment across startups and large enterprises alike. But as systems scale, speed without control creates risk. Failures are increasingly driven by system design, including fragmented workflows, excessive retries, and inefficient routing.
"Innovation alone isn't enough," added Li. "To scale AI with confidence, organizations need real-time visibility across the entire stack — from GPU utilization to model behavior to agent workflows. Visibility and operational control are what allow teams to move fast without sacrificing reliability or governance. At scale, how you operate AI may matter more than the models you choose."
Methodology: Datadog analyzed anonymized usage data from thousands of customers using LLMs in production environments, with global coverage across industries and geographies.