Galileo announced the launch of its comprehensive platform update for AI agent reliability, free for developers around the world.
As AI agents become increasingly autonomous and multi-step, traditional evaluation tools struggle to detect their complex failure modes. Galileo's new agent reliability solution is purpose-built for multi-agent AI systems and addresses this critical gap with agentic observability, evaluation, and guardrail capabilities working in concert.
Galileo's platform addresses the high-stakes nature of enterprise AI deployment, where a single agent failure can expose sensitive data, cost real money, or damage customer relationships. Galileo's new Luna-2 small language models(SLMs) deliver up to 97% cost reduction in production monitoring while enabling real-time protection against failures that could derail enterprise AI initiatives.
"When your agent fails, you shouldn't have to become a detective," said Vikram Chatterji, CEO and Co-founder of Galileo. "Our agent reliability platform, fueled by our world-first Insights Engine, represents a fundamental shift from reactive debugging to proactive intelligence, giving developers the confidence to deploy AI agents that perform reliably in production."
The platform tackles the unique challenges of agentic AI development, where a single bad action can expose sensitive data or cost real money, requiring guardrails that trigger before tools execute. Galileo's platform powers custom real-time evaluations and guardrails with new Luna-2 small language models, giving developers targeted visibility into agent behavior across every step, tool call, and output.
Galileo's Agent Reliability Platform delivers four key capabilities:
1. Agent Observability Reimagined
- Framework-agnostic Graph Engine that renders every branch, decision, and tool call
- Timeline View for execution flow analysis and bottleneck identification
- Conversation View for user-perspective debugging
2. Insights Engine for Automatic Failure Detection Powered by bespoke evaluation reasoning models, the Insights Engine automatically identifies failure modes and surfaces actionable insights, including:
- Root cause analysis linking errors to exact traces
- Multi-agent coordination analysis
- Tool usage optimization recommendations
- Conversation flow and performance monitoring
3. Scalable Agentic Metrics Purpose-built metrics covering flow adherence, task completion, conversation quality, and agent efficiency, with support for custom metrics using code-based approaches, LLM-as-a-judge, or Galileo's new Luna-2 small language models.
4. Real-Time Production Guardrails Luna-2 powered guardrails enable low-cost, real-time protection against malicious user behavior and agent mistakes without the expense of traditional LLM-based solutions.
Central to the platform are Galileo's Luna-2 small language models, specifically designed for always-on agent evaluations. Unlike traditional approaches that rely on expensive, slow LLMs, Luna-2 enables:
- 10-20 sophisticated metrics running simultaneously
- Sub-200ms latency even at 100% sampling rates
- Enterprise-scale production monitoring at 97% cheaper costs
- Session-level metrics that capture the entire agent journey
"Multiturn agents never follow a single script, so your tests can't either," explained Atin Sanyal, CTO and Co-founder of Galileo. "Luna-2's session metrics capture conversation quality, intent changes, efficiency, and compound-request resolution across the whole journey, not just individual turns."
The Galileo Agent Reliability Platform is available now as part of Galileo's free tier, with additional enterprise features available through paid plans. The platform integrates with popular agent frameworks, including CrewAI, LangGraph, OpenAI's Agent SDK, LlamaIndex, and Amazon Strands, leveraging open standards like OpenTelemetry for maximum compatibility.
To accompany the platform, Galileo has also released a new v2 of its viral AI agent leaderboard today. The leaderboard evaluates models for their effectiveness in solving domain-specific enterprise tasks across different purpose-built agent metrics and datasets covering banking, healthcare, insurance, investments, and telecoms. OpenAI's GPT-4.1 tops the updated research, and Kimi K2 leads among open-source models.
The Latest
As AI moves from generating responses to performing actions, the need for trust increases exponentially. And as organizations enlist AI agents for increasingly sophisticated business processes, trust is going to be the single most important theme for spurring adoption. What can organizations do to build trustworthy AI agents? ...
I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...
Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...
For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...
Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...
Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...
For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...
New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...
Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...
In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ...