Skip to main content

Galileo Releases Agentic Evaluations

Galileo unveiled Agentic Evaluations, a solution for evaluating the performance of AI agents powered by large language models (LLMs). 

With Agentic Evaluations, developers gain the tools and insights needed to optimize agent performance and reliability at every step—ensuring readiness for real-world deployment.

"AI agents are unlocking a new era of innovation, but their complexity has made it difficult for developers to understand where failures occur and why," said Vikram Chatterji, CEO and co-founder of Galileo. "With LLMs driving decision-making, teams need tools to pinpoint and understand an agent's failure modes. Agentic Evaluations delivers unprecedented visibility into every action, across entire workflows, empowering developers to build, ship, and scale reliable, trustworthy AI solutions."

Galileo's Agentic Evaluations offers an end-to-end framework that offers both system-level and step-by-step evaluation, enabling developers to build reliable, resilient, and high-performing AI agents.

Key capabilities include:

  • Complete Visibility into Agent Workflows: Gain a clear view of entire multi-step agent completions, from input to final action, with comprehensive tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors in agent sessions.
  • Agent-Specific Metrics: Measure agent performance at every level with proprietary, research-backed metrics built to evaluate agents at multiple levels.
    • LLM Planner: Assess tool selection quality and passing on the right instructions.
    • Tool Calls: Assess errors in individual tool completions.
    • Overall session success: Measure overall task completion and successful agentic interactions.
  • Granular Cost and Latency Tracking: Optimize the cost-effectiveness of agents with aggregate tracking for cost, latency, and errors across sessions and spans.
  • Seamless Integrations: Support for popular AI frameworks like LangGraph and CrewAI.
  • Proactive Insights: Alerts and dashboards help developers identify systemic issues and uncover actionable insights for continuous improvement such as failed tool calls or misalignment between the final action and initial instructions.

Agentic Evaluations is now available to all Galileo users.

The Latest

In MEAN TIME TO INSIGHT Episode 12, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses purchasing new network observability solutions.... 

There's an image problem with mobile app security. While it's critical for highly regulated industries like financial services, it is often overlooked in others. This usually comes down to development priorities, which typically fall into three categories: user experience, app performance, and app security. When dealing with finite resources such as time, shifting priorities, and team skill sets, engineering teams often have to prioritize one over the others. Usually, security is the odd man out ...

Image
Guardsquare

IT outages, caused by poor-quality software updates, are no longer rare incidents but rather frequent occurrences, directly impacting over half of US consumers. According to the 2024 Software Failure Sentiment Report from Harness, many now equate these failures to critical public health crises ...

In just a few months, Google will again head to Washington DC and meet with the government for a two-week remedy trial to cement the fate of what happens to Chrome and its search business in the face of ongoing antitrust court case(s). Or, Google may proactively decide to make changes, putting the power in its hands to outline a suitable remedy. Regardless of the outcome, one thing is sure: there will be far more implications for AI than just a shift in Google's Search business ... 

Image
Chrome

In today's fast-paced digital world, Application Performance Monitoring (APM) is crucial for maintaining the health of an organization's digital ecosystem. However, the complexities of modern IT environments, including distributed architectures, hybrid clouds, and dynamic workloads, present significant challenges ... This blog explores the challenges of implementing application performance monitoring (APM) and offers strategies for overcoming them ...

Service disruptions remain a critical concern for IT and business executives, with 88% of respondents saying they believe another major incident will occur in the next 12 months, according to a study from PagerDuty ...

IT infrastructure (on-premises, cloud, or hybrid) is becoming larger and more complex. IT management tools need data to drive better decision making and more process automation to complement manual intervention by IT staff. That is why smart organizations invest in the systems and strategies needed to make their IT infrastructure more resilient in the event of disruption, and why many are turning to application performance monitoring (APM) in conjunction with high availability (HA) clusters ...

In today's data-driven world, the management of databases has become increasingly complex and critical. The following are findings from Redgate's 2025 The State of the Database Landscape report ...

With the 2027 deadline for SAP S/4HANA migrations fast approaching, organizations are accelerating their transition plans ... For organizations that intend to remain on SAP ECC in the near-term, the focus has shifted to improving operational efficiencies and meeting demands for faster cycle times ...

As applications expand and systems intertwine, performance bottlenecks, quality lapses, and disjointed pipelines threaten progress. To stay ahead, leading organizations are turning to three foundational strategies: developer-first observability, API platform adoption, and sustainable test growth ...

Galileo Releases Agentic Evaluations

Galileo unveiled Agentic Evaluations, a solution for evaluating the performance of AI agents powered by large language models (LLMs). 

With Agentic Evaluations, developers gain the tools and insights needed to optimize agent performance and reliability at every step—ensuring readiness for real-world deployment.

"AI agents are unlocking a new era of innovation, but their complexity has made it difficult for developers to understand where failures occur and why," said Vikram Chatterji, CEO and co-founder of Galileo. "With LLMs driving decision-making, teams need tools to pinpoint and understand an agent's failure modes. Agentic Evaluations delivers unprecedented visibility into every action, across entire workflows, empowering developers to build, ship, and scale reliable, trustworthy AI solutions."

Galileo's Agentic Evaluations offers an end-to-end framework that offers both system-level and step-by-step evaluation, enabling developers to build reliable, resilient, and high-performing AI agents.

Key capabilities include:

  • Complete Visibility into Agent Workflows: Gain a clear view of entire multi-step agent completions, from input to final action, with comprehensive tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors in agent sessions.
  • Agent-Specific Metrics: Measure agent performance at every level with proprietary, research-backed metrics built to evaluate agents at multiple levels.
    • LLM Planner: Assess tool selection quality and passing on the right instructions.
    • Tool Calls: Assess errors in individual tool completions.
    • Overall session success: Measure overall task completion and successful agentic interactions.
  • Granular Cost and Latency Tracking: Optimize the cost-effectiveness of agents with aggregate tracking for cost, latency, and errors across sessions and spans.
  • Seamless Integrations: Support for popular AI frameworks like LangGraph and CrewAI.
  • Proactive Insights: Alerts and dashboards help developers identify systemic issues and uncover actionable insights for continuous improvement such as failed tool calls or misalignment between the final action and initial instructions.

Agentic Evaluations is now available to all Galileo users.

The Latest

In MEAN TIME TO INSIGHT Episode 12, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses purchasing new network observability solutions.... 

There's an image problem with mobile app security. While it's critical for highly regulated industries like financial services, it is often overlooked in others. This usually comes down to development priorities, which typically fall into three categories: user experience, app performance, and app security. When dealing with finite resources such as time, shifting priorities, and team skill sets, engineering teams often have to prioritize one over the others. Usually, security is the odd man out ...

Image
Guardsquare

IT outages, caused by poor-quality software updates, are no longer rare incidents but rather frequent occurrences, directly impacting over half of US consumers. According to the 2024 Software Failure Sentiment Report from Harness, many now equate these failures to critical public health crises ...

In just a few months, Google will again head to Washington DC and meet with the government for a two-week remedy trial to cement the fate of what happens to Chrome and its search business in the face of ongoing antitrust court case(s). Or, Google may proactively decide to make changes, putting the power in its hands to outline a suitable remedy. Regardless of the outcome, one thing is sure: there will be far more implications for AI than just a shift in Google's Search business ... 

Image
Chrome

In today's fast-paced digital world, Application Performance Monitoring (APM) is crucial for maintaining the health of an organization's digital ecosystem. However, the complexities of modern IT environments, including distributed architectures, hybrid clouds, and dynamic workloads, present significant challenges ... This blog explores the challenges of implementing application performance monitoring (APM) and offers strategies for overcoming them ...

Service disruptions remain a critical concern for IT and business executives, with 88% of respondents saying they believe another major incident will occur in the next 12 months, according to a study from PagerDuty ...

IT infrastructure (on-premises, cloud, or hybrid) is becoming larger and more complex. IT management tools need data to drive better decision making and more process automation to complement manual intervention by IT staff. That is why smart organizations invest in the systems and strategies needed to make their IT infrastructure more resilient in the event of disruption, and why many are turning to application performance monitoring (APM) in conjunction with high availability (HA) clusters ...

In today's data-driven world, the management of databases has become increasingly complex and critical. The following are findings from Redgate's 2025 The State of the Database Landscape report ...

With the 2027 deadline for SAP S/4HANA migrations fast approaching, organizations are accelerating their transition plans ... For organizations that intend to remain on SAP ECC in the near-term, the focus has shifted to improving operational efficiencies and meeting demands for faster cycle times ...

As applications expand and systems intertwine, performance bottlenecks, quality lapses, and disjointed pipelines threaten progress. To stay ahead, leading organizations are turning to three foundational strategies: developer-first observability, API platform adoption, and sustainable test growth ...