
Virtana announced AI Factory Observability for Dell AI Factory environments, bringing its AI Factory Observability platform to one of the industry’s most widely deployed enterprise AI infrastructure stacks.
The integration spans Dell PowerEdge compute, PowerScale and ObjectScale storage, high-performance networking fabrics, including InfiniBand, Ethernet, and NVLink, and Dell’s Smart Fabric Manager (SFM) orchestration layer. As enterprises deploy Dell AI Factory to run GPU-intensive training and inference at scale, the operational challenge shifts from infrastructure acquisition to infrastructure performance: understanding not just whether components are running, but whether the system is producing outcomes efficiently. Virtana directly addresses this challenge, giving infrastructure and AI platform teams end-to-end visibility and control across every layer of the Dell AI Factory stack. Having established deep integrations with NVIDIA and Nutanix, Virtana continues to extend full-stack observability across the major ecosystem environments where enterprises are building and operating AI at scale.
“Dell AI Factory gives enterprises a world-class foundation for running AI at scale. The challenge every organization faces, regardless of platform, is connecting infrastructure performance to actual AI outcomes,” said Paul Appleby, CEO of Virtana. “Virtana solves that. We give Dell AI Factory customers the end-to-end visibility to know whether their GPUs are producing value, where constraints exist, and how to optimize the system to get more from their investment.”
Virtana AI Factory Observability integrates natively across every layer of the Dell AI Factory architecture. Rather than adding telemetry volume, Virtana connects signals across the entire stack and explains why the system behaves the way it does by correlating GPU performance with storage I/O, network fabric throughput, workload orchestration, and AI model output in a single operational view.
Virtana AI Factory Observability capabilities delivered across the Dell AI Factory stack include:
- GPU and compute performance across PowerEdge infrastructure map utilization to workload output, expose idle and misallocated capacity, and correlate GPU performance with upstream and downstream dependencies
- Storage observability across PowerScale and ObjectScale identify I/O latency that directly impacts training and inference, correlate data pipeline performance with model slowdown, and enable storage bottlenecks visible and actionable
- Network fabric intelligence across InfiniBand, Ethernet, and NVLink detect east-west congestion across GPU clusters, correlate fabric performance with job latency, and identify constraints that limit scaling efficiency in distributed training environments
- Cluster and fabric management visibility through SFM integration surface workload placement behavior and provide directional insight into potential imbalances or inefficiencies, without requiring deep manual correlation across tools
- Node-level hardware intelligence from iDRAC telemetry correlate power, thermal, and health signals with system impact to distinguish hardware issues from workload or orchestration problems
- AI workload and cost optimization connect LLM behavior, token usage, and latency to infrastructure performance, map cost per token to actual infrastructure consumption, and enable true optimization of AI economics
“AI workloads at scale are complex by nature; they span GPUs, storage, networking, and orchestration. Performance depends on how all of those layers interact,” said Amitkumar Rathi, Chief Product Officer at Virtana. “The Dell AI Factory gives enterprises a powerful, integrated foundation. Virtana connects the signals across that foundation so teams can resolve issues faster, maximize GPU ROI, and scale from pilot to production with confidence.”
The Latest
Many organizations describe AI as strategic, but they do not manage it strategically. When AI plans are disconnected from strategy, detached from organizational learning, and protected from serious assumptions testing, the problem is no longer technical immaturity; it is a failure of management discipline ... Executives too often tell organizations to "use AI" before they define what AI is supposed to change. The problem deepens in organizations where strategy isn't well articulated in the first place ...
Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...
Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...
Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...
Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...
Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...
If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...
In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...
In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...
Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...