NeuBird AI study finds gap on incident management AI
NeuBird AI has published research showing a sharp gap between executives and engineers in how they view the use of artificial intelligence in incident management. The survey also found that alert fatigue is contributing to outages.
The study surveyed 1,039 site reliability engineering, DevOps and IT operations professionals at organisations with 100 or more employees. Respondents included C-suite executives, IT and engineering leaders, and practitioners such as software engineers, system administrators, DevOps engineers and SREs.
According to the findings, 74% of C-suite respondents said their organisations actively use AI for incident management, compared with 39% of practitioners. Executives were also nearly three times as likely as practitioners to say AI had significantly reduced operational toil, at 35% versus 12%.
The figures suggest a divide between budget holders and the engineering teams working directly with operational systems. Among practitioners who do use AI tools, 28% said the effect on workload was less than 10%.
Alert burden
Incident management consumes a large share of engineering time. Most teams spend 40% or more of their time handling incidents rather than working on product development and innovation.
The burden grows when incidents affect the business. In 93% of organisations, three or more engineers are pulled in to resolve a business-impacting incident, while nearly 40% involve six to ten people.
Post-incident work adds further strain. Thirty-six per cent of teams spend five to ten hours each week on incident reports and post-mortems alone, and 83% use four or more tools during a live incident.
The survey linked this environment to growing operational risk. It found that 77% of on-call teams receive at least ten alerts a day, while 57% said fewer than 30% of those alerts are actionable.
As a result, 83% said teams ignore or dismiss alerts at least occasionally. Forty-four per cent of organisations had suffered an outage in the past year directly linked to suppressed or ignored alerts.
Another 78% reported at least one incident in which no alert fired at all, meaning engineers discovered failures only after customers had been affected. That suggests monitoring systems are missing issues as production environments become more complex.
Downtime costs
The financial impact can be substantial. Sixty-one per cent of organisations estimated infrastructure downtime costs at least USD $50,000 an hour, while 34% put the figure at USD $100,000 or more.
Almost 60% said their mean time to resolve a critical incident was between 30 minutes and two hours. Nearly 90% handle up to 50 incidents a month, adding up to a significant cost burden.
The research also pointed to human costs. Nearly 40% of organisations said more than a quarter of their on-call engineers showed burnout symptoms related to incident management.
Respondents ranked alert fatigue and noise as their biggest challenge. Other issues included limited automation, knowledge silos, documentation gaps, difficulty identifying root causes and integration problems between tools.
Automated root cause analysis was the most common use of AI among organisations that have deployed it in incident management. Anomaly detection and prediction, along with alert correlation and noise reduction, followed.
Budget limits were cited as the main barrier to broader adoption. Respondents also raised concerns that AI could add system complexity, along with security and compliance risks.
Gou Rao, chief executive officer and co-founder of NeuBird AI, said the findings showed existing tools were struggling to keep up with modern production systems. "This data highlights a gap in how today's tools support modern production environments," Rao said. "As systems grow more complex, alert-driven approaches alone can't keep pace. Teams need AI that works alongside them to identify risks before they surface, resolve incidents faster and continuously improve operations so reliability scales with the business."
He also pointed to the cost of delays in resolving incidents. "The math is stark. At a median downtime cost between $50,000 and $100,000 per hour, a one-to-two-hour resolution window for a critical incident represents $50,000 to $200,000 in direct exposure per event, not counting the engineering hours that disappear into diagnosis, root cause analysis and post-mortems," Rao said. "MTTR is the number one KPI organizations track for incident response, which reflects how central resolution speed is to operational performance, yet most organizations are still resolving incidents the same way they were five years ago."
Alongside the research, NeuBird said it had raised USD $19.3 million in new funding led by Xora Innovation and launched an autonomous production operations agent.