What Happens When Your Intelligence Tool Goes Down During an Active Incident

Generative AITradecraft & Analyst Skills

Jun 11

Before the Next Incident

It is 2:23 AM on a Tuesday when the alert fires across the security operations center at a regional water authority. The initial indicators are familiar enough: anomalous authentication attempts against an operational technology network, lateral movement signatures, a process controller queried from an endpoint that has no business touching it. The SOC analyst on duty pulls up the intelligence platform, begins correlating the indicators against known threat actor profiles, and starts building the picture that will determine whether this escalates to incident response within the hour or gets triaged down for morning review.

At 2:31 AM, the platform goes offline.

The interface returns a connection error. The analyst's working collection, the half-built threat actor correlation, the source documents pulled from the last forty minutes of investigation — all of it inaccessible. The browser tabs are still open, but the session state is lost and the underlying data is unreachable.

The analyst still has raw telemetry from the SIEM and endpoint logs. What's missing is the analytical layer that was turning those disparate signals into an intelligible picture. The context about which threat groups target water treatment infrastructure, the behavioral pattern matching against historical intrusion data, the structured report framework that would let them communicate findings up the chain with the speed and specificity that a potential OT network compromise demands.

The clock does not stop because the tool went down. The intrusion — if that is what this is — continues to develop. The window for containment is measured in minutes, not hours, and every minute the analyst spends reconstructing context from memory, manually cross-referencing sources that were already curated and organized, or waiting on a support queue to restore access is a minute the threat has to move laterally, establish persistence, or reach something it should never touch.

By 3:10 AM, access is restored. The platform was down for thirty-nine minutes. No data was lost. The vendor's status page would log it as a minor infrastructure disruption affecting a subset of customers.

But in the SOC's accounting, it was not minor at all. The analyst's investigation lost its continuity. The initial threat assessment, which should have been completed and escalated by 3:00 AM, was delivered at 3:47 AM after a shift change had already complicated the handoff. The utility's incident response team was notified late.

While hypothetical, this scenario is a compressed version of what happens whenever a team has built their investigation workflow around a platform that was never stress-tested as infrastructure. The tool worked exactly as advertised — right up until the moment it mattered most.

Why This Risk Is Underweighted at Procurement

Most procurement conversations about AI intelligence platforms follow a familiar pattern: capabilities get demonstrated, integrations get checked, pricing gets negotiated. Reliability, if it comes up at all, surfaces near the end.

When AI was a background tool — helping analysts draft faster, compile sources, format reports — a few hours of downtime was an inconvenience. The work would take longer, but it would get done.

Today, AI intelligence platforms have moved to the center of analytical workflows. Teams have built their collection logic, template architecture, source validation processes, and escalation cadences around the assumption that the platform is available. Institutional knowledge about how investigations are structured, how sources are weighted, how reports are formatted for specific stakeholders — much of that now lives inside the platform rather than in an analyst's head or a shared drive. When the platform goes down, analysts lose the analytical layer entirely.

This shift carries a specific implication that procurement processes haven't caught up to: evaluating an AI intelligence platform requires different criteria depending on whether it's a productivity tool or operational infrastructure. Productivity tools are evaluated on what they add. Operational infrastructure is also evaluated on what happens when it fails.

The distinction matters because failure modes are categorically different. A productivity tool failing slows output. When operational infrastructure fails, the operation itself is impaired. In a SOC context, impairment during an active investigation is a gap in coverage at the moment it matters most.

What makes this risk easy to underweight at procurement is that it's almost never tested. Vendors don't demo degraded states, procurement teams don't simulate outages, and SLAs get reviewed as documents rather than as commitments with real response structures behind them. The implicit assumption throughout the evaluation process is that the platform will be available, and that assumption goes unchallenged because in a demo environment, it always is.

Organizations tend to discover this gap mid-investigation, under pressure, with no pre-established fallback. The question of what the vendor actually commits to during an outage, and how fast they respond, and whether any operational capability survives while the platform is unavailable, becomes urgent long after the moment to negotiate it has passed.

Before any AI intelligence platform reaches final procurement consideration, one question should come first: are we treating this as a productivity tool or as operational infrastructure?

What Mission-Critical Continuity Actually Requires

When a platform becomes load-bearing in a security operation, the evaluation criteria have to reflect that. These six areas are the minimum standard.

Uptime commitments have to be specific and contractually binding. A vendor claiming 'high availability' without a published SLA is making a marketing statement. You need to find out what they guarantee, how they measure it, and what remedies exist when they miss it. For a SOC running continuous operations, the difference between 99.5% and 99.99% availability translates to roughly 43 hours of potential downtime per year versus under an hour.
Redundancy and failover architecture determine whether an outage is a momentary interruption or an operational collapse. Platforms built on single-region cloud infrastructure can fail across an entire instance when a provider experiences a regional event. The platforms that hold up during those moments are the ones architected with geographic redundancy, automatic failover to secondary environments, and load balancing designed for fault tolerance. This is infrastructure investment on the vendor's side, and it shows up, or fails to, the moment the primary environment goes offline.
Degraded-mode capability is the criterion most organizations fail to think through until they need it. What happens when partial functionality is lost, like when one data source is down or collaborative features drop? Does core reporting remain accessible? A platform with graceful degradation keeps analysts working even when partial functionality is lost. Losing 20% of functionality mid-investigation is manageable, but losing all of it is a containment failure.
Data residency during an outage is a dimension of continuity that deserves its own conversation. When a platform goes offline, the critical question is where that data lives and whether it remains accessible. Organizations operating under data sovereignty requirements, federal compliance frameworks, or strict information security policies need explicit answers: Is data stored in isolated customer environments? Is it encrypted at rest and in transit with standards like AES-256 and TLS 1.2+? Can data be exported before or during an outage? Vendors who cannot answer these questions with specifics introduce compliance and continuity risk simultaneously.
In an operational context, support response time is a recovery metric, not a customer service one. When an analyst in an active incident cannot access their platform, the clock on that response gap runs against the investigation. The relevant question is what their committed response time looks like for a P1 incident at 3:00 AM on a weekend. Organizations should understand whether their contract includes a defined SLA for critical issues, what escalation paths exist, and whether they have a named point of contact or are routed through a general queue.
SLA transparency is what makes the criteria above enforceable. Each of these criteria becomes meaningful only when it is documented, published, and auditable. A verbal uptime commitment that doesn't appear in the contract is not a commitment. Security leadership and procurement teams should expect to receive, review, and retain the vendor's full SLA documentation before signing — and should treat any reluctance to provide that documentation as a signal worth taking seriously.

Questions to Ask Any Vendor Before You Commit

The conversations that happen in most vendor evaluations — features, seat counts, data encryption — won't tell you whether the platform holds up at 2 AM during an active incident. These questions will.

What is your guaranteed uptime SLA, and what are the actual remedies if you miss it? Every vendor will cite a number — 99.9 percent, 99.95 percent, five nines. What matters is what happens when they fall short. Service credits on your next invoice don't reflect the operational cost of failure during an active incident. Ask specifically whether your contract includes performance remedies that reflect real operational impact, or whether the SLA exists primarily to protect the vendor.

What does the platform look like during a partial outage or degraded state? This question separates vendors who have genuinely thought about resilience from those who have not. A platform that goes fully dark during infrastructure issues is a different risk profile than one that maintains core functionality — search, template access, report generation from a local cache — while backend systems recover. Ask them to walk you through what your analyst can and cannot do during a degraded state.

Where does my data reside during an outage, and who can access it? Data residency questions are typically asked in the context of compliance, but they matter operationally, too. During a platform disruption, what happens to the intelligence your team has been building — the source collections, the in-progress reports, the historical analysis? Can your analysts access that material through any fallback mechanism, or does it become inaccessible until the vendor restores service? For organizations in regulated sectors or with data sovereignty requirements, you also need to know whether your data moves jurisdictions during a failover event.

What is your incident response time when we have a critical production issue? Ask for specifics, not category names. Not "Priority 1 tickets receive expedited handling" — ask what expedited means in hours, which support tier that response comes from, whether that tier is available around the clock and across time zones, and whether the SLA clock starts when you submit the ticket or when a human actually reads it. Then ask whether that response commitment appears in your contract or exists only in the vendor's current support documentation, which can change.

Can you show me the last three times your platform experienced unplanned downtime, and what the resolution timeline looked like? Vendors who have mature operational practices will be able to answer this directly. They will have public or shared status histories, post-incident reports, and honest timelines. Vendors who struggle with this question — or who redirect to their uptime percentage without addressing actual incident history — are signaling that transparency is not a priority. Reliability shows up in how incidents are managed and communicated, not just how rarely they occur. You want evidence of the latter.

How does your architecture handle regional infrastructure failures from your cloud provider? Most AI platforms run on major cloud infrastructure, which means a regional availability event at the provider level can cascade into a platform outage. Ask whether the platform is deployed across multiple regions with automatic failover, or whether it runs out of a single region. This is particularly relevant for organizations where geographic separation matters either for compliance or for operational redundancy across distributed SOC environments.

Outages happen. No vendor can promise otherwise. What you are evaluating is whether the vendor has thought seriously about operational continuity, communicates honestly about its limits, and has built the contractual and technical infrastructure to back up what they tell you.

Know What You’re Buying

If these questions don't have easy answers, it's worth doing that evaluation before the next incident makes it urgent. Book a demo with the Indago team to walk through how the platform performs under the conditions that actually matter.

outageriskuptimeslareliabilityaiai intelligenceai reportingdegraded-moderedundancy

Indago Team