By Nate Rich, Principal Engineer, Foulk Consulting
It’s 2:15 AM on a Tuesday. Your phone is screaming on the nightstand. You stumble to your laptop, eyes blurring as you try to parse a critical alert, only to realize it’s a momentary CPU spike on a non-production database that self-corrected before you even logged in.
We’ve all been there. In the world of IT Operations and Site Reliability Engineering (SRE), alert fatigue isn’t just a nuisance, it’s a systemic risk. When your engineers are drowning in a sea of “noise,” they stop treating alerts with urgency. Eventually, a real, business-impacting outage gets buried in the static.
At Foulk Consulting, we advocate for a shift from “monitoring everything” to “observing what matters.” Here is our four-step framework for taming the noise and reclaiming your team’s focus.
Step 1: The Alert Audit (Inventory vs. Action)
The first step to recovery is admitting you have a noise problem. Start by pulling a report of every alert triggered in the last 30 days. Categorize them into two buckets:
- Actionable: Did this alert require a human to take a specific, immediate action to prevent or fix an issue?
- Informational: Was this just a “good to know” event? (e.g., a backup finished, or a scheduled reboot occurred).
The Rule: If an alert isn’t actionable, it shouldn’t be a page. Move informational alerts to a daily summary email or a non-urgent Slack channel. If you find yourself clicking “Acknowledge” without investigating, that alert is a candidate for the scrap heap.
Step 2: Tune Your Thresholds
Static thresholds are the primary cause of false positives. Setting a hard cap of 80% CPU usage might make sense for a steady-state legacy app, but in modern, elastic environments, it’s a recipe for disaster.
- Move to Percentiles: Don’t alert on average latency. Averages hide the “long tail” of frustrated users. Instead, alert on the 95th or 99th percentile (p95/p99). This tells you what your worst-off users are experiencing.
- Implement “Sustained” Logic: Avoid alerting on “blips.” Instead of triggering an alert the moment a threshold is crossed, require the condition to persist for a specific window (e.g., “Latency > 500ms for 5 consecutive minutes”).
- Embrace Dynamic Baselines: Use tools that leverage machine learning to establish a “normal” range for your specific environment, accounting for time-of-day or seasonal fluctuations.
Step 3: Standardize on the “Golden Signals”
If you’re overwhelmed by which metrics to watch, simplify. Google’s SRE handbook popularized the Four Golden Signals, which provide a high-level health check for any user-facing system.
By focusing your primary alerting on these four pillars, you shift the focus from the cause (e.g., “CPU is high”) to the symptom (e.g., “Users are seeing errors”).
- Latency: The time it takes to service a request.
- Traffic: The demand being placed on your system (e.g., HTTP requests per second).
- Errors: The rate of requests that fail, either explicitly (500 errors) or implicitly (wrong content delivered).
- Saturation: How “full” your service is. This is your leading indicator for future latency issues.
Step 4: Silence the Noise with “Intelligent Observability”
Finally, use your tools to work for you, not against you.
- Suppression and Dependency Mapping: If your primary database goes down, you don’t need 50 alerts from every microservice that depends on it. Use dependency mapping to suppress downstream alerts and highlight the root cause.
- Composite Alarms: Trigger a critical page only when multiple conditions are met (e.g., “High Latency” AND “High Error Rate”).
- Automated Triage: For common, predictable issues, implement automated playbooks. If a service hangs, let the system attempt a restart and log the result before waking up an engineer.
The Path to Maturity
Reducing alert fatigue isn’t a one-time project; it’s a culture of continuous improvement. By auditing your inventory, tuning your logic, and focusing on the Golden Signals, you transform your monitoring from a source of stress into a strategic asset.
At Foulk Consulting, we help organizations bridge the gap between complex infrastructure and intelligent observability. If your team is struggling to see the signal through the noise, let’s talk about building a framework that lets you sleep through the night.

*** Nate Rich is a Principal Engineer at Foulk Consulting, where he helps enterprises master performance engineering and full-stack observability.