Why Your Spark Jobs Are Failing at 2am (And Why You're the Last to Know)

May 26, 2026

10 minute

Failures of Spark jobs during off-hours can significantly disrupt workflows, causing delays in data pipelines and requiring unplanned incident responses. Built-in and external monitoring tools, essential Spark metrics, and proactive alerting strategies can identify issues before they escalate into critical production incidents.

Imagine this: your phone buzzes at 6 AM with an urgent message from a downstream analyst reporting missing data. In a rush, you check the Spark History Server, sift through event logs, inspect kubectl output, and finally trace the failure back to an executor that crashed at 2:17 AM. The job silently retried, encountered the same issue, and ultimately failed. By the time the problem was identified, four hours’ worth of pipeline output had been lost.

This scenario highlights the pitfalls of Apache Spark monitoring that is primarily reactive. While the built-in UI can show what has already occurred, what you truly need is a monitoring stack that informs you of potential issues before they arise.

In the sections below, we will explore failure modes, critical metrics, Prometheus/Grafana setup, effective alerting patterns, executor sizing, and diagnostics for OOMKill events.

Why Spark Jobs Fail in the Middle of the Night

Failures that occur during the night are often not random; they tend to fit within a limited range of recognizable categories, each accompanied by specific signals you can monitor.

Data Skew and Stragglers: Tasks that handle an excessive proportion of shuffle data can cause runtime delays. To diagnose, use the Spark web UI stage detail to view task duration distributions and any shuffle read/write imbalances.
Executor Memory Pressure and GC Churn: These issues typically manifest before an outright job failure. The Executors tab provides crucial information about GC time and memory usage. According to Spark’s tuning guide, instances of complete garbage collections (GC) before a task completes are strong indicators that the executor may not have sufficient memory.
Unhandled Application Exceptions: These tend to be subtler. Failed stages display reasons in the Stages view, and you can access the Spark REST API at /api/v1 to programmatically identify failed or completed applications—highlighting the importance of alerting hooks.
Resource Contention and Kubernetes Disruptions: These can be the hardest to detect without specialized tools. If a container exceeds its memory limit, the Linux OOM killer may terminate it, which appears as a terminated Kubernetes pod with a reason of OOMKilled and an exit code of 137, visible via kubectl describe pod.

The common denominator in all these failure modes is the existence of detectable signals, contingent upon the monitoring infrastructure you implement.

Spark Metrics That Reveal Trouble Early

Understanding the right Spark metrics to collect begins with knowing their sources. Spark has distinct metrics originating from the driver and executors.

Driver Metrics: These highlight application-level behaviors, including job scheduling, directed acyclic graph (DAG) planning, and resource requests.
Executor Metrics: These are crucial for operational stability, including JVM heap usage, GC behavior, and task throughput. Metrics are sent from executors to drivers as part of heartbeats and can be accessed via the REST endpoint at /applications/[app-id]/executors.

Rather than applying fixed global thresholds, calibrate against your actual workload. Spark's tuning documentation recommends analyzing real GC statistics and storage behavior to understand memory pressure in the context of your specific jobs.

Configuration is handled through $SPARK_HOME/conf/metrics.properties or the spark.metrics.conf property. Available sinks include MetricsServlet (JSON format) and the experimental PrometheusServlet. Keep overhead low: enable only the sinks you need, and tune Prometheus scrape intervals and timeouts to match your alerting requirements.

Built-In Spark Monitoring Tools (UI and History Server)

The Spark web UI is the quickest way to investigate during live runs. The Jobs view provides an event timeline along with a DAG visualization. The Stage detail section reveals GC times, peak execution memory, and a detailed breakdown per task, while the SQL tab associates query durations with logical and physical execution plans.

However, the UI is ephemeral by default. If an application stops running, the UI disappears unless you explicitly enable event logging by setting spark.eventLog.enabled=true and defining the location with spark.eventLog.dir. The History Server can then reconstruct the UI from preserved logs, allowing for post-mortem investigation.

Managing log volume is simplified with controls in the History Server, such as:

spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

For applications that run continuously, consider enabling rolling event logs with spark.eventLog.rolling.enabled to facilitate management.

Security Considerations: Security is paramount in multi-tenant environments. Spark’s security model supports servlet-filter authentication for the History Server, alongside authorization settings to manage access control effectively.

While the built-in tools provide valuable insights, they lack capabilities such as persistent time-series storage, alert routing, silencing, or cross-signal correlation. Thus, for holistic monitoring, an external layer is essential.

Setting Up Prometheus and Grafana for Apache Spark

To route Prometheus data from Spark, start by activating the executor metrics endpoint. Set spark.ui.prometheus.enabled=true in your Spark configuration — this exposes executor metrics at /metrics/executors/prometheus on the driver UI. Also configure PrometheusServlet as a sink in metrics.properties.

Note: Spark's documentation marks PrometheusServlet as experimental. Evaluate stability carefully before relying on it in production.

Prometheus setup: Create a scrape_config block specifying job_name, metrics_path, scrape_interval, and scrape_timeout. Before building dashboards, confirm the target appears as "up" in your Prometheus targets view.

Comparison of Spark Monitoring Approaches

A useful Grafana dashboard for Spark should cover executor memory and GC indicators, shuffle read/write pressure, job and stage success/failure counts, and Prometheus scrape health.

Approach	How It Works	Pros	Cons
Spark Web UI + History Server	During live runs and post-termination via event logs	Extensive per-run investigation capabilities	No time-series; limited for alerting
Spark REST API (/api/v1) polling	Programmatically polls driver/SHS endpoints	Programmatic detection of state changes	Coarse polling, not PromQL-native
MetricsServlet (JSON)	JSON metrics served via Spark UI	Lightweight and uses native configurations	Requires a translation layer for Prometheus
PrometheusServlet (experimental)	Provides Prometheus format metrics	Direct path for Prometheus scrapes	Experimental status, potentially unstable
Executor Prometheus endpoint	Active through spark.ui.prometheus.enabled=true	Native executor health metrics	Requires secure exposure for access

Apply the same security standards to metrics endpoints that you apply to the History Server: authentication, ACLs, and restricted access to /api/v1. Verify the full setup by checking http://<driver>:4040/api/v1 and confirming Prometheus is successfully scraping the expected path.

Proactive Alerting: Stopping 2 AM Failures Before They Escalate

Static thresholds set without baseline analysis create noise or miss the signals that matter. Instead, build symptom chains from your workload's baseline metrics. Escalating GC time coupled with rising peak execution memory and stalled stage progress is a meaningful alert pattern. Each signal alone might be noise; together, they indicate a problem in progress.

Alertmanager handles the mechanics of routing, grouping, deduplication, silencing, and inhibition. Use inhibition rules to suppress downstream alerts when a root-cause alert fires. For example, if an executor memory exhaustion alert fires, suppress the resulting task and job failure alerts that stem from it.

Grafana Alerting integrates with Slack, webhooks, PagerDuty, and OpsGenie. Use notification policies for severity-based routing: GC pressure warnings route to Slack; executor failures page the on-call responder. For planned maintenance, use silences for one-off events and mute timings for recurring quiet windows.

Runbooks reduce MTTR. Every alert should link to a runbook that covers, at minimum:

Checking the Spark UI for stage failure reasons
Reviewing executor GC times and memory peaks
Verifying pod termination reasons via kubectl describe pod

This compresses the first phase of incident response without overcomplicating resolution steps.

Right-Sizing Spark Executor Instances for Stability

The stability of Spark executor instances hinges on a comprehensive understanding of how executor count, cores per executor, and memory interact. Enhancing parallelism can shrink per-task working set sizes—critical for avoiding out-of-memory (OOM) events. Spark’s tuning guide suggests beginning with a target of 2 to 3 tasks per CPU core.

Memory allocated for each executor container combines:

spark.executor.memory
spark.executor.memoryOverhead
spark.memory.offHeap.size
spark.executor.pyspark.memory

Within Kubernetes environments, Spark's Kubernetes documentation notes that spark.kubernetes.memoryOverheadFactor defaults higher for non-JVM workloads due to their tendency to fail with Memory Overhead Exceeded errors.

Workload-to-configuration mapping

Dynamic allocation introduces a specific monitoring wrinkle: when executors are removed, shuffle files they held may need recomputation. Spark's job scheduling documentation covers the external shuffle service as the mitigation. Monitor executor churn alongside job duration to catch operational thrashing before it becomes a problem.

Workload Type	Common Instability Driver	Executor Configuration Approach	Validation Signals
Wide transformations / heavy shuffles	Large per-task working set; GC churn	Increase parallelism; fine-tune execution memory per observed GC frequency	Monitoring peak execution memory, GC times, and long-tail tasks
Long-running streaming apps	Accumulating event logs; UI degradation	Enable rolling event logs; configure SHS compaction with known lossy trade-offs	Assess SHS response; control event log directory growth
Multi-tenant clusters with spiky demand	Over/under provisioning issues	Consider dynamic allocation with an external shuffle service to preserve shuffle files	Monitor executor addition/removal rates; evaluate job duration variance
Spark on Kubernetes	OOMKilled due to container limits	Adjust total container memory across the four components; apply overhead factors for non-JVM workloads	Verify kubectl describe pod exit code 137; correlate executor GC/memory signals

The right-sizing loop is iterative: observe GC times, peak execution memory, and shuffle behavior via the UI or Prometheus → adjust → redeploy → repeat. No formula replaces this cycle for your specific workloads.

Diagnosing OOMKilled Spark Executors and Other Common Errors

A spark oomkilled event can be traced on two fronts: Kubernetes and Spark. Use kubectl describe pod on Kubernetes to examine the terminated container’s final state, looking for Reason: OOMKilled and Exit Code: 137, along with timestamps.

Microsoft's AKS troubleshooting guidance reinforces this as the standard protocol.

On the Spark side, align the timestamps with the Executors tab in the UI or SHS, examining spikes in GC time, memory pressure trends, and shuffle imbalances to pinpoint the source of memory exhaustion.

OOMKilled executor diagnostic checklist:

Confirm OOMKilled reason and Exit Code: 137 via kubectl describe pod and note the timestamp
Ensure total Spark memory configuration fits within container limits
Analyze executor GC times and shuffle behavior
Check stage-level peak execution memory
Increase executor memory or rebalance parallelism if needed
Adjust workload distribution to reduce per-task memory pressure
Enable metrics collection for future observability

For non-OOM failures, begin by examining failure reasons in the Stage view. For deeper insights, consult the Spark configuration reference.

Your 2 AM Calls Are Optional — If You're Watching the Right Signals

Spark failures at 2 AM aren't inevitable. They're the result of specific monitoring gaps: missing event logs, no time-series metrics, poor alert routing, no runbooks. Spark's own documentation frames effective monitoring as a composite — web UIs, metrics, and external instrumentation working together.

The stack covered here reflects that layered approach: built-in tools for investigation → Prometheus and Grafana for visibility and alerting → proactive alert workflows → executor right-sizing → structured OOMKill diagnostics.

For teams running Spark on Kubernetes at scale, xLake's pod-level observability brings driver logs, executor health, OOMKill signals, and spot eviction reasons into a unified control plane—eliminating the multi-tool context-switching that lets incidents go undetected overnight.

Ready to stop reacting to Spark failures after the fact? See how xLake's unified observability surfaces problems before your users do — book a demo at acceldata.io.

Apache Spark Monitoring: Frequently Asked Questions

How do I enable metrics in production without impacting performance?

Limit your configuration to essential sinks only, and use conservative Prometheus scrape settings aligned with your actual alerting needs. Avoid enabling every available sink by default.

What's the typical overhead of enabling Spark monitoring?

It depends on which sinks are enabled, scrape frequency, and total metric volume. Start conservative, measure the impact on your workloads, and adjust from there.

How long should I retain Spark History Server data?

Control retention with:

spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

Set maxAge based on how far back your incident investigations typically need to go.

How do I secure metric endpoints in multi-tenant clusters?

Enable authentication on the History Server, configure ACLs, and restrict access to /api/v1 endpoints. Apply the same controls to Prometheus scrape targets and Grafana data sources.

What should I do first when a Spark job fails silently overnight?

Open the Spark History Server
Check stage failures for failure reasons
Review executor GC times and memory peaks
Inspect the Kubernetes pod via kubectl describe pod

About Author