My First Bad Databricks Execution Plan

Lately we’ve been tuning and refactoring several Databricks workflows. Nothing radical — mostly sensible engineering improvements:

replacing count-based existence checks with is_empty
swapping single-task loops for foreach tasks
upgrading Databricks runtimes

The code is shared across five datasets, so we expected broad performance gains.

During testing, everything looked promising. Cluster runtimes dropped, DBU usage improved, and the jobs consistently finished faster.

After a long regression-testing cycle, we released to production.

Four of the five datasets behaved exactly as expected: runtimes dropped from roughly an hour down to 20–30 minutes.

Then the fifth dataset did the opposite.

A workflow that normally completed in ~45 minutes suddenly started taking over an hour.

Initial Investigations

Looking at the foreach task, two iterations stood out immediately. While most iterations completed in under five minutes, these two were each taking close to an hour.

The odd part was that they didn’t correspond to unusually large partitions or datasets — at least not based on the metrics we initially checked.

Digging deeper into the slow iterations, we found two non-lazy operations taking significant runtimes:

a query retrieving the max timestamp
an is_empty existence check

Both queries were taking around 15 minutes each.

That was unexpected, especially given how lightweight those operations normally are.

The not so successful attempts at tuning

We tried several tuning approaches that seemed likely to help:

Changed the sequence of the foreach iterations so tasks wouldn't run concurrently and contend for resources.
Bumped up driver/executor resources to see if it was resource starvation.
Tweaked Spark parallelism and shuffle partitioning.
Looked for skew and tried redistributing keys.
Considered setting the concurrency to 1 but decided it might make the job run for 2-3 hours.
Considered gathering stats but saw the full stats were already there.
Enabled (and disabled) a few runtime optimizations.

Those only cut off less then 5 minutes. Not enough. How can the serialized operation run faster than the foreach task in this scenario. The problem remained stubbornly intermittent.

What finally helped (the fix)

The problem kept bothering me over the weekend. It started to feel very similar to the old days of wrestling with the Oracle optimizer and comparing execution plans.

Back on Monday, I dug deeper into the Spark UI, looking for the SQL DataFrames behind the slow queries.

Once I found them, I used AI to help translate the Spark execution plans into something closer to the Oracle-style plans I was more familiar with reading.

The problem kept bothering me over the weekend. It started to feel very similar to the old days of wrestling with the Oracle optimizer and comparing execution plans.

Back on Monday, I dug deeper into the Spark UI, looking for the SQL DataFrames behind the slow queries.

Once I found them, I used AI to help translate the Spark execution plans into something closer to the Oracle-style plans I was more familiar with reading.

Takeaways

After solving that, we could finally celebrate. Four datasets got dramatically faster, and the fifth turned out to be a reminder that performance tuning is rarely as straightforward as “newer runtime = better runtime.”

A few things stood out from this exercise:

Runtime upgrades deserve the same scrutiny as application changes.
We treated the DBR upgrade as a low-risk improvement bundled into a broader refactor. In reality, it changed query behaviour enough to introduce a major regression on a specific workload. Runtime upgrades should go through the same level of regression and performance testing as code changes.
Execution plans still matter.
Even in managed platforms like Databricks, understanding what the engine is actually doing remains critical. The issue only became visible once we dug into the Spark UI and compared execution plans closely enough to spot the bloom filter difference. The tooling changes, but the skill of reading plans is still invaluable.
Parallelism can expose problems that serialized workloads hide.
The foreach implementation surfaced behaviour that never appeared in the single-task version. Concurrency, optimizer choices, and runtime heuristics can interact in unexpected ways, especially under production-scale workloads.
Non-production environments rarely behave exactly like production.
We never reproduced the slowdown outside production, despite extensive testing. That highlighted gaps in our validation approach — dataset shape, concurrency patterns, cluster sizing, and workload characteristics all matter when testing performance-sensitive changes.
Observability pays for itself during incidents.
Better instrumentation would have shortened the investigation significantly. Faster access to execution plans, clearer query-level metrics, and more targeted logging would have reduced a lot of the guesswork.

The silver lining is that these are the kinds of problems that sharpen engineering instincts. There’s something oddly satisfying about chasing down a subtle optimizer regression, especially when the final fix turns a one-hour task back into a two-minute operation.

My First Bad Databricks Execution Plan

Initial Investigations

The not so successful attempts at tuning

What finally helped (the fix)

Takeaways

Comments

Laying bricks with Databricks

IAC Face-Off: Databricks Asset Bundles vs Terraform

More from this blog

Databricks Antipatterns: Two Hard Lessons from the Field

Kurdapyo's Newbie Guide to PySpark

Counting when it really counts

From For Loops to For Each

Command Palette

Initial Investigations

The not so successful attempts at tuning

What finally helped (the fix)

Takeaways

Comments

Laying bricks with Databricks

IAC Face-Off: Databricks Asset Bundles vs Terraform

More from this blog