Databricks can have bad plans too
Lately we’ve been tuning and refactoring several Databricks workflows. Nothing radical — mostly sensible engineering improvements:
replacing count-based existence checks with
is_emptyswapping single-task loops for
foreachtasksupgrading Databricks runtimes
The code is shared across five datasets, so we expected broad performance gains.
During testing, everything looked promising. Cluster runtimes dropped, DBU usage improved, and the jobs consistently finished faster.
After a long regression-testing cycle, we released to production.
Four of the five datasets behaved exactly as expected: runtimes dropped from roughly an hour down to 20–30 minutes.
Then the fifth dataset did the opposite.
A workflow that normally completed in ~45 minutes suddenly started taking over an hour.
Initial Investigations
Looking at the foreach task, two iterations stood out immediately. While most iterations completed in under five minutes, these two were each taking close to an hour.
The odd part was that they didn’t correspond to unusually large partitions or datasets — at least not based on the metrics we initially checked.
Digging deeper into the slow iterations, we found two non-lazy operations taking significant runtimes:
a query retrieving the max timestamp
an
is_emptyexistence check
Both queries were taking around 15 minutes each.
That was unexpected, especially given how lightweight those operations normally are.
The not so successful attempts at tuning
We tried several tuning approaches that seemed likely to help:
Changed the sequence of the
foreachiterations so tasks wouldn't run concurrently and contend for resources.Bumped up driver/executor resources to see if it was resource starvation.
Tweaked Spark parallelism and shuffle partitioning.
Looked for skew and tried redistributing keys.
Considered setting the concurrency to 1 but decided it might make the job run for 2-3 hours.
Considered gathering stats but saw the full stats were already there.
Enabled (and disabled) a few runtime optimizations.
Those only cut off less then 5 minutes. Not enough. How can the serialized operation run faster than the foreach task in this scenario. The problem remained stubbornly intermittent.
What finally helped (the fix)
The problem kept bothering me over the weekend. It started to feel very similar to the old days of wrestling with the Oracle optimizer and comparing execution plans.
Back on Monday, I dug deeper into the Spark UI, looking for the SQL DataFrames behind the slow queries.
Once I found them, I used AI to help translate the Spark execution plans into something closer to the Oracle-style plans I was more familiar with reading.
It looked like a subtle optimizer change, but it had a massive impact on those two queries under concurrency.
That narrowed the investigation to one major variable: the DBR upgrade from 15.4 to 16.4.
After validating the hypothesis, we rolled the runtime back to 15.4.
The result was immediate:
total job runtime dropped to 11 minutes
the two problematic iterations fell from over an hour to roughly two minutes
Interestingly, the refactor itself wasn’t the real problem. The new concurrency model simply exposed an optimizer behaviour change that had never surfaced in the serialized implementation.
Takeaways
After solving that, we could finally celebrate. Four datasets got dramatically faster, while the fifth reminded us that performance tuning is rarely as simple as “newer runtime = better runtime.”
A few things stood out from this exercise:
Runtime upgrades deserve the same scrutiny as application changes.
We treated the DBR upgrade as a low-risk improvement bundled into a broader refactor. In reality, it introduced different optimizer behaviour that caused a major regression for a specific workload.Runtime upgrades aren’t just infrastructure changes — they can alter execution plans, heuristics, and query strategies in ways that materially affect performance. They deserve the same level of regression and performance testing as application code changes.
Execution plans still matter.
Even in managed platforms like Databricks, understanding what the engine is actually doing remains critical.The breakthrough only came after digging through the Spark UI and comparing execution plans closely enough to notice that the faster plan was using a bloom filter while the slower one was not.
The tooling may have changed, but the engineering skill of reading and reasoning about execution plans is still incredibly valuable.
Parallelism can expose problems that serialized workloads hide.
The foreach implementation surfaced behaviour that never appeared in the single-task version. Concurrency, optimizer choices, and runtime heuristics can interact in unexpected ways, particularly at production scale.Non-production environments rarely behave exactly like production.
We never reproduced the slowdown outside production, despite extensive testing.The issue only surfaced under production concurrency and workload characteristics, which exposed gaps in our validation approach. Dataset shape, partition distribution, cluster sizing, and concurrent execution patterns all matter when testing performance-sensitive changes.
Observability pays for itself during incidents.
Better instrumentation would have shortened this investigation significantly.Faster access to execution plans, clearer query-level metrics, better task-level visibility, and more targeted logging would have reduced a lot of the guesswork and helped isolate the regression earlier.
Performance tuning becomes dramatically easier when the system makes its decisions observable.
The silver lining is that these are the kinds of problems that sharpen engineering instincts. There’s something deeply satisfying about tracking down a subtle optimizer regression — especially when the final fix turns an hour-long task back into a two-minute operation.
Addendum
After the typical back-and-forth shuffle with Databricks Support, they identified a new Spark planning optimisation in DBR 16.4 called CollapsePartialAndFinalAggregates as the root cause of the plan change. This feature is enabled by default in this version. To disable it, set the Spark configuration spark.databricks.execution.collapsePartialAndFinalAggregates.enabled to false.