Databricks Antipatterns: Two Hard Lessons from the Field
I’m the kuya at Kurdapyo Labs — a recovering Oracle developer who saw the light and helped migrate legacy systems out of Oracle (and saved a lot of money doing it). I used to write PL/SQL, Perl, ksh, Bash, and all kinds of hand-crafted ETL. These days, I wrestle with PySpark, Airflow, Terraform, and YAML that refuses to cooperate. I’ve been around long enough to know when things were harder… and when they were actually better. This blog is where I write (and occasionally rant) about modern data tools — especially the ones marketed as “no-code” that promise simplicity, but still break in production anyway.
Disclaimer: These are my thoughts—100% my own, not my employer’s, my client’s, or that one loud guy on tech Twitter. I’m just sharing what I’ve learned (and unlearned) along the way. No promises, no warranties—just real talk, some opinions, and the occasional coffee/beer-fueled rant.
If something here helps you out, awesome! If you think I’ve missed something or want to share your own take, I’d love to hear from you. Let’s learn from each other.
Experience is a great teacher. After several years working with Databricks, I’ve noticed patterns that made perfect sense at the start but eventually became costly. At small scale they work fine — we get away with them for months or even years — until an edge case hits or volumes grow beyond what we expected. The event we joked was rarer than winning the Powerball happens, or that hypothetical load we thought we'd never reach becomes reality.
This is 2 of those hard-earned lessons: antipatterns I now watch for and the guidance I share when I see them creeping into projects.
Single Row DMLs
Many engineers coming from a stored-procedure or OLTP background (I was one of them) naturally reach for single-row INSERT/UPDATE/DELETE operations: write a checkpoint row, flip a flag, append a log entry — all from the same SQL client used for development. That feels simple and fast during development, and it usually “just works” at low scale.
Why this becomes a problem on Delta
Every individual DML tends to produce a tiny file. Tens, then thousands, then millions of those files drive:
poor read performance (many small files = high task scheduling and IO overhead),
long metadata operations (Delta/Apache Spark must track many files and transactions),
higher storage and management costs,
increased maintenance (VACUUM, compaction, and longer checkpoint/commit times).
Single-row DMLs also make compaction and concurrency trickier — your housekeeping jobs have to work harder and run more frequently.
Practical guidance and alternatives
Avoid single-row transactional writes to Delta tables whenever possible.
Prefer append-only, batched writes:
accumulate events/changes and write them in bulk (micro-batches) rather than per row,
use a streaming approach (Structured Streaming) with appropriate micro-batch sizes and checkpointing for incremental, bounded writes.
For upserts/merges, use MERGE with a source batch (not one-row MERGEs). Build a small staging dataset of changes and MERGE that in one operation.
Control file count at write time:
repartition/coalesce to a reasonable number of output files before writing,
avoid writing one tiny file per write operation.
If you must persist fine-grained events, write them to append-only files (Delta or Parquet) and compact periodically.
Use Delta compaction tools on a schedule (OPTIMIZE/compaction jobs) to consolidate small files, and monitor file counts so compaction is run before it becomes painful.
Prefer external logging/monitoring for operational logs: cluster logs, Databricks/job metrics, or a centralized logging system (e.g., ELK, Splunk, cloud logging). Application-level logs in Delta are only justified if you need them for audit or replay — and then treat them as batched event streams, not single-row table DMLs.
When you can’t avoid single-row operations
If strict transactional single-row updates are unavoidable, isolate that usage:
put such records in a dedicated table with a clear compaction policy,
limit retention or TTL to keep the table small,
schedule frequent, automated compaction and vacuuming,
and instrument monitoring to alert when small-file counts grow.
Bottom line Single-row DMLs are easy and familiar, but they don’t scale well on Delta. Design for batched or append-only writes, use MERGE for bulk upserts, and rely on compaction/OPTIMIZE and centralized logging when possible to avoid the small-files and metadata headaches.
dbutils.fs — convenience vs. scalability
dbutils.fs is incredibly handy in notebooks: quick file reads/writes, directory listings, tiny helper copies and deletes. That convenience makes it an easy go-to during development — but several common uses become antipatterns in production.
Common antipatterns
Using
dbutils.fs.putor repeateddbutils.fs.cp/mv/rmin tight loops to emit many small files (e.g., one file per event or checkpoint).Relying on file presence /
_SUCCESSfiles as locks or transactional signals (race conditions and non-atomic checks).Using
dbutils.fs.lsto enumerate very large directories from notebooks (slow, paginated REST calls).Using dbutils for large data movement (instead of Spark, cloud SDKs or storage-native tools).
Storing secrets/credentials in files written via dbutils or using it to persist sensitive info.
Orchestrating business logic by chaining many dbutils operations synchronously (blocking, brittle).
Using dbutils.fs for compaction or heavy maintenance in production without retry/monitoring.
Why these are a problem
Many dbutils.fs calls translate to REST operations against cloud storage; they’re relatively slow and expensive when done at scale.
They don’t provide the transactional/atomic semantics Delta or Spark offer — leads to races and inconsistent state.
Repeated small-file writes create the same small-files and metadata explosion problems described earlier.
Lack of retries, backoff, and backpressure patterns in ad-hoc uses makes jobs brittle and hard to debug.
Practical guidance and alternatives
For data writes, prefer Spark/DataFrame APIs or Delta write APIs:
- Use DataFrame.write.format("delta").mode("append").save(path) or Structured Streaming with controlled micro-batch sizing instead of
dbutils.fs.put.
- Use DataFrame.write.format("delta").mode("append").save(path) or Structured Streaming with controlled micro-batch sizing instead of
For bulk copy/move of large datasets, prefer:
- Spark read/write (distributed), cloud storage SDKs (AWS S3 SDK, Azure Data Lake SDK) or storage-native copy tools — not
dbutils.fs.cpfor terabytes of data.
- Spark read/write (distributed), cloud storage SDKs (AWS S3 SDK, Azure Data Lake SDK) or storage-native copy tools — not
For listing/metadata at scale, prefer:
- Use Spark’s distributed listing (spark.read or cloud SDK with pagination) or storage-native APIs that support efficient pagination.
For coordination and locking, prefer:
- Use Delta transactions, a small metadata table (Delta) for control records, or an external lock service (Azure Blob lease, S3 atomic rename patterns, Redis, etc.), not file-existence checks.
For event/app-level logs, write append-only files in batches and compact (OPTIMIZE) periodically; avoid single-row file writes.
For secrets, use secret scopes / cloud KMS; never write credentials with dbutils.fs to storage.
Add robust retry/backoff, idempotency, and alerting around any required dbutils.fs calls to handle transient errors.
Safe uses of dbutils.fs
Quick inspections in interactive sessions (
dbutils.fs.lsof small directories,headfor a sample).Small administrative tasks in low-volume jobs, with clear TTL and compaction policy.
One-off notebook utilities, not core production pipelines.
Operational checklist
Audit notebook/job code for dbutils.fs usage in loops or per-row logic.
Replace per-row file writes with batched writes or Delta upserts.
Add metrics: file count, average file size, listing latencies; alert when thresholds exceeded.
Schedule compaction/OPTIMIZE jobs before file counts get out of hand.
Use cloud-native locking or Delta where atomicity is required.
Bottom line dbutils.fs is great for development and light administrative tasks, but it’s not a substitute for distributed, transactional, or high-throughput I/O patterns. Replace repetitive or large-scale dbutils.fs usage with Spark/Delta writes, cloud storage SDKs, or dedicated coordination mechanisms — and treat any remaining dbutils operations as small, monitored, and idempotent utilities.
Closing thoughts
Experience taught me that solutions that feel “simple and fast” in development often turn expensive at scale. Small decisions—like doing one-row DMLs from a notebook—compound into operational debt: slower queries, heavier maintenance, and surprising costs. The best defense is awareness plus a few engineering guardrails: batch where possible, use the platform’s compaction and write-optimization features, and add monitoring to catch small-file growth early. Those moves keep your Delta tables healthy and let you scale from a prototype to production without painful rewrites.