kurdapyo does data

Counting when it really counts

Kurdapyo Data Engineer — Mon, 03 Nov 2025 11:16:29 GMT

A seasoned data engineer is deliberate on their use of expensive operations such as counts. It’s fairly common to see a misuse of this. Such as using a count to check if a query returns some rows. Or running a count query to figure out the number of records to be inserted by an insert-select statement.

Frankly speaking, doing a select count just to check the query will return some rows is inefficient and many systems I’ve come across are littered with samples of this. I won’t dwell into that too much here.

What I want to discuss is more on how to get the counts after running doing dataframe.write statements in spark.

The Old Traditions

Let me start with how it used to be done with traditional databases. Across database systems, the method for determining the number of inserted rows differs. In MySQL, the ROW_COUNT() function returns the count of rows affected by the preceding INSERT statement. PostgreSQL offers a RETURNING clause within the INSERT statement itself, which can provide a count or return the inserted data. In SQL Server, the @@ROWCOUNT system variable holds the number of rows impacted by the most recent statement. Snowflake provides a SQLROWCOUNT variable for this purpose, specifically for DML statements within its scripting environment.

For Amazon Redshift, not really a traditional database but is a fork off good old postgres, obtaining an accurate insert count typically involves querying the stl_insert system table for the query ID of the recent INSERT operation, which can be more complex due to the distributed nature of the database. These variations mean developers must use the correct function or variable for their specific database to get an immediate and accurate count of inserted rows.

Simply put, there used to be values returned or special variables that provides a way to count.

Breaking from tradition

Redshift was already hinting a little bit about the counting difficulties for distributed databases.

In my early days using databricks, I tried this:

results = df.write \
  .format("csv") \
  .option("header", "true") \
  .mode("overwrite") \
  .save(output_path)

print(results)

What a surprise to see this back:

> None

A common workaround is to do a df.count() to get the count either before or after the operation to get the counts. This was okay in cases where the data is not changing in between operations. However, it means that the dataframe will actually be executed 2x albeit the count would be optimized to just get a count.

I explored this a bit further and found that accumulators can be used.

sc = spark.sparkContext
rows_written_acc = sc.longAccumulator("RowsWrittenCount")

def increment_counter_and_return(row):
    rows_written_acc.add(1)
    return row

df_rdd = df.rdd.map(increment_counter_and_return)

df.write \
  .format("csv") \
  .option("header", "true") \
  .mode("overwrite") \
  .save(output_path)

print(rows_written_acc)

This code works BUT only if the cluster’s isolation mode is set to Single User. It rules out Unity Catalog so I had to rule it out instead. Back to the drawing board.

Observing counts

My search for a better way to count or not count led me to the Dataframe.observe function. This was introduced on Spark 3.3.1 and it computes the defined aggregates in an dataframe operation. This was exactly what I needed. I ended up with this:

from pyspark.sql import Observation
from pyspark.sql.functions import lit, count

observation = Observation("write-metrics")

df.observe(observation, count(lit("1")).alias("count")) \
  .write \
  .format("csv") \
  .option("header", "true") \
  .mode("overwrite") \
  .save("test.csv")

print(observation.get)

This returned me a dict of the aggregates I put on the observation:

> {'count': 4425435}

More things can be added to the observe function. Things like getting the max timestamps or summing up values. This is basically what I needed for now.

It is important to note though that the Observation class works on batch queries. For streaming a slightly different approach is needed. But that is a topic for a future blog.

The catch

This approach worked well and good for 99.9% of our use case. However, I did observe that .1% of our processes got None returned on observation.get. I’m still getting to the bottom of it but I’ve observed that this happens when I try to overload the cluster with too much workload. I eventually added a check on that observation.get function to count in case it was null. Because I really don’t want to write a big file again. Wish me luck on this investigation and if I ever get to the reasons I’ll put an addendum to this article.

From For Loops to For Each

Kurdapyo Data Engineer — Tue, 30 Sep 2025 08:15:55 GMT

I'm a strong advocate for using the enterprise chosen orchestration tool for complex data pipelines. However, there are instances where using that strictly is not practical, and the scheduling tool within the database itself makes more sense.

Take Databricks, for example. We create Databricks workflows, but the Databricks workflow itself acts as an orchestrator. The challenge lies in determining what should be executed as a single Databricks workflow and what should be tasks within that workflow.

The Problem and Initial Approach

The problem I wanted to solve was flattening nested JSON data into multiple child tables. This required using the variant datatype to transform and store the data in standard tables with native datatypes. While creating a separate Databricks workflow for each table was possible, it wasn't practical. Instead, we chose to use a single Databricks workflow, triggered by the enterprise orchestration tool (Airflow), to process all the necessary tables. The simplest solution was to go through each table one by one to refine the flattening logic. This was version 1 of the implementation.

The Issue with the Loop

While this approach worked in most cases, the total processing time was the sum of all individual iterations. For example, if we had 10 tables that finished in 1 minute each and 2 tables that took 20 minutes each, the entire workflow would take 50 minutes to complete. Adding more compute power wouldn't significantly reduce this time and would only increase costs exponentially.

For Each to the Rescue

This is where the "for each" Databricks task type becomes helpful. Instead of using a loop, I restructured the iterations into a single generic notebook, which became an individual Databricks workflow task. In the job definition, I can specify the individual tables in a JSON array that is formatted for YAML.

tasks:
  - task_key: "process"
    for_each_task:
      inputs: "[\"table_a\",\"table_b\",\"table_c\"]"
      concurrency: 2
      task:
        task_key: "process_iterator"
        notebook_task:
          base_parameters:
            p_table: "{{input}}"
          notebook_path: "process.py"

A Pre-Task to Make it Dynamic

That's all well and good, but sometimes we don't want to hardcode everything in the job definition. We want things to be dynamic.

This is where introducing a pre-task becomes useful. By using another task, we can set up some Python code that builds an array and passes it to the next tasks.

For example:

# sample logic to build a list for the foreach
tables_to_process = ["table_1", "table_2"]
tables_to_process.append("table_3")

dbutils.jobs.taskValues.set(key="tables_to_process", value=tables_to_process)

How It All Looks

The job definition should look like this:

tasks:
  - task_key: "prepare"
    notebook_task:
      notebook_path: "prepare.py"
  - task_key: "process"
    depends_on:
      - task_key: prepare
    for_each_task:
      inputs: "{{ tasks.prepare.values.tables_to_process }}"
      concurrency: 4 # or whatever
      task:
        task_key: "process_iterator"
        notebook_task:
          base_parameters:
            p_table: "{{input}}"
        notebook_path: "process.py"

Once everything is set up, the Databricks job/workflow definition will have the "for each" task receiving input from the pre-task's output.

After deployment:

In a dynamic "for each" setup, you won't see the actual list of tables or iterations beforehand. This list only appears when the job runs and finishes the pre-step. Once that's done, the "for each" task starts. Clicking on the "for each" task takes you to a page that looks like this:

Interestingly, the iterations were not in alphabetical order, and I haven't needed to make them alphabetical.

Conclusion

So there you have it, the "for each" loop. It's quite useful and can help improve serialized for loops if you have them.

I did encounter one limitation. Currently, the "for each" task can only take inputs from other tasks. It can't pass task values to the next tasks. My workaround was to append the results to a single table which the next tasks can query later.

Databricks Environment Splits

Kurdapyo Data Engineer — Tue, 05 Aug 2025 09:39:43 GMT

Three-tier architecture was the norm when I started my IT career. It consisted of the presentation layer (the web servers), the application layer (the app servers), and the data layer (the databases).

We could spin up as many environments as needed for the presentation and application layers — most engineers typically had their own local setups for these. But the database layer? That was usually shared.

This setup works fine when the schema is stable and the only operations are data manipulation (DML: inserts, updates, deletes). But in my experience, that’s rarely the case.

Data-heavy applications are always evolving — new tables here, an extra column there, another index for that slow join. Sharing a database across multiple developers and environments quickly becomes a source of friction.

Modern Workarounds for Database Isolation

Today, there are far more elegant ways to manage this. I've worked on systems where each developer spun up their own isolated data layer using Docker containers for PostgreSQL or MySQL. For Oracle, we found it more practical to use PDBs (Pluggable Databases) instead. In AWS-heavy environments, some teams spun up databases from RDS snapshots using CloudFormation or Terraform.

There are also broader solutions involving full-stack virtualization, infrastructure-as-code, and automation tools. These approaches help bring the data layer up to the same level of flexibility as the app and presentation layers — and reduce cross-team collisions along the way.

Databricks and the Environment Problem

But what happens when your “database layer” isn’t a traditional RDBMS at all?

What if it’s Databricks, where data lives in cloud-managed storage, and the interface revolves around notebooks and distributed compute jobs? In this world, environment isolation takes a different form — and requires different strategies.

What Is an Environment in Databricks?

The first important question is: how do you define an environment in Databricks? There are several valid approaches.

A common pattern is to define an environment as a Databricks workspace — e.g., kurdapyo-prd, kurdapyo-uat, kurdapyo-sit, kurdapyo-dev. This gives you clean isolation. Pre–Unity Catalog, you could even reuse catalog names across workspaces. But with Unity Catalog, catalog names must be globally unique, so it's now common to prefix them — e.g., gold_uat, gold_prd.

The downside? Spinning up a new workspace environment isn’t cheap. On AWS, it often involves provisioning a new account and VPC, setting up DNS, configuring SCIM and user permissions — the kind of setup best left to platform teams or DevOps experts.

Splitting a Workspace into Logical Environments

The next logical approach is to carve out logical environments within a single Databricks workspace. We explored two main options:

Splitting at the catalog level
Splitting at the schema level

I personally prefer the schema-level split. I even advocated for building automation to create schemas on demand and using zero-copy cloning to generate like-for-like environments — perfect for safe, isolated testing.

However, that approach wasn’t prioritized at the time. Instead, we went with a catalog-level split, managed through Terraform to keep things clean and avoid ad-hoc changes by engineers.

Naming Your Logical Environments

Once you've defined your environment boundaries, the next step is naming. While I prefer a “treat them as cattle, not pets” mindset, stakeholders usually prefer meaningful names.

A practical trick I’ve learned: always include a numeric suffix, like dev-001 or uat-02. It sounds simple, but it's incredibly effective — because no matter how many environments you start with, you'll always need more later. This naming convention makes it easy to scale your environments without resorting to naming gymnastics.

Aspirations: Automation and Production Parity

Once your environments and naming conventions are in place, the long-term goal is automation:

Automate environment creation and teardown
Make it easy to spin up production-like environments
Enable regression testing in consistent, isolated setups

These efforts can be time-consuming upfront but pay off significantly in delivery speed and engineering confidence. Of course, it’s essential to balance these aspirations with business priorities. Delivering value to users comes first — but carving out time for automation is what sets teams up for long-term success.

One tool I’m particularly curious about is SQLMesh. It promises branch-based environments, automated model testing, and declarative change tracking — all of which seem like a natural fit for solving the isolation problem at the data transformation layer. I haven’t tried it in anger yet, but it’s definitely on my list to explore as we continue refining our approach to Databricks environments.

Final Thoughts

The challenges of environment isolation aren’t new — they just evolve with the tools we use. Whether it's a traditional database in a three-tier architecture or a modern platform like Databricks, the underlying principles remain the same: avoid shared mutable state, automate what you can, and build with repeatability in mind.

What's with a name (in Databricks)

Kurdapyo Data Engineer — Fri, 11 Jul 2025 06:10:14 GMT

Naming conventions are often a contentious topic among architects. In Databricks, these debates become even more pronounced due to the extensive platform work required to establish a Databricks implementation.

In an AWS-based Databricks setup, naming conventions extend to AWS resources like VPNs, S3 buckets, IAM Roles, and IAM Policies. Additionally, with SCIM integration, service principals and groups are synced into Databricks, retaining their original names.

In Databricks, you'll encounter a mix of naming styles, including snake_case, kebab-case, camelCase, and PascalCase. This mix can include inconsistently named resources and occasionally misspelled ones that, for various reasons, aren't worth the effort to correct. It can be quite frustrating.

Give to Caesar the things which are Caesar’s

Typically, different teams manage the cloud and Databricks environments. It's essential to adhere to the cloud team's established naming conventions. They service other technological needs aside from the data so they know their world best. Cloud resources often use kebab-case because bucket names become part of URLs. Interestingly, Google advises against using underscores in URLs, as they don't treat them as word separators (e.g., my_site is read as mysite).

Databricks start with the workspace name. It’s a name that is used in a URL and should be in kebab-case.

For groups and users, it is common practice to implement SCIM(System for Cross-domain Identity Management) Integration. What happens there is that the groups and users are synced from ActiveDirectory/IdentityNow and the groups and users are synced into databricks carrying over how it’s named. Service principals, I’ve always seen kebab-case used there. There is also the option of creating databricks groups which are separate from the SCIM integrated groups, these ones often just get the same convention for Service Principals.

Enter Unity Catalog

Unity Catalog introduces additional complexity. Most resources, like Catalogs, Schemas, Tables, and Views, are used in SQL, where snake_case or UPPER_SNAKE_CASE is preferred. SQL syntax generally handles these well, as many databases aren't case-sensitive unless names are enclosed in double quotes. Using kebab-case can be problematic because the dash might be interpreted as a subtraction sign, necessitating double quotes.

PascalCase and camelCase are popular among Java developers and in SQLServer, but using these conventions require double quotes due to case sensitivity and spaces.

Beyond SQL objects, resources like storage credentials and external locations often follow kebab-case, especially when linked to S3 buckets.

ABCs

Ultimately, the key to effective naming is to Always Be Consistent. While smaller projects often achieve this, larger implementations may struggle due to the involvement of many people, time constraints, and limited reviews. Therefore, in addition to striving for consistency, it's important to Also Be Considerate. People generally do their best with the knowledge and resources available, and while names that violate standards can be changed, consistency might not always be prioritized due to other pressing business needs.

IAC Face-Off: Databricks Asset Bundles vs Terraform

Kurdapyo Data Engineer — Sun, 06 Jul 2025 12:13:28 GMT

Data Engineers often have to juggle both data processing and DevOps pipelines. I’ve worked with many skilled data professionals, but they tend to shy away from the DevOps side of things. Often, I get asked to take over this aspect due to my background.

For Databricks, resource deployment has always been a combination of Terraform and Databricks Asset Bundles. Terraform likely needs no introduction, as it's prevalent in infrastructure deployment nowadays. Other options like Pulumi or even Ansible is a distant second. While I’m no expert, I have a working understanding of how to implement changes with it.

Databricks Asset Bundles (DABs), however, are not as widely known. Before DABs, deploying jobs and workflows in Databricks was converging using DBX. Eventually, Databricks added the bundle command to the Databricks CLI, and it’s now being pushed more and more.

When I first heard about DABs from our Solutions Architect, it seemed like we could use it for all resource deployments. I was disappointed to find out later that it only deployed Workflows and DLT Pipelines.

What's Missing?

All the platform-related components, of course. Our project involved setting up Databricks on AWS from scratch, and we aimed to minimize manual configurations (clickops) while adhering to strict security protocols. Both Account Admin and workspace admin privileges were tightly controlled, as they should be.

We needed to deploy everything from the metastore, workspaces, groups, and service principals to catalogs, schemas, tables, views, and grants. However, aside from Workflows and DLT Pipelines, DABs couldn't create any of these components—not even tables and views. While we could write code to create these using the Databricks API, it required running them through notebooks, leading to less elegant, more cumbersome code compared to the declarative style of Terraform.

What we did eventually

Terraform continues to be our primary tool for provisioning most resources, including Catalogs, Schemas, External Locations, and Volumes. It also manages Shared Databricks Clusters and SQL Warehouses effectively. To maintain simplicity and security, we limit grants to the Schema level, avoiding individual object grants.

Alas, Terraform is not very common in our current crew of data engineers. This equated to a higher platform engineer to data engineer ratio for us.

We still rely on Databricks Asset Bundles for deploying Databricks workflows, as they offer a straightforward method to deploy the same code across different environments.

For creating tables and views, we utilize DBT (Data Build Tool), which provides a declarative approach to object creation. As for User-Defined Functions (UDFs), they might still require the less elegant notebook approach, but fortunately, we haven't needed to create any yet.

I am hopeful that Databricks Asset Bundles will eventually support a broader range of resources. The tool is rapidly evolving, and I remain optimistic about its future capabilities.

Fivetran Schema Change Handling

Kurdapyo Data Engineer — Fri, 04 Jul 2025 10:45:44 GMT

Schema changes can be a significant challenge for many ETL pipelines. In the past, changes were more manageable because systems operated independently. However, with the rise of SaaS products, out-of-the-box solutions, and frequent release cycles, change has become a constant, and occasional schema changes must be addressed.

Setting the Scene

In our scenario, our data source is a Snowflake database managed by a SaaS vendor. We decided to use Fivetran to sync data from Snowflake into our target database. The SaaS vendor follows a weekly release cycle for minor updates and a monthly cadence for major releases, with changes occurring in different environments on varying dates.

We set up Fivetran replication from this Snowflake source. Our connector is configured to sync a substantial number of tables, and we have been syncing daily for several days without issues.

Cue in the Schema Change

One day, our sync did not complete successfully. We received a status of “Rescheduled” with the reason “Unsupported schema change requires table resync: ADD_COLUMN” for one of the tables.

As we were still new to analyzing the Fivetran logs (not available on the Fivetran Console), we were unsure of what had happened. We asked some questions within the team, and while it was a learning experience, it was challenging at the time. Here are the questions we faced:

Question 1: What column was added?

This information was not readily available. However, by querying the Fivetran logs for alter_table events, we could identify the table and the column that was added.

SELECT   *
FROM     ..log
WHERE    conector_id = :connector_id
AND      message_event = 'alter_table'
ORDER BY time_stamp DESC;

Question 2: What happened to the sync of the other tables?

We were initially unaware of the status of the other tables. The Fivetran Console did not provide this information. However, we later learned that the other tables had indeed been synced. This information can also be queried from the logs.

SELECT   message_data:schema AS "schema",
         message_data:table AS "table",
         message_data:count
FROM     ..log
WHERE    conector_id = :connector_id
AND      message_event = 'records_modified'
ORDER BY time_stamp DESC;

Question 3: Should we just rerun the sync?

If we had relied on the Fivetran scheduler, it would have retried the sync automatically without any intervention from us. So, we decided to rerun the sync.

What we now understand is that Fivetran, on that first sync that ended in a reschedule, had already altered the table with the schema change. At this point, it simply required an acknowledgment to proceed with a full sync for that table.

This realization was surprising. I initially thought that Fivetran would detect the schema change and, upon rerunning the sync, would then alter the table and perform the sync. However, that was not the case.

This issue is somewhat unique to us because we do not own the source. But imagine this happening to a table of multi-terabyte size. Attempting to resync could take days or even weeks. And there was no way of getting around it.

RTFM

The documentation clarified the issue further and the behavior we saw.

https://fivetran.com/docs/connectors/databases/snowflake#automatictableresyncs

Automatic table re-syncs

For tables with a primary key, we support ADD COLUMN DDL operations with null or default static values. All other table or column alterations trigger an automatic table re-sync. For tables without a primary key, any DDL operation triggers an automatic table re-sync.

Apparently the fault was ours. Though we can’t really do anything about it as the snowflake source is vendor owned.

What to do now

Sad to say, not much could be done.

We educated support about this and watch out and just prepare communications on these occurences.

We raised the issue to fivetran and made a feature request.

Now we wait.

📧

We'd love to hear from you! Share your experiences with schema changes in ETL processes in the comments below. Have questions or feedback on the strategies discussed? Don't hesitate to reach out.

Upset with Fivetran Scheduler Offsets

Kurdapyo Data Engineer — Tue, 24 Jun 2025 09:44:20 GMT

A few months ago, I finally had the opportunity to use Fivetran in a real-world project. To provide some background, this was a greenfield project involving a financial system, tooling and architecture had been selected before my arrival. The tech stack included Databricks for the Data Lake (or Delta Lake), with Fivetran chosen as the data acquisition tool. The company was using Control-M as its scheduler, but no one really wanted to use that. For this specific task, it was decided to utilize the Fivetran Scheduler to trigger the data acquisition processes.

My Mission (or so I thought)

I was brought into this project to work on Databricks components. Despite this, I was eager to see Fivetran in action. Architect colleagues I’ve collaborated with for DMS often touted how much easier Fivetran will make such acquisition or migration tasks.

I diligently explored how to orchestrate the data acquisition with Databricks workflows. As expected, Databricks didn’t have direct connectivity to Fivetran. This is a common security practice. It prevents data engineers from being tempted to create code to directly access the Fivetran APIs, potentially leading to messy implementations.

I expressed concerns that orchestration should employ the preferred orchestration tool, despite it being considered outdated technology. Unfortunately, my concerns were ignored. With my well-defined JIRA card languishing without any progress, I decided to focus and proceed with the work regardless.

With no API connectivity, I began examining Fivetran logs, which were being synced by Fivetran to Databricks. BTW, Fivetran doesn’t provide logs on its web console, and the way a customer can access logs is by syncing log data to a destination using a Fivetran connector. (This novel topic might be a future blog.)

The Fivetran log connector had a 5-15 minute delay which was acceptable for now. SLA's will be dealt with much later on.

Initial Signs

We had a job scheduled at 4 p.m. From the logs, I noticed syncs were only starting around 4:11 p.m. This seemed odd, but I didn't think much of it at the time, as it was still early in the project and nothing was stable yet.

After a few more weeks, we were preparing to promote the code to the next higher environment. We deployed the same code (Fivetran was deployed using Terraform, by the way) for a 4 p.m. sync. We waited, but by 4:15, the job hadn't started. By 4:30, still nothing. 4:45, nothing. Finally, the job only started at 4:54 p.m.

RTFM

Digging through the Fivetran documentation, I stumbled upon this warning:

https://fivetran.com/docs/core-concepts/syncoverview

When you add a new destination, Fivetran assigns it a fixed time offset. The offset can be any random value in minutes ranging from 0 to 60. It is derived from the destination ID hash. This offset is shared by every connector in the destination. The offset value remains the same regardless of the set sync frequency.

Let that sink in: a random 0-60 minute offset. And it can’t be modified, even through a support request (trust me, I tried).

Denouement

This discovery threw a wrench in our plans to avoid using Control-M. After losing a moderate amount of effort on creating a log-based solution, we pivoted to developing a simple script to be triggered by Control-M. In terms of code, simpler solution. However, in terms of infrastructure, it was a minor nightmare. We had to sort out an agent for Control-M, open up network connectivity, create a Fivetran service user to trigger the API calls, and manage secrets, among other tasks.

Things I learned

Avoid the fivetran scheduler like the plague, the API is the way to go to trigger fivetran syncs. You won’t be hit by an unreasonably high offset. And you’ll have more options to handle reschedules or errors on syncs. I’ll rant about the Fivetran Scheduler retry logic in another time.

Embrace the orchestration tool as your ally. I still do not like Control-M. While Control-M may not be my preferred choice, I recognize the robust infrastructure this organization has established around it, including change management processes, 24/7 support teams, alerts, and more. It does make me think of what hurdles needs to be scaled to get something like Airflow productionized.