<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[kurdapyo does data]]></title><description><![CDATA[kurdapyo does data]]></description><link>https://data.kurdapyo.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 24 Apr 2026 18:54:37 GMT</lastBuildDate><atom:link href="https://data.kurdapyo.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Counting when it really counts]]></title><description><![CDATA[A seasoned data engineer is deliberate on their use of expensive operations such as counts. It’s fairly common to see a misuse of this. Such as using a count to check if a query returns some rows. Or running a count query to figure out the number of ...]]></description><link>https://data.kurdapyo.com/counting-when-it-really-counts</link><guid isPermaLink="true">https://data.kurdapyo.com/counting-when-it-really-counts</guid><category><![CDATA[Databricks]]></category><category><![CDATA[dataframe]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Mon, 03 Nov 2025 11:16:29 GMT</pubDate><content:encoded><![CDATA[<p>A seasoned data engineer is deliberate on their use of expensive operations such as counts. It’s fairly common to see a misuse of this. Such as using a count to check if a query returns some rows. Or running a count query to figure out the number of records to be inserted by an insert-select statement.</p>
<p>Frankly speaking, doing a select count just to check the query will return some rows is inefficient and many systems I’ve come across are littered with samples of this. I won’t dwell into that too much here.</p>
<p>What I want to discuss is more on how to get the counts after running doing dataframe.write statements in spark.</p>
<h2 id="heading-the-old-traditions">The Old Traditions</h2>
<p>Let me start with how it used to be done with traditional databases. Across database systems, the method for determining the number of inserted rows differs. In MySQL, the <code>ROW_COUNT()</code> function returns the count of rows affected by the preceding <code>INSERT</code> statement. PostgreSQL offers a <code>RETURNING</code> clause within the <code>INSERT</code> statement itself, which can provide a count or return the inserted data. In SQL Server, the <code>@@ROWCOUNT</code> system variable holds the number of rows impacted by the most recent statement. Snowflake provides a <code>SQLROWCOUNT</code> variable for this purpose, specifically for DML statements within its scripting environment.</p>
<p>For Amazon Redshift, not really a traditional database but is a fork off good old postgres, obtaining an accurate insert count typically involves querying the <code>stl_insert</code> system table for the query ID of the recent <code>INSERT</code> operation, which can be more complex due to the distributed nature of the database. These variations mean developers must use the correct function or variable for their specific database to get an immediate and accurate count of inserted rows.</p>
<p>Simply put, there used to be values returned or special variables that provides a way to count.</p>
<h2 id="heading-breaking-from-tradition">Breaking from tradition</h2>
<p>Redshift was already hinting a little bit about the counting difficulties for distributed databases.</p>
<p>In my early days using databricks, I tried this:</p>
<pre><code class="lang-python">results = df.write \
  .format(<span class="hljs-string">"csv"</span>) \
  .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>) \
  .mode(<span class="hljs-string">"overwrite"</span>) \
  .save(output_path)

print(results)
</code></pre>
<p>What a surprise to see this back:</p>
<blockquote>
<p><code>&gt; None</code></p>
</blockquote>
<p>A common workaround is to do a df.count() to get the count either before or after the operation to get the counts. This was okay in cases where the data is not changing in between operations. However, it means that the dataframe will actually be executed 2x albeit the count would be optimized to just get a count.</p>
<p>I explored this a bit further and found that accumulators can be used.</p>
<pre><code class="lang-python">sc = spark.sparkContext
rows_written_acc = sc.longAccumulator(<span class="hljs-string">"RowsWrittenCount"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">increment_counter_and_return</span>(<span class="hljs-params">row</span>):</span>
    rows_written_acc.add(<span class="hljs-number">1</span>)
    <span class="hljs-keyword">return</span> row

df_rdd = df.rdd.map(increment_counter_and_return)

df.write \
  .format(<span class="hljs-string">"csv"</span>) \
  .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>) \
  .mode(<span class="hljs-string">"overwrite"</span>) \
  .save(output_path)

print(rows_written_acc)
</code></pre>
<p>This code works BUT only if the cluster’s isolation mode is set to Single User. It rules out Unity Catalog so I had to rule it out instead. Back to the drawing board.</p>
<h2 id="heading-observing-counts">Observing counts</h2>
<p>My search for a better way to count or not count led me to the <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.observe.html">Dataframe.observe function</a>. This was introduced on Spark 3.3.1 and it computes the defined aggregates in an dataframe operation. This was exactly what I needed. I ended up with this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> Observation
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> lit, count

observation = Observation(<span class="hljs-string">"write-metrics"</span>)

df.observe(observation, count(lit(<span class="hljs-string">"1"</span>)).alias(<span class="hljs-string">"count"</span>)) \
  .write \
  .format(<span class="hljs-string">"csv"</span>) \
  .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>) \
  .mode(<span class="hljs-string">"overwrite"</span>) \
  .save(<span class="hljs-string">"test.csv"</span>)

print(observation.get)
</code></pre>
<p>This returned me a dict of the aggregates I put on the observation:</p>
<blockquote>
<p><code>&gt; {'count': 4425435}</code></p>
</blockquote>
<p>More things can be added to the observe function. Things like getting the max timestamps or summing up values. This is basically what I needed for now.</p>
<p>It is important to note though that the Observation class works on batch queries. For streaming a slightly different approach is needed. But that is a topic for a future blog.</p>
<h2 id="heading-the-catch">The catch</h2>
<p>This approach worked well and good for 99.9% of our use case. However, I did observe that .1% of our processes got None returned on <code>observation.get</code>. I’m still getting to the bottom of it but I’ve observed that this happens when I try to overload the cluster with too much workload. I eventually added a check on that observation.get function to count in case it was null. Because I really don’t want to write a big file again. Wish me luck on this investigation and if I ever get to the reasons I’ll put an addendum to this article.</p>
]]></content:encoded></item><item><title><![CDATA[From For Loops to For Each]]></title><description><![CDATA[I'm a strong advocate for using the enterprise chosen orchestration tool for complex data pipelines. However, there are instances where using that strictly is not practical, and the scheduling tool within the database itself makes more sense.
Take Da...]]></description><link>https://data.kurdapyo.com/from-for-loops-to-for-each</link><guid isPermaLink="true">https://data.kurdapyo.com/from-for-loops-to-for-each</guid><category><![CDATA[Databricks asset bundles]]></category><category><![CDATA[Databricks]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Tue, 30 Sep 2025 08:15:55 GMT</pubDate><content:encoded><![CDATA[<p>I'm a strong advocate for using the enterprise chosen orchestration tool for complex data pipelines. However, there are instances where using that strictly is not practical, and the scheduling tool within the database itself makes more sense.</p>
<p>Take Databricks, for example. We create Databricks workflows, but the Databricks workflow itself acts as an orchestrator. The challenge lies in determining what should be executed as a single Databricks workflow and what should be tasks within that workflow.</p>
<h2 id="heading-the-problem-and-initial-approach"><strong>The Problem and Initial Approach</strong></h2>
<p>The problem I wanted to solve was flattening nested JSON data into multiple child tables. This required using the variant datatype to transform and store the data in standard tables with native datatypes. While creating a separate Databricks workflow for each table was possible, it wasn't practical. Instead, we chose to use a single Databricks workflow, triggered by the enterprise orchestration tool (Airflow), to process all the necessary tables. The simplest solution was to go through each table one by one to refine the flattening logic. This was version 1 of the implementation.</p>
<h2 id="heading-the-issue-with-the-loop"><strong>The Issue with the Loop</strong></h2>
<p>While this approach worked in most cases, the total processing time was the sum of all individual iterations. For example, if we had 10 tables that finished in 1 minute each and 2 tables that took 20 minutes each, the entire workflow would take 50 minutes to complete. Adding more compute power wouldn't significantly reduce this time and would only increase costs exponentially.</p>
<h2 id="heading-for-each-to-the-rescue">For Each to the Rescue</h2>
<p>This is where the "for each" Databricks task type becomes helpful. Instead of using a loop, I restructured the iterations into a single generic notebook, which became an individual Databricks workflow task. In the job definition, I can specify the individual tables in a JSON array that is formatted for YAML.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">tasks:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">task_key:</span> <span class="hljs-string">"process"</span>
    <span class="hljs-attr">for_each_task:</span>
      <span class="hljs-attr">inputs:</span> <span class="hljs-string">"[\"table_a\",\"table_b\",\"table_c\"]"</span>
      <span class="hljs-attr">concurrency:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">task:</span>
        <span class="hljs-attr">task_key:</span> <span class="hljs-string">"process_iterator"</span>
        <span class="hljs-attr">notebook_task:</span>
          <span class="hljs-attr">base_parameters:</span>
            <span class="hljs-attr">p_table:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{input}}</span>"</span>
          <span class="hljs-attr">notebook_path:</span> <span class="hljs-string">"process.py"</span>
</code></pre>
<h2 id="heading-a-pre-task-to-make-it-dynamic">A Pre-Task to Make it Dynamic</h2>
<p>That's all well and good, but sometimes we don't want to hardcode everything in the job definition. We want things to be dynamic.</p>
<p>This is where introducing a pre-task becomes useful. By using another task, we can set up some Python code that builds an array and passes it to the next tasks.</p>
<p>For example:</p>
<pre><code class="lang-python"><span class="hljs-comment"># sample logic to build a list for the foreach</span>
tables_to_process = [<span class="hljs-string">"table_1"</span>, <span class="hljs-string">"table_2"</span>]
tables_to_process.append(<span class="hljs-string">"table_3"</span>)

dbutils.jobs.taskValues.set(key=<span class="hljs-string">"tables_to_process"</span>, value=tables_to_process)
</code></pre>
<h3 id="heading-how-it-all-looks"><strong>How It All Looks</strong></h3>
<p>The job definition should look like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">tasks:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">task_key:</span> <span class="hljs-string">"prepare"</span>
    <span class="hljs-attr">notebook_task:</span>
      <span class="hljs-attr">notebook_path:</span> <span class="hljs-string">"prepare.py"</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">task_key:</span> <span class="hljs-string">"process"</span>
    <span class="hljs-attr">depends_on:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">task_key:</span> <span class="hljs-string">prepare</span>
    <span class="hljs-attr">for_each_task:</span>
      <span class="hljs-attr">inputs:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ tasks.prepare.values.tables_to_process }}</span>"</span>
      <span class="hljs-attr">concurrency:</span> <span class="hljs-number">4</span> <span class="hljs-comment"># or whatever</span>
      <span class="hljs-attr">task:</span>
        <span class="hljs-attr">task_key:</span> <span class="hljs-string">"process_iterator"</span>
        <span class="hljs-attr">notebook_task:</span>
          <span class="hljs-attr">base_parameters:</span>
            <span class="hljs-attr">p_table:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{input}}</span>"</span>
        <span class="hljs-attr">notebook_path:</span> <span class="hljs-string">"process.py"</span>
</code></pre>
<p>Once everything is set up, the Databricks job/workflow definition will have the "for each" task receiving input from the pre-task's output.</p>
<p>After deployment:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759148490477/ddedd549-a3b3-455b-a8b4-851177cbdde7.png" alt class="image--center mx-auto" /></p>
<p>In a dynamic "for each" setup, you won't see the actual list of tables or iterations beforehand. This list only appears when the job runs and finishes the pre-step. Once that's done, the "for each" task starts. Clicking on the "for each" task takes you to a page that looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759148613864/337c0043-a697-4101-a616-d81f4f6d0a37.png" alt class="image--center mx-auto" /></p>
<p>Interestingly, the iterations were not in alphabetical order, and I haven't needed to make them alphabetical.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>So there you have it, the "for each" loop. It's quite useful and can help improve serialized for loops if you have them.</p>
<p>I did encounter one limitation. Currently, the "for each" task can only take inputs from other tasks. It can't pass task values to the next tasks. My workaround was to append the results to a single table which the next tasks can query later.</p>
]]></content:encoded></item><item><title><![CDATA[Databricks Environment Splits]]></title><description><![CDATA[Three-tier architecture was the norm when I started my IT career. It consisted of the presentation layer (the web servers), the application layer (the app servers), and the data layer (the databases).
We could spin up as many environments as needed f...]]></description><link>https://data.kurdapyo.com/databricks-environment-splits</link><guid isPermaLink="true">https://data.kurdapyo.com/databricks-environment-splits</guid><category><![CDATA[Databricks]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Tue, 05 Aug 2025 09:39:43 GMT</pubDate><content:encoded><![CDATA[<p>Three-tier architecture was the norm when I started my IT career. It consisted of the <strong>presentation layer</strong> (the web servers), the <strong>application layer</strong> (the app servers), and the <strong>data layer</strong> (the databases).</p>
<p>We could spin up as many environments as needed for the presentation and application layers — most engineers typically had their own local setups for these. But the <strong>database layer?</strong> That was usually shared.</p>
<p>This setup works fine when the schema is stable and the only operations are data manipulation (DML: inserts, updates, deletes). But in my experience, that’s rarely the case.</p>
<p><strong>Data-heavy applications</strong> are always evolving — new tables here, an extra column there, another index for that slow join. Sharing a database across multiple developers and environments quickly becomes a source of friction.</p>
<h2 id="heading-modern-workarounds-for-database-isolation"><strong>Modern Workarounds for Database Isolation</strong></h2>
<p>Today, there are far more elegant ways to manage this. I've worked on systems where each developer spun up their own isolated data layer using <strong>Docker containers</strong> for PostgreSQL or MySQL. For Oracle, we found it more practical to use <strong>PDBs (Pluggable Databases)</strong> instead. In AWS-heavy environments, some teams spun up databases from <strong>RDS snapshots</strong> using <strong>CloudFormation</strong> or <strong>Terraform</strong>.</p>
<p>There are also broader solutions involving full-stack virtualization, infrastructure-as-code, and automation tools. These approaches help bring the <strong>data layer</strong> up to the same level of flexibility as the app and presentation layers — and reduce cross-team collisions along the way.</p>
<h2 id="heading-databricks-and-the-environment-problem"><strong>Databricks and the Environment Problem</strong></h2>
<p>But what happens when your “database layer” isn’t a traditional RDBMS at all?</p>
<p>What if it’s <strong>Databricks</strong>, where data lives in cloud-managed storage, and the interface revolves around notebooks and distributed compute jobs? In this world, <strong>environment isolation</strong> takes a different form — and requires different strategies.</p>
<h2 id="heading-what-is-an-environment-in-databricks"><strong>What <em>Is</em> an Environment in Databricks?</strong></h2>
<p>The first important question is: <strong>how do you define an environment in Databricks?</strong> There are several valid approaches.</p>
<p>A common pattern is to define an environment as a <strong>Databricks workspace</strong> — e.g., <code>kurdapyo-prd</code>, <code>kurdapyo-uat</code>, <code>kurdapyo-sit</code>, <code>kurdapyo-dev</code>. This gives you clean isolation. Pre–Unity Catalog, you could even reuse catalog names across workspaces. But with Unity Catalog, <strong>catalog names must be globally unique</strong>, so it's now common to prefix them — e.g., <code>gold_uat</code>, <code>gold_prd</code>.</p>
<p>The downside? Spinning up a new workspace environment isn’t cheap. On AWS, it often involves provisioning a new account and VPC, setting up DNS, configuring SCIM and user permissions — the kind of setup best left to platform teams or DevOps experts.</p>
<h2 id="heading-splitting-a-workspace-into-logical-environments"><strong>Splitting a Workspace into Logical Environments</strong></h2>
<p>The next logical approach is to carve out <strong>logical environments</strong> within a single Databricks workspace. We explored two main options:</p>
<ul>
<li><p>Splitting at the <strong>catalog level</strong></p>
</li>
<li><p>Splitting at the <strong>schema level</strong></p>
</li>
</ul>
<p>I personally prefer the <strong>schema-level split</strong>. I even advocated for building automation to create schemas on demand and using <strong>zero-copy cloning</strong> to generate like-for-like environments — perfect for safe, isolated testing.</p>
<p>However, that approach wasn’t prioritized at the time. Instead, we went with a <strong>catalog-level split</strong>, managed through Terraform to keep things clean and avoid ad-hoc changes by engineers.</p>
<h2 id="heading-naming-your-logical-environments"><strong>Naming Your Logical Environments</strong></h2>
<p>Once you've defined your environment boundaries, the next step is naming. While I prefer a “<strong>treat them as cattle, not pets</strong>” mindset, stakeholders usually prefer meaningful names.</p>
<p>A practical trick I’ve learned: <strong>always include a numeric suffix</strong>, like <code>dev-001</code> or <code>uat-02</code>. It sounds simple, but it's incredibly effective — because no matter how many environments you start with, you'll always need more later. This naming convention makes it easy to scale your environments without resorting to naming gymnastics.</p>
<h2 id="heading-aspirations-automation-and-production-parity"><strong>Aspirations: Automation and Production Parity</strong></h2>
<p>Once your environments and naming conventions are in place, the long-term goal is automation:</p>
<ul>
<li><p>Automate environment <strong>creation and teardown</strong></p>
</li>
<li><p>Make it easy to spin up <strong>production-like environments</strong></p>
</li>
<li><p>Enable <strong>regression testing</strong> in consistent, isolated setups</p>
</li>
</ul>
<p>These efforts can be time-consuming upfront but pay off significantly in delivery speed and engineering confidence. Of course, it’s essential to balance these aspirations with business priorities. Delivering value to users comes first — but carving out time for automation is what sets teams up for long-term success.</p>
<p>One tool I’m particularly curious about is <strong>SQLMesh</strong>. It promises branch-based environments, automated model testing, and declarative change tracking — all of which seem like a natural fit for solving the isolation problem at the data transformation layer. I haven’t tried it in anger yet, but it’s definitely on my list to explore as we continue refining our approach to Databricks environments.</p>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>The challenges of environment isolation aren’t new — they just evolve with the tools we use. Whether it's a traditional database in a three-tier architecture or a modern platform like Databricks, the underlying principles remain the same: <strong>avoid shared mutable state</strong>, <strong>automate what you can</strong>, and <strong>build with repeatability in mind</strong>.</p>
]]></content:encoded></item><item><title><![CDATA[What's with a name (in Databricks)]]></title><description><![CDATA[Naming conventions are often a contentious topic among architects. In Databricks, these debates become even more pronounced due to the extensive platform work required to establish a Databricks implementation.
In an AWS-based Databricks setup, naming...]]></description><link>https://data.kurdapyo.com/whats-with-a-name-in-databricks</link><guid isPermaLink="true">https://data.kurdapyo.com/whats-with-a-name-in-databricks</guid><category><![CDATA[Databricks]]></category><category><![CDATA[unity catalog]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Fri, 11 Jul 2025 06:10:14 GMT</pubDate><content:encoded><![CDATA[<p>Naming conventions are often a contentious topic among architects. In Databricks, these debates become even more pronounced due to the extensive platform work required to establish a Databricks implementation.</p>
<p>In an AWS-based Databricks setup, naming conventions extend to AWS resources like VPNs, S3 buckets, IAM Roles, and IAM Policies. Additionally, with SCIM integration, service principals and groups are synced into Databricks, retaining their original names.</p>
<p>In Databricks, you'll encounter a mix of naming styles, including snake_case, kebab-case, camelCase, and PascalCase. This mix can include inconsistently named resources and occasionally misspelled ones that, for various reasons, aren't worth the effort to correct. It can be quite frustrating.</p>
<h2 id="heading-give-to-caesar-the-things-which-are-caesars">Give to Caesar the things which are Caesar’s</h2>
<p>Typically, different teams manage the cloud and Databricks environments. It's essential to adhere to the cloud team's established naming conventions. They service other technological needs aside from the data so they know their world best. Cloud resources often use kebab-case because bucket names become part of URLs. Interestingly, Google advises against using underscores in URLs, as they don't treat them as word separators (e.g., my_site is read as mysite).</p>
<p>Databricks start with the workspace name. It’s a name that is used in a URL and should be in kebab-case.</p>
<p>For groups and users, it is common practice to implement SCIM(System for Cross-domain Identity Management) Integration. What happens there is that the groups and users are synced from ActiveDirectory/IdentityNow and the groups and users are synced into databricks carrying over how it’s named. Service principals, I’ve always seen kebab-case used there. There is also the option of creating databricks groups which are separate from the SCIM integrated groups, these ones often just get the same convention for Service Principals.</p>
<h2 id="heading-enter-unity-catalog">Enter Unity Catalog</h2>
<p>Unity Catalog introduces additional complexity. Most resources, like Catalogs, Schemas, Tables, and Views, are used in SQL, where snake_case or UPPER_SNAKE_CASE is preferred. SQL syntax generally handles these well, as many databases aren't case-sensitive unless names are enclosed in double quotes. Using kebab-case can be problematic because the dash might be interpreted as a subtraction sign, necessitating double quotes.</p>
<p>PascalCase and camelCase are popular among Java developers and in SQLServer, but using these conventions require double quotes due to case sensitivity and spaces.</p>
<p>Beyond SQL objects, resources like storage credentials and external locations often follow kebab-case, especially when linked to S3 buckets.</p>
<h2 id="heading-abcs">ABCs</h2>
<p>Ultimately, the key to effective naming is to Always Be Consistent. While smaller projects often achieve this, larger implementations may struggle due to the involvement of many people, time constraints, and limited reviews. Therefore, in addition to striving for consistency, it's important to Also Be Considerate. People generally do their best with the knowledge and resources available, and while names that violate standards can be changed, consistency might not always be prioritized due to other pressing business needs.</p>
]]></content:encoded></item><item><title><![CDATA[IAC Face-Off: Databricks Asset Bundles vs Terraform]]></title><description><![CDATA[Data Engineers often have to juggle both data processing and DevOps pipelines. I’ve worked with many skilled data professionals, but they tend to shy away from the DevOps side of things. Often, I get asked to take over this aspect due to my backgroun...]]></description><link>https://data.kurdapyo.com/iac-face-off-databricks-asset-bundles-vs-terraform</link><guid isPermaLink="true">https://data.kurdapyo.com/iac-face-off-databricks-asset-bundles-vs-terraform</guid><category><![CDATA[Databricks]]></category><category><![CDATA[Databricks asset bundles]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Sun, 06 Jul 2025 12:13:28 GMT</pubDate><content:encoded><![CDATA[<p>Data Engineers often have to juggle both data processing and DevOps pipelines. I’ve worked with many skilled data professionals, but they tend to shy away from the DevOps side of things. Often, I get asked to take over this aspect due to my background.</p>
<p>For Databricks, resource deployment has always been a combination of Terraform and Databricks Asset Bundles. Terraform likely needs no introduction, as it's prevalent in infrastructure deployment nowadays. Other options like Pulumi or even Ansible is a distant second. While I’m no expert, I have a working understanding of how to implement changes with it.</p>
<p>Databricks Asset Bundles (DABs), however, are not as widely known. Before DABs, deploying jobs and workflows in Databricks was converging using DBX. Eventually, Databricks added the bundle command to the Databricks CLI, and it’s now being pushed more and more.</p>
<p>When I first heard about DABs from our Solutions Architect, it seemed like we could use it for all resource deployments. I was disappointed to find out later that it only deployed Workflows and DLT Pipelines.</p>
<h2 id="heading-whats-missing">What's Missing?</h2>
<p>All the platform-related components, of course. Our project involved setting up Databricks on AWS from scratch, and we aimed to minimize manual configurations (clickops) while adhering to strict security protocols. Both Account Admin and workspace admin privileges were tightly controlled, as they should be.</p>
<p>We needed to deploy everything from the metastore, workspaces, groups, and service principals to catalogs, schemas, tables, views, and grants. However, aside from Workflows and DLT Pipelines, DABs couldn't create any of these components—not even tables and views. While we could write code to create these using the Databricks API, it required running them through notebooks, leading to less elegant, more cumbersome code compared to the declarative style of Terraform.</p>
<h2 id="heading-what-we-did-eventually">What we did eventually</h2>
<p>Terraform continues to be our primary tool for provisioning most resources, including Catalogs, Schemas, External Locations, and Volumes. It also manages Shared Databricks Clusters and SQL Warehouses effectively. To maintain simplicity and security, we limit grants to the Schema level, avoiding individual object grants.</p>
<p>Alas, Terraform is not very common in our current crew of data engineers. This equated to a higher platform engineer to data engineer ratio for us.</p>
<p>We still rely on Databricks Asset Bundles for deploying Databricks workflows, as they offer a straightforward method to deploy the same code across different environments.</p>
<p>For creating tables and views, we utilize DBT (Data Build Tool), which provides a declarative approach to object creation. As for User-Defined Functions (UDFs), they might still require the less elegant notebook approach, but fortunately, we haven't needed to create any yet.</p>
<p>I am hopeful that Databricks Asset Bundles will eventually support a broader range of resources. The tool is rapidly evolving, and I remain optimistic about its future capabilities.</p>
]]></content:encoded></item><item><title><![CDATA[Fivetran Schema Change Handling]]></title><description><![CDATA[Schema changes can be a significant challenge for many ETL pipelines. In the past, changes were more manageable because systems operated independently. However, with the rise of SaaS products, out-of-the-box solutions, and frequent release cycles, ch...]]></description><link>https://data.kurdapyo.com/fivetran-schema-change-handling</link><guid isPermaLink="true">https://data.kurdapyo.com/fivetran-schema-change-handling</guid><category><![CDATA[fivetran]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Fri, 04 Jul 2025 10:45:44 GMT</pubDate><content:encoded><![CDATA[<p>Schema changes can be a significant challenge for many ETL pipelines. In the past, changes were more manageable because systems operated independently. However, with the rise of SaaS products, out-of-the-box solutions, and frequent release cycles, change has become a constant, and occasional schema changes must be addressed.</p>
<h2 id="heading-setting-the-scene"><strong>Setting the Scene</strong></h2>
<p>In our scenario, our data source is a Snowflake database managed by a SaaS vendor. We decided to use Fivetran to sync data from Snowflake into our target database. The SaaS vendor follows a weekly release cycle for minor updates and a monthly cadence for major releases, with changes occurring in different environments on varying dates.</p>
<p>We set up Fivetran replication from this Snowflake source. Our connector is configured to sync a substantial number of tables, and we have been syncing daily for several days without issues.</p>
<h2 id="heading-cue-in-the-schema-change"><strong>Cue in the Schema Change</strong></h2>
<p>One day, our sync did not complete successfully. We received a status of “Rescheduled” with the reason “Unsupported schema change requires table resync: ADD_COLUMN” for one of the tables.</p>
<p>As we were still new to analyzing the Fivetran logs (not available on the Fivetran Console), we were unsure of what had happened. We asked some questions within the team, and while it was a learning experience, it was challenging at the time. Here are the questions we faced:</p>
<h3 id="heading-question-1-what-column-was-added">Question 1: What column was added?</h3>
<p>This information was not readily available. However, by querying the Fivetran logs for <code>alter_table</code> events, we could identify the table and the column that was added.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>   *
<span class="hljs-keyword">FROM</span>     &lt;destination_db&gt;.&lt;destination_schema&gt;.log
<span class="hljs-keyword">WHERE</span>    conector_id = :connector_id
<span class="hljs-keyword">AND</span>      message_event = <span class="hljs-string">'alter_table'</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> time_stamp <span class="hljs-keyword">DESC</span>;
</code></pre>
<h3 id="heading-question-2-what-happened-to-the-sync-of-the-other-tables">Question 2: What happened to the sync of the other tables?</h3>
<p>We were initially unaware of the status of the other tables. The Fivetran Console did not provide this information. However, we later learned that the other tables had indeed been synced. This information can also be queried from the logs.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>   message_data:<span class="hljs-keyword">schema</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"schema"</span>,
         message_data:<span class="hljs-keyword">table</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"table"</span>,
         message_data:<span class="hljs-keyword">count</span>
<span class="hljs-keyword">FROM</span>     &lt;destination_db&gt;.&lt;destination_schema&gt;.log
<span class="hljs-keyword">WHERE</span>    conector_id = :connector_id
<span class="hljs-keyword">AND</span>      message_event = <span class="hljs-string">'records_modified'</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> time_stamp <span class="hljs-keyword">DESC</span>;
</code></pre>
<h3 id="heading-question-3-should-we-just-rerun-the-sync">Question 3: Should we just rerun the sync?</h3>
<p>If we had relied on the Fivetran scheduler, it would have retried the sync automatically without any intervention from us. So, we decided to rerun the sync.</p>
<p>What we now understand is that Fivetran, on that first sync that ended in a reschedule, had already altered the table with the schema change. At this point, it simply required an acknowledgment to proceed with a full sync for that table.</p>
<p>This realization was surprising. I initially thought that Fivetran would detect the schema change and, upon rerunning the sync, would then alter the table and perform the sync. However, that was not the case.</p>
<p>This issue is somewhat unique to us because we do not own the source. But imagine this happening to a table of multi-terabyte size. Attempting to resync could take days or even weeks. And there was no way of getting around it.</p>
<h2 id="heading-rtfm">RTFM</h2>
<p>The documentation clarified the issue further and the behavior we saw.</p>
<p><a target="_blank" href="https://fivetran.com/docs/connectors/databases/snowflake#automatictableresyncs">https://fivetran.com/docs/connectors/databases/snowflake#automatictableresyncs</a></p>
<blockquote>
<h3 id="heading-automatic-table-re-syncs">Automatic table re-syncs</h3>
<p>For tables with a primary key, we support ADD COLUMN DDL operations with null or default static values. All other table or column alterations trigger an automatic table re-sync. For tables without a primary key, any DDL operation triggers an automatic table re-sync.</p>
</blockquote>
<p>Apparently the fault was ours. Though we can’t really do anything about it as the snowflake source is vendor owned.</p>
<h2 id="heading-what-to-do-now">What to do now</h2>
<p>Sad to say, not much could be done.</p>
<p>We educated support about this and watch out and just prepare communications on these occurences.</p>
<p>We raised the issue to fivetran and made a feature request.</p>
<p>Now we wait.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">📧</div>
<div data-node-type="callout-text">We'd love to hear from you! Share your experiences with schema changes in ETL processes in the comments below. Have questions or feedback on the strategies discussed? Don't hesitate to reach out.</div>
</div>]]></content:encoded></item><item><title><![CDATA[Upset with Fivetran Scheduler Offsets]]></title><description><![CDATA[A few months ago, I finally had the opportunity to use Fivetran in a real-world project. To provide some background, this was a greenfield project involving a financial system, tooling and architecture had been selected before my arrival. The tech st...]]></description><link>https://data.kurdapyo.com/upset-with-fivetran-scheduler-offsets</link><guid isPermaLink="true">https://data.kurdapyo.com/upset-with-fivetran-scheduler-offsets</guid><category><![CDATA[control-m]]></category><category><![CDATA[fivetran]]></category><category><![CDATA[Orchestration]]></category><dc:creator><![CDATA[Kurdapyo Data Engineer]]></dc:creator><pubDate>Tue, 24 Jun 2025 09:44:20 GMT</pubDate><content:encoded><![CDATA[<p>A few months ago, I finally had the opportunity to use Fivetran in a real-world project. To provide some background, this was a greenfield project involving a financial system, tooling and architecture had been selected before my arrival. The tech stack included Databricks for the Data Lake (or Delta Lake), with Fivetran chosen as the data acquisition tool. The company was using Control-M as its scheduler, but no one really wanted to use that. For this specific task, it was decided to utilize the Fivetran Scheduler to trigger the data acquisition processes.</p>
<h2 id="heading-my-mission-or-so-i-thought">My Mission (or so I thought)</h2>
<p>I was brought into this project to work on Databricks components. Despite this, I was eager to see Fivetran in action. Architect colleagues I’ve collaborated with for DMS often touted how much easier Fivetran will make such acquisition or migration tasks.</p>
<p>I diligently explored how to orchestrate the data acquisition with Databricks workflows. As expected, Databricks didn’t have direct connectivity to Fivetran. This is a common security practice. It prevents data engineers from being tempted to create code to directly access the Fivetran APIs, potentially leading to messy implementations.</p>
<p>I expressed concerns that orchestration should employ the preferred orchestration tool, despite it being considered outdated technology. Unfortunately, my concerns were ignored. With my well-defined JIRA card languishing without any progress, I decided to focus and proceed with the work regardless.</p>
<p>With no API connectivity, I began examining Fivetran logs, which were being synced by Fivetran to Databricks. BTW, Fivetran doesn’t provide logs on its web console, and the way a customer can access logs is by syncing log data to a destination using a Fivetran connector. (This novel topic might be a future blog.)</p>
<p>The Fivetran log connector had a 5-15 minute delay which was acceptable for now. SLA's will be dealt with much later on.</p>
<h2 id="heading-initial-signs">Initial Signs</h2>
<p>We had a job scheduled at 4 p.m. From the logs, I noticed syncs were only starting around 4:11 p.m. This seemed odd, but I didn't think much of it at the time, as it was still early in the project and nothing was stable yet.</p>
<p>After a few more weeks, we were preparing to promote the code to the next higher environment. We deployed the same code (Fivetran was deployed using Terraform, by the way) for a 4 p.m. sync. We waited, but by 4:15, the job hadn't started. By 4:30, still nothing. 4:45, nothing. Finally, the job only started at 4:54 p.m.</p>
<h2 id="heading-rtfm">RTFM</h2>
<p>Digging through the Fivetran documentation, I stumbled upon this warning:</p>
<p><a target="_blank" href="https://fivetran.com/docs/core-concepts/syncoverview">https://fivetran.com/docs/core-concepts/syncoverview</a></p>
<blockquote>
<p>When you add a new destination, Fivetran assigns it a fixed time offset. The offset can be any random value in minutes ranging from 0 to 60. It is derived from the destination ID hash. This offset is shared by every connector in the destination. The offset value remains the same regardless of the set sync frequency.</p>
</blockquote>
<p>Let that sink in: a random 0-60 minute offset. And it can’t be modified, even through a support request (trust me, I tried).</p>
<h2 id="heading-denouement">Denouement</h2>
<p>This discovery threw a wrench in our plans to avoid using Control-M. After losing a moderate amount of effort on creating a log-based solution, we pivoted to developing a simple script to be triggered by Control-M. In terms of code, simpler solution. However, in terms of infrastructure, it was a minor nightmare. We had to sort out an agent for Control-M, open up network connectivity, create a Fivetran service user to trigger the API calls, and manage secrets, among other tasks.</p>
<h2 id="heading-things-i-learned">Things I learned</h2>
<p>Avoid the fivetran scheduler like the plague, the API is the way to go to trigger fivetran syncs. You won’t be hit by an unreasonably high offset. And you’ll have more options to handle reschedules or errors on syncs. I’ll rant about the Fivetran Scheduler retry logic in another time.</p>
<p>Embrace the orchestration tool as your ally. I still do not like Control-M. While Control-M may not be my preferred choice, I recognize the robust infrastructure this organization has established around it, including change management processes, 24/7 support teams, alerts, and more. It does make me think of what hurdles needs to be scaled to get something like Airflow productionized.</p>
]]></content:encoded></item></channel></rss>