Skip to main content

Command Palette

Search for a command to run...

Kurdapyo's Newbie Guide to PySpark

Updated
6 min read
K

I’m the kuya at Kurdapyo Labs — a recovering Oracle developer who saw the light and helped migrate legacy systems out of Oracle (and saved a lot of money doing it). I used to write PL/SQL, Perl, ksh, Bash, and all kinds of hand-crafted ETL. These days, I wrestle with PySpark, Airflow, Terraform, and YAML that refuses to cooperate. I’ve been around long enough to know when things were harder… and when they were actually better. This blog is where I write (and occasionally rant) about modern data tools — especially the ones marketed as “no-code” that promise simplicity, but still break in production anyway.

Disclaimer: These are my thoughts—100% my own, not my employer’s, my client’s, or that one loud guy on tech Twitter. I’m just sharing what I’ve learned (and unlearned) along the way. No promises, no warranties—just real talk, some opinions, and the occasional coffee/beer-fueled rant.

If something here helps you out, awesome! If you think I’ve missed something or want to share your own take, I’d love to hear from you. Let’s learn from each other.

Lately, I've been working for an enterprise customer with hundreds of data engineers in a chapter. They're new in their Databricks journey, and their language of choice is SQL with a sprinkling of oracle PL/SQL, redshift stored procedures, some shell scripting, some panda-esq python code thrown in the mix. Having done more Databricks work more than the average data engineer, little old rebellious me has been asked to come out with some standards on top of my regular programming. I couldn't get to the task immediately, and ended up in some meetings to review initial coding attempts by our data engineers. Some patterns emerged and now I get to write them down.

Mindset Shift

Coding for Databricks using PySpark comes with a slight paradigm shift. Instead of the usual procedural approach, involving row-by-row processing with cursors and loops, one must think of how to do operations in bulk. How to setup dataframes with data transformations that will be lazily evaluated. Storage is not tightly coupled with the database engine, hence the usual context switching problems with PL/SQL hits extra hard. Understanding the columnar way the data is stored is important. There are no indexes and data is organized into files and partitions, whether explicitly or implicitly defined. Optimizing access paths is about eliminating partitions or files that do not need to be read.

SQL vs PySpark Syntax

Given that background, I present how an SQL statement actually looks like converted to PySpark. For example:

SELECT ename, job, sal
FROM emp
WHERE job IN ('CLERK', 'ANALYST') 
  AND sal > 1000
ORDER BY sal DESC;

This converted to PySpark looks like this:

from PySpark.sql import functions as F

df = spark.table("emp")

result = (
    df.filter(F.col("job").isin("CLERK", "ANALYST"))
    .filter(F.col("sal") > 1000)
    .select("ename", "job", "sal")
    .orderBy(col("sal").desc())
)

Some things stand out immediately. Firstly, we start with the FROM clause which initializes the dataframe. The next part is a set of filters, a select clause, and an order by clause. This introduces the chaining syntax which can be useful for refactoring later on. I would also highlight the fact that with PySpark, the filters should normally be defined before the select clause. This brings me to some key takeaways I've come to know about and want to impart.

Takeaway Boxes

Filtration is key

Think about the filters first. It's a good rule because this makes one think of how to filter a table so that the least amount of rows are subjected to the next steps. Multiple filter functions can be chained as well to avoid the multitude of AND clauses which I can never decide when to cut to a new line and indent for readability.

Be Lazy Until You Can't

Running the code above segways into the concept of lazy evaulation. Data Engineers should be conscious of the operations that are not lazily evaluated.

A df.count() actually executes, do not use it to check for the existence of a row.

A df.collect() executes and fetches all rows and puts it into an array of python rows which can easily be iterated. It should only be used on small data sets for possibly metadata.

A df.show() or a display(df) actually runs the query and displays it on the cell.

A df.write() should almost always be the last operation as this means the dataframe has all the lazy transformations it needs and can now be written to a table, or the storage.

Why not spark.sql

There is a tendency for a data engineer with insane SQL skills to write their queries and transformations in SQL statements when they are in a python notebook. Therein lies the problem, using spark.sql() means building dynamic SQL strings in python, F-strings can only get you so far. You've got to handle the WHERE clauses with multiple AND conditions. You'll have to handle subqueries and make sure that parenthesis are matching. Simple put, it quickly becomes a string manipulation exercise. Some of the points I raise regularly are:

  • SQL Injection - this used to be a mortal sin in database development

  • Using spark.sql means everything is a string. Datatypes need to be casted properly from string

  • Capitalization and SQL standards are optional and not enforced. Linting tools can easily refactor python PySpark functions, but the strings inside a spark.sql is free form.

Know your (data) types

I can usually tell a very good engineer by how they handle data types. In databases, using the correct native datatype gives significant gains in performance and storage efficiency. Relying on implicit casting or leaving types ambiguous often leads to subtle bugs, data quality issues, and degraded performance.

In PySpark, a strong grasp of data types is essential because you are working across two systems:

  • Python data types (e.g. int, float, str, datetime)

  • PySpark SQL data types (e.g. IntegerType, DoubleType, StringType, TimestampType)

Misalignment between these can introduce silent errors, especially during ingestion, transformation, and writes. Timestamps (or datetime in python) can also be easily confusing. Using the python datetime variables help in getting all the nuances on timestamps in place rather than converting to specific datetime/timestamp string formats. Databricks also by default does not display all the microseconds when it implicitly converts timestamps to string. Hence loss of precision happens when someone naively just uses strings.

Explicitly cast literals on spark dataframes:

from PySpark.sql import functions as F
from datetime import datetime

current_time = datetime.now()

df.filter(F.col("id") == F.lit(123).cast("int"))
df.withColumn("extract_timestamp", F.lit(current_time).cast("timestamp"))

Null and None Matters

Nulls are always a fun database concept. In PySpark, nulls actually highlight the importance of explicitly setting data types. In Python, None is the equivalent of the Spark null.

Shuffling Out

Wrapping up, I realize I haven't even discussed shuffling yet. I'll save that for later.

So shuffling out, the biggest shift when moving SQL/PLSQL-minded engineers to PySpark on Databricks isn’t just new syntax — it’s a different way of thinking about data. Instead of procedural, row-by-row logic, you design transformations as lazy, set-based operations on dataframes; instead of relying on indexes you reason about files, partitions and columnar layouts; and instead of micro-optimizations in a procedural loop you focus on eliminating unnecessary reads and choosing the right execution patterns.

For the next article I’ll dig into concrete patterns and anti-patterns — the idioms you’ll reach for most often and the traps you’ll want to avoid. I'll aim for runnable examples, checklist-style rules of thumb, and before/after comparisons so engineers coming from SQL or PL/SQL can map familiar concepts to PySpark best practices.