Kurdapyo's Newbie Guide to PySpark

Lately, I've been working for an enterprise customer with hundreds of data engineers in a chapter. They're new in their Databricks journey, and their language of choice is SQL with a sprinkling of oracle PL/SQL, redshift stored procedures, some shell scripting, some panda-esq python code thrown in the mix. Having done more Databricks work more than the average data engineer, little old rebellious me has been asked to come out with some standards on top of my regular programming. I couldn't get to the task immediately, and ended up in some meetings to review initial coding attempts by our data engineers. Some patterns emerged and now I get to write them down.

Mindset Shift

Coding for Databricks using PySpark comes with a slight paradigm shift. Instead of the usual procedural approach, involving row-by-row processing with cursors and loops, one must think of how to do operations in bulk. How to setup dataframes with data transformations that will be lazily evaluated. Storage is not tightly coupled with the database engine, hence the usual context switching problems with PL/SQL hits extra hard. Understanding the columnar way the data is stored is important. There are no indexes and data is organized into files and partitions, whether explicitly or implicitly defined. Optimizing access paths is about eliminating partitions or files that do not need to be read.

SQL vs PySpark Syntax

Given that background, I present how an SQL statement actually looks like converted to PySpark. For example:

SELECT ename, job, sal
FROM emp
WHERE job IN ('CLERK', 'ANALYST') 
  AND sal > 1000
ORDER BY sal DESC;

This converted to PySpark looks like this:

from PySpark.sql import functions as F

df = spark.table("emp")

result = (
    df.filter(F.col("job").isin("CLERK", "ANALYST"))
    .filter(F.col("sal") > 1000)
    .select("ename", "job", "sal")
    .orderBy(col("sal").desc())
)

Some things stand out immediately. Firstly, we start with the FROM clause which initializes the dataframe. The next part is a set of filters, a select clause, and an order by clause. This introduces the chaining syntax which can be useful for refactoring later on. I would also highlight the fact that with PySpark, the filters should normally be defined before the select clause. This brings me to some key takeaways I've come to know about and want to impart.

Takeaway Boxes

Filtration is key

Think about the filters first. It's a good rule because this makes one think of how to filter a table so that the least amount of rows are subjected to the next steps. Multiple filter functions can be chained as well to avoid the multitude of AND clauses which I can never decide when to cut to a new line and indent for readability.

Be Lazy Until You Can't

Running the code above segues into the concept of lazy evaluation. Data Engineers should be conscious of the operations that are eager actions which is the opposite of .

A df.count() actually executes, do not use it to check for the existence of a row.

A df.collect() executes and fetches all rows and puts it into an array of python rows which can easily be iterated. It should only be used on small data sets for possibly metadata.

A df.show() or a display(df) actually runs the query and displays it on the cell.

A df.write() should almost always be the last operation as this means the data frame has all the lazy transformations it needs and can now be written to a table, or the storage.

Why not spark.sql

Data engineers with insane SQL skills new to PySpark tend to write their queries and transformations in SQL statements when they are in a python notebook. Therein lies the problem, using spark.sql() means building dynamic SQL strings in python, F-strings can only get you so far. You've got to handle the WHERE clauses with multiple AND conditions. You've got to escape special characters. You'll have to handle subqueries and make sure that parenthesis are matching. Simple put, it quickly becomes a string manipulation exercise. Some of the points I raise regularly are:

SQL Injection - this used to be a mortal sin in database development
Using spark.sql means everything is a string. Datatypes need to be casted properly from string. Special characters need to be escaped properly to become the correct value for the SQL statement.
Capitalization and SQL standards are optional and not enforced. Linting tools can easily refactor python PySpark functions, but the strings inside a spark.sql is free form.

Know your (data) types

I can usually tell a very good engineer by how they handle data types. In databases, using the correct native datatype gives significant gains in performance and storage efficiency. Relying on implicit casting or leaving types ambiguous often leads to subtle bugs, data quality issues, and degraded performance.

In PySpark, a strong grasp of data types is essential because you are working across two systems:

Python data types (e.g. int, float, str, datetime)
PySpark SQL data types (e.g. IntegerType, DoubleType, StringType, TimestampType)

Misalignment between these can introduce silent errors, especially during ingestion, transformation, and writes. Timestamps (or datetime in python) can also be easily confusing. Using the python datetime variables help in getting all the nuances on timestamps in place rather than converting to specific datetime/timestamp string formats. Databricks also by default does not display all the microseconds when it implicitly converts timestamps to string. Hence loss of precision happens when someone naively just uses strings.

Explicitly cast literals on spark dataframes:

from PySpark.sql import functions as F
from datetime import datetime

current_time = datetime.now()

df.filter(F.col("id") == F.lit(123).cast("int"))
df.withColumn("extract_timestamp", F.lit(current_time).cast("timestamp"))

If needing to actually put datetimes into strings, use the isoformat. That format maintains readability and the option to put the desired time offset that makes comparison of the data easier later on.

Null and None Matters

Nulls are always a fun database concept. In PySpark, nulls actually highlight the importance of explicitly setting data types. In Python, None is the equivalent of the Spark null.

Shuffling Out

Wrapping up, I realize I haven't even discussed shuffling yet. I'll save that for later.

So shuffling out, the biggest shift when moving SQL/PLSQL-minded engineers to PySpark on Databricks isn’t just new syntax — it’s a different way of thinking about data. Instead of procedural, row-by-row logic, you design transformations as lazy, set-based operations on dataframes; instead of relying on indexes you reason about files, partitions and columnar layouts; and instead of micro-optimizations in a procedural loop you focus on eliminating unnecessary reads and choosing the right execution patterns.

For the next article I’ll dig into concrete patterns and anti-patterns — the idioms you’ll reach for most often and the traps you’ll want to avoid. I'll aim for runnable examples, checklist-style rules of thumb, and before/after comparisons so engineers coming from SQL or PL/SQL can map familiar concepts to PySpark best practices.

Kurdapyo's Newbie Guide to PySpark

Mindset Shift

SQL vs PySpark Syntax

Takeaway Boxes

Filtration is key

Be Lazy Until You Can't

Why not spark.sql

Know your (data) types

Null and None Matters

Shuffling Out

Comments

Laying bricks with Databricks

Databricks Antipatterns: Two Hard Lessons from the Field

More from this blog

Elements of Style and Substance: Databricks Edition

Databricks can have bad plans too

Databricks Antipatterns: Two Hard Lessons from the Field

Counting when it really counts

Command Palette

Mindset Shift

SQL vs PySpark Syntax

Takeaway Boxes

Filtration is key

Be Lazy Until You Can't

Why not spark.sql

Know your (data) types

Null and None Matters

Shuffling Out

Comments

Laying bricks with Databricks

Databricks Antipatterns: Two Hard Lessons from the Field

More from this blog