PySpark for Beginners: Understanding the Basics

0 0 10 minutes read

PySpark for Beginners: Understanding the Basics

usually starts with tools like pandas. They are smart, powerful, and ideal for small to medium sized datasets. But as soon as your data grows beyond what fits comfortably in memory, performance problems start to appear. This is where it is PySpark enters.

Note that in this article I will often use the terms Spark and PySpark interchangeably. For our purposes, it doesn't matter, but you should remember that they are different. Spark is a comprehensive distributed computing framework (written in Scala), and PySpark is a Python API dedicated to Spark.

What is PySpark?

PySpark is a Python API for Apache Spark, a distributed computing framework for efficiently processing large volumes of data. Instead of running all the calculations on a single machine, Spark distributes the work across multiple machines (a cluster), allowing you to process data at scale while writing code that still feels familiar to Python users.

One of the key advantages of PySpark is that it removes much of the complexity of distributed programming. You don't need to manually manage threads, memory, or network connections. Spark handles these concerns for you, while you focus on interpretation what what you want to do with the data rather than How it must be done.

If you're new to Spark, there are three key, important concepts you should learn before using it. These are:

1. Collections

When people hear that Spark runs in a “cluster,” it can sound intimidating. Actually, you don't need deep knowledge of distributed systems to get started. A cluster is simply a group of servers grouped together that can share. In a Spark application running on a cluster, a single machine acts as a the driverlink work, while others act like this inheritancesperform calculations on data cubes. When the Executor node has finished its work, it returns the signal to the Driver node, and the Driver can do whatever is needed with the final result set.

                          ┌───────────────────┐
                          │       Driver      │
                          │(your PySpark app) │
                          └─────────┬─────────┘
                                    │ 
                                    | The Driver farms out work
                                    | to one or more executors
               ┌────────────────────┼───────────────────────────┐
               │                    │                           │
       ┌───────▼────────┐   ┌───────▼────────┐          ┌───────▼────────┐
       │   Executor 1   │   │   Executor 2   │          │   Executor N   │
       │  processes part│   │  processes part│  ......  │  processes part│
       │   of the data  │   |   of the data  │          │   of the data  │
       └────────────────┘   └────────────────┘          └────────────────┘

Just remember, you don't need to run Spark on a desktop computer cluster. If you use PySpark locally, Spark imitation cluster on your laptop or PC using multiple cores. One of the strengths of PySpark is that the same code can be deployed to a real cluster, either in the cloud or on premises, with very little changes.

This separation of communication and collaboration enables Spark to scale. As datasets grow, more users can be added to process the data in parallel, reducing runtime without requiring changes to your code.

2. Spark data frame

At the heart of PySpark is DataFrame APIwhich is the main way you work with data in Spark. A DataFrame is simply a table of data, made up of rows and columns – very similar to a table in a database or a DataFrame in pandas. If you've used SQL or pandas before, the basic concepts will feel familiar.

With Spark DataFrames, you can perform common data operations such as sorting rows, selecting columns, grouping data, grouping tables, and calculating summaries such as counts or averages. These activities are easy to read and write, allowing you to focus on them what what you want to do with the data is the technical details of how it works.

What makes Spark special is what happens behind the scenes. Spark automatically determines the most efficient way to run your DataFrame operations and executes them in parallel across multiple computers in the cluster. You don't have to manage this yourself – Spark handles things like data partitioning, task coordination, and failure recovery if something goes wrong.

Because of this, Spark DataFrames can handle very large data setseven those that are too large to fit into the memory of a single machine. At the same time, they provide a simple and familiar interface, making PySpark a powerful yet accessible tool for working with big data.

3. Laziness vs eager exploration

Another power of PySpark that you should know is its lazy approach versus eager execution.

Most Python data libraries, such as Pandas, are used to be killed with eagerness. This means that if you perform an operation, it is done immediately, followed by the next operation, and so on.

PySpark deals with this differently by using a technique called lazy execution. When you write data manipulations, such as selecting columns or sorting rows, Spark doesn't apply them immediately. Instead, it creates an advanced operating system and uses computation only when an action (such as displaying results or writing data to disk) is initiated. This allows Spark to configure the workflow before execution, making your code run smoothly without any extra effort on your part.

Eager execution (e.g. pandas)
data  ──filter──► result (computed immediately)

In pandas, each operation runs as soon as it is called. This is 
intuitive but can be inefficient for large datasets.

PySpark uses lazy execution.

Lazy execution (PySpark)
data ──filter──►
        │
        └─groupby──►   (plan builds here)
             │
             └─agg──►  (still no execution)
                     │
                 action ──► executes here

To drive this point home, consider the following scenario. Let's say we have a database of 10 million records that we want to…

a) Add a new empty column to it called X

b) Filter the data in some way causing us to remove 50% of the records.

c) Perform a join on the remaining records so that the new column X contains the MAX value of the other value in that row.

d) Print the row with the highest value of X

In a system that does eager execution, like Panda, every step is done as we described above. For 10 million records, it will look like this:

Add Column: The program creates a new version of the dataset of 10 million rows in memory, adding column X.
Sort by: The program sorts through all 10 million rows, resulting in 5 million deletions, and writes a new 5-million datasets in memory.
Integration: Calculates the MAX value of every row and updates the column.
Print: It finds the top line and shows it to you.

The problem is that we did a huge amount of “heavy lifting” (adding a column to 10 million rows) to quickly throw away part of that work in the next step.

On the other hand, Spark, because of its lazy model, does not do any work when you define steps (a), (b), or (c). Instead, build a Logical System (also called DAG – Directed Acyclic Graph) to do the job.

When you finally trigger step (d) – i Action -Spark motivator looks at the whole system and realizes that it can work more intelligently:

Predicate Pushdown: Spark detects a filter (remove 50% of records). Instead of adding column X to the 10 million rows, it moves the filter to the beginning.
Preparation: Add only column X and include i left 5 million rows.
Result: It avoids processing 5 million records, saving 50% of memory and CPU time.

Sets the dev environment

Okay, that's enough theory. Let's take a look at how to install PySpark on your system and use some example code snippets. Now, with the introductory text to begin with, actually creating a real-world multi-node cluster is beyond the scope of this article. But as I said before, Spark can create an artificial cluster on your PC or laptop if it is multi-core, which it will be if your system is less than 10 years old.

The first thing we will do is set up a separate development environment for this project, making sure our projects are closed and do not interfere. I'm using WSL2 Ubuntu for Windows and Conda for this part, but feel free to use any location and method you're familiar with.

Install PySpark, etc.

# 1. Create a new environment with Python 3.11 (very stable for Spark)
conda create -n spark_env python=3.11 -y

# 2. Activate it
conda activate spark_env

# 3. Install PySpark and PyArrow (needed for Parquet files)
pip install pyspark pyarrow jupyter

To check that PySpark is installed correctly, type the pyspark command in a terminal window.

$ pyspark

Python 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:46:25) [GCC 14.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/15 16:15:21 WARN Utils: Your hostname, tpr-desktop, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
26/01/15 16:15:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/15 16:15:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::arrayBaseOffset has been called by org.apache.spark.unsafe.Platform (file:/home/tom/miniconda3/envs/pandas_to_pyspark/lib/python3.11/site-packages/pyspark/jars/spark-unsafe_2.13-4.1.1.jar)
WARNING: Please consider reporting this to the maintainers of class org.apache.spark.unsafe.Platform
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 4.1.1
      /_/
Using Python version 3.11.14 (main, Oct 22 2025 22:46:25)
Spark context Web UI available at 
Spark context available as 'sc' (master = local[*], app id = local-1768493723158).
SparkSession available as 'spark'.
>>>

If you don't see the Spark welcome banner, then something has gone wrong, and you should double-check your installation.

Example 1 – Creating a local cluster

This is actually very simple. Just type the following in your notebook.

from pyspark.sql import SparkSession

# Initialize the Spark Session
spark = SparkSession.builder 
    .master("local[*]") 
    .appName("MyLocalCluster") 
    .config("spark.driver.memory", "2g") 
    .getOrCreate()

# Verify the cluster is running
print(f"Spark is running version: {spark.version}")
print(f"Master URL: {spark.sparkContext.master}")



#
# The output
#
Spark is running version: 4.1.1
Master URL: local[*]

I SparkSession the concept is important. In the early days of Spark, users had to throw multiple “entry points” (such as a SparkContext for main functions, a SQLContext for dataframes, and a HiveContext for databases). It was confusing for the beginners.

SparkSession was introduced in Spark 2.0 as a “one stop shop” for everything. It is a single point of entry for interacting with Spark functionality.

Example 2 — Creating a data frame

Creating Dataframes and manipulating the data contained in PySpark will be what you do most of the time. And it's very straightforward to do. Here, we define that our data frame will contain three records and three named columns.

# 1. Define your data as a list of tuples
data = [
    ("Alice", 34, "New York"),
    ("Bob", 45, "London"),
    ("Catherine", 29, "Paris")
]

# 2. Define your column names
columns = ["Name", "Age", "City"]

# 3. Create the DataFrame
df = spark.createDataFrame(data, columns)

# 4. Show the result
df.show()



#
# The output
#
+---------+---+--------+
|     Name|Age|    City|
+---------+---+--------+
|    Alice| 34|New York|
|      Bob| 45|  London|
|Catherine| 29|   Paris|
+---------+---+--------+

Presumably, any dataframes you use will be created by reading data from a file or database. Create a CSV file named sales_data.csv on your system with the following content.

transaction_id,customer_name,net_amount,tax_amount, is_member
101,Alice,250.50,25.05,true
102,Bob,120.00,6.00, false
103,Charlie,450.75,25.07,true
104,David,89.99,5.73,false

Creating a data frame with a file like this is easy,

# Load the CSV file
df = spark.read.format("csv") 
    .option("header", "true") 
    .option("inferSchema", "true") 
    .load("sales_data.csv")

# Show the data
print("Dataframe Contents:")
df.show()

# Show the data types (Schema)
print("Data Schema:")
df.printSchema()



#
# The output
#
Dataframe Contents:
+--------------+-------------+----------+----------+----------+
|transaction_id|customer_name|net_amount|tax_amount| is_member|
+--------------+-------------+----------+----------+----------+
|           101|        Alice|     250.5|     25.05|      true|
|           102|          Bob|     120.0|       6.0|     false|
|           103|      Charlie|    450.75|     25.07|      true|
|           104|        David|     89.99|      5.73|     false|
+--------------+-------------+----------+----------+----------+

Data Schema:
root
 |-- transaction_id: integer (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- net_amount: double (nullable = true)
 |-- tax_amount: double (nullable = true)
 |--  is_member: string (nullable = true)

Example 3 – Data processing

Of course, once you have your input data in a data frame, the next thing you'll want to do is process or manipulate it in some way. It's easy too. Referring to the sales_data we just loaded, let's say we want to calculate the total price (net + tax) and the tax rate as a percentage of the total price for each record and add those to our original data frame.

from pyspark.sql import functions as F

# 1. Add 'gross_amount' by adding net and tax
# 2. Add 'tax_percentage' by dividing tax by the new gross amount
df_extended = df.withColumn("gross_amount", F.col("net_amount") + F.col("tax_amount")) 
                .withColumn("tax_percentage", 
                            (F.col("tax_amount") / (F.col("net_amount") + F.col("tax_amount"))) * 100)

# 3. Optional: Round the percentage to 2 decimal places for readability
df_extended = df_extended.withColumn("tax_percentage", F.round(F.col("tax_percentage"), 2))

# Show the new columns along with the old ones
df_extended.show()



#
# The output
#
+--------------+-------------+----------+----------+----------+------------+--------------+
|transaction_id|customer_name|net_amount|tax_amount| is_member|gross_amount|tax_percentage|
+--------------+-------------+----------+----------+----------+------------+--------------+
|           101|        Alice|     250.5|     25.05|      true|      275.55|          9.09|
|           102|          Bob|     120.0|       6.0|     false|       126.0|          4.76|
|           103|      Charlie|    450.75|     25.07|      true|      475.82|          5.27|
|           104|        David|     89.99|      5.73|     false|       95.72|          5.99|
+--------------+-------------+----------+----------+----------+------------+--------------+

Summary

That concludes our brief journey into the world of distributed computing with PySpark. I explained what PySpark is and why you should consider using it if the data you're processing exceeds your memory limit. In short, PySpark's ability to access large multi-node clusters, its lazy model and data frame data structure make it an ideal powerhouse for data processing.

PySpark is widely used in data engineering, analytics, and machine learning pipelines. It integrates well with cloud platforms, supports various data sources (such as CSV, Parquet, and databases), and scales from laptops to large production clusters.

If you're comfortable with Python and want to work with large datasets without abandoning standard syntax, PySpark is a great next step. It bridges the gap between simple data analysis and large-scale data processing, making it an essential tool for anyone entering the world of big data.

Hopefully, you can use my simple coding examples and explanations to take the next step to using PySpark in the real world, on a real cluster, and doing the right big data processing.

Source link

nimda 3 hours ago

0 0 10 minutes read