Machine Learning

How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API to Dashboard

Databricks also shook up the data market. The company has launched its free Databricks platform with all functionality included. It's a great resource for learning and exploring, to say the least.

With that in mind, I've created an end-to-end project to help you learn the basics of the main resources within Databricks.

This project demonstrates a complete Extract, Transform, Load (ETL) workflow within Databricks. It integrates the OpenWeatherMap API for data retrieval and the OpenAI GPT-4o-mini model to provide personalized, weather-based clothing suggestions.

Let's learn more about it.

The project

The project uses the full data pipeline within Databricks, following these steps.

  1. Take it out: Downloads current weather data for New York City with the OpenWeatherMap API [1].
  2. Convert: Converts UTC timestamps to New York local time and uses OpenAI [2] GPT-4o-mini to generate personalized clothing recommendations based on temperature.
  3. Upload it: Persists data in the Databricks Unity Catalog as both raw JSON files and a structured Delta table (Silver Layer).
  4. Orchestration: A notebook with this ETL code is added to the job and scheduled to run every 1 hour on Databricks.
  5. Statistics: The silver lining is serving up a Databricks Dashboard that displays weather information in line with LLM suggestions.

Here is the architecture.

Project Architecture. Author's photo.

Good. Now that we understand what to do, let's move on How part of this course.

Be careful: if you don't already have an account on Databricks, go to the Databricks Free Edition page [3]click Sign up for the Free Edition then follow the on-screen instructions to get your free access.

Extract: Includes API and Databricks

As I often say, a data project needs data to start, right? So our task here is to integrate the OpenWeatherMap API to import data directly into a PySpark notebook within Databricks. This task may seem complicated at first, but trust me, it is not.

On the Databricks home page, create a new notebook using + What's new button, and select A notebook.

Create a new Notebook. Author's photo.

Of course Take it out part, we will need:

1. API key from API OpenWeatherMap.

To get that, go to the API registration page and complete your free registration process. Once you're logged in to the dashboard, click API key tabwhere you will be able to see it.

2. Import packages

# Imports
import requests
import json

Next, we'll build Python class to fix our code and make it ready for production again.

  • This class receives the API_KEY we just created, as well as the city and country to retrieve the weather.
  • Returns the response in JSON format.
# Creating a class to modularize our code

class Weather:
    
    # Define the constructor
    def __init__(self, API_KEY):
        self.API_KEY = API_KEY

    # Define a method to retrieve weather data
    def get_weather(self, city, country, units='imperial'):
        self.city = city
        self.country = country
        self.units = units

        # Make a GET request to an API endpoint that returns JSON data
        url = f"
        response = requests.get(url)

        # Use the .json() method to parse the response text and return
        if response.status_code != 200:
            raise Exception(f"Error: {response.status_code} - {response.text}")
        return response.json()

Good. Now we can run this class. Be careful we use dbutils.widgets.get(). This command checks the Parameters in organized work, which we will see later in this article. It is a best practice to keep secrets safe.

# Get the API OpenWeatherMap key
API_KEY = dbutils.widgets.get('API_KEY')

# Instantiate the class
w = Weather(API_KEY=API_KEY)

# Get the weather data
nyc = w.get_weather(city='New York', country='US')
nyc

Here is the answer.

{'coord': {'lon': -74.006, 'lat': 40.7143},
 'weather': [{'id': 804,
   'main': 'Clouds',
   'description': 'overcast clouds',
   'icon': '04d'}],
 'base': 'stations',
 'main': {'temp': 54.14,
  'feels_like': 53.44,
  'temp_min': 51.76,
  'temp_max': 56.26,
  'pressure': 992,
  'humidity': 89,
  'sea_level': 992,
  'grnd_level': 993},
 'visibility': 10000,
 'wind': {'speed': 21.85, 'deg': 270, 'gust': 37.98},
 'clouds': {'all': 100},
 'dt': 1766161441,
 'sys': {'type': 1,
  'id': 4610,
  'country': 'US',
  'sunrise': 1766146541,
  'sunset': 1766179850},
 'timezone': -18000,
 'id': 5128581,
 'name': 'New York',
 'cod': 200}

With that answer in hand, we can move on to the Transformation part of our project, where we will clean and transform the data.

Convert: Formatting Data

In this section, we will look at clean and transform operations performed on raw data. We'll start by selecting the pieces of data needed for our dashboard. This is just getting the data from a dictionary (or JSON).

# Getting information
id = nyc['id']
timestamp = nyc['dt']
weather = nyc['weather'][0]['main']
temp = nyc['main']['temp']
tmin = nyc['main']['temp_min']
tmax = nyc['main']['temp_max']
country = nyc['sys']['country']
city = nyc['name']
sunrise = nyc['sys']['sunrise']
sunset = nyc['sys']['sunset']

Next, let's convert the timestamps to the New York time zone, as it comes with Greenwich Mean Time.

# Transform sunrise and sunset to datetime in NYC timezone
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
import time

# Timestamp, Sunrise and Sunset to NYC timezone
target_timezone = ZoneInfo("America/New_York")
dt_utc = datetime.fromtimestamp(sunrise, tz=timezone.utc)
sunrise_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunrise time time
dt_utc = datetime.fromtimestamp(sunset, tz=timezone.utc)
sunset_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunset time time
dt_utc = datetime.fromtimestamp(timestamp, tz=timezone.utc)
time_nyc = str(dt_utc.astimezone(target_timezone))

Finally, we format it as a Spark data frame.

# Create a dataframe from the variables
df = spark.createDataFrame([[id, time_nyc, weather, temp, tmin, tmax, country, city, sunrise_nyc, sunset_nyc]], schema=['id', 'timestamp','weather', 'temp', 'tmin', 'tmax', 'country', 'city', 'sunrise', 'sunset'])
Data were cleaned and transformed. Author's photo.

The last step in this section is to add a proposal from the LLM. In this step, we will select some data downloaded from the API and pass it to the model, asking it to return a suggestion of how to dress for the weather.

  • You will need an OpenAI API key.
  • Pass the weather, plural again min temperatures (weather, tmax, tmin)
  • Ask the LLM to return a proposal on how to dress according to the weather.
  • Add a proposal to the final data frame.
%pip install openai --quiet
from openai import OpenAI
import pyspark.sql.functions as F
from pyspark.sql.functions import col

# Get OpenAI Key
OPENAI_API_KEY= dbutils.widgets.get('OPENAI_API_KEY')

client = OpenAI(
    # This is the default and can be omitted
    api_key=OPENAI_API_KEY
)

response = client.responses.create(
    model="gpt-4o-mini",
    instructions="You are a weatherman that gives suggestions about how to dress based on the weather. Answer in one sentence.",
    input=f"The weather is {weather}, with max temperature {tmax} and min temperature {tmin}. How should I dress?"
)

suggestion = response.output_text

# Add the suggestion to the df
df = df.withColumn('suggestion', F.lit(suggestion))
display(df)

Good. We are almost done with ETL. Now it's about uploading it. That is the next stage.

Upload: Saving Data and Building a Silver Lining

The last part of ETL is it is loading data. We will load it in two different ways.

  1. Persistence of raw files on the Unity Catalog volume.
  2. It saves the transformed data frame directly to the silver layer, which is a Delta Table ready for use in the Dashboard.

Let's build a catalog that will hold all the weather data we get from the API.

-- Creating a Catalog
CREATE CATALOG IF NOT EXISTS pipeline_weather
COMMENT 'This is the catalog for the weather pipeline';

Next, we create ia schema of Lakehouse. This one will save the volume with raw JSON files downloaded.

-- Creating a Schema
CREATE SCHEMA IF NOT EXISTS pipeline_weather.lakehouse
COMMENT 'This is the schema for the weather pipeline';

Now, we create a volume of raw files.

-- Let's create a volume
CREATE VOLUME IF NOT EXISTS pipeline_weather.lakehouse.raw_data
COMMENT 'This is the raw data volume for the weather pipeline';

We are creating another one schema holding the Silver Layer Delta Table.

--Creating Schema to hold transformed data
CREATE SCHEMA IF NOT EXISTS pipeline_weather.silver
COMMENT 'This is the schema for the weather pipeline';

Once we've set everything up, here's what our Catalog looks like.

The catalog is ready to receive data. Author's photo.

Now, let's save the raw JSON response to our raw Volume. To keep everything organized and prevent overwriting, we'll attach a unique timestamp to each file name.

By to combine these files on the volume rather than just overwriting them, we create a reliable “audit path”. This acts as a safety net, meaning that if the downstream process fails or we experience data loss later, we can always go back to the source and reprocess the original data whenever we need it.

# Get timestamp
stamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')

# Path to save
json_path = f'/Volumes/pipeline_weather/lakehouse/raw_data/weather_{stamp}.json'

# Save the data into a json file
df.write.mode('append').json(json_path)

While we keep the raw JSON as our “source of truth,” we save the cleaned data to the Delta Table in the Silver layer where the real magic happens. By using .mode(“append”) and the Delta format, we ensure that our data is structured, valid in the schema, and ready for speed calculations or BI tools. This layer transforms raw API responses into a reliable, queryable table that grows with every pipeline used.

# Save the transformed data into a table (schema)
(
    df
    .write
    .format('delta')
    .mode("append")
    .saveAsTable('pipeline_weather.silver.weather')
)

Good! With this all set up, let's check how our table looks now.

Silver Layer Table. Author's photo.

Let's start making this pipe now.

Orchestration: Programming the Notebook to run automatically

If we continue with the project, it's time to make this pipeline work on its own, with minimal supervision. For that, Databricks has Works & Pipes tab, where it is easy to schedule tasks to be run.

  1. Click on Works & Pipes tab in the left panel
  2. Find the button Create and select Activity
  3. Click on the Notebook to add it to the task.
  4. Configure as below data.
  5. Add API Keys to Parameters.
  6. Click Create a job.
  7. Click Run Now to check if it works.
Adding a Notebook to Work. Author's photo

If you click the Run Now button, it should start using the notebook and display It was successful the message.

Jobs ran. Author's photo.

If the task is working well, it's time to schedule it to run automatically.

  1. Click on Add a trigger on the right side of the screen, under the section Schedules and triggers.
  2. Trigger type = Scheduled.
  3. Schedule type: select Advanced
  4. Select Every 1 hour from being lowered.
  5. Save it.

It's very good. Our pipeline is in default mode now! Every hour, the system will call the OpenWeatherMap API and get the latest NYC weather information and save it to our Silver Layer table.

Analytics: Building a Dashboard for Data-Driven Decisions

The last piece of this puzzle is to create Statistics, which will display weather information and provide the user with practical information about how to dress for the weather outside.

  1. Click on Dashboards tab in the left panel.
  2. Click on Create a dashboard button
  3. It will open up a blank canvas for us to work on.
The dashboard has started. Author's photo.

Now dashboards work based on data fetched from SQL queries. So, before we start adding text and images to the canvas, first we need to create the metrics that will be the variables to feed the dashboard cards and images.

So, click the button +Create from SQL metric start button. Say the name. For example, Locationto retrieve the last downloaded city name, I have to use the following query.

-- Get the latest city name fetched
SELECT city
FROM pipeline_weather.silver.weather
ORDER BY timestamp DESC
LIMIT 1

And we have to create one SQL query for each metric. You can see them all in the GitHub repository [ ].

Next, we click on the dashboard tab and start dragging and dropping things on the canvas.

Dashboard creation items menu. Author's photo.

Once you click on Text, it allows you to insert a box on the canvas and edit the text. When you click on the image part, it inserts a placeholder for the image, and opens a menu on the right side to select variables and settings.

Interacting with dashboards in Databricks. Author's photo.

That's right. After all items are added, the dashboard will look like this.

Completed Dashboard. Author's photo.

Very nice! And that concludes our project.

Before You Go

You can easily replicate this project in about an hour, depending on your experience with the Databricks ecosystem. Although it's quick to build, it packs a lot in terms of key engineering skills you'll practice:

  • Architectural Design: You'll learn how to build a modern Lakehouse from the ground up.
  • Seamless Data Integration: You will bridge the gap between external web APIs and the Databricks platform for real-time data ingestion.
  • Clean, Modular Code: We go beyond simple scripts by using Python classes and functions to keep the codebase organized and maintainable.
  • Automation & Orchestration: You will find tasks to organize your experience to ensure that your project works reliably in automated testing.
  • Delivering Real Value: The goal is not just to move data; is to provide value. By turning raw weather metrics into actionable clothing recommendations with AI, we turn “cold data” into a useful service for the end user.

If you liked this content, find my contacts and more about me on my website.

GitHub Repository

Here is the repository for this project.

References

[1. OpenWeatherMap API] (

[2. Open Ai Platform] (

[3. Databricks Free Edition] (

[4. GitHub Repository] ()

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button