Machine Learning

Bootstrap Data Laleuse in the afternoon

It doesn't have to be that's it it's complicated. In this article, I will show you how to develop a basic, “Starter” person using Iceberg Table on AWS S3 storage. When a table is registered using aw glue, you will be able to query and modify it from Amazon Athena, including using:

  • Aggregating, updating and deleting data
  • Doing well and closing your tables.

I will also show you how to test similar tables in your area from the Duckdb, and we will also see how you can use it Glue / spark to enter multiple table data.

Our example may be basic, but it will show the setup, different tools and processes you can put in place for a wide range of data. All of today's cloud providers have similarities to the AWS services that I discuss in this article, so it should be exactly the same as what I discussed here in Azure, Google Cloud, and others.

To make sure we're all on the same page, here's a brief description of some of the key technologies we'll be using.

Aw glue / spark

AWS glue is a fully managed, inbound ETL service from Amazon that streamlines data preparation and integration for analytics and machine learning. It automatically receives and catalogs Metadata from various sources, such as S3, in a central data store. Additionally, it can create custom Python Ellork Ellork-based Python scripts to perform these tasks on the Apache Spark platform. This makes it ideal for creating data pools in Amazon S3, loading data into data streams such as Amazon Redshift, and performing data cleansing and transformation. Everything except managing the infrastructure.

AWS ATHENA

AWS ATH ATHENA BASED SERVICE GUARANTEE SUPPORTS HIGH PERFORMANCE OF Amazon S3 using standard SQL. As a featureless platform, there is no need to manage or provision servers; Just point Athena at your S3 data, define your schema (usually with AWS glue), and start running SQL queries. It is often used for ad hoc analysis, reporting, and evaluation of large datasets in formats such as CSV, JSON, ORC, or Parquet.

Iceberg tables

Iceberg tables are an open table format for Datasets that provide data-like capabilities for data stored in data pools, such as Amazon S3 object storage. Traditionally, in S3, you can create, read, and delete objects (files), but updating them is impossible. The Iceberg format addresses this limitation while providing other benefits, including acid transactions, schema evolution, hidden classification, and time travel features.

Duckdb

DuckDB is an in-memory analytical database written in C++ and designed for Analytical SQL workloads. Since its release a few years ago, it has grown in popularity and is now one of the premier data processing tools used by data engineers and scientists, due to its performance in SQL, functionality and performance.

Overview Status

Let's say you've been tasked with a thin “Warehouse-Lite” table for ordering events, but you don't want to accept a heavy platform now. You need:

  • Secure writing (No Broken Readers, No Part Making)
  • Veil Changes (Update / Delete / Merge, not just append)
  • Point-in-time read (What Audits and Errors Do)
  • Local Analysis against accurate production data for quick checks

What are we going to build

  1. Create an Iceberg Table on Mix with glue – He can say S3 flat out Athena
  2. Load and dynamic lines (Install / Update / Delete / MergeWe are divided
  3. The passage of time from previous snapshots (with timestamp and snapshot id)
  4. Save quickly with Select and Dig what's exciting together
  5. Learn The same table local from Duckdb (S3 access to duckd secrets)
  6. See how to add new records to our table using the glue Spark code

So, in short, we will be using:-

  • S3 data storage
  • Glue catalog for table metadata / Discovery
  • Athena from Scarless SQL Reads and a couple
  • DuckDB for cheap, local analytics faces the same iceberg
  • It is jubilant processing grit

The key takeaway from our vision is that by using the technology above, we will be able to perform queries similar to data storage.

Setting up our development environment

I prefer to separate the local tools in a separate area. Use any tool you like to do this; I will demonstrate using colla as that is what I usually do. For demo purposes, I'll be running all the code in a Jupybotter Notebook environment.

# create and activate a local env
conda create -n iceberg-demo python=3.11 -y
conda activate iceberg-demo

# install duckdb CLI + Python package and awscli for quick tests
pip install duckdb awscli jupyter

Requirements

Since we will be using AWS services, you will need an AWS account. And,

  • S3 bucket for Data queue (eg. s3://my-demo-lake/warehouse/We are divided
  • Glue database (we will create one)
  • Athena ENINX Version 3 men Your work group
  • IAM role or Athena user with S3 + glue permissions

1 / Athena Setup

Once you're logged in to AWS, open Athena in the Console and set your WorkGroup, Engine version and S3 output location (for query results). To do this, look for the hamburger-style menu bar at the top left of the Athena screen. Click on it to bring up a new menu block on the left. There, you should see the management->>> Links to the work function. You will be automatically assigned to the primary group. You can stick with this or build a new one if you like. Whichever option you choose, edit them and ensure that the following options are selected.

  • Analytics Engine – Athena SQL. Manually set the engine version to 3.0.
  • Select the customer managed query configuration and enter the required bucket and account details.

2 / Create an Iceberg table in Athena

We'll keep the order events and let Iceberg handle the partition transparently. I will use the “hidden” partition for the time period of the stream write / read. Return to the Athena home page and launch the Trino SQL query editor. Your screen should look like this.

Image from the AWS website

Type and run the following SQL. Change the buckets / table names to match.

-- This automatically creates a Glue database 
-- if you don't have one already
CREATE DATABASE IF NOT EXISTS analytics;
CREATE TABLE analytics.sales_iceberg (
  order_id    bigint,
  customer_id bigint,
  ts          timestamp,
  status      string,
  amount_usd  double
)
PARTITIONED BY (day(ts))
LOCATION 's3://your_bucket/warehouse/sales_iceberg/'
TBLPROPERTIES (
  'table_type' = 'ICEBERG',
  'format' = 'parquet',
  'write_compression' = 'snappy'
)

3) Load and modify data (Insert / Update / Delete / Merge)

Athena supports true Iceberg DML, allowing you to insert rows, update and delete records, and upsert using a join statement. Under the hood, iceberg uses snapshot-based acid for deleted files; Students remain consistent while writers work in parallel.

A few rows of seeds.

INSERT INTO analytics.sales_iceberg VALUES
  (101, 1, timestamp '2025-08-01 10:00:00', 'created', 120.00),
  (102, 2, timestamp '2025-08-01 10:05:00', 'created',  75.50),
  (103, 2, timestamp '2025-08-02 09:12:00', 'created',  49.99),
  (104, 3, timestamp '2025-08-02 11:47:00', 'created', 250.00);

A quick sanity check.

SELECT * FROM analytics.sales_iceberg ORDER BY order_id;

 order_id | customer_id |          ts           |  status  | amount_usd
----------+-------------+-----------------------+----------+-----------
  101     | 1           | 2025-08-01 10:00:00   | created  | 120.00
  102     | 2           | 2025-08-01 10:05:00   | created  |  75.50
  103     | 2           | 2025-08-02 09:12:00   | created  |  49.99
  104     | 3           | 2025-08-02 11:47:00   | created  | 250.00

Review and delete.

UPDATE analytics.sales_iceberg
SET status = 'paid'
WHERE order_id IN (101, 102)
-- removes order 103
DELETE FROM analytics.sales_iceberg
WHERE status = 'created' AND amount_usd < 60

Idempotent Upserts by association

Let's treat Order 104 as a refund and create a new Order 105.

MERGE INTO analytics.sales_iceberg AS t
USING (
  VALUES
    (104, 3, timestamp '2025-08-02 11:47:00', 'refunded', 250.00),
    (105, 4, timestamp '2025-08-03 08:30:00', 'created',   35.00)
) AS s(order_id, customer_id, ts, status, amount_usd)
ON s.order_id = t.order_id
WHEN MATCHED THEN 
  UPDATE SET 
    customer_id = s.customer_id,
    ts = s.ts,
    status = s.status,
    amount_usd = s.amount_usd
WHEN NOT MATCHED THEN 
  INSERT (order_id, customer_id, ts, status, amount_usd)
  VALUES (s.order_id, s.customer_id, s.ts, s.status, s.amount_usd);

Now you can review again to see: 101/102 → – believed103 Removed, 104 → -reimbursed for a refundand 105 → created. .

SELECT * FROM analytics.sales_iceberg ORDER BY order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 105 4 2025-08-03 08:30:00.000000 created 35.0
3 102 2 2025-08-01 10:05:00.000000 paid 75.5
4 104 3 2025-08-02 11:47:00.000000 refunded 250.0

4) Time travel (and version travel)

This is where the real value of using an iceberg shines. You can ask the table as it looks for a moment at a time or with a specific snapshot id. In Athena, use this syntax,

-- Time travel to noon on Aug 2 (UTC)
SELECT order_id, status, amount_usd
FROM analytics.sales_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2025-08-02 12:00:00 UTC'
ORDER BY order_id;

-- Or Version travel (replace the id with an actual snapshot id from your table)

SELECT *
FROM analytics.sales_iceberg
FOR VERSION AS OF 949530903748831860;

To find the various ID (Snapshot) IDs associated with a particular table, use this query.

SELECT * FROM "analytics"."sales_iceberg$snapshots"
ORDER BY committed_at DESC;

5) To keep your data healthy: Prepare and vacuum

Level line Delete the files and plaque data. Two statements keep things fast and friendly:

  • Prepare … Rewrite data using bin_pack – Compacts small files / splits with Folds Deletes data
  • Dig what's exciting together – Expires old + cleaning an orphan Files
-- compact "hot" data (yesterday) and merge deletes
OPTIMIZE analytics.sales_iceberg
REWRITE DATA USING BIN_PACK
WHERE ts >= date_trunc('day', current_timestamp - interval '1' day);

-- expire old snapshots and remove orphan files
VACUUM analytics.sales_iceberg;

6) Local analytics with Duckb (Read-Only)

It is good to know the production tables that test the laptop without running the cluster. Duckdb's httpfs +come an iceberg floating in the sea Extensions make this easy.

6.1 Install & Load Extensions

Open your jupyter notebook and type in the following.

# httpfs gives S3 support; iceberg adds Iceberg readers.

import duckdb as db
db.sql("install httpfs; load httpfs;")
db.sql("install iceberg; load iceberg;")

6.2 Give S3 credentials to DuckDB the “right” way (secrets)

DuckDB has a small but powerful privacy manager. Very tight setup for aw aw A certified chainA provider, which issues anything the AWS SDK can find (environment variables, IAM roles, etc.). Therefore, you will need to ensure that, for example, your AWS CLI credentials are configured.

db.sql("""CREATE SECRET ( TYPE s3, PROVIDER credential_chain )""")

After that, anywhere S3: // … Reads from this DuckDB session will use private data.

6.3 duckd point in Iceberg table metadata

The most obvious way is to refer to the concrete metadata file (eg, the latest in your table metadata/ folder 🙂

To find a list of those, use this query

result = db.sql("""
SELECT *
FROM glob('s3://your_bucket/warehouse/**')
ORDER BY file
""")
print(result)

...
...
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00000-942a25ce-24e5-45f8-ae86-b70d8239e3bb.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00001-fa2d9997-590e-4231-93ab-642c0da83f19.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00002-0da3a4af-64af-4e46-bea2-0ac450bf1786.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00003-eae21a3d-1bf3-4ed1-b64e-1562faa445d0.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json    
...
...
...

Search Metadata.json file with The most limited prefix in the file name 00004 for me. Then, you can use that in a query like this to retrieve it the latest position of your underlying table.

# Use the highest numbered metadata file (00004 appears to be the latest in my case)
result = db.sql("""
SELECT *
FROM iceberg_scan('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
LIMIT 10
""")
print(result)

┌──────────┬─────────────┬─────────────────────┬──────────┬────────────┐
│ order_id │ customer_id │         ts          │  status  │ amount_usd │
│  int64   │    int64    │      timestamp      │ varchar  │   double   │
├──────────┼─────────────┼─────────────────────┼──────────┼────────────┤
│      105 │           4 │ 2025-08-03 08:30:00 │ created  │       35.0 │
│      104 │           3 │ 2025-08-02 11:47:00 │ refunded │      250.0 │
│      101 │           1 │ 2025-08-01 10:00:00 │ paid     │      120.0 │
│      102 │           2 │ 2025-08-01 10:05:00 │ paid     │       75.5 │
└──────────┴─────────────┴─────────────────────┴──────────┴────────────┘

Looking for a specific summary? Use this to get a list.

result = db.sql("""
SELECT *
FROM iceberg_snapshots('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
""")
print("Available Snapshots:")
print(result)

Available Snapshots:
┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                          manifest_list                                                           │
│     uint64      │       uint64        │        timestamp        │                                                             varchar                                                              │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 5665457382547658217 │ 2025-09-09 10:58:44.225 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-5665457382547658217-1-bb7d0497-0f97-4483-98e2-8bd26ddcf879.avro │
│               3 │ 8808557756756599285 │ 2025-09-09 11:19:24.422 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-8808557756756599285-1-f83d407d-ec31-49d6-900e-25bc8d19049c.avro │
│               2 │   31637314992569797 │ 2025-09-09 11:08:08.805 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-31637314992569797-1-000a2e8f-b016-4d91-9942-72fe9ddadccc.avro   │
│               4 │ 4009826928128589775 │ 2025-09-09 11:43:18.117 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-4009826928128589775-1-cd184303-38ab-4736-90da-52e0cf102abf.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

7) Additional Options: Writing from spark / glue

If you choose a larger batch spark write, glue can read / write Iceberg tables registered in the glue catalog. You'll probably want to use Athena for Ad-Hoc SQL, time travel, and maintenance, but big CTAs/ETLs can come with glue functions. (Just be aware that Version Compatibility and Permissions for AWS COLWARATIONS games can be tricky, as the glue and Athena may lose a bit in frozen versions.)

Here is an example of some glue spark code that inserts new rows of new data, starting with order_id = 110, into our existing table. Before running this, you must add the following glue parameter (under glue job details->advanced parameters->advanced parameters->job parameters.

Key: --conf
Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
import sys
import random
from datetime import datetime
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row

# --------------------------------------------------------
# Init Glue job
# --------------------------------------------------------
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# --------------------------------------------------------
# Force Iceberg + Glue catalog configs (dynamic only)
# --------------------------------------------------------
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://your_bucket/warehouse/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.sql.defaultCatalog", "glue_catalog")

# --------------------------------------------------------
# Debug: list catalogs to confirm glue_catalog is registered
# --------------------------------------------------------
print("Current catalogs available:")
spark.sql("SHOW CATALOGS").show(truncate=False)

# --------------------------------------------------------
# Read existing Iceberg table (optional)
# --------------------------------------------------------
existing_table_df = glueContext.create_data_frame.from_catalog(
    database="analytics",
    table_name="sales_iceberg"
)
print("Existing table schema:")
existing_table_df.printSchema()

# --------------------------------------------------------
# Create 5 new records
# --------------------------------------------------------
new_records_data = []
for i in range(5):
    order_id = 110 + i
    record = {
        "order_id": order_id,
        "customer_id": 1000 + (i % 10),
        "price": round(random.uniform(10.0, 500.0), 2),
        "created_at": datetime.now(),
        "status": "completed"
    }
    new_records_data.append(record)

new_records_df = spark.createDataFrame([Row(**r) for r in new_records_data])
print(f"Creating {new_records_df.count()} new records:")
new_records_df.show()

# Register temp view for SQL insert
new_records_df.createOrReplaceTempView("new_records_temp")

# --------------------------------------------------------
# Insert into Iceberg table (alias columns as needed)
# --------------------------------------------------------
spark.sql("""
    INSERT INTO analytics.sales_iceberg (order_id, customer_id, ts, status, amount_usd)
    SELECT order_id,
           customer_id,
           created_at AS ts,
           status,
           price AS amount_usd
    FROM new_records_temp
""")

print(" Sccessfully added 5 new records to analytics.sales_iceberg")

# --------------------------------------------------------
# Commit Glue job
# --------------------------------------------------------
job.commit()

Double check with Athena.

select * from analytics.sales_iceberg
order by order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 102 2 2025-08-01 10:05:00.000000 paid 75.5
3 104 3 2025-08-02 11:47:00.000000 refunded 250.0
4 105 4 2025-08-03 08:30:00.000000 created 35.0
5 110 1000 2025-09-10 16:06:45.505935 completed 248.64
6 111 1001 2025-09-10 16:06:45.505947 completed 453.76
7 112 1002 2025-09-10 16:06:45.505955 completed 467.79
8 113 1003 2025-09-10 16:06:45.505963 completed 359.9
9 114 1004 2025-09-10 16:06:45.506059 completed 398.52

Next Steps

From here, you can:

  • Create multiple tables with data.
  • Check the evolution by segmentation (e.g., change the segmentation of the table from day → hour as the volumes grow),
  • Add a scheduled fix. For example, IzindundaBridgeStep, and lambdas can be used to run Prepare / Vacuum Edited CAADE.

To put it briefly

In this article, I tried to give a clear way to build a Iceberg Data for Iceberg Laterhousesurprised. It should be a guide for data engineers who want to connect simple storage with business-critical storage facilities.

Hopefully, I have shown that building a captate Lakehouse – a system that combines the low cost of money pools with the reliability of the sale of warehouses in stores – wants to be releaseddeployment of infrastructure infrastructure. And while creating a full Lelsehouse is something that comes out for a long time, I hope I convince you that you really can make some bones in an afternoon.

By fullness Apache Icebergin the cloud storage system Amazon S3I showed you how to turn dense files into dynamic, managed tables capable of acid transactions, level changes (merge, update, move), and time travel, all without providing a single server.

I also showed that by using new analytical tools like Duckdbit is possible to study small and medium-sized lakes. And when your data volumes grow and become too large for local processing, I pointed out how easy it was to move up to an Enterprise data processing platform It is jubilant.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button