Machine Learning

The anatomy of the parequet file

In recent years, the Parquet has become a common data storage format in the Big Data in Cosystems. Its inclined format on column provides several benefits:

  • Quick query execution when processing only the column set
  • Quick calculation of statistics in every detail
  • Reduced the last volume due to effective pressure

When combined with the final structures of Delta or Apache Iceberg, it meets outside the question engines (eg in this article, the contents of the pariquet file.

To write files (s)

Producing pariquet files, using Pyrow, Python tying Apache arrow that keeps Dafadrames in the columntar format. Pyarrow allows a properly digested parameter when writing the file. This makes Pyarrow ready to deceive parquet (one person using pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

fake = Faker()
Faker.seed(12345)
num_records = 100

# Generate fake data
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace("n", ", ") for _ in range(num_records)]
birth_dates = [
    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(", ")[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Cast the data to the Arrow format
name_array = pa.array(names, type=pa.string())
address_array = pa.array(addresses, type=pa.string())
birth_date_array = pa.array(birth_dates, type=pa.date32())
city_array = pa.array(cities, type=pa.string())
birth_year_array = pa.array(birth_years, type=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
    [
        pa.field("name", pa.string(), nullable=False),
        pa.field("address", pa.string(), nullable=False),
        pa.field("date_of_birth", pa.date32(), nullable=False),
        pa.field("city", pa.string(), nullable=False),
        pa.field("birth_year", pa.int32(), nullable=False),
    ]
)

table = pa.Table.from_arrays(
    [name_array, address_array, birth_date_array, city_array, birth_year_array],
    schema=schema,
)

print(table)
pyarrow.Table
name: string not null
address: string not null
date_of_birth: date32[day] not null
city: string not null
birth_year: int32 not null
----
name: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
address: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
city: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
birth_year: [[1955,1950,1955,1957,1956]]

Release indicates the maintenance of columns that are inclined, unlike the panda, which often shows the native “intelligent” table.

How was the parset file stored?

Parquet files are usually stored in the cheapest storage details such as S3 (AWs) or GCS (GCP) for easy access to data functioning. These files are usually edited with a distinguishing plan inputs:

# generator.py

num_records = 100

# ...

# Writing the parquet files to disk
pq.write_to_dataset(
    table,
    root_path='dataset',
    partition_cols=['birth_year', 'city']
)

If birth_year including city columns They are described as a divorce keys, Pyarrow forms such a tree structure to the identification data:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ city=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ city=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ ...

The strategy enables you to be separated: When the filings of these columns, the engine may use folders to read only the required files. That is why the distinction strategy is important for delivers delays, / o, and compute resources when treating large volumes (decades in terms of traditional details).

The result of easily guaranteed by calculating the python script files that affect the year of birth:

# query.py
import duckdb

duckdb.sql(
    """
    SELECT * 
    FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
    where birth_year = 1949
    """
).show()
> strace -e trace=open,openat,read -f python query.py 2>&1 | grep "dataset/.*.parquet"

[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

Only 23 Files are read in 100.

Reading the Raw Parequet file

Let us decide the infrastructure file with special libraries. For convenience, the datasset is thrown into one file without pressing or entering the code.

# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    use_dictionary=False,
    compression="NONE",
    write_statistics=True,
    column_encoding=None,
)

The first thing to know is that the binary file is made of 4 bytes representing the “Par1” tiles. The file is corrupt if this is not.

# reader.py

with open("dataset.parquet", "rb") as file:
    parquet_data = file.read()

assert parquet_data[:4] == b"PAR1", "Not a valid parquet file"
assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

As shown in the documents, the file is divided into two parts: “Some groups” contain real information, and foot containing the metadata (schema below).

Foot

Footer size is displayed in 4 Bytes that preceded the end mark as an unregistered number written in the “Little ENTIAN” (shared “formatunchanged activity).

# reader.py

import struct

# ...

footer_length = struct.unpack("
Footer size in bytes: 1088

The Footer details are included in the Serianization format of a cross-language format called Apache Thrift. Using format reading by man but of action as JSON and translates it into a binary can work well according to the use of memory. Through thrift, one may declare data structures like this:

struct Customer {
	1: required string name,
	2: optional i16 birthYear,
	3: optional list interests
}

On the basis of this proclamation, thrift can produce the Python code to determine the Byte wires in such a data structure (also producing the code to make the payment part). Thrift file containing all data structures used in parequet file can be downloaded here. After adding a binary thrift, let's run:

thrift -r --gen py parquet.thrift

The Python code made was placed in the “Gen-PY” folder. Foot data structure represents the file sectionDadata – Python section is automatically generated from the Thrift SCHEMA. Python's Python's Python resources are used, binary data is separated and filled with an example of this filefemetadata.

# reader.py

import sys

# ...

# Add the generated classes to the python path
sys.path.append("gen-py")
from parquet.ttypes import FileMetaData, PageHeader
from thrift.transport import TTransport
from thrift.protocol import TCompactProtocol

def read_thrift(data, thrift_instance):
    """
    Read a Thrift object from a binary buffer.
    Returns the Thrift object and the number of bytes read.
    """
    transport = TTransport.TMemoryBuffer(data)
    protocol = TCompactProtocol.TCompactProtocol(transport)
    thrift_instance.read(protocol)
    return thrift_instance, transport._buffer.tell()

# The number of bytes read is not used for now
file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())

print(f"Number of rows in the whole file: {file_metadata_thrift.num_rows}")
print(f"Number of row groups: {len(file_metadata_thrift.row_groups)}")

Number of rows in the whole file: 100
Number of row groups: 1

Footer contains broad information on structure and file content. For example, follows accurately the number of lines in the prodfish datframe. These lines all contains within the “Row Group.” But what is “team team?”

Row groups

Unlike formats only in column, parequet uses the hybrid method. Before writing down column blocks, datafum is divided directly into the following groups (produced by produced epidemics is too small to be separated by multiple consecutive groups).

This hybrid structure offers several benefits:

Parquet stamp statistics (such as min / max numbers) in each column in each row. These figures are important in the performance of a question, which allows questions for questions to skip all good groups not install ways to filter. For example, if the questions of the question of birth_year > 1955 And the birth of the High Group is 1954, the engine can skip the whole class. This efficiency is called “moral fullness”. PARQEET and keeps some helpful statistics such as a different number and null count.

# reader.py
# ...

first_row_group = file_metadata_thrift.row_groups[0]
birth_year_column = first_row_group.columns[4]

min_stat_bytes = birth_year_column.meta_data.statistics.min
max_stat_bytes = birth_year_column.meta_data.statistics.max

min_year = struct.unpack("
The birth year range is between 1949 and 1958
  • Row groups empower the relevant data processing (especially the frameworks such as Apache sparks). The size of these line groups can be organized based on the available computer users (using row_group_size property in operation write_table When using Pyarrow).
# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    row_group_size=100,
)

# /! Keep the default value of "row_group_size" for the next parts
  • Whether this is not the main purpose of column format, the Parquet's Hybrid building stores logical performance when rebuilding complete lines. Without bright groups, rebuilding the entire line may require complete scanning of each column that could be most effective in large files.

Data pages

The smallest registration of the Parquet file is a page. It contains numerical sequence from the same column and then, in the same type. The selection of the page size is the result of no trading trading:

  • Hundreds of pages mean small metadata to keep and read, ready for questions about little filter.
  • Small pages reduce unnecessary data readings, better when questions focus on new, scattered.

Let us now determine the content on the first column page provided to the addresses of his or her location (provided by the data_page_offset The right of the right ColumnMetaData). Each page is preceded by Thrift PageHeader something containing a particular metadata. Offset actually prevents representation of the happy binary representation of the page preceding the page itself. Thrift Class is called with PageHeader and can be found in gen-py directory.

💡 Between the page page and personal prices containing within a page, there may be a few bytes dedicated to operation Increased Format, allows installation to the Encoding Combined data properties. As our data has a standard table format and prices are not believable, these masses that are cut off the file (https://parquet.apache.org/doc/fil-eformat/data-pages/).

# reader.py
# ...

address_column = first_row_group.columns[1]
column_start = address_column.meta_data.data_page_offset
column_end = column_start + address_column.meta_data.total_compressed_size
column_content = parquet_data[column_start:column_end]

page_thrift, page_header_size = read_thrift(column_content, PageHeader())
page_content = column_content[
    page_header_size : (page_header_size + page_thrift.compressed_page_size)
]
print(column_content[:100])
b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'

Products are finally produced, in a clear text and combined (as described when writing a startup file). However, to increase the column format, it is recommended to use one of the following consecutive algorithms from installed pages that contain the same amounts (all decimal numbers, etc.)

As recorded in the specification, when the character's strings (byte_arr) are not included, each prelimination of its size is set out as a 4-byte number. This can be seen at previous exit:

To read all prices (for example, the first 10), the loop is simple:

idx = 0
for _ in range(10):
    str_size = struct.unpack("
481 Mata Squares Suite 260, Lake Rachelville, KY 87464
671 Barker Crossing Suite 390, Mooretown, MI 21488
62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068
948 Victor Square Apt. 753, Braybury, RI 67113
365 Edward Place Apt. 162, Calebborough, AL 13037
894 Reed Lock, New Davidmouth, NV 84612
24082 Allison Squares Suite 345, North Sharonberg, WY 97642
00266 Johnson Drives, South Lori, MI 98513
15255 Kelly Plains, Richardmouth, GA 33438
260 Thomas Glens, Port Gabriela, OH 96758

And when we have it! We also restarted, in a very simple way, how special library was to learn the spice file. By understanding its blocks include articles, footers, role groups, and data pages, we can be better appreciated that insects are similar to the purchase of shrubs and striking distinguishing these magnificent benefits. I am sure how Frumsut works under the hood helps to make better decisions about final strategies, oppression decisions, and efficiency.

Every code used in this article is available on my GitHub Repository where you can test multiple examples and test with the unique parequet file configuration.

Whether you create data pipes, the preparation of the question, or simply curious about the data final formats, I hope this Purquet sweats are given to important understanding of your data engineering trip.

All the pictures are the writer.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button