Machine Learning

PARQEET File Format – Everything you need to know!

The amount of data increases in the last few years, one of the largest challenges has found the best way to keep a variety of data taste. Unlike (not so far, when the related information is available, organizations now want to make analysis of traditional feelings – Think about it in a very important time, growing normal analysis.

Another challenge is that somewhat a particular way to have a systematic data, but without the need for designing complex loads and eating ETL in ETL in Enterprise Data Warehouse. Additionally, what if half of the information professional in your organization is skilled, let's say, Pythons (data engineers, data engineers), the “Pythonists” read SQL? Or, the opposite?

Or, would you like the last option to play in the power of your entire data group? I have good news for you – something like this is already available since 2013, and is called Apache Parchet!

PARQUET file format in the brief

Before you can show you the Ins and output of the ParQoet file format, there are the best five reasons why the parcet is considered a typical decuracy of the data in these days:

  • Data pressure – Using various slance installation and various algoriths, the pariquet file provides the use of reduced memory
  • Columnar Storage – This is very important for the analysis of the analysis, where the speedy functionality of data readings is an important requirement. But, more from that later in the article …
  • Language Agnostic – As mentioned earlier, enhancements can use different languages ​​to plan to cheat details in the Parqet files
  • Open source format – which means, you are not locked with a particular seller
  • Support of the complex form of data

Row-Store vs column-in store

We have already described the parequet in the format-based storage format. However, understanding the benefits of using parcet file format, first we need to draw a line between line-based methods and the data storage column.

In traditional storage, line-based, information is stored in line order. Something like:

Photo by the writer

Now, when we talk about olap situations, some common questions your users can ask are:

  • How many balls have we sold?
  • How many users from the USA buy it-shirt?
  • The total amount spent by customer Maria Adams?
  • How many sales do we have on January 2nd?

To be able to answer any of these questions, the engine should check each row all from the beginning to the end! Therefore, to answer the question: How many users from the USA buy it-shirt, the engine must do something like:

Photo by the writer

In fact, we need information from two columns: product (T-shirts and the country (USA), but the engine will scan all five columns! This is not the most effective solution – I think we can agree on that …

Column store

Now let us check how the column store works. As you can imagine, the way 180 different degrees:

Photo by the writer

In this case, each column is a different organization – which means, each column is physically divided into other columns! To go back to our previous business question: The engine can now scan those columns needed by a question (product and country), while to skip the scan unnecessary columns. Also, in many cases, this should enhance the performance of the analysis questions.

All right, good, but the column store was before the lung and there is still any parequet again. So, what is special about the Parquet format?

PARQAET Format for Columnar keeping data in good groups

Wait, what?! Was it not because of problems sufficient even before this? Don't worry, very easy than sound 🙂

Let's get back to our previous example and show how the Frugoet will keep this same reference:

Photo by the writer

Let us give up for a moment and explain the above image, because this is the formation of the parequet file (other items that were deliberately left, but the columns are not soon explained.

Why is this additional building so important?

You will need to wait a little answer :). In cases of olap, we are especially concerned about two concepts: Balance Future including Premitate (s). Guess refers to the Designate The statement in SQL language – which columns are needed for the question. Back to our previous example, we only need the product and world columns, so the engine can skip the remaining scan.

The annex (s) refers to the Where Clause in the SQL language – which satisfactory lines are described in the question. In it, we are only interested in t-shirts, so the engine can completely skip the group scan 2, where all prices are a product equivalent!

Photo by the writer

Let's quickly stop here, because I want you to realize the difference between different types of storage by tracking the engine work:

  • Row shop – the engine needs to scan all 5 columns and all 6 lines
  • Column Store – The engine needs to scan 2 columns and all 6 lines
  • Column store with beautiful groups – engine needs to scan 2 columns and 4 lines

Obviously, this is an excessive, only 6 strings and 5 columns, where you will definitely see the difference between the last three options. However, in real life, in the face of a large number of information, the difference is transparent.

Now, the right question will be: Parquet “able to” skip / scan?

Parquet file contains metadata

This means that all parsquo files contain “data about data” – information such as smaller prices and high in a certain column within a particular line. In addition, all the parqet files contain the footer, keeping information about format version, schema details, Metadata column, and so on. You can find more information about the types of pariquet metadata here.

Important: In order to improve the operation and remove unnecessary data structures (role groups and columns), the original engine requires a small amount of time. (but still not too bad 🙂 …

Do I hear you, I will hear you: Do you speak, what is the “little” and what is the “great”? Unfortunately, no one “gold” number is, but for example, Microsoft Azure Synapse Analytics recommend that each Parquet file should be at least several hundred MBS size.

What else is there there?

Here's a simplified image, the best parequet file format:

Photo by the writer

Can it be better than this? Yes, by data pressure

All right, I explain you to skip the scan of unwanted data structures (Row groups and columns) can benefit your questions and increase complete performance. However, it is not only – remember when I told you at the beginning that one of the best benefits of the Parqet format is available through various pressure algorithms.

I've already wrote about different types of stress data in Power Bi (and the tabar model usually) Here, so maybe it is a good idea to start reading this article.

There are two main encoding types that empower the parsquoet to push details and reach a surprise savings in space:

  • To enter the code – Parquet creates a different price dictionary in column, and then instead of the prices “Real” at Index prices from the dictionary. To go back to our example, this process looks like such:
Photo by the writer

You can think: Why is this passing, when the words of short products, isn't it? All right, but now think that you keep the detailed product meaning, as well as: and now imagine a million times … WABA ARM … Bla Arm … Bla arm.

Can it be better than this?! Yes, in Delta Lake File format

Okay, what is the heck now is the Delta Lake format?! This is the topic about parquet, of course?

Therefore, to put it in a clear English: Delta Lake is nothing without parcet format “on steroids”. When I say “the steroids”, the basics are not the combination of pariquet files. It also maintains shopping log for enabling to track all changes that are used for ParQuet file. This is also known as acid associations.

As it supports the ACID transactions, but also supports the passage of time (rollbacks, audits, etc. to have the benefits of the Lakehouse emerge. But if you want to learn this article from databicks.

Store

We repent! The same with us, details also appear. Therefore, a new taste of data requires new ways to keep it. Parquet File Format is one of the most effective data storage, because it provides many benefits – both through various algorithms, and fast processing by enabling engine to skate unnecessary data.

Thanks for reading!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button