The modification data modification of the column level

Tools such as DBT does to create SQL data pipes are easy and formal. But also of the additional structure and models of data are clearly defined, the pipes are still complex, making issues of error and verify changes in data models.
The increasing difficulty of the datation transformation logic increases on the following issues:
- Traditional Code Review Processes Look only code change and do not include the impact of data for those changes.
- The impact of data from the code change is difficult to track. In plans with reliable sprawling specified, determining how data impact is also how much longer eats, or in the impossible.
Gitlab's DBT Dag (shown in the above-installed image) is the full example of the data project already on house cards. Imagine trying to follow the simple change of SQL logic change in a column throughout this Lineap Dag dag. Reviewing the data model update can be a terrible job.
How would you get close to this kind of review?
What is data verification?
Data Confirmation refers to the process used to determine whether the data is correct depending on the actual needs of the world. This means to ensure that SQL Logic in the data model is treating as decorated by ensuring that data is correct. Verification is usually done after changing the data model, such as new requirements, or as part of the REFT.
Challenge of a different review
Data has nations and affected directly for converting transformation. This is why it is updating the data model changes is a unique challenge, because both code including Details require review.
As a result, data model reviews should be updated not only completion, but also context. In other words, how the details are correct and there are data and metrics it was not an unintended.
Two data guarantee
In many data groups, a person who makes the change in the center information, understanding, or experience to inspect impact and ensure change.
“I made changes to x, I think I know what the impact should be. I will check it by working y”
The way to confirm the frequency falls on the overdue two, whether it is the correct one:
- Checking Spot With questions and top checks such as line and schema. It is fast but endanger the real impact. Sensitive and silent errors can be seen.
- Full test of all the model of one townstream. It is slow and is a device, and it can be very expensive as the pipe grows.
This results in the random data review process, which is difficult to repeat, and usually it is importing silent errors. The new method is required to help the developer to make verification of accurate and intended data.
Best Way by Using the Data Data Characterism
To ensure a change in the data project, it is important to understand the relationship between models and how the data flows on this work. This depends on the between models informing how data is transferred and modified from one model to another.
Analyze the relationship between models
As we have seen, data projects can be great, but the data model change affects only the Subset of models. By separating this and re-installed and analyzing the relationship between models, you can return difficulty plays and focus on the models that need to confirm, given some SQL logic changes.
Data projects on the data project are:
Model-to Model
The formal reliance on which columns are selected from high model.
--- downstream_model
select
a,
b
from {{ ref("upstream_model") }}
Column-to-column
Depending on the selected speculation, renovation, or transforming the pushream column.
--- downstream_model
select
a,
b as b2
from {{ ref("upstream_model") }}
Model-to-Cog
Reliance on sorting when the DOWNTNTREAM model uses the UPStream model there, joining, or another conditional clause.
-- downstream_model
select
a
from {{ ref("upstream_model") }}
where b > 0
Understanding the dependence between models helps us to describe the impact of the Data Model Model Logic Change.
Identify the Restius of the influence
When making changes in the SQL model model, it is important to understand which other factors are affected (models to check). At high level, this is done through model-to-model relationships. The Dag Node Subset is known as the Impact Radius.
In the dag below, radius impact includes nodes b (modified model) and D (Downstream model). In Dbt, these types can be seen using a + selected option.
Determining converted areas and Downsam is a good start, and with distinctive changes such as this will reduce the potential site of verification data. However, this can still result in a large number of low models.
To separate the types The SQL changes can enable you to prioritize which models need to be verified in understanding the complexity of changes, to complete the branches known to be safe.
Mix the SQL change
Not all SQL changes that carry the same level of danger in the Downsam data, and therefore should be separated correctly. By separating the SQL changes in this way, you can add a systematic way of your data review process.
SQL change in data model can be classified as one of the following:
The Uncounting Change
Changes that do not affect information about low models such as adding new columns, SQL formatting, or comments etc.
-- Non-breaking change: New column added
select
id,
category,
created_at,
-- new column
now() as ingestion_time
from {{ ref('a') }}
A different change
The only influence of low models look at specific columns as removing or renovating column; or change the description of the column.
-- Partial breaking change: `category` column renamed
select
id,
created_at,
category as event_category
from {{ ref('a') }}
To change the change
Changes that affect all models are low as sorting, sorting, or alternecting the structure or interpretation of modified data.
-- Breaking change: Filtered to exclude data
select
id,
category,
created_at
from {{ ref('a') }}
where category != 'internal'
Apply to be classified to reduce the size
After using these sections radius of impact, as well as the number of models that need to be verified, it may be reduced.

In the Dag, Node B, C and F has been changed, resulting in about 7 areas that need confirmation (C in E). However, not each branch containing SQL changes actually need to be verified. Let's look at each branch:
Node C: A change that is not a violation
C is classified as a transition that is not a breach. So both c and H do not have to be tested, not completed.
Node B: different change
B is divided as partly disturbed changes due to transition to the C.C1 column. So, d and e need a test single If they look at the column B.C1.
Node f: breaking changes
Model conversion F isolated as a breaking change. Therefore, all Downstream and DO (G and E) need to be addressed. For example, model g may include data in the converted Upstreasiff column
The first 7 areas are already reduced to 5 to be screened for data impact (B, d, e, F, G). Now, by checking the SQL changes at the column level, we can reduce that number until it comes until it comes.
Reducing the limit to the lines of the column
Broadcasting and crash changes are easy to distinguish but, when it comes to checking unique changes, models need analysis at the column level.
Let's look at the violation of the Model B, where Logic of Collom C1 has been changed. This conversion may have resulted in 4 submerged DOWNTINGTRAM NO: D, E, K, K, and J. After following the use of column on the floor, this subset can advance down.

Following Column B.C1 Downstream We see that:
- B.C1 → D.C1 is a dependency on column-to-column-to-column (projector).
- D.C1 → e is the saving depths.
- D → k is an example. However, as ID.C1 was not used in K, this model can be completed.
Therefore, models need to be verified in this branch is B, D, and e. and the breached change f and the dowstrem g, the total models 9 models can have a gentle impact.
Store
Data Confirmation After the model change is difficult, especially in large and sophisticated lips. It is easy to miss quiet errors and the verification of performance becomes a terrible job, with data models usually feel as black boxes when it comes to a low effect.
Systematic and recurring process
By using this method of renewal of change data, you can bring the make-up and accuracy in the review process, making it good and repeat. This reduces the number of models that need to be considered, simplify the review process, and the cost of the Lower with only verifying models require.
Before you go …
Dave is a senior receptionist in the Reccept, where to build a toolkit to enable advanced data flow. You are always happy to discuss the SQL, data engineering, or to help groups navigate their data control challenges. Connect with Dave on LinkedIn.
The survey of this article was made available by Zokuthu Chens en L L Lu Lbu (Popcorny).



