Stop Creating Bad Dags – Configure Your Airflow Location by Improving Your Python Code | by Alvaro Leandro Carneito | Jan, 2025
Apache airflow is one of the most popular equipment orchment in the data sector, to deduct companies' operational power operations. However, anyone who has worked airflow in the production area, especially in a complicated area, knows that occasionally can introduce other problems and strangers.
Among many things you need to behave in the air conditioning, one important metric is usually flying under the radar: Dag Parse time. Monitoring and doing well the test time is important to avoid operating bottles and confirm your performance of your orchersvations, as we will explore in this article.
That means, the lesson intends to introduce airflow-parse-bench
Open source tool promotes the data engineers monitoring and increasing their air transport facilities, to provide understanding of the Code recommendations and the test time.
About Airflow, Dag Parse is usually The metric unattended. Parsing takes place every time Airflow processing your Python files to form dags powerfully.
By default, your entire marijuana is separated every 30 seconds – often controlled by configuration Min_fi_process_rocols_netval. This means that every 30 seconds, all Python code exists for you dags
The folder is read, imported, and processed to produce Dag items that contain tasks. Files successfully processed and heard from the Dag bag.
Two main airflow elements treat this procedure:
Together, both elements (often referred to as Dag Processor) They were killed by Airflow Scheduler, confirming that your Dag objects were updated before being caused. However, because of the disability and reasons for safety, it is also possible to run your Dag processor as a different part of your collection.
If your environment has only a few dags, it is impossible that diverting process will create any kind of problem. However, producing production situations and hundreds of dagga are common. In this case, if your tall time is too high, it can lead to:
- To delay the dag arrangement.
- Increase the use of resources.
- Natural heart beat problems.
- The failure of the schedule.
- Overuse of CPU and to use memory, spending services.
Now, imagine the nature with large dags containing unnecessary logic. Small unemployment can turn into important problems, affecting the strengths and functioning of your spiritual setup.
When you write Airflow Dags, there are key important habits to remember to create a configured code. Although you can find a lot of tutorials how to improve your dags, I will summarize some of the important goals that can greatly improve your Dag performance.
Limit a higher code
One of the most common caigid caula Dag Parsing Times does not work or a complex maximum code. The high airfffple code of Airflow file is always made at all times when the schedule deviates the file. If this code includes the focus functions of the resources, such as data questions, API calls, or powerful work production, may affect the performance of paring.
The following code shows an instance of Dag not prepared:
In this case, every time the file is divided by schedule, the high-quality code is made, apply for API and process a datafeme, which may affect the highest time.
Another important thing that contributes a little from the slightest importation. All libraries are imported is loaded on the memory during the prick, which can take time. To avoid this, you can submit import within jobs or descriptions of work.
The following code shows a better kind of similar dag:
Avoid XCOMS and flexibility in the upper code
We still talk about the same topic, we are very interesting to avoid using XCOMS and flexibility in your highest code. As described by Google texts:
If you are using Suff.get () on the upper level code, every time the A.py file is divided, Airflow makes the difference.get () opens session in DB. This can significantly reduce parse times.
Dealing With This, Consider Using a Jonsson Retrieving a lot of variables to one data question, rather than do more Variable.get()
Calls. Otherwise, use Templates of ginAs the restored variables are considered only during the duties, not during the Dag Parsing.
Remove Ungunta Margeman
Although it seems obvious, it is important that you always remember to clean up occasional dags and files from your environment:
- Remove unused dags: Look at your
dags
Folda and delete any files that are no longer required. - Work
.airflowignore
: Specify the Airflow files You must deliberately ignore the parsing. - Review the Temporary Dags: Dags installed in water is still combined by schedules, eating resources. If they are no longer necessary, consider removing or if you were to name.
Change Airflow Configuration
Finally, you can change the configuration of airflow to reduce the use of schedule resources:
min_file_process_interval
: This is prepared to control how often (in seconds) deviation is divided into your files in the Dag. To enhance it from 30 default Sowsors can reduce the trust load at the cost of a slight renewal cost of Dag.dag_dir_list_interval
: This determines how often (in seconds) scanning airflowdags
The new dag guide. If you enter new dags generally, consider increasing this temporary to reduce the use of the CPU.
We've discussed many of the importance of creating prepared dags to maintain a healthy air. But how do you actually measure your hairdressing time? Fortunately, there are several ways to do this, depending on your Airflow's submission or application.
For example, if you have a cloud designer, you can easily replace Dag Parse by doing the following command on Google CLI:
gcloud composer environments run $ENVIRONMENT_NAME
— location $LOCATION
dags report
While retrieving the metric parse is upright, measure the performance of your code can be below that. Every time you change your code, you need to re-install the Python file renewed from your cloud provider, wait for the Dag to be compiled, and delivers a new report – a long-term process and time.
Another possible way, if you are in Linux or MAC, to run this command to measure the pastse time in your area on your machine:
time python airflow/example_dags/example.py
However, it is easy, this method is not working in order and comparing the pars of multiple dags.
Dealing With These Backs, I Created
airflow-parse-bench
The Python Library that makes it easy to balance and comparing the pars of your Dag using a traditional airflow traditional method.
This page airflow-parse-bench
The tool makes it easy to store PARSE times, compare results, and compare comparisons in all your remains.
To add a library
Before installation, it is recommended to use virtualv to avoid bookmarks. When setting, you can include package through the following command:
pip install airflow-parse-bench
Note: This order only includes important dependence (related to airflow providers and airflow providers. You should manage any additional libraries you depend on.
For example, if the dag uses boto3
Communication with AWS, make sure that boto3
be included in your area. Besides, you will meet the parse errors.
Then, it is necessary to start your Airflow database. This can be done by doing the following command:
airflow db init
In addition, when your dags is used Variousness of airYou have to define the local well. However, there is no need to put real values to your change, as real prices are not required for attack purposes:
airflow variables set MY_VARIABLE 'ANY TEST VALUE'
Apart from this, you will meet a mistake such as:
error: 'Variable MY_VARIABLE does not exist'
Using Tool
After installing a library, you can start rating parse times. For example, suppose you have a dag file named dag_test.py
containing the unpunished Dag code used in the example above.
To measure its test time, simply run:
airflow-parse-bench --path dag_test.py
This killings produce the following result:
As considered, our dag produced a time of consultation for 0.61 seconds. When I run a command again, I will see a little difference, as the test times vary slightly because of the program as a result of the program and environmental factors:
To present a short number, you may include a lot of killings by explaining the amount of Iterations:
airflow-parse-bench --path dag_test.py --num-iterations 5
Although it takes a long time to complete, this is calculated A limited time for testing the killing of five people.
Now, examining the impact of the above-mentioned correcting, I replaced my codedag_test.py
By a well-shared version of the shared. After removing the same command, I found the following result:
As noted, use other good habits that could reduce nearly 0.5 seconds In the dag Pastse period, highlighting the importance of the changes we made!
There are some exciting factors that I think is worth sharing.
As a reminder, if there is any doubt or problems using a tool, you can access complete texts in GitTub.
Besides, watching all the library supported parameters, simply running:
airflow-parse-bench --help
To explore multiple dags
In most cases, you may have multiple dags to check PARSE times. Dealing with the case of use, created a name calling dags
Then place four files of Python in it.
Rate PARSE times for all residues in a folder, it is only necessary to specify the folder method in --path
parameter:
airflow-parse-bench --path my_path/dags
Running this command produces a table summing the PARSE times for all folder:
Automatically, the table is scheduled from the fastest marijuana. However, you can change the plans through --order
parameter:
airflow-parse-bench --path my_path/dags --order desc
Skip unchanged dags
This page --skip-unchanged
The parameter can be especially helpful during the development. As the name displays, this option skip the murder of the parase dosage from the last murder:
airflow-parse-bench --path my_path/dags --skip-unchanged
As shown below, where removed dags are unchanged, the result shows the difference in PARSE times:
To reset the database
All Dag details, including Methodistics and history, maintained a local SQLITE database. If you want to clear all the database and start new, use --reset-db
Flag:
airflow-parse-bench --path my_path/dags --reset-db
This command reset the database and process the dats like it was the first murder.
Time to rotate the main metric storage storage and active air conditioners, especially since your hostile needs are more difficult.
For this reason, airflow-parse-bench
The library can be an important tool for helping data engineers make better dags. By checking your time for your Dags at a time, you can find easily and immediately get your code on the bottle, making your dags quickly and more efficient.
As the code is killed in your area, the time produced will be similar to the one in your Airflow collection. However, if you can reduce the testing time on your local machine, it may be reset similar to your cloud area.
Finally, this project is open to partnership! If you have suggestions, ideas, or improve, feel free to contribute to GitTub.