Developing Wizart Home Visualizer, we use complex algorithms based on deep learning models. To keep them high-performing, we rely on a data-centric approach in our work. However, the creation of high-quality datasets leads to more and more complex data pipelines.
So today, we'll explain to you how we deal with this challenge using Apache Airflow.
Apache Airflow is a tool for automating ETL (a.k.a extract, transform, load) pipelines. The software has an intuitive interface for scheduling and managing workflow dependencies. With its help, you can maintain and manage data pipelines, including AWS S3, RabbitMQ, different DBMS, and load testing tools.
In the Airflow, workflows are defined as directed acyclic graphs ( Airflow DAG). DAG is an ETL where each downstream node depends on the successful completion of the previous (upstream) node.
Working with Airflow, you'll need to provide hosts, ports, logins, and passwords to other systems and services, so the system can work correctly.
As mentioned, Apache Airflow is our choice of ETL (extract-transform-load) orchestration service. A summary of its advantages is the following:
Automated pipelines help us to process dozens of material brands in our PIM system and prepare them for further visualization in digital showrooms. Also, it speeds up the processing of raw data, which is used later for the continuous improvement of our computer vision algorithms.
More details about and examples of you can find in our article on Medium. For now, we are going to focus on another way to use Apache Airflow - automating metrics reports.
The second use case of automation after data processing pipelines is a process of gathering analytical and technical insights for our company. It helps the team to generate reports using BI Engine and periodically track the key metrics of our services.
The dashboards we have in the end contain a detailed summary with visualizations of changes in certain metrics, for example, API usage statistics over some time.
Currently, we prepare insights for several scheduled and on-demand reports in Wizart:
Let’s have a quick look at one of them.
Wizart Visualizer is being improved with advanced features, and the number of active users of our APIs and services grows over time. These are the two main reasons why it is so important to control the performance of our services and maintain it on a high level.
For this purpose, we have started to run load tests regularly for each standalone service. The tests were developed using Locust - a popular load-testing tool for Python. Even though they were made with a few lines of code - it often took some time to interpret and share test results within a company properly. So we went further towards automation.
The DAG runs load tests by triggering a locust file (i.e. a Python script) in a virtual environment. Each test run is configured by the set of the following parameters:
The intuition behind adding a custom request-per-second parameter is to maintain a stable load over test iterations, which is quite not achievable by Locust’s built-in functionality. To produce concurrent requests, we’ve used a pseudo-thread group available in the gevent library. A short example of defining a locust task is below:
When rps is too high - the test safely stops to prevent the denial of service. Then the rps value can be interpreted as the strict amount of requests per second that the service may handle.
Assuming that Locust automatically saves request stats in a .csv file - Airflow picks it up, transforms it into a pandas dataframe, and populates the Google Sheets table with it. Thus we automatically received fresh data for our analytical dashboards.
A DAG tree represents a sequential set of tasks designed for collecting performance metrics.
A DAG tree represents a sequential set of tasks designed for collecting performance metrics.
All the job steps are atomic, which means there is an extremely small probability of things not working as expected for the creator, and debugging becomes easy.
So, this is how we managed to implement a fully automated process for preparing metrics reports. Some extra features made the pipelines even more advanced, like an ability to run back-dated jobs by simply giving those beginning or ending dates as a parameter. Callbacks were integrated with our in-house Microsoft Teams via a webhook mechanism to send out notifications e.g. if a DAG task fails. By analyzing insights, we improve our products to provide a better user experience, increase service stability and boost sales.
As you can see, Apache Airflow is a powerful automatization tool that can be used in many ways and make your daily tasks easier.