Optimizing dbt Models & DAGs for Performance

William

Nov 06, 2023 - 7 Min read

In today’s world of data & analytics engineering, teams are usually faced with syncing large number of data sources bidirectionally. You may start to notice that a lot of companies are hiring for more “technology & tooling” roles that are trying to reduce the amount of software used internally since they are clearly stepping into issues of working with too many dimensions (idea of curse of dimensionality anyone?… cough cough).

Data modeling and understanding your DAGs have become more important for any data person. These aren’t new tools at our disposal, but one teams are paying more attention to due the pricing schemas of new and efficient tooling (Fivetran, dbt, cloud warehousing, etc) that have become a more expensive utility cost over time.

DAG’s are simply a data structure to represent and model a wide range of problems that involve dependencies, order, and relationships. Their nature makes them easy to interpret and suitable for many applications.

Why Bother Looking at Your DAGs?

There are a lot of reasons to keep an eye on your models, but some of the main reasons within the data world include:

Dependency Management: DAGs are used to model dependencies between tasks or events. A DAG can represent the order in which different components depend on each other.
Data Flow: They are used to represent the flow of data and computations. DAGs are used to define and execute data pipelines, ensuring that data transformations and analysis are performed in the right sequence.
Causal Relationships: DAGs are used in causal inference to represent causal relationships between variables. They help researchers and data scientists analyze and understand the cause-and-effect relationships in data.
Optimization and Decision Making: DAGs are employed in decision theory and optimization problems. They can help model various choices along with their consequences.

dbt & DAGs

One of dbt’s features is that it helps its users view their DAGs in a clean and informative manner. It is super important to pay attention to them in dbt because the world of data modelling gets quite complex and relationships in your system doesn’t stop generating.

If you are on dbt cloud, viewing DAGs is quite easy, but on dbt core all you have to do is run this code in your terminal (and navigate to appropriate section) with dbt installed and in the proper directory:

dbt docs generate; dbt docs serve

You can also work with the .json files and parse them in more useful ways. We can import it into our favorite data catalog tool, Secoda, so we can use it alongside all our other governance information!

The page that opens up is quite useful and hosts a ton of information for your engineers. Its important to stay diligent with you documentation, even as boring as it may be to do so.

Before we get to optimizing your DAGs, it is an important to have a few things checked off. Your relationships, models, and pipeline stages need to be clear and follow rules. We aren’t saying that they have to follow some sort of golden rule… there could be groups of rules based on the data source or structure of data, but there needs to be consistency. Focus on optimizing your DAGs only when you feel that:

You have processes and pipeline stages clearly defined
You feel comfortable that there is a pipeline architecture (source, staging, viewing) and layers to be followed
You feels like there is clarity and consistency within your DAGs
You feels like your DAGs follow a certain rhythm, they are not all over the place

If you DAG or sections of your entire DAG look like the below joke. There is a different issue you must solve first…

Some Basic to Help Optimize

The first thing you have to do before you begin to optimize should be self explanatory… you must first profile and identify your bottlenecks. Take a note of all the areas that need improvement.

Simplify Complex Transformations

Heavy transformations come with higher CPU and performance costs. The more joins and aggregations are present, the more they should broken down into smaller and manageable steps. Its possible to even have one model that involves create intermediate models to handle the joins independently.

Implement Incremental Models

dbt’s incremental models will save you tons of time when dealing with extremely large tables (think transactional data). These models add new records and update existing ones, so you don’t need to process the entire dataset. Some of the biggest bottleneck models can be solved by implementing these incremental models.

Optimize your SQL queries

DISTINCT clauses, window functions, and certain joins… that’s all we need to say…

Use Your Warehouse Features

Don’t forget that dbt sits on top of your warehouse, which means you can utilize the benefits and features of why you chose your warehouse in the first place. Sorting and clustering tables in Redshift? Partitioning in BigQuery? Result caching in Snowflake?

Testing & Monitoring Please don’t forget to test and monitor your pipeline to ensure it continues to perform efficiently. dbt has built in tests that can do this simply for you, but we can get a proper testing tool if needed. People tend to put away the problem once solved, but without consistent testing and monitoring, you are just creating and adding to a growing garbage bin.

An Introduction to Some Other Techniques

A lot of improvements can be made solely from the above, but here are just some introductions on methodologies to increases performance when executing your DAGs.

Materialization Schedules

Any model that requires intense calculations should be run as materlizaed tables as opposed to views and run on specific schedule. Views are fine in certain situations and can be used for minimal editing in your presentation layer (say self serve and drag & drop tools or cleaning names to be more human friendly).

The argument that there is urgency to have updated data will always come up, and in more situations then none, no one needs near real time data unless they may be working in certain fields (say stock markets or certain health care providers) and is absolute necessary. Even then, not all data in these fields needs minimal lag in ingestion and presentation.

When you have a solid understanding of your different model schedules and when they need to be run, you can start to implement tags. Tags can be done model specific and will help greatly with your orchestration. Decoupling one giant run for everything into different schedules through tags will help to only run what needs to be run at a specific time.

Parallelize!

Multithreading in dbt does exist, and if your have nicely designed DAGs… why would not run multiple models at the same time? So just turning on multithreading will accomplish a fair amount, but it may even actually stall your DAG, unless you do it with some thought to it.

It is possible to use multithreading to set up a different queue solution instead of dbt’s way of doing it, however this requires a fair amount of time, effort, and code complexity. It generally would yield large benefits that are worth the time and effort in larger organizations where a 5% decrease saves more cash then your house costs now a days.

dbt works on a running “levels” and it works well as it cant execute in a wrong order (unless you tried to hack up its core and screwed up!) but this is what may lead to stalling as well. By running a different queue solution, it can ensure models will execute as soon as they are ready to.

The State of Analytics Engineering

The truth is that analytics engineering as a field in its own is getting larger by the day, and new tools and methodologies will soon come out to solve the problems that an analytics engineer solves. Data teams know the importance of the field, because they have gone through the suffering of having poor data governance. Companies tend to put the need for this field to the side because they may not see “immediate ROI benefits”. They however, have the expectation that having data analysts, data scientists, and data engineers will increase their knowledge so that they can make the greater ROI.

When companies are ready to truly invest in their data capabilities, and not by just hiring an analyst… companies will realize the quality they can get out of their data team.

At Kizmet Solutions we can help achieve your strategies and see the potential in your data stack.