Skip to main content

Developer Blog | dbt Developer Hub

Find tutorials, product updates, and developer insights in the dbt Developer Blog.

Start here

· 8 min read

The problem, the builder and tooling

The problem: My partner and I are considering buying a property in Portugal. There is no reference data for the real estate market here - how many houses are being sold, for what price? Nobody knows except the property office and maybe the banks, and they don’t readily divulge this information. The only data source we have is Idealista, which is a portal where real estate agencies post ads.

Unfortunately, there are significantly fewer properties than ads - it seems many real estate companies re-post the same ad that others do, with intentionally different data and often misleading bits of info. The real estate agencies do this so the interested parties reach out to them for clarification, and from there they can start a sales process. At the same time, the website with the ads is incentivised to allow this to continue as they get paid per ad, not per property.

The builder: I’m a data freelancer who deploys end to end solutions, so when I have a data problem, I cannot just let it go.

The tools: I want to be able to run my project on Google Cloud Functions due to the generous free tier. dlt is a new Python library for declarative data ingestion which I have wanted to test for some time. Finally, I will use dbt Core for transformation.

The starting point

If I want to have reliable information on the state of the market I will need to:

  • Grab the messy data from Idealista and historize it.
  • Deduplicate existing listings.
  • Try to infer what listings sold for how much.

Once I have deduplicated listings with some online history, I can get an idea:

  • How expensive which properties are.
  • How fast they get sold, hopefully a signal of whether they are “worth it” or not.

Towards a solution

The solution has pretty standard components:

  • An EtL pipeline. The little t stands for normalisation, such as transforming strings to dates or unpacking nested structures. This is handled by dlt functions written in Python.
  • A transformation layer taking the source data loaded by my dlt functions and creating the tables necessary, handled by dbt.
  • Due to the complexity of deduplication, I needed to add a human element to confirm the deduplication in Google Sheets.

These elements are reflected in the diagram below and further clarified in greater detail later in the article:

Project architectureProject architecture

Ingesting the data

For ingestion, I use a couple of sources:

First, I ingest home listings from the Idealista API, accessed through API Dojo's freemium wrapper. The dlt pipeline I created for ingestion is in this repo.

After an initial round of transformation (described in the next section), the deduplicated data is loaded into BigQuery where I can query it from the Google Sheets client and manually review the deduplication.

When I'm happy with the results, I use the ready-made dlt Sheets source connector to pull the data back into BigQuery, as defined here.

Transforming the data

For transforming I use my favorite solution, dbt Core. For running and orchestrating dbt on Cloud Functions, I am using dlt’s dbt Core runner. The benefit of the runner in this context is that I can re-use the same credential setup, instead of creating a separate profiles.yml file.

This is the package I created: https://github.com/euanjohnston-dev/idealista_dbt_pipeline

Production-readying the pipeline

To make the pipeline more “production ready”, I made some improvements:

  • Using a credential store instead of hard-coding passwords, in this case Google Secret Manager.
  • Be notified when the pipeline runs and what the outcome is. For this I sent data to Slack via a dlt decorator that posts the error on failure and the metadata on success.
from dlt.common.runtime.slack import send_slack_message

def notify_on_completion(hook):
def decorator(func):
def wrapper(*args, **kwargs):
try:
load_info = func(*args, **kwargs)
message = f"Function {func.__name__} completed successfully. Load info: {load_info}"
send_slack_message(hook, message)
return load_info
except Exception as e:
message = f"Function {func.__name__} failed. Error: {str(e)}"
send_slack_message(hook, message)
raise
return wrapper
return decorator

The outcome

The outcome was first and foremost a visualisation highlighting the unique properties available in my specific area of search. The map shown on the left of the page gives a live overview of location, number of duplicates (bubble size) and price (bubble colour) which can amongst other features be filtered using the sliders on the right. This represents a much better decluttered solution from which to observe the actual inventory available.

Dashboard mapping overviewDashboard mapping overview

Further charts highlight additional metrics which – now that deduplication is complete – can be accurately measured including most importantly, the development over time of “average price/square metre” and those properties which have been inferred to have been sold.

Next steps

This version was very much about getting a base from which to analyze the properties for my own personal use case.

In terms of further development which could take place, I have had interest from people to run the solution on their own specific target area.

For this to work at scale I would need a more robust method to deal with duplicate attribution, which is a difficult problem as real estate agencies intentionally change details like number of rooms or surface area.

Perhaps this is a problem ML or GPT could solve equally well as a human, given the limited options available.

Learnings and conclusion

The data problem itself was an eye opener into the real-estate market. It’s a messy market full of unknowns and noise, which adds a significant purchase risk to first time buyers.

Tooling wise, it was surprising how quick it was to set everything up. dlt integrates well with dbt and enables fast and simple data ingestion, making this project simpler than I thought it would be.

dlt

Good:

  • As a big fan of dbt I love how seamlessly the two solutions complement one another. dlt handles the data cleaning and normalisation automatically so I can focus on curating and modelling it in dbt. While the automatic unpacking leaves some small adjustments for the analytics engineer, it’s much better than cleaning and typing json in the database or in custom python code.
  • When creating my first dummy pipeline I used duckdb. It felt like a great introduction into how simple it is to get started and provided a solid starting block before developing something for the cloud.

Bad:

  • I did have a small hiccup with the google sheets connector assuming an oauth authentication over my desired sdk but this was relatively easy to rectify. (explicitly stating GcpServiceAccountCredentials in the init.py file for the source).
  • Using both a verified source in the gsheets connector and building my own from Rapid API endpoints seemed equally intuitive. However I would have wanted more documentation on how to run these 2 pipelines in the same script with the dbt pipeline.

dbt

No surprises there. I developed the project locally, and to deploy to cloud functions I injected credentials to dbt via the dlt runner. This meant I could re-use the setup I did for the other dlt pipelines.

def dbt_run():
# make an authenticated connection with dlt to the dwh
pipeline = dlt.pipeline(
pipeline_name='dbt_pipeline',
destination='bigquery', # credentials read from env
dataset_name='dbt'
)
# make a venv in case we have lib conflicts between dlt and current env
venv = dlt.dbt.get_venv(pipeline)
# package the pipeline, dbt package and env
dbt = dlt.dbt.package(pipeline, "dbt/property_analytics", venv=venv)
# and run it
models = dbt.run_all()
# show outcome
for m in models:
print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}"

Cloud functions

While I had used cloud functions before, I had never previously set them up for dbt and I was able to easily follow dlt’s docs to run the pipelines there. Cloud functions is a great solution to cheaply run small scale pipelines and my running cost of the project is a few cents a month. If the insights drawn from the project help us save even 1% of a house price, the project will have been a success.

To sum up

dlt feels like the perfect solution for anyone who has scratched the surface of python development. To be able to have schemas ready for transformation in such a short space of time is truly… transformational. As a freelancer, being able to accelerate the development of pipelines is a huge benefit within companies who are often frustrated with the amount of time it takes to start ‘showing value’.

I’d welcome the chance to discuss what’s been built to date or collaborate on any potential further development in the comments below.

· 9 min read
Dave Connors

Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at your lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.

“OK,” you think further, “I’ll just run a dbt build -s +my_changed_model to make sure I have everything I need built into my dev schema and I can test my changes”. You run the command. You wait. You wait some more. You get some coffee, and completely take yourself out of your dbt development flow state. A lot of time and money down the drain to get to a point where you can start your work. That’s no good!

Luckily, dbt’s defer functionality allow you to only build what you care about when you need it, and nothing more. This feature helps developers spend less time and money in development, helping ship trusted data products faster. dbt Cloud offers native support for this workflow in development, so you can start deferring without any additional overhead!

· 9 min read
Amy Chen

Overview

Over the course of my three years running the Partner Engineering team at dbt Labs, the most common question I've been asked is, How do we integrate with dbt? Because those conversations often start out at the same place, I decided to create this guide so I’m no longer the blocker to fundamental information. This also allows us to skip the intro and get to the fun conversations so much faster, like what a joint solution for our customers would look like.

This guide doesn't include how to integrate with dbt Core. If you’re interested in creating a dbt adapter, please check out the adapter development guide instead.

Instead, we're going to focus on integrating with dbt Cloud. Integrating with dbt Cloud is a key requirement to become a dbt Labs technology partner, opening the door to a variety of collaborative commercial opportunities.

Here I'll cover how to get started, potential use cases you want to solve for, and points of integrations to do so.

· 9 min read
Jordan Stein

There’s nothing quite like the feeling of launching a new product. On launch day emotions can range from excitement, to fear, to accomplishment all in the same hour. Once the dust settles and the product is in the wild, the next thing the team needs to do is track how the product is doing. How many users do we have? How is performance looking? What features are customers using? How often? Answering these questions is vital to understanding the success of any product launch.

At dbt we recently made the Semantic Layer Generally Available. The Semantic Layer lets teams define business metrics centrally, in dbt, and access them in multiple analytics tools through our semantic layer APIs. I’m a Product Manager on the Semantic Layer team, and the launch of the Semantic Layer put our team in an interesting, somewhat “meta,” position: we need to understand how a product launch is doing, and the product we just launched is designed to make defining and consuming metrics much more efficient. It’s the perfect opportunity to put the semantic layer through its paces for product analytics. This blog post walks through the end-to-end process we used to set up product analytics for the dbt Semantic Layer using the dbt Semantic Layer.

· 5 min read
Joel Labes
The Bottom Line:

You should split your Jobs across Environments in dbt Cloud based on their purposes (e.g. Production and Staging/CI) and set one environment as Production. This will improve your CI experience and enable you to use dbt Explorer.

Environmental segmentation has always been an important part of the analytics engineering workflow:

  • When developing new models you can process a smaller subset of your data by using target.name or an environment variable.
  • By building your production-grade models into a different schema and database, you can experiment in peace without being worried that your changes will accidentally impact downstream users.
  • Using dedicated credentials for production runs, instead of an analytics engineer's individual dev credentials, ensures that things don't break when that long-tenured employee finally hangs up their IDE.

Historically, dbt Cloud required a separate environment for Development, but was otherwise unopinionated in how you configured your account. This mostly just worked – as long as you didn't have anything more complex than a CI job mixed in with a couple of production jobs – because important constructs like deferral in CI and documentation were only ever tied to a single job.

But as companies' dbt deployments have grown more complex, it doesn't make sense to assume that a single job is enough anymore. We need to exchange a job-oriented strategy for a more mature and scalable environment-centric view of the world. To support this, a recent change in dbt Cloud enables project administrators to mark one of their environments as the Production environment, just as has long been possible for the Development environment.

Explicitly separating your Production workloads lets dbt Cloud be smarter with the metadata it creates, and is particularly important for two new features: dbt Explorer and the revised CI workflows.

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

· 11 min read
Amy Chen
note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

· 8 min read
Pedro Brito de Sa

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

· 15 min read
Rastislav Zdechovan
Sean McIntyre

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

· 14 min read
Santiago Jauregui

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

· 9 min read
Mikael Thorup

At Lunar, most of our dbt models are sourcing from event-driven architecture. As an example, we have the following models for our activity_based_interest folder in our ingestion layer:

  • activity_based_interest_activated.sql
  • activity_based_interest_deactivated.sql
  • activity_based_interest_updated.sql
  • downgrade_interest_level_for_user.sql
  • set_inactive_interest_rate_after_july_1st_in_bec_for_user.sql
  • set_inactive_interest_rate_from_july_1st_in_bec_for_user.sql
  • set_interest_levels_from_june_1st_in_bec_for_user.sql

This results in a lot of the same columns (e.g. account_id) existing in different models, across different layers. This means I end up:

  1. Writing/copy-pasting the same documentation over and over again
  2. Halfway through, realizing I could improve the wording to make it easier to understand, and go back and update the .yml files I already did
  3. Realizing I made a syntax error in my .yml file, so I go back and fix it
  4. Realizing the columns are defined differently with different wording being used in other folders in our dbt project
  5. Reconsidering my choice of career and pray that a large language model will steal my job
  6. Considering if there’s a better way to be generating documentation used across different models

· 18 min read
Sterling Paramore

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

· 12 min read
Sung Won Chung
Kira Furuichi

I, Sung, entered the data industry by chance in Fall 2014. I was using this thing called audit command language (ACL) to automate debits equal credits for accounting analytics (yes, it’s as tedious as it sounds). I remember working my butt off in a hotel room in Des Moines, Iowa where the most interesting thing there was a Panda Express. It was late in the AM. I’m thinking about 2 am. And I took a step back and thought to myself, “Why am I working so hard for something that I just don’t care about with tools that hurt more than help?”

· 5 min read
Callum McCann

Hello, my dear data people.

If you haven’t read Nick & Roxi’s blog post about what’s coming in the future of the dbt Semantic Layer, I highly recommend you read through that, as it gives helpful context around what the future holds.

With that said, it has come time for us to bid adieu to our beloved dbt_metrics package. Upon the release of dbt-core v1.6 in late July, we will be deprecating support for the dbt_metrics package.

· 12 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

· 20 min read
Jonathan Neo
Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.
Data modeling techniques on a normalization vs denormalization scaleData modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

  • Understand dimensional modeling concepts
  • Set up a mock dbt project and database
  • Identify the business process to model
  • Identify the fact and dimension tables
  • Create the dimension tables
  • Create the fact table
  • Document the dimensional model relationships
  • Consume the dimensional model

· 12 min read
João Antunes
Yannick Misteli
Sean McIntyre

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

· 7 min read
Benoit Perigaud

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Pacakages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

· 15 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.