Creating testable and verifiable data pipelines is one of the focuses of Dagster. We believe ensuring data quality is critical for managing the complexity of data systems. Here, we'll cover how to write unit tests for individual assets, as well as for graphs of assets together.
This guide builds off of the project written in the tutorial. If you haven't already, you should complete the tutorial before continuing. Other guides may also build off the project created in the tutorial, but for this guide, we'll assume that the Dagster project is the same as the one created in the tutorial.
It also assumes that you have installed a test runner like pytest.
We'll start by writing a test for the topstories_word_cloud asset definition, which is an image of a word cloud of the titles of top stories on Hacker News. To run the function that derives an asset from its upstream dependencies, we can invoke it directly, as if it's a regular Python function.
Add the following code to the test_assets.py file in your tutorial_project_tests directory:
import pandas as pd
from tutorial_project.assets import topstories_word_cloud
deftest_topstories_word_cloud():
df = pd.DataFrame([{"title":"Wow, Dagster is such an awesome and amazing product. I can't wait to use it!"},{"title":"Pied Piper launches new product"},])
results = topstories_word_cloud(df)assert results isnotNone# It returned something
We'll also write a test for all the assets together. To do that, we can put them in a list and then pass it to the materialize function. That returns an ExecuteInProcessResult object, whose methods let us investigate, in detail, the success or failure of execution, the values produced by the computation, and other events associated with execution.
Update the test_assets.py file to include the following code:
from dagster import materialize
from tutorial_project.assets import(
topstory_ids,
topstories,
topstories_word_cloud
)# Instead of importing one asset, import them alldeftest_hackernews_assets():
assets =[topstory_ids, topstories, topstories_word_cloud]
result = materialize(assets)assert result.success
df = result.output_for_node("topstories")assertlen(df)==100
Use pytest, or your test runner of choice, to run the unit tests. Navigate to the top-level tutorial_project directory (the one that contains the tutorial_project_tests directory) and run:
pytest tutorial_project_tests
Wait a few seconds for the tests to run and observe the output in your terminal.