In this guide, we'll walk through a fully featured Dagster project that takes advantage of a wide range of Dagster features. This example can be useful as a point of reference for using different Dagster APIs and integrating other tools.
At a high level, this project consists of three asset groups, all centered around a contrived organization that wants to do ML and analysis on Hacker News user activity data.
Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.
core: Contains data sets of activity on Hacker News, fetched from the Hacker News API. These are partitioned by hour and updated every hour.
recommender: A machine learning model that recommends stories to specific users based on their comment history, as well as the features and training set used to fit that model. These are dropped and recreated whenever the core assets receive updates.
activity_analytics: Aggregate statistics computed about Hacker News activity represented by dbt models and a Python model that depends on them. These are dropped and recreated whenever the core assets receive updates.
Varying external services or I/O without changing your DAG#
Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:
HNAPIClient interacts with the real Hacker News API and gets the full data set, which will be used in production.
HNAPISubsampleClient talks to the real API but subsamples the data, which is much faster than the normal implementation and is great for demoing purposes.
HNSnapshotClient reads from a local snapshot, which is useful for unit testing or environments where the connection isn't available.
The way we model resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.
ParquetIOManager: stores outputs as parquet files in your local file system. It minimizes setup difficulty and is useful for local development.
DuckDBIOManager: Dagster provided DuckDB I/O manager that can store and load data as Pandas or PySpark DataFrames. Useful for local development since DuckDB runs locally and requires minimal setup. In this example, a DuckDB I/O manager that can handle Pandas and PySpark DataFrames is built using the build_duckdb_io_manager function.
SnowflakeIOManager: Dagster provided Snowflake I/O manager that can store and load data as Pandas or PySpark DataFrames. In this example, a Snowflake I/O manager that can handle Pandas and PySpark DataFrames is built using the build_snowflake_io_manager function.
Schedules - A schedule allows you to execute a job at a fixed interval. This example includes an hourly schedule that materializes the core asset group every hour.
Sensors - A sensor allows you to instigate runs based on some external state change. In this example, we have sensors to react to different state changes:
Dagster orchestrates dbt alongside other tools, so you can combine dbt with Python, Spark, etc. in a single workflow. This example includes a standalone dbt_project, and loads dbt models from an existing dbt manifest.json file in the dbt project to Dagster assets. It is useful for larger dbt projects as you may not want to recompile the entire dbt project every time you load the Dagster project.
Dagster provides out-of-the-box support for messaging a given Slack channel, via a resource that connects to Slack. Resources are useful for interacting with Slack, as you may want to send messages in production but mock the Slack resource while testing.
This project builds a PartitionedParquetIOManager that can take a PySpark DataFrame and store it in Parquet at the given path. It uses pyspark_resource to access to a PySpark SparkSession for executing PySpark code within Dagster.
Besides that, Dagster ops also can perform computations using Spark. Visit Using Dagster with Spark to learn more.
As time goes on, this guide will be kept up to date, taking advantage of new Dagster features and learnings from the community. If you have anything you'd like to add, or an additional example you'd like to see, don't hesitate to reach out!