Fully Featured Project#

In this guide, we'll walk through a fully featured Dagster project that takes advantage of a wide range of Dagster features. This example can be useful as a point of reference for using different Dagster APIs and integrating other tools.

At a high level, this project consists of three asset groups, all centered around a contrived organization that wants to do ML and analysis on Hacker News user activity data.

Getting started#

You can find the code for this example on Github

To follow along with this guide, you can bootstrap your own project with this example:


dagster project from-example \
    --name my-dagster-project \
    --example project_fully_featured

To install this example and its Python dependencies, run:


cd my-dagster-project
pip install -e .

Once you've done this, you can run:


dagster dev

to view this example in the Dagster UI.

Using Dagster concepts#

This example shows useful patterns for many Dagster concepts, including:

Organizing your assets in groups#

Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.

This example contains three asset groups:

core: Contains data sets of activity on Hacker News, fetched from the Hacker News API. These are partitioned by hour and updated every hour.
recommender: A machine learning model that recommends stories to specific users based on their comment history, as well as the features and training set used to fit that model. These are dropped and recreated whenever the core assets receive updates.
activity_analytics: Aggregate statistics computed about Hacker News activity represented by dbt models and a Python model that depends on them. These are dropped and recreated whenever the core assets receive updates.

Varying external services or I/O without changing your DAG#

Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:

HNAPIClient interacts with the real Hacker News API and gets the full data set, which will be used in production.
HNAPISubsampleClient talks to the real API but subsamples the data, which is much faster than the normal implementation and is great for demoing purposes.
HNSnapshotClient reads from a local snapshot, which is useful for unit testing or environments where the connection isn't available.

The way we model resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.

I/O managers - An I/O manager is a special kind of resource that handles storing and loading assets. This example includes a wide range of I/O managers such as:

ParquetIOManager: stores outputs as parquet files in your local file system. It minimizes setup difficulty and is useful for local development.
DuckDBIOManager: Dagster provided DuckDB I/O manager that can store and load data as Pandas or PySpark DataFrames. Useful for local development since DuckDB runs locally and requires minimal setup. In this example, a DuckDB I/O manager that can handle Pandas and PySpark DataFrames is built using the build_duckdb_io_manager function.
SnowflakeIOManager: Dagster provided Snowflake I/O manager that can store and load data as Pandas or PySpark DataFrames. In this example, a Snowflake I/O manager that can handle Pandas and PySpark DataFrames is built using the build_snowflake_io_manager function.

Scheduling and triggering jobs#

Schedules - A schedule allows you to execute a job at a fixed interval. This example includes an hourly schedule that materializes the core asset group every hour.

Sensors - A sensor allows you to instigate runs based on some external state change. In this example, we have sensors to react to different state changes:

Testing#

Testing - All Dagster entities are unit-testable. This example illustrates lightweight invocations in unit tests, including:

Testing sensors by mocking event storage and ticks and verifying definitions can load. Visit Testing sensors for more details.
Testing assets by directly invoking the @asset-decorated functions. Read more about testing assets on the Testing page.
Testing I/O managers by mocking the I/O and constructing OutputContext and InputContext with the mocks. Check out Testing an IO manager to learn more.

Environments#

This example is meant to be loaded from three deployments:

A production deployment, which stores assets in S3 and Snowflake.
A staging deployment, which stores assets in S3 and Snowflake, under a different key and database.
A local deployment, which stores assets in the local filesystem and DuckDB.

By default, it will load for the local deployment. You can toggle deployments by setting the DAGSTER_DEPLOYMENT env var to prod or staging.

Integrating other tools#

Beyond leveraging Dagster core concepts, this project also uses several dagster integration libraries:

dagster_dbt
- Dagster orchestrates dbt alongside other tools, so you can combine dbt with Python, Spark, etc. in a single workflow. This example includes a standalone dbt_project, and loads dbt models from an existing dbt manifest.json file in the dbt project to Dagster assets. It is useful for larger dbt projects as you may not want to recompile the entire dbt project every time you load the Dagster project.
- Check out Using dbt with Dagster for more recommendations.
dagster_aws
- Dagster provides utilities for interfacing with AWS services, including S3, ECS, Redshift, EMR, etc.
dagster_slack
- Dagster provides out-of-the-box support for messaging a given Slack channel, via a resource that connects to Slack. Resources are useful for interacting with Slack, as you may want to send messages in production but mock the Slack resource while testing.
dagster_pyspark
- This project builds a PartitionedParquetIOManager that can take a PySpark DataFrame and store it in Parquet at the given path. It uses pyspark_resource to access to a PySpark SparkSession for executing PySpark code within Dagster.
- Besides that, Dagster ops also can perform computations using Spark. Visit Using Dagster with Spark to learn more.

Conclusion#

As time goes on, this guide will be kept up to date, taking advantage of new Dagster features and learnings from the community. If you have anything you'd like to add, or an additional example you'd like to see, don't hesitate to reach out!

You are viewing an outdated version of the documentation.