You are viewing an outdated version of the documentation.

This documentation is for an older version (1.4.7) of Dagster. You can view the version of this page from our latest release below.

Backfills#

Overview#

Dagster supports data backfills for each partition or subsets of partitions. After defining a partitioned asset or job, you can launch a backfill which submits runs to fill in multiple partitions at the same time.


Launching Backfills#

Partitioned Asset Backfills#

You can open the backfill modal to launch backfills for a partitioned asset using the "Materialize" button, either on the Asset detail page or when viewing a graph of assets. Backfills can also be launched for a selection of differently partitioned assets, as long as the roots share the same partitioning.

backfills-launch-modal

To observe the the progress of an asset backfill, you can visit the backfill details page. You can get to this page by clicking on the notification that shows up after launching a backfill. You can also reach this page by clicking on "Overview" in the top navigation pane, then clicking the "Backfills" tab, and then clicking on the ID of one of the backfills.

backfills-launch-modal

Single-run backfills#

By default, if you launch a backfill that covers N partitions, Dagster will launch N separate runs - one for each partition. This works well when your code is single threaded, because it avoids overwhelming it with large amounts of data. However, if you're using a parallel-processing engine like Spark and Snowflake, you often don't need Dagster to help with parallelism, so splitting up the backfill into multiple runs just adds extra overhead.

Dagster supports backfills that execute as a single run that covers a range of partitions. For example, this allows you to execute the backfill as a single Snowflake query. After it completes, Dagster will track that all those partitions have been filled.

For this to work, your code needs to be written in a way that allows it to operate on partition ranges, instead of just single partitions.

If your code uses any of the partition_key, asset_partition_key, asset_partition_key_for_output, or asset_partition_key_for_input context methods or properties, you'll need to update it to use methods or properties like asset_partitions_time_window, asset_partition_key_range, asset_partition_keys, asset_partitions_time_window_for_output, asset_partition_key_range_output, asset_partitions_time_window_for_input, or asset_partition_key_range_for_input instead.

from dagster import AssetKey, DailyPartitionsDefinition, asset


@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"),
    deps=[AssetKey("raw_events")],
)
def events(context):
    (
        input_start_datetime,
        input_end_datetime,
    ) = context.asset_partitions_time_window_for_input("raw_events")
    input_data = read_data_in_datetime_range(input_start_datetime, input_end_datetime)
    output_data = compute_events_from_raw_events(input_data)

    (
        output_start_datetime,
        output_end_datetime,
    ) = context.asset_partitions_time_window_for_output()
    return overwrite_data_in_datetime_range(
        output_start_datetime,
        output_end_datetime,
        output_data,
    )

Alternatively, if you are using an I/O manager to handle saving and loading your data, you'll need to ensure the I/O manager is also using these methods. If you're using any of the built-in database I/O managers, like Snowflake, BigQuery, or DuckDB, you'll have this out-of-the-box.

from dagster import IOManager


class MyIOManager(IOManager):
    def load_input(self, context):
        start_datetime, end_datetime = context.asset_partitions_time_window
        return read_data_in_datetime_range(start_datetime, end_datetime)

    def handle_output(self, context, obj):
        start_datetime, end_datetime = context.asset_partitions_time_window
        return overwrite_data_in_datetime_range(start_datetime, end_datetime, obj)

Partitioned Job Backfills#

You can launch and monitor backfills of a job using the Partitions tab.

To launch a backfill, click the "Launch backfill" button at the top center of the Partitions tab. This opens the "Launch backfill" modal, which lets you select the set of partitions to launch the backfill over. A run will be launched for each partition.

backfills-launch-modal

You can click the button on the bottom right to submit the runs. What happens when you hit this button depends on your Run Coordinator. With the default run coordinator, the modal will exit after all runs have been launched. With the queued run coordinator, the modal will exit after all runs have been queued.

After all the runs have been submitted, you'll be returned to the partitions page, with a filter for runs inside the backfill. This refreshes periodically and allows you to see how the backfill is progressing. Boxes become green or red as steps in the backfill runs succeed or fail.

partitions-page

Using the Backfill CLI#

You can also launch backfills using the backfill CLI.

In the Partitions concept page, we defined a partitioned job called do_stuff_partitioned that had date partitions.

Having done so, we can run the command dagster job backfill to execute the backfill.

$ dagster job backfill -p do_stuff_partitioned

This will display a list of all the partitions in the job, ask you if you want to proceed, and then launch a run for each partition.

Executing a subset of partitions#

You can also execute subsets of the partition sets.

You can specify the --partitions argument and provide a comma-separated list of partition names you want to backfill:

$ dagster job backfill -p do_stuff_partitioned --partitions 2021-04-01,2021-04-02

Alternatively, you can also specify ranges of partitions using the --from and --to arguments:

$ dagster job backfill -p do_stuff_partitioned --from 2021-04-01 --to 2021-05-01