You are viewing an outdated version of the documentation.

This documentation is for an older version (1.4.7) of Dagster. You can view the version of this page from our latest release below.

Hello Dagster#

Welcome to Dagster! In this guide, you'll build a simple data pipeline in Dagster that downloads the top 10 HackerNews stories. In three quick steps, you'll have functional code and begin exploring Dagster's user interface.

Note: Before you dive in, make sure you have Python 3.8+ installed.

Let's get started!


Step 1: Create hello-dagster.py#

Create a file named hello-dagster.py that contains the following code:

import json

import pandas as pd
import requests

from dagster import AssetExecutionContext, MetadataValue, asset


@asset
def hackernews_top_story_ids():
    """Get top stories from the HackerNews top stories endpoint.

    API Docs: https://github.com/HackerNews/API#new-top-and-best-stories.
    """
    top_story_ids = requests.get(
        "https://hacker-news.firebaseio.com/v0/topstories.json"
    ).json()

    with open("hackernews_top_story_ids.json", "w") as f:
        json.dump(top_story_ids[:10], f)


# asset dependencies can be inferred from parameter names
@asset(deps=[hackernews_top_story_ids])
def hackernews_top_stories(context: AssetExecutionContext):
    """Get items based on story ids from the HackerNews items endpoint."""
    with open("hackernews_top_story_ids.json", "r") as f:
        hackernews_top_story_ids = json.load(f)

    results = []
    for item_id in hackernews_top_story_ids:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{item_id}.json"
        ).json()
        results.append(item)

    df = pd.DataFrame(results)
    df.to_csv("hackernews_top_stories.csv")

    # recorded metadata can be customized
    metadata = {
        "num_records": len(df),
        "preview": MetadataValue.md(df[["title", "by", "url"]].to_markdown()),
    }

    context.add_output_metadata(metadata=metadata)

Step 2: Install Python packages#

Next, install the Python packages you'll need to run your code in your favorite Python environment:

# run in a terminal in your favorite python environment
pip install dagster dagster-webserver pandas

Unsure? Check out the installation guide.


Step 3: Start the Dagster UI and materialize assets#

  1. In the same directory as hello-dagster.py, run dagster dev. This command starts a web server to host Dagster's user interface:

    # run in a terminal in your favorite python environment
    dagster dev -f hello-dagster.py
    
  2. In your browser, navigate to http://localhost:3000/.

  3. Click Materialize All to run the pipeline and create your assets. Materializing an asset runs the asset function and saves the result. This pipeline uses the Dagster defaults to save the result to a pickle file on disk.

    HackerNews assets in Dagster's Asset Graph, unmaterialized

That's it! You now have two materialized Dagster assets:

HackerNews asset graph

But wait - there's more. Because the hackernews_top_stories asset specified metadata, you can view the metadata right in the UI:

  1. Click the asset.

  2. In the sidebar that displays, click the Show Markdown link in the Materialization in Last Run section. This opens a preview of the pipeline result, allowing you to view the top 10 HackerNews stories:

    Markdown preview of HackerNews top 10 stories

Next steps#

Congrats on your first Dagster pipeline! This example used assets, which most Dagster projects utilize because they let data engineers:

  • Think in the same terms as stakeholders
  • Answer questions about data quality and lineage
  • Work with the modern data stack (dbt, Airbyte/Fivetran, Spark)
  • Create declarative freshness policies instead of task-driven cron schedules

Dagster also offers ops and jobs, but we recommend starting with assets.

While this example used a single file, most Dagster projects are organized as Python packages. From here, you can: