There are a substantial number of breaking changes in the 0.7.0 release.
Please see 070_MIGRATION.md for instructions regarding migrating old code.
Scheduler
The scheduler configuration has been moved from the @schedules decorator to DagsterInstance.
Existing schedules that have been running are no longer compatible with current storage. To
migrate, remove the scheduler argument on all @schedules decorators:
Finally, if you had any existing schedules running, delete the existing $DAGSTER_HOME/schedules
directory and run dagster schedule wipe && dagster schedule up to re-instatiate schedules in a
valid state.
The should_execute and environment_dict_fn argument to ScheduleDefinition now have a
required first argument context, representing the ScheduleExecutionContext
Config System Changes
In the config system, Dict has been renamed to Shape; List to Array; Optional to
Noneable; and PermissiveDict to Permissive. The motivation here is to clearly delineate
config use cases versus cases where you are using types as the inputs and outputs of solids as
well as python typing types (for mypy and friends). We believe this will be clearer to users in
addition to simplifying our own implementation and internal abstractions.
Our recommended fix is not to use Shape and Array, but instead to use our new condensed
config specification API. This allow one to use bare dictionaries instead of Shape, lists with
one member instead of Array, bare types instead of Field with a single argument, and python
primitive types (int, bool etc) instead of the dagster equivalents. These result in
dramatically less verbose config specs in most cases.
So instead of
from dagster import Shape, Field, Int, Array, String
# ... code
config=Shape({ # Dict prior to change
'some_int' : Field(Int),
'some_list: Field(Array[String]) # List prior to change
})
one can instead write:
config={'some_int': int, 'some_list': [str]}
No imports and much simpler, cleaner syntax.
config_field is no longer a valid argument on solid, SolidDefinition, ExecutorDefintion,
executor, LoggerDefinition, logger, ResourceDefinition, resource, system_storage, and
SystemStorageDefinition. Use config instead.
For composite solids, the config_fn no longer takes a ConfigMappingContext, and the context
has been deleted. To upgrade, remove the first argument to config_fn.
Field takes a is_required rather than a is_optional argument. This is to avoid confusion
with python's typing and dagster's definition of Optional, which indicates None-ability,
rather than existence. is_optional is deprecated and will be removed in a future version.
Required Resources
All solids, types, and config functions that use a resource must explicitly list that
resource using the argument required_resource_keys. This is to enable efficient
resource management during pipeline execution, especially in a multiprocessing or
remote execution environment.
The @system_storage decorator now requires argument required_resource_keys, which was
previously optional.
Dagster Type System Changes
dagster.Set and dagster.Tuple can no longer be used within the config system.
Dagster types are now instances of DagsterType, rather than a class than inherits from
RuntimeType. Instead of dynamically generating a class to create a custom runtime type, just
create an instance of a DagsterType. The type checking function is now an argument to the
DagsterType, rather than an abstract method that has to be implemented in
a subclass.
RuntimeType has been renamed to DagsterType is now an encouraged API for type creation.
Core type check function of DagsterType can now return a naked bool in addition
to a TypeCheck object.
type_check_fn on DagsterType (formerly type_check and RuntimeType, respectively) now
takes a first argument context of type TypeCheckContext in addition to the second argument of
value.
define_python_dagster_type has been eliminated in favor of PythonObjectDagsterType .
dagster_type has been renamed to usable_as_dagster_type.
as_dagster_type has been removed and similar capabilities added as
make_python_type_usable_as_dagster_type.
PythonObjectDagsterType and usable_as_dagster_type no longer take a type_check argument. If
a custom type_check is needed, use DagsterType.
As a consequence of these changes, if you were previously using dagster_pyspark or
dagster_pandas and expecting Pyspark or Pandas types to work as Dagster types, e.g., in type
annotations to functions decorated with @solid to indicate that they are input or output types
for a solid, you will need to call make_python_type_usable_as_dagster_type from your code in
order to map the Python types to the Dagster types, or just use the Dagster types
(dagster_pandas.DataFrame instead of pandas.DataFrame) directly.
Other
We no longer publish base Docker images. Please see the updated deployment docs for an example
Dockerfile off of which you can work.
step_metadata_fn has been removed from SolidDefinition & @solid.
SolidDefinition & @solid now takes tags and enforces that values are strings or
are safely encoded as JSON. metadata is deprecated and will be removed in a future version.
resource_mapper_fn has been removed from SolidInvocation.
New
Dagit now includes a much richer execution view, with a Gantt-style visualization of step
execution and a live timeline.
Early support for Python 3.8 is now available, and Dagster/Dagit along with many of our libraries
are now tested against 3.8. Note that several of our upstream dependencies have yet to publish
wheels for 3.8 on all platforms, so running on Python 3.8 likely still involves building some
dependencies from source.
dagster/priority tags can now be used to prioritize the order of execution for the built-in
in-process and multiprocess engines.
dagster-postgres storages can now be configured with separate arguments and environment
variables, such as:
run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
username: test
password:
env: ENV_VAR_FOR_PG_PASSWORD
hostname: localhost
db_name: test
Support for RunLaunchers on DagsterInstance allows for execution to be "launched" outside of
the Dagit/Dagster process. As one example, this is used by dagster-k8s to submit pipeline
execution as a Kubernetes Job.
Added support for adding tags to runs initiated from the Playground view in Dagit.
Added @monthly_schedule decorator.
Added Enum.from_python_enum helper to wrap Python enums for config. (Thanks @kdungs!)
[dagster-bash] The Dagster bash solid factory now passes along kwargs to the underlying
solid construction, and now has a single Nothing input by default to make it easier to create a
sequencing dependency. Also, logs are now buffered by default to make execution less noisy.
[dagster-aws] We've improved our EMR support substantially in this release. The
dagster_aws.emr library now provides an EmrJobRunner with various utilities for creating EMR
clusters, submitting jobs, and waiting for jobs/logs. We also now provide a
emr_pyspark_resource, which together with the new @pyspark_solid decorator makes moving
pyspark execution from your laptop to EMR as simple as changing modes.
[dagster-pandas] Added create_dagster_pandas_dataframe_type, PandasColumn, and
Constraint API's in order for users to create custom types which perform column validation,
dataframe validation, summary statistics emission, and dataframe serialization/deserialization.
[dagster-gcp] GCS is now supported for system storage, as well as being supported with the
Dask executor. (Thanks @habibutsu!) Bigquery solids have also been updated to support the new API.
Bugfix
Ensured that all implementations of RunStorage clean up pipeline run tags when a run
is deleted. Requires a storage migration, using dagster instance migrate.
The multiprocess and Celery engines now handle solid subsets correctly.
The multiprocess and Celery engines will now correctly emit skip events for steps downstream of
failures and other skips.
The @solid and @lambda_solid decorators now correctly wrap their decorated functions, in the
sense of functools.wraps.
Performance improvements in Dagit when working with runs with large configurations.
The Helm chart in dagster_k8s has been hardened against various failure modes and is now
compatible with Helm 2.
SQLite run and event log storages are more robust to concurrent use.
Improvements to error messages and to handling of user code errors in input hydration and output
materialization logic.
Fixed an issue where the Airflow scheduler could hang when attempting to load dagster-airflow
pipelines.
We now handle our SQLAlchemy connections in a more canonical way (thanks @zzztimbo!).
Fixed an issue using S3 system storage with certain custom serialization strategies.
Fixed an issue leaking orphan processes from compute logging.
Fixed an issue leaking semaphores from Dagit.
Setting the raise_error flag in execute_pipeline now actually raises user exceptions instead
of a wrapper type.
Documentation
Our docs have been reorganized and expanded (thanks @habibutsu, @vatervonacht, @zzztimbo). We'd
love feedback and contributions!
Thank you
Thank you to all of the community contributors to this release!! In alphabetical order: @habibutsu,
@kdungs, @vatervonacht, @zzztimbo.
Added the dagster-github library, a community contribution from @Ramshackle-Jamathon and
@k-mahoney!
dagster-celery
Simplified and improved config handling.
An engine event is now emitted when the engine fails to connect to a broker.
Bugfix
Fixes a file descriptor leak when running many concurrent dagster-graphql queries (e.g., for
backfill).
The @pyspark_solid decorator now handles inputs correctly.
The handling of solid compute functions that accept kwargs but which are decorated with explicit
input definitions has been rationalized.
Fixed race conditions in concurrent execution using SQLite event log storage with concurrent
execution, uncovered by upstream improvements in the Python inotify library we use.
Documentation
Improved error messages when using system storages that don't fulfill executor requirements.
We are now more permissive when specifying configuration schema in order make constructing
configuration schema more concise.
When specifying the value of scalar inputs in config, one can now specify that value directly as
the key of the input, rather than having to embed it within a value key.
Breaking
The implementation of SQL-based event log storages has been consolidated,
which has entailed a schema change. If you have event logs stored in a
Postgres- or SQLite-backed event log storage, and you would like to maintain
access to these logs, you should run dagster instance migrate. To check
what event log storages you are using, run dagster instance info.
Type matches on both sides of an InputMapping or OutputMapping are now enforced.
New
Dagster is now tested on Python 3.8
Added the dagster-celery library, which implements a Celery-based engine for parallel pipeline
execution.
Added the dagster-k8s library, which includes a Helm chart for a simple Dagit installation on a
Kubernetes cluster.
Dagit
The Explore UI now allows you to render a subset of a large DAG via a new solid
query bar that accepts terms like solid_name+* and +solid_name+. When viewing
very large DAGs, nothing is displayed by default and * produces the original behavior.
Performance improvements in the Explore UI and config editor for large pipelines.
The Explore UI now includes a zoom slider that makes it easier to navigate large DAGs.
Dagit pages now render more gracefully in the presence of inconsistent run storage and event logs.
Improved handling of GraphQL errors and backend programming errors.
Minor display improvements.
dagster-aws
A default prefix is now configurable on APIs that use S3.
S3 APIs now parametrize region_name and endpoint_url.
dagster-gcp
A default prefix is now configurable on APIs that use GCS.
dagster-postgres
Performance improvements for Postgres-backed storages.
dagster-pyspark
Pyspark sessions may now be configured to be held open after pipeline execution completes, to
enable extended test cases.
dagster-spark
spark_outputs must now be specified when initializing a SparkSolidDefinition, rather than in
config.
Added new create_spark_solid helper and new spark_resource.
Improved EMR implementation.
Bugfix
Fixed an issue retrieving output values using SolidExecutionResult (e.g., in test) for
dagster-pyspark solids.
Fixes an issue when expanding composite solids in Dagit.
Better errors when solid names collide.
Config mapping in composite solids now works as expected when the composite solid has no top
level config.
Compute log filenames are now guaranteed not to exceed the POSIX limit of 255 chars.
Fixes an issue when copying and pasting solid names from Dagit.
Termination now works as expected in the multiprocessing executor.
The multiprocessing executor now executes parallel steps in the expected order.
The multiprocessing executor now correctly handles solid subsets.
Fixed a bad error condition in dagster_ssh.sftp_solid.
Fixed a bad error message giving incorrect log level suggestions.
Documentation
Minor fixes and improvements.
Thank you
Thank you to all of the community contributors to this release!! In alphabetical order: @cclauss,
@deem0n, @irabinovitch, @pseudoPixels, @Ramshackle-Jamathon, @rparrapy, @yamrzou.
The selector argument to PipelineDefinition has been removed. This API made it possible to
construct a PipelineDefinition in an invalid state. Use PipelineDefinition.build_sub_pipeline
instead.
New
Added the dagster_prometheus library, which exposes a basic Prometheus resource.
Dagster Airflow DAGs may now use GCS instead of S3 for storage.
Expanded interface for schedule management in Dagit.
Dagit
Performance improvements when loading, displaying, and editing config for large pipelines.
Smooth scrolling zoom in the explore tab replaces the previous two-step zoom.
No longer depends on internet fonts to run, allowing fully offline dev.
Typeahead behavior in search has improved.
Invocations of composite solids remain visible in the sidebar when the solid is expanded.
The config schema panel now appears when the config editor is first opened.
Interface now includes hints for autocompletion in the config editor.
Improved display of solid inputs and output in the explore tab.
Provides visual feedback while filter results are loading.
Better handling of pipelines that aren't present in the currently loaded repo.
Bugfix
Dagster Airflow DAGs previously could crash while handling Python errors in DAG logic.
Step failures when running Dagster Airflow DAGs were previously not being surfaced as task
failures in Airflow.
Dagit could previously get into an invalid state when switching pipelines in the context of a
solid subselection.
frozenlist and frozendict now pass Dagster's parameter type checks for list and dict.
The GraphQL playground in Dagit is now working again.
Nits
Dagit now prints its pid when it loads.
Third-party dependencies have been relaxed to reduce the risk of version conflicts.