About durable steps

Steps are the building blocks of a Taskurai command. They are an optional extension on top of the built-in durability of tasks, allowing you to structure and orchestrate work in a more fine-grained way. They can be used to perform sub tasks, orchestrate stateful coordination in microservices, run inline code in a durable way, sleeping for a (long) period of time, waiting for external events. Steps automatically persist their state and can recover after failure.

Steps enhance commands with additional fault-tolerance and orchestration capabilities:

Commands that are difficult to design as idempotent;
Long running processes that benefit from being kept in memory but also must be able to recover from failures;
Complex workflows, such as approval processes, order processing, and data processing pipelines and can run for months or years

Commands can be designed as commands that are kept in memory or suspend style commands, depending on the specific requirements. For example supporting long running commands that benefit from being kept in memory,

While commands can be designed to be pure orchestration commands, it is not a requirement by the framework, commands can be designed to be a mix of orchestration and business logic. There is no strict separation orchestration and actions, the framework allows the programmer to design the command as needed.

Durable steps

The following steps are available in Taskurai:

Sleep: Durable sleep for a specified amount of time.
Call tasks: Durable call one or more tasks and wait for the tasks to be completed.
Create tasks: Durable create tasks.
Run inline: Run inline code in a durable way.
Wait for external events: Wait for an external event(s) to occur or tasks to be completed.

Task patterns

As a bonus, all patterns that are possible on a tasks, are also available when designing sub tasks:

Scheduled start: Start a task at a specific time.
Not start after: Do not start a task after a specific time.
Max duration: Set a maximum duration for a task.
Custom retry policy: Set a custom retry policy for a task.

Command catalogue

Any command in your catalogue can be used a sub tasks of an orchestrating command:

Simple commands without any steps or durable state;
Commands with manual state management;
Commands without sub tasks, but with durable milestones using steps;
Complete orchestrations or workflows, with sub tasks, can act als child workflows.

Application patterns

The following application patterns are possible:

Command chaining: Commands can call other commands, creating a chain of commands.
Fan out/fan in: Commands can create multiple tasks, wait for all tasks to complete and continue.
Async HTTP APIs: Commands can call external APIs, wait for the response and continue. Tasks that initiate commands keep the commands progress, the end-user process can follow the progress of the command.
Monitor: Commands can monitor other processes, wait for events to occur, take actions based on the events and send events themselves.
Human interaction: Commands can wait for one or more external events, like human interactions, wait for approval, wait for input, etc.

Step identifier

Each step should have a unique step id within the command. The step id is used to identify the step and to store the step state. The step id should be unique within the command, and should be a string that is easy to read and understand.

Milestones in long running processes

Steps can act as a milestone in the long running command, storing intermediate resource expensive calculations, securing rate limits, etc.

When a command is restarted for any reason, the succeeded steps are restored automatically, and the command can continue from the last succeeded step.

Step style

Each step can opt for one of the following styles:

Suspend: Except for the RunInline step, all steps can be designed as suspend style steps. When a step is waiting for tasks to be completed or external events to occur, the step is suspended and resumed when the tasks are completed or the external event occurs. This kind of step style is very suited for long running business orchestrations and workflows.
WaitUntil: All steps can be configured to run and wait until tasks or events are received. If no error occurs, the process is kept in memory, supporting processes that benefit in terms of overal performance (e.g. avoiding the overhead of starting a new process, memory cache, ...). This kind of step style is suited for resource intense processes that benefit from continuously running approach. Please note that keeping the process in memory is best effort - e.g. when tasks or steps fail and are retried, or long periods of inactivity (like waiting for an approval) - the process may be suspended and resumed.

Stateful steps

Steps, just like tasks, are automatically persisted and can recover after a resume or restart.

The following data is persisted:

Step initialization data;
Step progress;
Step output data;

All data that is persisted in a step should be serializable.

info

Just like Tasks, Steps are not designed to store large amounts of data or binary data. Large amounts of data of binary content should be stored in Taskurai's build in state stores or in a storage location of choice. The step initialization and output data can reference data stored in a state store.

Step lifecycle and retry policy

By design, commands can (and will) run multiple times to run orchestrations or handle task and step retries. Code should be designed to be idempotent, meaning that the command can be run multiple times without side effects.

The overal retry-policy is part of the task execution options. The task containing steps will be retried according to the default retry policy or a custom task retry policy.

The following step lifecycle is used:

Step initialization: When used, initialization code is executed only once, to guarantee Deterministic initialization;
Running steps:
- Sleep is a durable step with an exact duration. When the duration is reached, the step is automatically completed and will not be retried.
- Call tasks can configure a retry policy on the tasks created. The retry is handled by the sub task, the invoking step is just waiting for the task to be completed (Succeeded or Failed). All tasks must be created successfully before any task can start. The step is considered completed when all sub tasks succeed or have exhausted all retries.
- Create tasks only creates tasks, but does not wait for the tasks to be completed. The step is considered completed when the tasks are successfully created. When one or more tasks fail to create, no tasks are created or started.
- Inline run steps have a default retry policy and can be retried on failure. Depending on the delay, determined by the retry policy, the step will be retried in the same task run or the command is suspended and resumed (see InlineStepRetryDelayThresholdSec). The step is considered completed when the step succeeds or has exhausted all retries.
- Wait for external events steps are considered complete when all events are received.
Step progress: Only available when the step is running in Suspend style.
- Progress is reported when any sub tasks is completed or when events are received. Do mind that progress is reported in intervals, so multiple progress events can be reported at once.
- When used, progress is reported each time the step is called, even if the step is completed.
Maximum duration:
- Steps: Some steps can be configured with a maximum duration. When the step is not completed within the maximum duration, the step fails and an exception is thrown.
- Tasks: Tasks can be configured with a maximum duration. The maximum duration can be used to limit to complete orchestration of steps and sub tasks.
Completed steps::
- Succeeded steps: When a step is completed successfully, results are persisted, the step will not be retried or executed again even if the task is retried.
- Failed steps: When a step fails and has exhausted all retries, the step is considered failed. The step will not be retried or executed again even if the task is retried. The step will throw a StepFailedException, StepCanceledException or StepTimeoutException exception. When this exception is not handled, the task will fail in a Fatal state and will not be retried.
Step results: Step results are persisted, the step will behave the same way when the task is resumed or restarted:
- Succeeded steps: The step result is returned.
- Failed steps: Exceptions are thrown.

All steps and commands should respect the CancellationToken passed in the context. The cancellation token is used to cancel the step when the task is canceled. The cancellation token should be passed to all async calls and should be used to cancel long running operations. When a step or command fails to stop, then the worker will be marked as unhealthy and will be restarted.

Deterministic initialization

While a succeeded step will never be called again, other steps that are part of the orchestration or are retried can be called multiple times. This due to the design of the orchestration, or due to failures, scale downs, upgrades, etc.

It is imported when initializing data, calling APIs, services or methods, that the outcome is deterministic. Each run should result in the same output values, without having side effects.

Most steps support a deterministic initialization fase, where the step is initialized only the first time the step is executed. When the task is restarted or resumed, the same input values are used to initialize the step. This allows the step to be deterministic, the step can be resumed multiple times with the same input values and the same output values are returned.

All calls that are difficult to designed as idempotent and not return the same deterministic result should be wrapped in durable steps.

Versioning durable commands

Versioning commands enables you to introduce new logic for new future tasks, without affecting running tasks.

However, when there are issues with long running commands, like orchestrations or workflows, it may be needed to introduce new logic into running tasks. Versioning allows the programmer to introduce specific conversions for running tasks, introducing new steps, correcting step data, ignoring old steps, etc.

There are two situations:

Tasks are resumed or failed and are retried: New logic can be introduced in the same versioned command.
Tasks keep running (step style WaitUntil): Unless of a failure, the command will keep running and the new logic will not be introduced.

Running tasks are never upgraded to new versions of a command, the existing version should be patched with new logic.

It is possible to introduce new logic into running tasks, for example to introduce new logic or solve bugs:

In most cases, adding new steps can be added safely, even between steps that are already succeeded; On each resume of the task, all the code in the command is re-run and steps are re-evaluated.
It is possible to use the StepClient in controllers derived from WorkController to get, list or remove steps.

Existing steps, where the step id is kept unchanged and are already initialized, will not be initialized again. Steps that have succeeded, where the step id is kept unchanged, will not be retried.

Be aware that steps may have persisted initialization data to keep the command deterministic, it is possible to retrieve the initialization data of existing steps by retrieving the step using the TaskuraiStepsClient and read the step's arguments.

Step exceptions

Steps can throw exceptions when steps are unable to be called (argument mismatch, configuration problems, timeouts, ...) or when steps fail to execute successfully.

A step will only throw a failed completion exception when all retry options are exhausted.

Runtime exceptions

The following exceptions are thrown when the configuration of the step is incorrect or the runtime failures:

StepBadRequestException: The step is called with invalid arguments.
StepInvocationException: The step is unable to be called (service unavailable, etc.).
StepPersistingException: The step failed to persist the step state.

The impacted steps will be retried according to the retry policy of the task. While some exceptions are caused by transient errors, others are caused by configuration problems (bad parameters, non existing commands, etc.). Configuration problems should be fixed by updating workers or fixing tasks.

Step exceptions

The following exceptions are step lifetime exceptions and can be used to handle step failures and provide compensation logic:

StepFailedException: The step failed to execute successfully (after all retry options are exhausted).
StepCanceledException: The step is canceled.
StepTimeoutException: The step is unable to complete within the maximum duration.

warning

Exceptions should only be caught selectively, no catch all is allowed. All other exceptions should be considered part of the Taskurai runtime and should be handled by the Taskurai runtime.

Handling compensation when a step fails

Compensation logic can be provided in two ways:

Normal flow control, your code detects unexpected return results, and will act accordingly.
Catching exceptions, catch step exceptions to introduce compensation logic.

The following exceptions can be monitored to insert compensation logic:

StepFailedException: The step failed to execute successfully (after all retry options are exhausted).
StepCanceledException: The step is canceled.
StepTimeoutException: The step is unable to complete within the maximum duration.

Durable steps​

Task patterns​

Command catalogue​

Application patterns​

Step identifier​

Milestones in long running processes​

Step style​

Stateful steps​

Step lifecycle and retry policy​

Deterministic initialization​

Versioning durable commands​

Step exceptions​

Runtime exceptions​

Step exceptions​

Handling compensation when a step fails​

Durable steps

Task patterns

Command catalogue

Application patterns

Step identifier

Milestones in long running processes

Step style

Stateful steps

Step lifecycle and retry policy

Deterministic initialization

Versioning durable commands

Step exceptions

Runtime exceptions

Step exceptions

Handling compensation when a step fails