Exceptions happen. Workflow Services—implemented using .NET 4 and hosted in the Windows Application Server Extensions (code-named "Dublin")—offers three options for dealing with exceptions. They can immediately stop running, shut down gracefully, or roll back to a last-known good state. This article goes beyond the structured exception handling offered by the TryCatch activity to demonstrate what can be achieved by a combination of workflow activities, workflow design, and Dublin configuration.

What’s Reliable?
Reliability entails handling or recovering from exceptions generated within the workflow logic as well as faults generated by interactions with systems external to the workflow. When the exception is caused by the internal logic, it might be acceptable to simply discard any work performed thus far and return to a last-known good state on the assumption that retrying the operation may result in success. However, with external systems you will typically want to perform some form of cleanup of the work done in order to return the entire system to its previous good state. For example, a graceful cleanup may entail deleting inserted records in a database or making calls to externals services that "undo" work completed, or otherwise restoring the state of the external system to a state before action occurred (e.g., if a file was created, delete it).

Upon encountering an exception, the workflow service host runtime can treat the error as a fatal exception and simply quit the instance—resulting in a terminated workflow. If the error might be recoverable (e.g., by retrying the operation), the in-memory instance of the workflow can be discarded, and the workflow is considered aborted. The runtime will automatically resume execution of the aborted workflow from the point at which it was last persisted. Alternately, an aborted workflow instance can be suspended, remaining persisted until a user explicitly resumes the instance through the IIS Manager. Sometimes the error is not recoverable, but additional logic is required to return the system to a consistent state—a canceled workflow provides the opportunity to execute this cleanup logic. On top of these options, workflow also supports the notion of compensation, which allows one to define best-effort “undo” logic that runs automatically when the workflow encounters an unhandled exception.

Configuring Reliable Workflows in Dublin
Through the IIS Manager, a running workflow service instance that is configured for persistence in Dublin can be suspended, terminated, or canceled. One way to access these options within the IIS Manager is to select an application, double-click the Dashboard feature, and then click the Persisted WF Instances feature hyperlink. Right clicking an item in the listing for a running workflow instance will produce a menu similar to that shown in Figure 1. By selecting terminate, suspend, or cancel, the user is able to force the workflow instance to take the selected action.

Dublin can be configured to automatically take one of these actions for any workflow instance that throws an unhandled exception. Configuring Dublin to react in this way is the cornerstone of reliable workflow services and is performed while enabling an application for persistence. Within IIS Manager, right click an application and choose .NET 4 WCF and WF, Configuration, then select the Workflow Persistence tab. On that tab, check the Enable Application Persistence check box to enable the Advanced… button. Clicking this button displays the dialog shown in Figure 2. Take notice of the Action on unhandled exception drop- down, as this will play a critical role in reliability.

Figure 3 summarizes the states (cancel, terminate, or abort) that a workflow instance can enter in response to some form of error condition. The triggers consist of either a user performing an action through IIS Manager as described previously (UI), in response to an exception that has gone unhandled and bubbled all the way to the top (unhandled exception), or as the result of a particular activity. Cancellation allows cleanup logic to be defined, whereas an aborted workflow allows the instance to be manually restarted from a last-known good checkpoint. A canceled workflow also allows for logic to undo work previously done in the face of an unhandled exception via a process known as implicit compensation. A terminated workflow is one that halts execution immediately and does not allow for cleanup or returning to a previously known state. Let's consider the ramifications to workflow service reliability from entering each of these states.

Aborting Workflows
A workflow can be aborted by either aborting a workflow through IIS Manager or by configuring the Advanced Persistence Settings to abandon unhandled exceptions, as described previously. Note that in Dublin the terminology is abandon, though .NET 4 Workflow uses the term abort. Abort effectively discards any state changes to the workflow instance that have occurred in memory and resets its current state to what is stored in the persistence database. Therefore, creating a reliable workflow service using abort amounts to establishing checkpoints that update the persistence store. This can be done most easily by placing persist activities at the desired locations, but can also be enabled on SendReply activities by setting their PersistBeforeSend property to true. Figure 4 shows a sample workflow that puts a persistence “checkpoint” between the receive operations. In calling DoMoreWork, an exception will be thrown which will remain unhandled. The Action on unhandled exception in this case is set to Abandon. The runtime will act as if DoMoreWork was never called, and re-schedule it so it can be called again.

Consider that instead of using Throw, we have an activity that calls out to another service and that service is simply unreachable because the network is down. Abort provides a simple way to retry that call by invoking DoMoreWork again, once network connectivity is restored. If we had chosen Abandon and suspend before DoMoreWork could be called, the persisted workflow instance would have to be manually resumed via IIS Manager. This extra step can be useful to exert more precise control over when workflows resume their execution.

Cancelling Workflows
A workflow can be canceled in two ways: through IIS Manager or in response to an unhandled exception. In addition, cancellation occurs for activities such as the Pick—when one PickBranch executes, the other branches that are not executed are canceled. Only activities that are currently executing or scheduled can be canceled; in other words, an activity that has already completed cannot be canceled. In order to specify the logic that should be executed when canceled, you need to use a CancellationScope activity. Figure 5 provides an example service that wraps a sequence of Receive activities in separate CancellationScopes.

If we execute the Test1 operation, a new instance of the workflow will be launched and CancellationScope1 will execute the Receive in its body. When you locate the workflow in the Persisted Workflows in IIS Manager and cancel it, only the Cancel Test2 activity will execute because the runtime will only run the CancellationHandlers for CancellationScope activities that have not completed. Similarly, if we were to set the workflow's Action on unhandled exception to Cancel, and an unhandled exception were thrown while executing the body of CancellationScope2, its CancellationHandler would execute. In practice, the CancellationHandler is the location where you can define cleanup logic to run in response to an exception.

It’s important to contrast this to abort: Once a workflow has been canceled, it is finished executing and cannot be resumed.

Compensation
As described previously, compensation allows you to define the best-effort undo of any completed work. Compensation is defined for a workflow by adding the work-performing activities to the body of the CompensableActivity (Figure 6). Similarly, the activities that undo the work, in the exception case, are added to the CompensationHandler. In some situations, a two-phase commit is required (one that creates the initial change and another that commits the actions as final at a later point in time). In this case, the first phase of changes are performed in the body, and upon successful execution, the commit phase defined in the ConfirmationHandler will execute.

There are two ways that compensation/confirmation can be triggered in workflow services, known as implicit and explicit compensation. When the Action on unhandled exception is Cancel or the user selects to cancel an instance through the IIS Manager, implicit compensation will execute the CompensationHandler for any compensable activities that have completed successfully. Those that are still executing will have their CancellationHandler executed. If the workflow runs to a successful completion, the ConfirmationHandler of each CompensableActivity will be executed. Implicit compensation will always execute the compensation, confirmation, or CancellationHandler of each CompensableActivity in the reverse order of completion in the containing workflow.

Figure 7 shows compensation in the context of a workflow service. Calling DoWork will execute the body CompensableActivity, which is defined to complete successfully and without error. If this were the end of the workflow definition, then the workflow would finish and the confirmation handler of the CompensableActivity in DoWork’s implementation would be called. In order to demonstrate implicit compensation, we throw an exception in response to a DoMoreWork invocation. This error goes unhandled and will cause the workflow’s state to be changed according to the Action on unhandled exception option, which must be set to Cancel to trigger the implicit compensation. Because CompensableActivity completed its execution successfully, it is a candidate for compensation and will have its CompensationHandler invoked.

Explicit compensation makes use of activities to define the execution of compensation and confirmation handlers. Specifically, the compensate activity names, by means of a CompensationToken, a successfully completed CompensableScope activity to compensate. The confirm activity causes the execution of the ConfirmationHandler.

The application for performing undo work within the CompensationHandler is fairly apparent if you’re manipulating a database or working with the file system. Where it adds to the reliability of services is in its ability to also enable calling out to other services and invoking their clean-up operations. For example, in an inventory system, the original body might call ReserveItem to remove an item from inventory, but in the CompensationHandler, RestoreItem might be invoked, which adds the unused item back.

Cancelation and explicit compensation can work together to deliver a reliable solution. For example, if you place a compensate activity in the Catch of a TryCatch activity to selectively undo the work performed, and in the same Catch add a Rethrow activity, then that will go unhandled. This will cause the remainder of the executing activities in the workflow to be canceled, plus implicitly compensate any successfully completed CompensableActivity.

Proper Configuration Leads to Reliability
By combining the CancellationScope and CompensableActivity with the appropriate Action on unhandled exception configuration, one can build workflow services that can be made to recover, retry, or at least fail gracefully, ultimately delivering reliable workflow services.