Deadlocking / Aborting Workflows with Tracking and Persistence

UPDATE: Thanks to a very quick response from the product team on this on… It turns out the workflow is not deadlocking, but is aborting. And it is not completely silent about it, you just have to be listening to the WorkflowAborted event on the runtime.

I’ve been bit a couple of times by this one, so I thought I would record it here as much for my own records as for others who might get similarly stuck.

Tracking and Persistence are two of the powerful capabilities that Windows Workflow Foundation (WF) offers that work right out of the box. You can easily enable them with code like the following:

12[STAThread] 13staticvoid Main() 14{ 15using (WorkflowRuntime runtime = newWorkflowRuntime()) 16{ 17SqlWorkflowPersistenceService persistence = 18newSqlWorkflowPersistenceService(persistConnString,true, 19TimeSpan.MaxValue,TimeSpan.FromMinutes(1)); 20runtime.AddService(persistence); 21SqlTrackingService tracking = newSqlTrackingService(trackConnString); 22runtime.AddService(tracking); 23runtime.StartRuntime(); 24… 25} 26}

Tracking allows you to capture lifecycle events of your workflows and activities to a provider (SQL Server persistent store with the built in provider) for operations management monitoring or diagnostic purposes. You can even use tracking events to drive some of your application logic, such as knowing when particular workflows have completed.

Persistence is even more powerful. With persistence enabled, when workflows go idle (waiting on an external event or timer), the workflow runtime can persist the workflow (serialize the workflow object and all of its contained activities state) and stick it in a persistent store (SQL Server with the built in provider). Then when the event occurs, the runtime can wake the workflow back up and turn it back into objects and get it running again. This keeps you from consuming resources on the server for long-running workflows that are just waiting for something that might take hours, days, weeks, or months. It also allows you to scale out the execution of your workflows to multiple servers in a farm, because another (less busy) server can pick up the workflow when it is ready to run instead of the one that was originally running it. You can also prevent this from happening with a process lock if that makes sense for your scenario. Persistence also allows your running workflows to survive across server restarts. All amazing capabilities that you get for free from the workflow runtime.

Both tracking and persistence can also be enabled through your config file as well.

So where is the gotcha? Tracking and Persistence work perfectly fine with one another, normally. However, you can find yourself with your workflows deadlocking indefinitely if you turn them both on, and the cure may not be very apparent unless you really understand how they are working under the covers. OK, even if you understand that, you really have to think it through to understand the cause.

First, the short answer: If you find that your workflows are hanging ordeadlocking when they go idle with both persistence and tracking turned on, check to make sure your Distributed Transaction Coordinator (DTC) service is running.

The explanation: If you enable tracking and persistence with code like shown above, both services write their data to the database as part of a transaction. Depending on whether you put them in the same database and depending on which workflow commit batch service (whole separate topic there) you use, the writes to your tracking and persistence stores may or may not use the same connection to talk to the database (obviously if separate databases, they will not). However, they do get wrapped up into a single transaction when a persistence point is hit at the same time as a tracking point (such as when an workflow goes idle). So if the DTC is not running, the transaction cannot promote to a DTC transaction and the workflow ends up deadlocking. Getting the DTC service running fixes the problem.

Normally DTC should be running on a machine with SQL Server running anyway, but I have had it happen on more than one machine where it was not running and had this problem.

The workflow runtime should really be throwing an exception or timing out after a reasonable time if it can’t promote the transaction, but instead it just silently deadlocks.

Hopefully if you have this problem, you will find this post through your search engine of your choice and your problem will be solved.