On a recent project, we created a workflow to send an email to each of around 3000 users – the users’ email address and name came from a database table.
Our first attempt was to use the SQL component to read the users’ name and email address into list variables (separate lists for name and email address). Once we had the data in a list, it’s a reasonably easy workflow task to iterate over the list:
- find out the size of the list, using xpath expression
- /process_data/@count = xxx
- define an integer counter variable
- define a simple loop in the workflow, indexing each element in the lists like this:
- /process_data/email_list[/process_data/@index] and
- Loop with the following condition:
- /process_data/@index < /process_data/@count
This all worked fine during simple testing, but once we started testing with large numbers of users, we suddenly experienced a massive performance degradation, around 2-3 seconds for every single step in the workflow. We suspected the very large list variables to be the culprit, but interestingly, even steps in the workflow that did not access the list variables were running slow. While this was happening, both CPU and disk were maxing out. What was going on?
A little knowledge of what is really going on under the covers when a workflow runs explains what is happening. When a long-lived workflow runs, every single step of the workflow effectively runs as a separate, asynchronous step, generally within its own transaction. This means that before each step runs, the workflow engine must establish the “process context”, loading it from the database, and after each step, the process context must be persisted back to the database. The process context basically contains all information about the process, including the values of all variables. The process context is loaded/persisted, regardless of whether a particular variable is actually used by a particular step, because the process engine doesn’t actually know whether a particular variable is going to be used or not.
So what is going on is that we have two very large Java List variables that are being loaded into memory from the database, deserialized, and the serialized and stored back into the database, on each and every step – regardless of whether the list is actually needed in that step or not. For two large lists, this accounted for the 2-3 seconds of activity on each step.
How to solve the problem?
Our first approach was to simply change the workflow to a short-lived workflow. Short-lived workflows don’t persist their data, they run more like a regular Java method, simply creating all variables in memory at the beginning of the process, and garbage collecting them at the end of the process – all variables are in memory all the time, and the process context is made available to each of the steps – no intermediate serialization or storage.
A short-lived workflow did indeed solve the problem, but introduced several new problems, related to error recovery:
- If the process failed for any reason (and it did – every now and then, the SendEmail step would fail due to a very overworked Mail Server being too busy to respond), the process threw an exception, but all the data was discarded, and we had no record of how many emails had been sent. It was impossible to restart the stalled process, because short-lived processes don’t stall, they simply throw an exception and they’re done.
- We could simply restart the failed process, but then some users would get two emails – not a good look.
- The process took several hours to complete, and this exceeded the default transaction timeout on the workflow. Workflow assumes that short-lived workflows are transactional by default. This can be fixed by modifying the workflows transactional properties, but felt “wrong”.
Finally, we solved the problem by completely removing the list variables, and adding a status column to the database table. The logic in our long-lived workflow was something like this:
- Using the database component, select a row using a query something like:
“select id, name, email from usertable where status is null”.
Retrieve only the first matching row into simple variables called “rowid”, “username” and “email”.
- Do stuff. In this case, this meant creating a document, and sending an email.
- Using the database component, update the previous row using a statement something like:
“update usertable set status =’sent’ where id = ?”
and set the parameter to /process_data/@rowid
- Loop. The next time around the loop, the query will pick up the second row, because the first row no longer has a null status. Repeat until no row is returned.
This approach solves the entire problem very elegantly:
- The status column ensures that you always know where you’re up to. It’s persisted in the database, so it is resilient to any type of failure.
- There are no large collections slowing down the running of the process – performance was almost an order of magnitude faster.
- If the workflow stalls at any step, you can just restart it, and things will continue where they left off
- Even if the workflow stalls in a weird way, where restarting it doesn’t actually cause it to continue (we have seen this occasionally), you can simply start another instance of the workflow, and it will pick the next row in the database. The only danger is that it is possible that one user may get the same email twice, although since it’s usually the mail server that stalls, not the database query, this is unlikely.
- We’re using a long-lived process, not because we have a user-step in the workflow, but because we want the auditability and reliability that long-lived processes deliver.
- We believe (although we haven’t verified) that Document variables have some “smarts” that Lists and Maps don’t have. In particular, they are cached in memory between steps. It is therefore probably that a Document variable (perhaps containing the XML data from the query, would not create the loading/storing problems that we experienced with Lists. However, the XML data would still need to be parsed into an XML document each time a row was needed, and for a large data-set, this could introduce delays. Overall, we believe that the above solution is superior.