Workflow Environments Guide


By Chris Dwan

Aug 15, 2005 | During our recent work with the Web services interface to iNquiry, BioTeam has gained familiarity with several graphical workflow packages for scientific computing. These tools have been gathering acceptance in bioinformatics, genomics, and general scientific computing groups from large pharmaceutical companies to single investigators.

I’ve compiled a short list of the features that I use to differentiate these offerings when selecting the one that is most appropriate for a particular user. As with many technology decisions, the choice of a workflow environment is seldom clear. Many factors must be weighed in the context of user requirements, local expertise, and required features.

The packages I’ve worked with are Taverna, a free, open-source workflow environment produced as part of the MyGrid project; InforSense; and Scitegic’s Pipeline Pilot, commercial products with robust features and enterprise-level support; and Apple’s Automator. Apple has built Web services capabilities into their Tiger operating system, and Automator is a way to access these services. Packages I simply have not yet had the time to try out are TurboWorx, the Broad Institute’s GenePattern,  and VIBE from Incogen.

Features I use to differentiate between offerings are:

Support for basic programmatic constructs. While graphical environments will never replace traditional interpreted or compiled programs, they should still support the full range of language constructs required to implement arbitrary algorithms. This includes conditional execution (if/else), loops (do/while), and rudimentary variables. These features are absolutely essential to developing large, complex protocols.

Multiple inputs/outputs for modules. Useful modules produce multiple input and output streams.

Failure handling. Developing workflows for a complex, heterogeneous, highly connected infrastructure requires what might be called defensive programming. Errors will inevitably occur outside the purview of the developer. Workflow environments need to provide easy access to underlying error codes and messages, as well as clear notification as to which steps in a process failed and need to be recomputed. A clean way to differentiate between transient and permanent errors would be a huge plus.

Cached results/partial reexecution. For me, at least, debugging requires running a process over and over again, working out the errors from beginning to end. The ability to selectively reexecute those portions of a workflow that have changed or depend on those changed modules helps accelerate this process.

User interaction/steering. Some processes (particularly those relevant to a bench scientist) require interaction and decision making in the middle. While it is simple enough to create N+1 workflows for a process with N user interactions, it is better to explicitly support user choice, input, and notification without stopping and restarting the entire pipeline. A very-high-level version of this would involve publishing process status notification via an RSS feed or similar technology. Of course, this would only encourage the Blackberry crowd to check their processes more frequently than they already do.

Ease of relocation. Perhaps the best part of Web services technology is the fact that services are explicitly virtualized. In theory, this means that workflows should be entirely portable. Workflow environments should make it simple to point a particular action at a different service provider. If I publish a workflow that points at a set of services on my cluster/database/grid then remote users should be able to redirect each call to their local resource with minimal effort.

Revision history. As workflows become part of the enterprise environment, they will need the same sort of revision control as any other document. For workflows saved as XML files, this can be simply implemented with a revision control system such as RCS, CVS, or SVN. Robust integration with the workflow environment itself is a big plus.

Command line execution. The emerging-use model for workflows appears to be that expert developers will create protocols for use by others. This means that in many cases, the workflows themselves will be pieces in other automated systems. Therefore, they must support execution from the command line and thus automated or remote invocation.

Encapsulated scripting. No environment will ever provide every possible module. One of the most powerful features I’ve seen in any of these tools is the ability to very simply define a “script wrapper” action. Of course, this could lead to abuses of the environment such as wrapping an existing monolithic PERL script in a single action and declaring it a workflow.

Disconnect/reconnect. Production workflows must support long running processes. In the extreme case, some pipelines will run perpetually, receiving new data from automated instruments. I simply cannot endorse any product that requires me to leave my laptop connected to the Internet for my jobs to run.

Process encapsulation. Both of the commercial offerings allow me to wrap up a set of calls into the equivalent of a subroutine and then to republish that subroutine as a Web service using WSDL and SOAP. This is absolutely imperative for many reasons, not least of which is the fact that the whole point of a graphical workflow system is to mitigate complexity and provide a clear and simple view of the process being implemented. When workflows require wall-sized posters to display, they no longer serve that purpose.

Parse WSDL; speak SOAP. This seems self-evident to me: Any new programmatic technology should make use of Web services and discoverable resources.

I’m certain that this is not an exhaustive list. These are just a few points that I’ve seen in a couple of months of working with the technology.

The compelling differentiator for me comes down to user expectations and needs. An academic lab with limited financial resources will find the free and open-source tools appealing. Corporations with enterprise-level computing needs tend to be willing to pay a premium for tools with support teams to back them up. The technology is still young and malleable enough that both groups will find plenty of opportunity to do great and interesting things, and these graphical environments provide a valuable addition to the scientific computing toolbox.

 

Chris Dwan is a senior consultant with The BioTeam. E-mail: cdwan@bioteam.net.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

Waters white paper image
Software Helps Doping Control Lab Streamline Results Management
Sponsored by Waters
The Karolinska University Hospital’s Doping Control Lab tests thousands of samples annually for stimulants, diuretics, and other masking agents. Increased regulatory pressure and new technologies increased the number of samples analyzed creating data management challenges. Waters® NuGenesis® Scientific Data Management System and TargetLynx™ Application Manager software were used to reduce the time required to calculate, review and search results.


sas whitepaper92
Managed Innovation, Assured Compliance
Sponsored by SAS
Discovery organizations are identifying a lot of promising compounds, but clinical research processes haven't kept pace with timely testing of all those potential therapies. This white paper describes how SAS® Drug Development supports true innovation across the clinical trial process.

In this white paper you will learn how to:

  • Assemble data to foster better collaboration
  • Get up-to-date information during clinical trials
  • Make informed decisions earlier in the trial process


BlueArc white paper image
Addressing Life Sciences Constantly Growing Data Challenges Research Environments
Sponsored by BlueArc
The continued explosion of raw experimental data, the increased use of video, the growing adoption of new data retention practices, and the move to high throughput computational workflows are all placing new demands on the way life sciences organizations store and manage their data.

Download this white paper to learn about:

  • Factors driving the data explosion in the life sciences
  • New data management issues that must be addressed
  • HPC trends that are placing new demands on storage
  • Storage solution attributes that address performance, manageability, and energy efficiency.


Life Science Webcasts & Podcasts

Medidata Solutions

Rising Clinical Trial Delays and Costs - Addressing the Cause, Not the Symptoms 

medidata podcastProtocol complexity is taking a toll on clinical study speed and efficiency: increasingly complicated and ambitious protocols are not only burdening sites and study volunteers but are also prolonging trials and increasing expenses. In response, sponsors have turned to global study placement, restructured site relationships and new site management practices, but the problem remains.

This podcast will discuss:

  • Why these responses address only the symptoms, not the underlying cause, of rising clinical trial delays and costs.
  • Results of a recent joint Tufts University / Medidata Solutions study.
  • New metrics benchmarking protocol design trends.
  • Systematic protocol design improvements and why they are essential to clinical trial performance excellence.

Speakers: Ken Getz, Senior Research Fellow at the Tufts Center for the Study of Drug Development, and Ed Seguine, General Manager, Trial Planning Solutions at Medidata.

Download Now 



More Podcasts

Job Openings

Director, Center For Information Technology (CIT) - National Institutes of Health  (NIH), Department of Health and Human Service
Located in Bethesda, MD. This position requires:
• High-level vision, leadership, management, and modernization of CIT programs and services.
• Strategic direction and policy development for CIT long-term operations and objectives.
• Serve as a key IT advisor to the NIH Chief Information Officer.
A TOP SECRET security clearance will be required.  More job detail is found at:  http://www.jobs.nih.gov under the Executive Jobs section.Or contact Ms.Winnie Garner at seniorre@od.nih.gov.  Applications must be received ELECTRONICALLY by (11:59 p.m.), December 17, 2008.  DHHS and NIH are Equal Opportunity Employers

Bioinformatics Manager- Lilly Singapore Centre for Drug Discovery
For more information click here 

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.