Skip to end of metadata
Go to start of metadata

Introduction

This document enumerates the SCAPE requirements on provenance information to be provided by Taverna. It has been based on the provenance requirements of the Wf4Ever project.

Taverna is a workflow execution environment, so the provenance provided will concern the generation of artifacts through the execution of workflows. This page does not address wider aspects of requirements on provenance that may come from SCAPE Taverna usage. Thus, requirements relating to provenance determined by means other than workflow execution, are not covered here.

Provenance is concerned with a specific workflow execution (not a definition or template), which is assumed to consist of a number of individual process executions (any of which could itself be a workflow execution). Each process execution uses a number of input parameters or artifacts, and generates one or more artifacts. Similarly, a workflow execution consumes and generates artifacts.

Minimum requirements

SCAPE MUST be able to access the following information, where appropriate identified uniquely at least within the scope of a related workflow execution. Globally accessible resources should be identified with a globally unique identifier.

  • Artifacts used by a workflow - at least identifier/URL
  • Artifacts generated by a workflow - at least identifier/URL
  • Process executions
  • For each process execution, a description of what artifacts it used (at least identifier/URL), and in what role
  • For each process execution, a description of what artifacts it generated (at least identifier/URL), and in what role
  • For each process execution (output) which failed, an indication of this error occurring. This could be both as direct information about the execution, but also an indication if the generated artifacts are non-data-carrying errors (e.g. Taverna's error documents)
  • The specific format and/or vocabulary used for provenance data is to be selected by the Taverna team, but once chosen it should be stable, and as far as reasonably possible any changes should be in a backward compatible fashion that doesn't break software written to consume a previous version.

i.e. sufficient information to construct a provenance trace from workflow inputs via process executions for all artifacts created by a workflow, including intermediate artifacts. Each output artifact should be traceable to the initial workflow inputs and/or initial non-input process executions.

The actual data values passed through the workflow is not included in the minimum requirements, but all artifacts should have identifiers - e.g. a second import of the same execution provenance should use the same artifact identifiers. A second execution with the same byte-wise values SHOULD NOT have the same identifiers for generated workflow and process artifacts, but SHOULD have the same identifiers for top-level *used* artifacts if they are coming from the same source (e.g. file).(We will anyway be double-guarded against such cross-run collisions by keeping each workflow run provenance in a separate annotation body)

The reference to "role" here is intended to be sufficient to determine for any process execution what artifact was used for each possible input. (For Taverna, these are port names)

Highly desirable requirements

SCAPE SHOULD be able to access the following information for any workflow execution

  • A reference to the workflow instance which defines a specific workflow execution (including inputs, outputs, options, etc.). Note that this is distinct from a workflow template that can be instantiated with different inputs by different workflow instances. If a workflow is repeated, there may be several workflow executions defined by a workflow instance.
  • A unique identifier, or information from which a globally unique identifier can be constructed for any workflow execution (e.g. host name and timestamp for workflow execution)
  • A unique identifier for a value at its point of use or generation (to be able to annotate a given usage)
  • Date, time and timezone at which the workflow execution started
  • Date, time and timezone at which the workflow execution completed
  • Identification of the person or other agent that initiated the workflow execution
  • Identification of the host (or cluster, or hosts) on which the workflow was run
  • Software or service(s) used for perform each process execution (possibly by reference to an element in the corresponding workflow instance)
  • Date, time and timezone at which any external services were invoked, or any software image was retrieved and loaded (if materially different from the time at which the process execution was started).
  • Date, time and timezone at which each process execution started
  • Date, time and timezone at which each process execution completed
  • Identification of the host (or cluster, or hosts) on which each process execution was run (WSDL endpoint, SSH node, REST URL)
  • For each process execution (output) which failed, details about the error, such as a message and stack trace
  • For each artifact which can be represented as a binary blob, a SHA-1 (or SHA-256?) checksum of the binary value
  • For each artifact, the date, time and timezone it was created
  • For each artifact used in a process execution, the date, time and timezone it was accessed by that process execution. (This is particularly relevant for external artifacts.)
  • For each external artifact accessed via the Web, the URI from which it was retrieved
  • For each created artifact that is subsequently web-accessible, the URI at which it is made available
  • For each workflow input/output artifact, the actual values as a binary blob (which can be stored in the RO)
  • Details of artifact list/collection memberships (list of artifacts as another artifact)

Note for artifact checksums: not all (intermediate process) artifacts might be representable as a blob, ie. a reference to a JVM object. Some artifacts might not have a uniform binary representation (ie. a table in Galaxy). The actual value might be inaccessible because it is large, secured or in a different system like GridFTP - but should in these cases generally still have an URI reference.

Note that Taverna performs implicit iteration - so that if service A generates a list, and service B consumes single items - an iteration over B will occur for each value - generating a new list of B's outputs. Provenance-wise it might initially look like B is consuming artifacts "out of nowhere" unless you either claim that A also generated each list value (but then you should also store the list position somewhere), or provide a mechanism to describe such compound artifacts - ref PROV-DM Collections.

Additional requirements

Access to the following information could extend the capabilities of certain Research Object evaluation tasks, but is not required:

  • Resources (memory, CPU time, transient disk space, special compute resources) used by each process execution
  • User comments and annotations about a particular execution
  • For each artifact used or generated, information that can be used to verify its integrity (e.g. checksum, details of web retrieval transaction, etc.)
  • For each software or system used in a process execution, information that can be used to verify its integrity
  • For each software or system used, including the workflow engine, version, path and compile information (32/64-bit, OS, etc)
  • For each external resource used (artifact, software or system), links to human- and machine-readable information about that resource (this may be in the workflow instance).
  • For each process execution, and indication of whether it is a "shim" process, or something more substantial (Note: Should this not be specified in the Workflow Template?)
  • For each process execution (output) which gave a warning, details about this warning
  • For external processes - metadata about the service at time of invocation, e.g. copy of WSDL and XSD, HTTP headers from server, etc
  • For external processes - a full protocol-level trace of the invocation - e.g. HTTP headers and body of request(s) and response(s) 
  • For each artifact retrieved from external sources, the protocol-level trace of its retrieval e.g. HTTP headers of request and response
  • For each workflow input/output artifact, the actual values as a binary blob (which can be stored in the RO)
  • For each process input/output artifact, the actual values as a binary blob (which can be stored in the RO)

Non-requirements

(Add here any plausible requirements considered that turn out to be not necessary - this should help to clarify the intended scope of these requirements.)

  • Identification of alternative provenance "Accounts" (per OPM) 

Other information that Taverna might provide

Examples:

  • Implicit iteration details - what iteration was performed for which artifacts (each iteration should be a new Process - but they are linked to the same process definition - and there's a virtual process for the actual iteration which consumes and produces lists)
  • With which software was the workflow run (workbench, command line - and which version)
  • Which version of which Tavena plugins were installed (not necessarily used)
  • Trace of the Taverna dispatch stack, e.g.:
    • any retries occurred (errors repaired by retries are currently hidden from final provenance)
    • how many parallel calls (this can be inferred), 
    • details of looping (I think this is important for Pique), what did each loop return - and for loop over nested workflows - the process execution details of the non-final loops
    • Errors bounced - e.g. executions that never happened because one of the inputs was an upstream error
    • Which activity/service was used in cases of Failover
    • In/out from any custom dispatch layers, e.g. if/else branching or dynamic service lookup
  • Stdout/Stderr log for the run (currently captured for the whole session, but thread-magic could capture per wf run)
  • Taverna's internal log (currently just a rolling file)
  • For Taverna Server - details of who submitted the workflow job run, when it was submitted, when it was scheduled, etc
  • Implied wasDerivedFrom (or equivalent) relationships between generated artifact and used artifacts - including list expansion (a service takes in an artifact which is a list of artifacts and produces a single artifact - say a calculated average)
  • For each external artifact - any additional URIs from where it might be accessible - ie. http://example.com/fred is also at sftp://example.org/tmp/fred and gridftp://example.net/some/grid/fred

Information Taverna already captures

  • Provenance database - See http://www.mygrid.org.uk/dev/wiki/display/developer/Provenance+schema+in+2.2.0 - but minus ServiceInvocation and Activity
    • Older structure (used by OPM/Janus export):
      • Copy of workflow definition (and its UUID) at time of run
      • Abstract structure of workflow and its nested workflows (Processor/Port/Datalink)
      • For every workflow run (incl. nested workflows), run ID, workflow definition ID, when it started, and in which processor it is if nested
      • Which values/lists where seen at workflow run, at which wf input/output
      • Each value (but not list) seen at a processor port in a given run. Reference to value - actual value is accessible if run with database-storage or in-memory and still same session (Taverna normally deletes in-memory-run provenance on exit)
      • Which collection a value was a member of in which position - but only the last collection seen
    • Newer structure (only used by GUI and experimental PROV-O export)
      • For each process execution per iteration: start/stop time, input/outputs per port (top level list) - parent process if nested (supports nested-nested)
      • For each (also nested) workflow execution - the same as for process executions above
  • At execution time Taverna also knows anything shown in the Result perspective - but don't store this at the moment.

Information Taverna already exports

Taverna can export to:

See http://www.mygrid.org.uk/dev/wiki/display/taverna/Provenance+export+to+OPM+and+Janus

Both of the current export options have known issues.

OPM:

  • Identifier for main workflow run (but not start/stop times)
  • Reference to which defined process template (e.g. processor in workflow definition) generated/used an artifact - but not to the identifier of the overall workflow template
  • Reference to in which iteration/workflow run an artifact was used/generated (but not at which port, and only for processes in main workflow)

In particular OPM export is missing start/stop times, separation of each process execution, ports, iterations, and nested workflows. OPM is currently generated from a provenance query - but should probably better be generated from the new ProcessorEnactment table.

Janus:

  • A semi-abstract structure of the workflow template, including processors, ports and links
  • Indication of process types (WSDL, Beanshell)
  • Defined processor names
  • Identifiers for workflow definitions
  • Which artifacts was seen at which input/output ports
  • Which list an artifact was a member of (but not which position)
  • Which iterations an artifact was involved with (but not in which process execution)
  • The actual string value of the artifact (if this is big, the export does not work)

In particular Janus is missing start/stop times, separation of process executions, list positions.

General preferences

These are not requirements, but if satisfied could make life easier for SCAPE

  • Use RDF for provenance information, in one of the commonly-supported syntaxes.
  • Use URIs for identifying things (artifacts, process executions, etc.) Where appropriate, use globally dereferencable URIs. But note that identifier URIs may differ from retrieval URIs if the latter cannot be expected to be globally unique; e.g. if a locator is re-used when a workflow is re-run.
  • Present provenance information in a form that is broadly compatible with OPM
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Feb 28, 2012

    We put together a few properties that were of interest here: Process performance metrics. That list should probably just be folded into here.