| XA.WP.2 Technical Co-ordination
| Task 1: Implementation Guidelines - BL; AIT, EA, ONB, TUW
|| Defining project-wide technologies (operating systems support, programming languages, and standards). Providing guidelines for coding best practices (naming conventions, coding style, and unit testing) and recommendations for development environments and tools.
This report gives an outline of the technical architecture of the SCAPE project, and defines project-wide technologies, coding best practices, and recommendations for development environments and tools. It is primarily aimed at those developing tools under the Preservation Components work package, but touches on broader issues.
- 1 Summary
- 2 Technologies
- 3 Architectural Roadmap
- 4 Development Best Practices
- 4.1 Java Development Best Practices
- 4.2 Development Environment & Tools
- 4.2.1 Development Resources
- 4.3 Code Repositories & Conventions
- 4.3.1 Repositories
- 4.3.2 Naming Conventions
- 4.3.3 Licensing
- 4.3.4 Using the Approved Useful Software Registry
- 4.4 Build Management
- 4.5 Large-scale Testing Infrastructure
- 4.6 Publishing our outputs via Maven
- 4.7 Working with Third-Parties
- 5 Communications & Feedback
The SCAPE architecture attempts to make a clean separation between the preservation component (tool) layer, the web layer, the platform layer and the application layer. The technologies and requirements are different at each level, and an initial roadmap is in place to ensure that these components can be fully integrated as the project proceeds.
The primary user applications in the SCAPE architecture are Taverna (for the Testbeds) and PLATO (for Preservation Watch). Taverna will be used to develop the Testbed scenarios, has a mature integration framework for WSDL/SOAP services, and has more recent support for RESTful services and for integrating command-line tools (the latter as of version 2.3). PLATO will be used for preservation planning, and is needs to invoke preservation tools during the planning process. To so this, it simply requires a web service (WSDL or REST) with a stable API.
Please note that tool developments should be driven by the needs of the Testbeds Scenarios, and so it must be possible to make them invokable from Taverna.
It is envisaged that simple command-line or GUI clients to local tools or remote services may be required later, in order to assist broader integration of the SCAPE outputs. Repository integration may need remote services or local tools (as per the ePrints PLATO integration). For general systems and web integration, a RESTful style would be preferred.
The Platform layer exposes the tools and Taverna workflows that run on a cluster of machines as local command-line applications and remote services. This should allow the tools developed during this project to be tested and deployed at scale, processing large collections with relative ease. The Platform layer will be able to invoke Preservation Components either as local Java or command line tools. Initially, the Platform will be Apache Hadoop, with data stored in HDFS and/or HBase. Later on, Taverna workflows will be run directly on the cluster, and tools that have particular platform requirements may be invoked as 'remote' services from inside locally-deployed virtual machines constructed precisely for that purpose.
In order to make the tools available during the development of Taverna workflows and during the PLATO planning process (or indeed during broader institutional or repository integration), it is necessary to have the SCAPE Preservation Components available over the web, either as WSDL or RESTful endpoints.
Preservation components should, in general, be developed as simple command-line tools or as Java classes implementing a standardised interface. Any command-line tools we write or adopt must compatible with the SCAPE Platform. The process of turning a tool into a remote service will be standardised by the Platform and TCC work packages, and tool developers should not need to write wrapper code or other web-layer components. Only limited standardisation of the tool inputs and outputs is absolutely required, but further standardisation will be encouraged in order to allow the results from different tools to be combined together more reliably.
The four layers describe above will work together best if we can find a way of ensuring that the preservation tools can be invoked in a reproducible manner, whether the call is being executed locally or remotely. We aim to do this by writing declarative specifications that describe the tools and indicate how they should be invoked from the command-line or as Java classes, and then standardising the way these specifications can be invoked as WSDL/SOAP or as RESTful services. To improve the integration further, we will propose interoperable data formats and consistent semantics across contexts where required. As the different layers start with different capabilities and will mature over time, we have a staged roadmap, as follows.
In order to meet it's deadlines, the Testbed sub-project must be able to invoke tools from Taverna straight away. The fastest route to making this happen has been to start with the Axis2-based WSDL/SOAP wrapper code that was developed for the IMPACT project, and extend it to meet SCAPE's needs. This lead to the WSDL Tool Wrapper codebase that you can find here:
This framework allows ad-hoc WSDL/SOAP web services to be deployed, based on simple configuration files that specify how to invoke a given command-line tool. Using this information, it builds a deployment package that exposes the tool as a web service, with the command parameters exposed as ports so that Taverna can show them. This permits loose integration via WSDL/SOAP to be set up fairly easily, with results returned as a pointer to a new file hosted at a URL local to the service, while service and execution metadata is returned in a standardised form.
Currently, the ONB is hosting a number of services for the Testbed sub-project, thus ensuring that the current SCAPE workflows (http://www.myexperiment.org/groups/490.html) can be executed locally without installing anything other than Taverna itself. At the current time, all tools should be deployed as services in this way. Note that the process of deploying a tool does not require any software development, just the creation of a new set of configuration files in the xa-toolwrapper project. See the tool wrapper instructions for more details.
Although Taverna allows ad-hoc service integration, there have been ongoing efforts to standardise different types of services as hot-swappable components, making it easy to shift between different services that perform the same operation. This aims very close to the original Planets approach, where the strongly-typed web services described the standardised forms for preservation actions like Migrate, Identify and Characterise. However, the Planets interfaces suffered by being too complex, inflexible, and clumsy, especially due to the way the digital object was transferred (as a SOAP attachment, by default). By shifting to typed, extensible tool specifications (backed by Java interfaces per type), we can combine the best of all of these approaches, leading to a lightweight deployment procedure that efficiently exploits local data while allowing remote integration to be supported as required.
During year one, for adoption in year two, we will work with the Taverna External Tools & Components efforts to build these standardised interfaces. This will be done by defining an shared schema for tool specifications. These simple XML definitions describes individual tools and how to invoke them to perform different types of actions. The specifications are based on those used by the Taverna External Tools plugin (http://www.mygrid.org.uk/dev/wiki/display/developer/Calling+external+commands+from+Taverna), which look like this:
These tool specifications will be extended to meet the SCAPE project's needs, and will allow us to use standardised software to invoke these tools locally, or to expose these tools as web services. In the latter case, we will target RESTful deployment, as this is the easiest to integrate across a wide range of contexts. In both cases, we can use the standard invoker to automatically collect performance metrics during execution, re-using and extending the methods used by the MiniMe registry from the Planets version of PLATO.
Furthermore, by adopting these re-usable tool specifications, we make it easy to share information on tool invocation patterns reliably, e.g. via Email or for sharing via GitHub. It also provides a route by which tool parameters may be standardised across different tools that perform the same operation. Note also that, for format conversion, this approach is similar to (and broadly compatible with) that of the NCSA Polyglot system (http://isda.ncsa.uiuc.edu/NARA/conversion.html)and its associated Conversion Software Registry (http://isda.ncsa.uiuc.edu/NARA/csrAbout.html).
During code development, Java is generally the preferred platform language, as this makes sharing code across contexts easier. This does not, however, mean that the tools themselves must be developed in Java, and indeed the tool specification approach is designed to avoid making this necessary. In particular, it is preferable to extend an existing project in the language that the developers are currently using, rather than attempting to port to Java. Nevertheless, Java should be considered as the default language for new developments.
We will initially target Java 6, and Java 7 will be evaluated as soon as possible after it's release.
Any Java code should follow Oracle's Code Conventions for the Java Programming Language (http://www.oracle.com/technetwork/java/codeconv-138413.html). Conformance to those conventions will be evaluated using the Checkstyle tool (http://checkstyle.sourceforge.net/).
All code should be accompanied by unit tests. The Clover code coverage tool (http://www.atlassian.com/software/clover/) will be used to assess how well the supplied tests cover the codebase.
The status of the codebase will be monitored via an integrated build server. This automatically builds the code after each commit to the code repository, and ensures that any changes that break the build are brought to the attention of the author of the code as soon as possible. See the section on Build Management below.
The supported build tool is Maven 2. By handling the build and the dependencies in this way, we can support a wide range of development environments without managing IDE-specific configuration files. SCAPE Developers IDEs include Eclipse, NetBeans, IntelliJ and good old CLI. However, Eclipse should be considered the default IDE, as this is the development platform that the SCAPE partners have the most experience with.
If you are using Eclipse, here is a list of useful plugins that will help you develop SCAPE code.
- http://www.eclipse.org/m2e/ for Maven support.
- http://www.eclipse.org/egit/ for Git integration.
- http://www.atlassian.com/software/ideconnector/ for integration with Atlassian tools, e.g. JIRA, Bamboo, etc.
- http://wiki.apache.org/hadoop/EclipsePlugIn for integration with Hadoop.
- http://eclipse-cs.sourceforge.net/ for using the Checkstyle tool.
- http://www.javaforge.com/project/HGE MercurialEclipse for working with Mercurial repositories.
- http://subclipse.tigris.org/ for working with Subversion repositories.
- http://pydev.org/ for Python or Jython development.
- http://eclipse.org/mylyn/ for task management.
Along with the basic build tools, the SCAPE project is using the following resources to help manage the code, hosted by the Open Planets Foundation.
- http://wiki.opf-labs.org/display/SP The public SCAPE wiki, powered by Confluence.
- http://jira.opf-labs.org/browse/SP The SCAPE issue tracker, powered by JIRA.
All code will default to open source and be publicly available by default. If particular pieces of code are found to have dependencies or legal restrictions inconsistent with this approach, they will be moved elsewhere.
The initial plan for new project code to use a central repository where we all work together, hosted by GitHub at https://github.com/openplanets/scape. All developers will have write access. Code will be arranged by work package (see below). Any pull requests will be fielded by the Technical Coordinator. As the functional requirements become clearer, the code may be refactored along functional lines rather than by work package. As the code stabilises and matures, we should switch to a more tightly managed system where the core repo has limited commit access and larger modifications are managed via pull requests. Similarly, as the modules become clearer, we can split the code into sub-repositories as required.
Of course, for more mature projects, or if the work involves forking an existing project, this may not be appropriate. Separate SCAPE repositories can be set up for specific purposes, and of course in some cases it will be necessary to use external repositories of various kinds (e.g. we may fork the JHOVE2 BitBucket Mercurial repository). Please let the Technical Coordinator know about any external repositories, so that the code can be tracked and integrated into the build as necessary.
Java code developed under the SCAPE project should belong to the package
and it is recommended that, within that package, the sub-package names reflect the project structure. i.e
Similarly, it is recommended the code held in the main Git repository is held as distinct maven projects per task, but where the name of the top-level folder reflect the sub-project and work package as well as the task:
Any new developments should, in general, be released under the Apache 2.0 licence. For example, a Java program should have this header:
Of course, when extending existing code, you should NOT remove the existing header and replace it with this unless you are the copyright holder. If you cannot use the Apache 2.0 licence (e.g. when extending an existing project that is under the GPL), or have any other questions about licensing, you should get in touch with the Technical Coordinator
The Approved Useful Software Register is a list of the SCAPE project's dependencies on third-party software tools and their associated licenses.
Any software your code depends on should appear on that list. The purpose is so that all parties can have an opportunity to object to any items whose licensing conditions would prohibit further legitimate use of the software as per the consortium agreement.
Currently, the Technical Coordinator is the build manager, and the Open Planets Foundation is providing the automated testing and continuous integration server:
The build manager role will shortly be taken over by the Internet Memory Foundation, and the build server is expected to migrate to IMF systems.
As per the description of work, the SCAPE project will set up a central instance of the SCAPE platform, hosted by the Internet Memory Foundation. The system manager for the central platform will make it available to SCAPE partners so that they can test their tools and software at scale.
As well as publishing our own artefacts so that they can be re-used, we will also pursue the possibility of getting other related projects to publish there artefacts through Sonatype, and indeed help them do so. This will encourage code reuse and help ensure the network of software dependencies we need will be preserved over time. If we cannot publish JARs directly, we can also use Sonatype to upload third-party JARs.
When extending or improving existing tools, the Technical Coordinator should be made aware of the tool being adopted and of the identities of the people involved (on both sides). In particular, the Technical Coordinator will wish to know about the licensing terms and the location of the source code repository. We will also offer out build integration and testing services to the third-party projects, as appropriate.
In general, it is recommended that SCAPE project partners work closely with the third-party developers to agree and document the distinct features that they wish to add to the third-party project (e.g. as issues in the SCAPE JIRA tracker). Specific branches should be constructed for each feature, and a road-map should be agreed specifying the terms and time-scales under which these feature branches should be folder into the core project.
If this is not possible, the SCAPE project partners may wish to consider forking the original project and created a SCAPE variant of the original work. However, this is not encourage in general, as we are aiming to improve the quality of existing tools rather than invest heavily in branches that may not be supported once the project draws to a close. We are not just preserving digital objects. We are also preserving code, and the representation information enshrined by the code is extremely valuable and it's sustainability should be a primary goal.
In general, if you don't know who else to get in touch with, get in touch with the Technical Coordinator.
communication will be via github.com and the OPF Labs resources. See Getting started for details.
The first developer workshop has been announced here, with details here. Please keep an eye on the SCAPE and/or OPF web sites for future announcements. More information will be added under the Events page.
People in SCAPE on twitter include:
The Technical Coordinator also runs the Technical Coordination Committee, and two mailing lists for technical discussion. For TCC discussions:
and for general SCAPE technical discussions
Both lists are fully public, and open to non-project members.