Net Traveller: Big Data for Big Science

Tuesday, July 10, 2012

Big Data for Big Science

Greetings from the Australian National University in Canberra, where

Ian Gorton, from the Pacific Northwest National Laboratory of the US Dept. of Energy, is speaking on "Towards a Collaborative Scientific Knowledge Management Platform with Velo". The software has been used for modelling carbon sequestration and climate change. He began by explaining that "Velo" was not an acronym for anything, it just came from one of the french collaborator's interest in bicycles.

Velo is designed for high performance computing. Scientists want to be able to easily get at their data. Large scientific instruments can produce large amounts of data quickly. This requires that the definitions of the data and the requirements ("Scientific Knowledge-Management Requirements") need to be understood by the people building the software. Also new applications need to be able to access old data and applications.

Traditionally, a new software tool would be built for each area of science. However, this is becoming prohibitively expensive. Just as is using more package software, so is science.

Velo uses open source software, including MediaWiki and the Alfresco Content Management System (CMS). These are tools normally though of being used for e-publishing (the Australian Computer Society uses Alfresco for keeping its course content).

Originally ontologies were used for defining the data, but this was found to be to complex for scientists and now a spreadsheet is used.

The Velo approach is an interesting one as it goes some way to unite the document and data repositories which universities, such as the ANU are now building. The Australian National Data Service (ANDS), location at ANU, is building a system for cataloguing and allowing sharing of research data. But this is separate from the repositories used for the research papers about the same data. Essentially the same technology is used to describe the data and the papers and so there is no reason why they need to be kept separate.

At question time I asked IAn when the software would be freely available. He replied that the interfaces are now being documents and the software should be freely available online as open source in ten weeks time.

The rival product to Velo is "Hub Zero" from Purdue.

There is a set of slides with an overview of "Velo: Knowledge Management for Collaborative Science available. Unfortunately, this is a large file, so here is the text (minus images):

Velo:
Knowledge Management for Collaborative (Science | Biology) Projects

A framework to support collaborative

1

Scientific Knowledge Management (KM)

Knowledge Management
systematic strategy of creating, conserving, and sharing knowledge to increase performance and innovation
Capabilities required for a collaborative scientific KM Platform
Associating disparate information
Questioning data and results
Experimenting with data
Sharing hypotheses, data, results

2

Velo Overview

Velo supports common knowledge management needs across science domains
Carbon sequestration
Climate Modeling
Bioinformatics
Subsurface modeling
…..
Easily customized to specific science needs
Data types
Analysis/simulation tools
Pluggable, extensible architecture
Robust and scalable – built on widely used open source technologies
Built to support collaboration across multi-disciplinary teams

Knowledge Management in Velo
Knowledge = data + models + results + provenance
Scientific Data
Manage empirical/observational/derived data used to set up and parameterize models
Velo can be easily customized to handle different data types
Models and Simulations
Manage multiple versions of models and associated results
Launch simulations and data analysis on HPC/cloud platforms
Results
Automatically retrieve and store outputs associated with specific model versions
Incorporate visualizations of simulation/model outputs
Provenance
Automatically and manually create links between related inputs and outputs and computational processes

4
Velo: Data Management
Ingest any data types into Velo
Incorporate scripts and tools to visualize and analyze data
Extensible programmatic framework for new data types
Examples:
Incorporating well bore data logs for subsurface modeling
Managing genome data for bioinformatics

16

Models: Model Setup and Simulation

Manage conceptual models
Launch simulations on remote HPC platforms
Extensible to incorporate tools for model creation
Examples:
Conceptual model worksheets for subsurface models
Simulation launching
Mesh visualization

17

Results: Management and Analysis

Retrieve simulation results from execution platforms
Automatically visualize results
Framework for incorporating analysis and visualization tools
Examples:
Plots for climate simulation outputs
Visualizing plume extents for contaminants

18

Tool History

Velo gives option to the user to record
Inputs
Outputs
Control parameters

Automatically loads the last saved inputs in tool’s input form

Current Development plan – Browse and re-run any earlier invocation

19

Provenance

Ability to link related artifacts for forensic investigations
Both manually and automatically
Examples:
Link input data sets to models
Link conceptual model versions to results
Associate comments and analyses to simulation outputs

20
Next Steps
We’re keen to work with others
To deploy Velo to support scientific communities
To partner on proposals
To collaborate on projects
To enhance the technology
We’ll open source the Velo technology mid-year
Downloadable
User documentation
Programmer documentation

22

Tuesday, July 10, 2012

Big Data for Big Science

No comments: