Tuesday, July 10, 2012

Big Data for Big Science

Greetings from the Australian National University in Canberra, where Ian Gorton, from the Pacific Northwest National Laboratory of the US Dept. of Energy, is speaking on "Towards a Collaborative Scientific Knowledge Management Platform with Velo". The software has been used for modelling carbon sequestration and climate change. He began by explaining that "Velo" was not an acronym for anything, it just came from one of the french collaborator's interest in bicycles.

Velo is designed for high performance computing. Scientists want to be able to easily get at their data. Large scientific instruments can produce large amounts of data quickly. This requires that the definitions of the data and the requirements ("Scientific Knowledge-Management Requirements") need to be understood by the people building the software. Also new applications need to be able to access old data and applications.

Traditionally, a new software tool would be built for each area of science. However, this is becoming prohibitively expensive. Just as is using more package software, so is science.

Velo uses open source software, including MediaWiki and the Alfresco Content Management System (CMS). These are tools normally though of being used for e-publishing (the Australian Computer Society uses Alfresco for keeping its course content).

Originally ontologies were used for defining the data, but this was found to be to complex for scientists and now a spreadsheet is used.

The Velo approach is an interesting one as it goes some way to unite the document and data repositories which universities, such as the ANU are now building. The Australian National Data Service (ANDS), location at ANU, is building a system for cataloguing and allowing sharing of research data. But this is separate from the repositories used for the research papers about the same data. Essentially the same technology is used to describe the data and the papers and so there is no reason why they need to be kept separate.

At question time I asked IAn when the software would be freely available. He replied that the interfaces are now being documents and the software should be freely available online as open source in ten weeks time.

The rival product to Velo is "Hub Zero" from Purdue.

There is a set of slides with an overview of "Velo: Knowledge Management for Collaborative Science available. Unfortunately, this is a large file, so here is the text (minus images):

Knowledge Management for Collaborative (Science | Biology) Projects

A framework to support collaborative


Scientific Knowledge Management (KM)

    • Knowledge Management
      • systematic strategy of creating, conserving, and sharing knowledge to increase performance and innovation
    • Capabilities required for a collaborative scientific KM Platform
      • Associating disparate information
      • Questioning data and results
      • Experimenting with data
      • Sharing hypotheses, data, results


Velo Overview

    • Velo supports common knowledge management needs across science domains
      • Carbon sequestration
      • Climate Modeling
      • Bioinformatics
      • Subsurface modeling
      • …..
    • Easily customized to specific science needs
      • Data types
      • Analysis/simulation tools
    • Pluggable, extensible architecture
    • Robust and scalable – built on widely used open source technologies
    • Built to support collaboration across multi-disciplinary teams

Knowledge Management in Velo
    • Knowledge = data + models + results + provenance
      • Scientific Data
        • Manage empirical/observational/derived data used to set up and parameterize models
        • Velo can be easily customized to handle different data types
      • Models and Simulations
        • Manage multiple versions of models and associated results
        • Launch simulations and data analysis on HPC/cloud platforms
      • Results
        • Automatically retrieve and store outputs associated with specific model versions
        • Incorporate visualizations of simulation/model outputs
      • Provenance
        • Automatically and manually create links between related inputs and outputs and computational processes


Velo: Data Management
    • Ingest any data types into Velo
      • Incorporate scripts and tools to visualize and analyze data
        • Extensible programmatic framework for new data types
    • Examples:
      • Incorporating well bore data logs for subsurface modeling
      • Managing genome data for bioinformatics


Models: Model Setup and Simulation

    • Manage conceptual models
    • Launch simulations on remote HPC platforms
    • Extensible to incorporate tools for model creation
    • Examples:
      • Conceptual model worksheets for subsurface models
      • Simulation launching
      • Mesh visualization


Results: Management and Analysis

    • Retrieve simulation results from execution platforms
    • Automatically visualize results
    • Framework for incorporating analysis and visualization tools
    • Examples:
      • Plots for climate simulation outputs
      • Visualizing plume extents for contaminants


Tool History

    1. Velo gives option to the user to record
      • Inputs
      • Outputs
      • Control parameters

    1. Automatically loads the last saved inputs in tool’s input form

    • Current Development plan – Browse and re-run any earlier invocation



    • Ability to link related artifacts for forensic investigations
      • Both manually and automatically
    • Examples:
      • Link input data sets to models
      • Link conceptual model versions to results
      • Associate comments and analyses to simulation outputs


Next Steps
    • We’re keen to work with others
      • To deploy Velo to support scientific communities
      • To partner on proposals
      • To collaborate on projects
      • To enhance the technology
    • We’ll open source the Velo technology mid-year
      • Downloadable
      • User documentation
      • Programmer documentation


No comments: