Work package 7: Data warehouse

Brief description and aims of work

Future biomedical informatics systems can yield to personalized, predictive and integrative information based medicine if they use a sustainable scalable infrastructure that can store and manage large data sets for long periods of time in an affordable manner. Sound understanding of the links between genes, diseases and treatments requires advanced persistent storage infrastructure to store and manage the vast amounts of distributed data with variant characteristics.

This work package will meet that requirment by developing data warehousing infrastructure components. Initially this data warehouse will store data on the three different clinical scenarios being considered by this project, but the warehousing system developed will in fact be generic enough to apply to many different medical scenarios. The data warehouse will contain three different types of data: imaging data, structured clinical data, and file based data such as histological data. The data stored in the data warehouse will be suitably (pseudo)-anonymized before leaving the clinical environment, with anonymized patient ID tags being used to label different data sets belonging to the same patient. The components will be easily deployed at partner sites (e.g. by being released as self contained virtual machine images) and will in effect constitute a private data cloud for use by the p-medicine project.

Much of the data that will be stored in this project will constitute a valuable, highly curated resource, useful not only to this project, but to many current and future projects besides. The requirements gathering exercies (WP2) and infrasctuture design (WP3) will help inform the final design of this system, however, initial discussions with project partners indicate that the system will need to be distributed in nature, rather than a monolithic data warehouse. This work package will therefore develop data storage components that can be deployed at any partner site that wishes to share data. It will also develop the infrastructure and security services needed to facilitate the federation of these data sites.

The different types of data held in the repository will be unified by the application of suitable ontological tools. The cancer description developed in the FP6 ACGT project will be used as a starting point, and other ontologies developed/adopted as appropriate. Ontological markup will be added to the data held in the repository, enabling users to some extent reason about the properties of the data, rather than just making simple SQL-style queries across a series of tables.

The main objective of this work package is to develop a sustainable and persistent archive of clinical data. The specific objectives of this work package are:

  • To develop federated data warehouse infrastructure components to store the multiple different types of medical data produced in this project.
  • To develop storage services for large data objects.
  • To develop mechanisms for ensuring reliability and auditability of the data.
  • To develop and deploy suitable programmatic interfaces, to allow the different types of data to be uploaded to the warehouse via tools developed in other work packages.
  • To develop a service integrated in the web portal to allow users to search for and view available data.
  • To develop federated capability-based secure access mechanisms compliant with the legal and ethical framework of the project.
  • To integrate security and role based access, based on the legal and ethical framework developed in the project.
  • To integrate the disparate types of data produced and available in the project using appropriate ontological tools.

Work package leader

Peter V. Coveney Email

University College London
Gower Street 15
WC1E 6BT London/United Kingdom