Semantic Mashup for Nicotine Dependence Research

Here’s an excellent paper by S Sahoo, O Bodenreider, JL Butter, KJ Skinner, and AP Sheth titled “An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence”. From the abstract:

“This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.”

The papers describes the methodology they used to create an RDF store for data related to genes and pathways associated with nicotine dependence.  I was particularly interested in how they handled data extracted from the NCBI Entrez database.  I’ve struggled with this myself in generated RDF data from PubChem.  They developed a simple XML->RDF translator and then used OWL to provide the semantic framework for the results.

This is a good paper illustrating the advantages of using semantic technologies for data integration.

W3C Final Report on Relational Data Schemas to RDF

The W3C have published the final report from the RDB2RDF incubator group, with their recommendation that the W3C proceed to initiate a formal working group to standardize a language for mapping relational database schemas into RDF and OWL.

“Such as standard will enable the vast amount of data stored in relational databases to be published easily and conveniently on the Web.  It will also facilitate integrating data from separate relational databases and adding semantics to relational data.”

They go on to describe a use case for integration of enterprise information systems:

“Efficient information and data exchange between application systems within and across enterprises is of paramount importance in the increasingly networked and IT-dominated business atmosphere. Existing Enterprise Information Systems such as CRM, CMS and ERP systems use Relational database backends for persistence. RDF and Linked Data can provide data exchange and integration interfaces for such application systems, which are easy to implement and use, especially in settings where a loose and flexible coupling of the systems is required.”

It’s easy to see where this is headed.  One scenario, for drug discovery data integration purposes, would facilitate publishing data from relational databases to an RDF store, and the RDBMS schema semantics would be maintained.  Additional data semantics could be integrated within the RDF store.

Interview with Martin Leach of Merck Research Labs

Here’s an interview from BioInform with Martin Leach, Executive Director of Information Technology at Merck Research Labs. He discusses the issues and challenges of supporting Merck’s research data output.  Even though I’ve visited almost all of the large pharmaceutical companies over the years, it’s still hard to imagine the complexity of trying to manage the knowledge output across research. From my experience, Merck seems to do as good as job at this than any other organization I’ve been exposed to.

The practical nuts and bolts issues of how to deal with petabytes of data, especially data coming from high resolution instruments such as Illumina’s is a bit sobering.  Even though Martin Leach doesn’t seem impressed with the contributions the Semantic Web can make, I think that if he thought of RDF, RDFS and OWL as the platform for data integration across instruments I think he might be able to see an opportunity that goes being exporting XML files and then importing them into ORACLE.  Just using RDF to capture instrument metadata could be a significant step in integrating experiments across laboratories.

Anyway, my hats off to these guys.  I’m sure there are many at Merck that don’t really have a good appreciation of their efforts.

Java Content Repositories

I’ve been exploring ideas (using Apache’s Jackrabbit) for a specialized Content Repository as the basis for a collaboration tool to be used by researchers involved in drug discovery research.  Most research project teams are organized as a matrix of specialized laboratory and computational skill sets that combine to collaborate on data acquisition, analysis, integration and publication.  Much of the knowledge produced is stored in a variety of structured, semi-structured and unstructured formats. Capturing the knowledge generated during the research workflow and supporting the variety of  data formats is challenging.  However, I’m starting to see the value of applying a Content Repository data model for capturing research workflow data as an alternative to a traditional relational database.  Here’s an essay by Bertil Chapuis comparing the rationale for choosing content repositories versus relational databases.   This is an excellent introduction to the design of the Java Content Repository specification.

The Java community have worked to define a Java API specification called the JSR 170: Content Repository for Java technology API.  The Apache community have released a reference implementation of JSR 170 called Jackrabbit

LabAutomation 2009 Presentations

I recently participated in a couple of sessions at the LabAutomation 2009 conference in Palm Springs, CA.  LabAutomation is an annual meeting of the laboratory automation industry focused on the Life Sciences.

I participated in two sessions: “Data Management, Mining & Visualization” chaired by Petar Stojadinovic and “Ontologies and Semantic Technologies in Drug Discovery” chaired by Dr. Reinhold Shafer.

You can find my presentations at semanticlaboratories.com/labautomation2009.  This was the first LabAutomation meeting where semantic technologies have been discussed as a potential technology for Life Sciences data integration. So, the main goal at this meeting was to introduce the audience on the concepts of the technology.

During the first session there was am excellent talk by Randall Julian, President of Indigo Biosystems, who described some of his work using semantic technologies for data integration in the Life Sciences. I’m hoping that he will share his presentation at some point in the near future.

During the second session there was an excellent talk by Alan Ruttenberg of the Science Commons. Science Commons is a project within the Creative Commons framework.  Check out their excellent work at the Neurocommons, where they are conceiving a platform for knowledge management for biological research.