S. aureus fitness

The Infectious Disease group at Merck have (finally) published a couple of papers related to the bacterial “fitness test” we developed at Elitra Pharmaceuticals.  Check out the paper “A Staphylococcus aureus Fitness Test Platform for Mechanism-Based Profiling of Antibacterial Compounds” in the current issue of Cell Chemistry & Biology, which describes the design of this innovative experiment.

The S. aureus fitness test a product of work we did at Elitra Pharmaceuticals, extending the invention by Allyn Forsyth* to engineer bacterial strains (via antisense) to control the expression of individual genes and therefore become more sensitive to antimicrobial compounds acting through specific targets.  By pooling a collection of engineered strains corresponding to all of the essential genes of S. aureus and screening antimicrobial compounds, we were able to discover targets for antimicrobial compounds whose mechanism of action had not yet been determined.   We later collaborated with Merck to apply the fitness test to, among other things, their library of natural products.  Merck acquired the technology platform after Elitra shut down.

My particular contribution to the experiment was in coming up with the method for using unique fluorescently-labeled PCR primers corresponding to each of the individual essential gene bacterial strains and resolving them on a gel (capillary) using a multiplexing strategy.  I continue to believe that this approach is superior to using microarrays (an alternative successfully used by other groups) since the sensitivity of capillary eletrophoresis allowed for semi-quantitative analysis of the multiplex PCR results, and was quite flexible to iterations on the pooling and multiplex PCR design.

Congratulations to the Merck team for these publications and their commitment to the project.

*A genome-wide strategy for the identification of essential genes in Staphylococcus aureus

Semantic Mashup for Nicotine Dependence Research

Here’s an excellent paper by S Sahoo, O Bodenreider, JL Butter, KJ Skinner, and AP Sheth titled “An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence”. From the abstract:

“This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.”

The papers describes the methodology they used to create an RDF store for data related to genes and pathways associated with nicotine dependence.  I was particularly interested in how they handled data extracted from the NCBI Entrez database.  I’ve struggled with this myself in generated RDF data from PubChem.  They developed a simple XML->RDF translator and then used OWL to provide the semantic framework for the results.

This is a good paper illustrating the advantages of using semantic technologies for data integration.

W3C Final Report on Relational Data Schemas to RDF

The W3C have published the final report from the RDB2RDF incubator group, with their recommendation that the W3C proceed to initiate a formal working group to standardize a language for mapping relational database schemas into RDF and OWL.

“Such as standard will enable the vast amount of data stored in relational databases to be published easily and conveniently on the Web.  It will also facilitate integrating data from separate relational databases and adding semantics to relational data.”

They go on to describe a use case for integration of enterprise information systems:

“Efficient information and data exchange between application systems within and across enterprises is of paramount importance in the increasingly networked and IT-dominated business atmosphere. Existing Enterprise Information Systems such as CRM, CMS and ERP systems use Relational database backends for persistence. RDF and Linked Data can provide data exchange and integration interfaces for such application systems, which are easy to implement and use, especially in settings where a loose and flexible coupling of the systems is required.”

It’s easy to see where this is headed.  One scenario, for drug discovery data integration purposes, would facilitate publishing data from relational databases to an RDF store, and the RDBMS schema semantics would be maintained.  Additional data semantics could be integrated within the RDF store.

Interview with Martin Leach of Merck Research Labs

Here’s an interview from BioInform with Martin Leach, Executive Director of Information Technology at Merck Research Labs. He discusses the issues and challenges of supporting Merck’s research data output.  Even though I’ve visited almost all of the large pharmaceutical companies over the years, it’s still hard to imagine the complexity of trying to manage the knowledge output across research. From my experience, Merck seems to do as good as job at this than any other organization I’ve been exposed to.

The practical nuts and bolts issues of how to deal with petabytes of data, especially data coming from high resolution instruments such as Illumina’s is a bit sobering.  Even though Martin Leach doesn’t seem impressed with the contributions the Semantic Web can make, I think that if he thought of RDF, RDFS and OWL as the platform for data integration across instruments I think he might be able to see an opportunity that goes being exporting XML files and then importing them into ORACLE.  Just using RDF to capture instrument metadata could be a significant step in integrating experiments across laboratories.

Anyway, my hats off to these guys.  I’m sure there are many at Merck that don’t really have a good appreciation of their efforts.

Java Content Repositories

I’ve been exploring ideas (using Apache’s Jackrabbit) for a specialized Content Repository as the basis for a collaboration tool to be used by researchers involved in drug discovery research.  Most research project teams are organized as a matrix of specialized laboratory and computational skill sets that combine to collaborate on data acquisition, analysis, integration and publication.  Much of the knowledge produced is stored in a variety of structured, semi-structured and unstructured formats. Capturing the knowledge generated during the research workflow and supporting the variety of  data formats is challenging.  However, I’m starting to see the value of applying a Content Repository data model for capturing research workflow data as an alternative to a traditional relational database.  Here’s an essay by Bertil Chapuis comparing the rationale for choosing content repositories versus relational databases.   This is an excellent introduction to the design of the Java Content Repository specification.

The Java community have worked to define a Java API specification called the JSR 170: Content Repository for Java technology API.  The Apache community have released a reference implementation of JSR 170 called Jackrabbit

LabAutomation 2009 Presentations

I recently participated in a couple of sessions at the LabAutomation 2009 conference in Palm Springs, CA.  LabAutomation is an annual meeting of the laboratory automation industry focused on the Life Sciences.

I participated in two sessions: “Data Management, Mining & Visualization” chaired by Petar Stojadinovic and “Ontologies and Semantic Technologies in Drug Discovery” chaired by Dr. Reinhold Shafer.

You can find my presentations at semanticlaboratories.com/labautomation2009.  This was the first LabAutomation meeting where semantic technologies have been discussed as a potential technology for Life Sciences data integration. So, the main goal at this meeting was to introduce the audience on the concepts of the technology.

During the first session there was am excellent talk by Randall Julian, President of Indigo Biosystems, who described some of his work using semantic technologies for data integration in the Life Sciences. I’m hoping that he will share his presentation at some point in the near future.

During the second session there was an excellent talk by Alan Ruttenberg of the Science Commons. Science Commons is a project within the Creative Commons framework.  Check out their excellent work at the Neurocommons, where they are conceiving a platform for knowledge management for biological research. 

D2RQ Platform

I ran across another reference to the D2RQ platform recently and decided to explore this RDF mapping tool.  D2RQ describes itself as a platform "for accessing non-RDF, relational databases as virtual, read-only RDF graphs. D2RQ offers a variety of different RDF-based access mechanisms to the content of huge, non-RDF databases without having to replicate the database into RDF.  Using D2RQ you can:

    • query a non-RDF database using SPARQL or find(spo) queries,
    • access information in a non-RDF database using the Jena API or the Sesame API,
    • access the content of the database as Linked Data over the Web,
    • ask SPARQL queries over the SPARQL Protocol against the database."

I tried it on some simple data sets I downloaded from PubChem Bioassay using the Jena API.  Originally I planned to build  an RDF store from the bioassay data as a way to reason across data sets, but now I’m thinking I can accomplish some of my goals with DR2Q.  Descriptions of other similar tools can be found on the ESW Wiki.  D2RQ strikes me as an interesting platform for performing data integration across the various chemical screening data repositories generated in support of a drug discovery program.


Get every new post delivered to your Inbox.