S. aureus fitness

The Infectious Disease group at Merck have (finally) published a couple of papers related to the bacterial “fitness test” we developed at Elitra Pharmaceuticals.  Check out the paper “A Staphylococcus aureus Fitness Test Platform for Mechanism-Based Profiling of Antibacterial Compounds” in the current issue of Cell Chemistry & Biology, which describes the design of this innovative experiment.

The S. aureus fitness test a product of work we did at Elitra Pharmaceuticals, extending the invention by Allyn Forsyth* to engineer bacterial strains (via antisense) to control the expression of individual genes and therefore become more sensitive to antimicrobial compounds acting through specific targets.  By pooling a collection of engineered strains corresponding to all of the essential genes of S. aureus and screening antimicrobial compounds, we were able to discover targets for antimicrobial compounds whose mechanism of action had not yet been determined.   We later collaborated with Merck to apply the fitness test to, among other things, their library of natural products.  Merck acquired the technology platform after Elitra shut down.

My particular contribution to the experiment was in coming up with the method for using unique fluorescently-labeled PCR primers corresponding to each of the individual essential gene bacterial strains and resolving them on a gel (capillary) using a multiplexing strategy.  I continue to believe that this approach is superior to using microarrays (an alternative successfully used by other groups) since the sensitivity of capillary eletrophoresis allowed for semi-quantitative analysis of the multiplex PCR results, and was quite flexible to iterations on the pooling and multiplex PCR design.

Congratulations to the Merck team for these publications and their commitment to the project.

*A genome-wide strategy for the identification of essential genes in Staphylococcus aureus

Semantic Mashup for Nicotine Dependence Research

Here’s an excellent paper by S Sahoo, O Bodenreider, JL Butter, KJ Skinner, and AP Sheth titled “An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence”. From the abstract:

“This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.”

The papers describes the methodology they used to create an RDF store for data related to genes and pathways associated with nicotine dependence.  I was particularly interested in how they handled data extracted from the NCBI Entrez database.  I’ve struggled with this myself in generated RDF data from PubChem.  They developed a simple XML->RDF translator and then used OWL to provide the semantic framework for the results.

This is a good paper illustrating the advantages of using semantic technologies for data integration.

W3C Final Report on Relational Data Schemas to RDF

The W3C have published the final report from the RDB2RDF incubator group, with their recommendation that the W3C proceed to initiate a formal working group to standardize a language for mapping relational database schemas into RDF and OWL.

“Such as standard will enable the vast amount of data stored in relational databases to be published easily and conveniently on the Web.  It will also facilitate integrating data from separate relational databases and adding semantics to relational data.”

They go on to describe a use case for integration of enterprise information systems:

“Efficient information and data exchange between application systems within and across enterprises is of paramount importance in the increasingly networked and IT-dominated business atmosphere. Existing Enterprise Information Systems such as CRM, CMS and ERP systems use Relational database backends for persistence. RDF and Linked Data can provide data exchange and integration interfaces for such application systems, which are easy to implement and use, especially in settings where a loose and flexible coupling of the systems is required.”

It’s easy to see where this is headed.  One scenario, for drug discovery data integration purposes, would facilitate publishing data from relational databases to an RDF store, and the RDBMS schema semantics would be maintained.  Additional data semantics could be integrated within the RDF store.

Interview with Martin Leach of Merck Research Labs

Here’s an interview from BioInform with Martin Leach, Executive Director of Information Technology at Merck Research Labs. He discusses the issues and challenges of supporting Merck’s research data output.  Even though I’ve visited almost all of the large pharmaceutical companies over the years, it’s still hard to imagine the complexity of trying to manage the knowledge output across research. From my experience, Merck seems to do as good as job at this than any other organization I’ve been exposed to.

The practical nuts and bolts issues of how to deal with petabytes of data, especially data coming from high resolution instruments such as Illumina’s is a bit sobering.  Even though Martin Leach doesn’t seem impressed with the contributions the Semantic Web can make, I think that if he thought of RDF, RDFS and OWL as the platform for data integration across instruments I think he might be able to see an opportunity that goes being exporting XML files and then importing them into ORACLE.  Just using RDF to capture instrument metadata could be a significant step in integrating experiments across laboratories.

Anyway, my hats off to these guys.  I’m sure there are many at Merck that don’t really have a good appreciation of their efforts.

Java Content Repositories

I’ve been exploring ideas (using Apache’s Jackrabbit) for a specialized Content Repository as the basis for a collaboration tool to be used by researchers involved in drug discovery research.  Most research project teams are organized as a matrix of specialized laboratory and computational skill sets that combine to collaborate on data acquisition, analysis, integration and publication.  Much of the knowledge produced is stored in a variety of structured, semi-structured and unstructured formats. Capturing the knowledge generated during the research workflow and supporting the variety of  data formats is challenging.  However, I’m starting to see the value of applying a Content Repository data model for capturing research workflow data as an alternative to a traditional relational database.  Here’s an essay by Bertil Chapuis comparing the rationale for choosing content repositories versus relational databases.   This is an excellent introduction to the design of the Java Content Repository specification.

The Java community have worked to define a Java API specification called the JSR 170: Content Repository for Java technology API.  The Apache community have released a reference implementation of JSR 170 called Jackrabbit

LabAutomation 2009 Presentations

I recently participated in a couple of sessions at the LabAutomation 2009 conference in Palm Springs, CA.  LabAutomation is an annual meeting of the laboratory automation industry focused on the Life Sciences.

I participated in two sessions: “Data Management, Mining & Visualization” chaired by Petar Stojadinovic and “Ontologies and Semantic Technologies in Drug Discovery” chaired by Dr. Reinhold Shafer.

You can find my presentations at semanticlaboratories.com/labautomation2009.  This was the first LabAutomation meeting where semantic technologies have been discussed as a potential technology for Life Sciences data integration. So, the main goal at this meeting was to introduce the audience on the concepts of the technology.

During the first session there was am excellent talk by Randall Julian, President of Indigo Biosystems, who described some of his work using semantic technologies for data integration in the Life Sciences. I’m hoping that he will share his presentation at some point in the near future.

During the second session there was an excellent talk by Alan Ruttenberg of the Science Commons. Science Commons is a project within the Creative Commons framework.  Check out their excellent work at the Neurocommons, where they are conceiving a platform for knowledge management for biological research. 

D2RQ Platform

I ran across another reference to the D2RQ platform recently and decided to explore this RDF mapping tool.  D2RQ describes itself as a platform "for accessing non-RDF, relational databases as virtual, read-only RDF graphs. D2RQ offers a variety of different RDF-based access mechanisms to the content of huge, non-RDF databases without having to replicate the database into RDF.  Using D2RQ you can:

    • query a non-RDF database using SPARQL or find(spo) queries,
    • access information in a non-RDF database using the Jena API or the Sesame API,
    • access the content of the database as Linked Data over the Web,
    • ask SPARQL queries over the SPARQL Protocol against the database."

I tried it on some simple data sets I downloaded from PubChem Bioassay using the Jena API.  Originally I planned to build  an RDF store from the bioassay data as a way to reason across data sets, but now I’m thinking I can accomplish some of my goals with DR2Q.  Descriptions of other similar tools can be found on the ESW Wiki.  D2RQ strikes me as an interesting platform for performing data integration across the various chemical screening data repositories generated in support of a drug discovery program.

OWL 2 Web Ontology Language Drafts

The W3C has just released draft specifications for the OWL 2 Web Ontology Language. Interestingly, the sub-languages of OWL 2 are being called "profiles".  As described in the Profiles document, there are 3 OWL 2 profiles currently specified:

    • "OWL 2 EL is particularly useful in applications employing ontologies that contain very large numbers of properties and/or classes: it captures the expressive power used by many such ontologies and is a subset of OWL 2 for which the basic reasoning problems can be performed in time that is polynomial with respect to the size of the ontology [EL++]. Dedicated reasoning algorithms for this profile are available and have been demonstrated to be implementable in a highly scalable way.
    • OWL 2 QL is aimed at applications that use very large volumes of instance data, and where query answering is the most important reasoning task. In OWL 2 QL, conjunctive query answering can be implemented using conventional relational database systems, and can directly access data stored in such systems. Using this technique, sound and complete query answering can be performed in LOGSPACE with respect to the size of the data (assertions). As in OWL 2 EL, there are polynomial time algorithms for consistency, subsumption, and classification reasoning. The expressive power of the profile is necessarily quite limited, although it does include most of the main features of conceptual models such as UML class diagrams and ER diagrams.
    • OWL 2 RL is aimed at applications that require scalable reasoning without sacrificing too much expressive power. It is designed to accommodate both OWL 2 applications that can trade the full expressivity of the language for efficiency, and RDF(S) applications that need some added expressivity. OWL 2 RL reasoning systems can be implemented using rule-based reasoning engines. Such rule-based approaches can be used to perform consistency, satisfiability, subsumption, classification, instance checking, and conjunctive query answering in time that is polynomial with respect to the size of the ontology."

Semantic Web Industry Review

David Provost has published an industry survey titled "On The Cusp: A Global Review of the Semantic Web Industry" where he reviews the current international industry players in the Semantic Web arena.  From his conclusions:

"The Semantic Web industry is alive, well, and it’s increasingly competitive as a commercial technology. At this point, there are too many success stories and too much money being invested to dismiss the technology as non-viable. The Semantic Web is presently building a track record, which means the big wins and unanticipated uses are yet to come. In the meantime, adoption is occurring, and the early news is very good indeed."

Although I’m focused on open source technologies for my Semantic Web development, companies like TopQuadrant are definitely on my radar for future integration.

Corporate (and Research) Intranet Wiki’s

Here’s a recent article in CIO by C.G. Lynch making the argument for integrating wikis into a Corporate intranet. For those who are using or experimenting with wikis, the benefits of wikis are a no-brainer.  The ease with which wikis can be used to support sharing of collaborative information is well documented.  I probably use dozens of public wikis on a regular basis for collecting information from the research projects I follow. It truly is a "bottom up" technology as the CIO article suggests, where the wiki authors are in charge of the platform.  Many public wikis allow any user to add content to the wiki.  The research organization, however, can present some additional challenges, such as restricted access and data security.  In these cases,  user authentication should be required for authoring and accessing a wiki. Setting up single sign-on for intranet web applications is still a challenge, at lease in my hands, and for this routine monitoring and configuration by the IT team is necessary.

There are many excellent open-source wikis that have been developed.  Probably the most well known is MediaWiki, which is the technology that pioneered the creation of wikis and used by Wikipedia.  I was able to install and configure MediaWiki myself, with no problems.  Here is a comparison of wiki software provided by Wikipedia.  The CIO article describes the product Socialtext, a commercial product targeted for the Enterprise.

Much like content management systems,  discussion forums and blogs, I see wikis as an excellent research IT component to support the "documentation" of projects and processes of a typical research organization. By reducing the barriers to timely publication of project-related information, wikis offer a robust platform for capturing research knowledge that can be further mined using intranet search engines.