Mass Spec Data Standards
Proteomics and transcriptomics grew up relatively isolated
from each other, and with separate data standards. The result: a challenge to
integrate data, and the need for a single, better standard that works well with
XML
One of the great challenges for bioinformatics over the
course of the next decade will be erecting a framework that allows the interpretation
of transcription and proteomic data in a consistent and informative way. At
the moment, large amounts of both types of data are being generated by laboratories
all over the world, but little headway has been made in linking the information
derived from bioinformatics analysis into the larger, worldwide accumulation
of archived analytical results. For now, biologists must interpret their experiments
in a vacuum.
Creating archives of information to allow rapid query
and visualization of proteomic and transcriptomic information will not be easy.
The two fields have developed in near isolation from one another — they use
different sequence sets, accession numbers, interchange standards, and terminology.
Any effort to unite the two fields will require a good set of standards, and
the hardest part will be getting the details right. It’s possible to jury-rig
compatibility from each of the standards to another program (more on that later),
but that’s not a long-term solution.
I recently attended a meeting organized by the NIH to
try to reach some consensus on necessary standards in the sub-field of proteomics
associated with the complicated chromatographic and mass spectrometric experimental
setups used to produce lists of the proteins in a sample. A wide variety of
different experimental and informatics techniques are employed, but they tend
to be lumped together under the general name “protein identification.”
A group looking at the very complex problem of how to
exchange data on protein-protein interactions has made a lot of headway, and
I expect it to arrive at a standard everyone can live with. On the other hand,
the group looking at the comparatively simple problem of how to represent a
mass spectrum (a histogram made up of mass-intensity tuples) has split into
at least two camps girding for what will probably be a war of attrition between
competing standards, one European and one American.
Too Many Standards
How could something so simple go so wrong? There already
are perfectly acceptable ways to represent these tuples. The most commonly used
interchange format is the structured text file specified by one of the pioneering
companies in this area, Matrix Sciences. It was invented as an alternative to
the proprietary formats provided by instrument vendors, and it has been generally
accepted because it is easy to parse and rigid enough so that you can’t express
the same thing in too many different ways.
To invent an XML to replace this simple format, the Human
Proteome Organization impaneled a subcommittee of its Proteomics Standards Initiative.
Now, after several years of deliberation, the committee has published the schema
for its approved XML dialect, mzData. A code fragment representing a very simple
spectrum in mzData would look something like this:
endian=“little”
length=“1”>ZtqyRA=
=
endian=“little”
length=“1”>ANAJRQ=
=
The data tags enclose Base64 encoded floating point numbers.
The syntax is similar to the older General Analytical Markup Language, but the
structure of mzData is defined so that it can only be applied to mass spectrometry
data.
Not to be outdone, a group from the Institute for Systems
Biology has created its own standard, a dialect called mzXML, as part of their
ironically named “MS glossolalia” project. This dialect would record the same
information in a slightly different way:
byteOrder=“network”
pairOrder=“m/z-int”>yqtZAR
QJRANA==
This format has similarities to mzData, the main difference
being that rather than having the elements of the tuple separated into two items,
each tuple is represented as a pair of Base64 encoded numbers, designated by
the value of the pairOrder attribute.
The two formulations have a similar structure, mzData
being somewhat more verbose. However, as the W3 XML Commandment says, “Terseness
in XML markup is of minimal importance.” More important from the point of view
of adoption is another W3 XML commandment: “It shall be easy to write programs
which process XML documents,” and it is here that both standards run into trouble.
To reduce file sizes, the choice of Base64 encoding may have seemed like a good
idea, but differences in the handling of white space in Base64 and XML often
make parsing it difficult. The current dialects of mzData and mzXML try to skirt
the issue by excluding white space in the encoded text, but that means that
an mzXML or mzData XML document considered valid by XML and Base64 rules may
be declared invalid by a dialect-specific parser. This also means XSLT — the
accepted means for translating one markup language into another — cannot be
used to translate easily between mzXML and mzData, requiring a platform-specific
script to convert between the two representations and check for white space
dialect compliance.
Is all lost? Can we never agree? Fortunately, Rob Craig,
my chief programmer, and Patrick Lacosse from the Department of Medicine bioinformatics
group at Laval University rigged up compatibility for both mzXML and mzData
to one of our open source projects, adapting a C++ mzXML interpreter available
at http://sashimi.sourceforge.net
to read both standards.
The self-documenting nature of XML, the fact that only
a few tags enclose useful data for a particular application, and the availability
of some good open source code mean that so long as the standard dialects are
at least valid XML, it is pretty easy to use either one (or both). Here’s hoping
for a better solution down the road — and, in the meantime, no further additions
to our mix of standards.
Ron Beavis has developed instrumentation and informatics
for protein analysis since joining Brian Chait’s group at Rockefeller University
in 1989. He currently runs his own bioinformatics design and consulting company,
Beavis Informatics, based in Winnipeg, Canada.
Copyright � 2004 GenomeWeb LLC. All Rights Reserved.
|