Last week,
I was a bit baffled by a tweet of our “Linked Data” guru Tim Williams: it said:
"A move to
XML would be the same 'inside the box' thinking that has perpetuated the XPT
format. We need more forward thinking".
I agree with
the latter (forward thinking), but not with the first part, but I think Tim
mixed up a few things here, probably due to the 280-character limitation of
Twitter.
In my opinion, the problem is not XML, nor is it
JSON or Notation3. These are
transport formats. Also RDF Linked Data has an XML-based transport format for the
triples: RDF-XML.
What
probably Tim meant is that keeping SDTM, but just replacing the XPT format by
an XML format is not the solution. If that is what he meant, I can partially
agree. In my opinion, it is not that important in which format we transport the
data, as long as the format does not limit us in WHAT we can transport. SAS-XPT
clearly limits us in what we can transport, as it makes it impossible to
transport the metadata, only supports US-ASCII characters (which already excludes
over 50 Millions of Spanish speaking persons in the USA – and/or everything
must be translated and formatted to US-ASCII), and does not allow us to use
more than 8 characters for codes, 40 for test code descriptions and 200
characters for free text (or one must "split" into the terrible SUPPQUALS). XML
(or JSON) would at least solve these problems.
The real
problem however is that SDTM, originally developed with relational databases in
mind, has degraded to a VIEW on a database whereas the reviewers at the FDA
essentially never get the relational database itself. In SDTM, more and more
derived variables have been added, for "the ease of review", with duplicate
information in different tables, thus violating every of the "normal forms" of relational databases.
SDTM tables do not have foreign keys, which is of course usual for "views".
Essentially, the "foreign keys", describing the relations between data
(bringing us back to “linked data”) are only described in the SDTM-IG, a PDF
document that is not machine-readable.
So the real
problem is the SDTM model, as it is now.
SDTM is "two-dimensional", "table" thinking. There are relations, but these are only
described in a PDF (the "Implementation Guide"). Simple examples are that the
value of USUBJID in each non-DM dataset must also appear in DM. Or that DM-RFXSTDTC
is the first record for that subject in either EX or EC. And –DY variables
should not be in a "relational" SDTM at all as they can be derived "on the
fly".
It is much
better when relationships between objects are explicit, as is e.g. in FHIR
(references between resources), and of
course in "linked data" where they are at the core of the model. For the
transport format, FHIR supports as well XML, JSON as well as Turtle. An example
of the Patient resource in Turtle can be found here.
Implementations are supposed to support at least XML and JSON (most do). For
SDTM, the FDA only support … XPT. Essentially, FDA should be able to accept as
well XPT (as a legacy format), XML and JSON. If a normal hospital information
system can accept multiple formats, why shouldn’t the FDA?
But, and is
where Tim is right (although unfortunate expressed), on the longer term, "linked data" for submissions will probably the right choice, especially when
more and more data do not come from CRFs anymore, but do come from a variety of
sources, including EHRs, devices, etc., which encourages the use of "biomedicalconcepts" (BCs). FHIR is essentially
already "linked data", but organized a bit different than using "triples". A
LOINC code already essentially represents a BC anyway.
Essentially,
whether "linked data" is transported using the format of Notation3, JSON or
XML, I do not really care, as long as the format does not limit us what we can
transport. So the question "XML or Linked Data" is not a correct one. The
question should be "SDTM or Linked Data".
It will not
be easy to get rid of SDTM. It has been a huge success, as there was nothing
before SDTM. The merits of SDTM are that it took care that, for the first time
in clinical research history, clinical data was submitted in a standardized
(semantic) way. Unfortunately, in the last 10 years, first principles were left
for "ease of review", deteriorating the model itself. The fact that ever more
domains were added and some variables (like –CAT) have been abused fordifferent kinds of purposes doesn’t make it better of course. However, moving from XPT to XML format for
SDTM submissions would at least solve some of the major problems of XPT, for
the time that SDTM still exists and is still required by the FDA. We cannot
wait another 10 years before the FDA even starts thinking about "linked data".
On the
longer term, moving to "linked data", also for submissions, is surely a good
idea, but can then still use RDF-XML, JSON or Notation3 or any other format
for transporting data between A and B.
But honestly,
will we still need "files" with "formatted" information for submissions in the
future? Or can reviewers start doing reviews long before the last data point is
collected, using machine-machine communication as described in the blog "The 0.5MB submission"?
Such an approach could save the lives of ten-thousands of patients each year,
by making new therapies available considerably earlier.