Tuesday, October 30, 2018

SDTM: XML or Linked Data?

Last week, I was a bit baffled by a tweet of our “Linked Data” guru Tim Williams: it said:
"A move to XML would be the same 'inside the box' thinking that has perpetuated the XPT format. We need more forward thinking".


I agree with the latter (forward thinking), but not with the first part, but I think Tim mixed up a few things here, probably due to the 280-character limitation of Twitter. In my opinion, the problem is not XML, nor is it JSON or Notation3. These are transport formats. Also RDF Linked Data has an XML-based transport format for the triples: RDF-XML

What probably Tim meant is that keeping SDTM, but just replacing the XPT format by an XML format is not the solution. If that is what he meant, I can partially agree. In my opinion, it is not that important in which format we transport the data, as long as the format does not limit us in WHAT we can transport. SAS-XPT clearly limits us in what we can transport, as it makes it impossible to transport the metadata, only supports US-ASCII characters (which already excludes over 50 Millions of Spanish speaking persons in the USA – and/or everything must be translated and formatted to US-ASCII), and does not allow us to use more than 8 characters for codes, 40 for test code descriptions and 200 characters for free text (or one must "split" into the terrible SUPPQUALS). XML (or JSON) would at least solve these problems.

The real problem however is that SDTM, originally developed with relational databases in mind, has degraded to a VIEW on a database whereas the reviewers at the FDA essentially never get the relational database itself. In SDTM, more and more derived variables have been added, for "the ease of review", with duplicate information in different tables, thus violating every of the "normal forms" of relational databases. SDTM tables do not have foreign keys, which is of course usual for "views". Essentially, the "foreign keys", describing the relations between data (bringing us back to “linked data”) are only described in the SDTM-IG, a PDF document that is not machine-readable.

So the real problem is the SDTM model, as it is now.
SDTM is "two-dimensional", "table" thinking. There are relations, but these are only described in a PDF (the "Implementation Guide"). Simple examples are that the value of USUBJID in each non-DM dataset must also appear in DM. Or that DM-RFXSTDTC is the first record for that subject in either EX or EC. And –DY variables should not be in a "relational" SDTM at all as they can be derived "on the fly". 

It is much better when relationships between objects are explicit, as is e.g. in FHIR (references between  resources), and of course in "linked data" where they are at the core of the model. For the transport format, FHIR supports as well XML, JSON as well as Turtle. An example of the Patient resource in Turtle can be found here. Implementations are supposed to support at least XML and JSON (most do). For SDTM, the FDA only support … XPT. Essentially, FDA should be able to accept as well XPT (as a legacy format), XML and JSON. If a normal hospital information system can accept multiple formats, why shouldn’t the FDA?

But, and is where Tim is right (although unfortunate expressed), on the longer term, "linked data" for submissions will probably the right choice, especially when more and more data do not come from CRFs anymore, but do come from a variety of sources, including EHRs, devices, etc., which encourages the use of "biomedicalconcepts" (BCs). FHIR is essentially already "linked data", but organized a bit different than using "triples". A LOINC code already essentially represents a BC anyway. 

Essentially, whether "linked data" is transported using the format of Notation3, JSON or XML, I do not really care, as long as the format does not limit us what we can transport. So the question "XML or Linked Data" is not a correct one. The question should be "SDTM or Linked Data". 

It will not be easy to get rid of SDTM. It has been a huge success, as there was nothing before SDTM. The merits of SDTM are that it took care that, for the first time in clinical research history, clinical data was submitted in a standardized (semantic) way. Unfortunately, in the last 10 years, first principles were left for "ease of review", deteriorating the model itself. The fact that ever more domains were added and some variables (like –CAT) have been abused fordifferent kinds of purposes doesn’t make it better of course. However, moving from XPT to XML format for SDTM submissions would at least solve some of the major problems of XPT, for the time that SDTM still exists and is still required by the FDA. We cannot wait another 10 years before the FDA even starts thinking about "linked data".

On the longer term, moving to "linked data", also for submissions, is surely a good idea, but can then still use RDF-XML, JSON or Notation3 or any other format for transporting data between A and B.

But honestly, will we still need "files" with "formatted" information for submissions in the future? Or can reviewers start doing reviews long before the last data point is collected, using machine-machine communication as described in the blog "The 0.5MB submission"? Such an approach could save the lives of ten-thousands of patients each year, by making new therapies available considerably earlier.