Thursday, March 22, 2018

The machine-readable SDTM-IG

One of the major problem with the CDISC SDTM Implementation Guide (SDTM-IG) is that it is a PDF document. The last published version (SDTM-IG v.3.2) is just not 400 pages long, containing information for almost 50 domains.

As the SDTM-IG (as PDF) is not well machine-readable, this means that when implementing its contents in sofware, one needs to read each line, implement each table, go through all the "assumptions", and first interprete and implement every rule (which is even not designated as "rule" in the text) in the software. This and other facts have led to validation software that completely over-interpretes the IG, and then also implemented many of the "rules" incorrectly, leading to a software that delivers many false positive errors. Unfortunately, this "buggy" software is also used by the FDA for all incoming SDTM submissions!

So implementing an SDTM-IG in software (the same applies for SEND-IG) is not only a huge task, but also error-prone - a lot of copy-and-paste may be involved. As it requires a human interpretation during the implementation, each software for generating SDTM uses its own interpretation of the standard, which of course undermines the meaning of the word "standard".

So wouldn't it be better if we have a machine-readable SDTM-IG, published by CDISC? This would enable that software just reads the electronic version and implements it, replacing weeks of writing code for it, with the result of yet-another-interpretation of the standard. The call for such an electronic version already exists for many years, but each time it is requested during the public review, the so-called "CDISC disposition" is: "considered for the future".

Now, I am sick and tired of each time hearing "considered for the future" during almost the last 10 years, so I decided to start working on a machine-readable version myself. This is also a lot of work (but maybe only 10% of the overall effort of developing a completely new version of the SDTM-IG). Fortunately, I could convince four of my undergradute students to do this in the scope of their "Bachelor project", in which they essentially learn all the aspects of working in a project team, but also must deliver a technical result. So the SDTM-IG in XML was their envisaged result.

They did a good job, and delivered the result well in time. It wasn't perfect (SDTM was new to them), so I still needed to make a few corrections and do some additions (I am still adding new "features"). We also developed an XSLT stylesheet to create a human-readable view of the standard, and which is >99% identical of what is found in the by CDISC published PDF. This means that we created a machine-readable document which at the same time displays exactly identical as the PDF document.

In the XML version of the IG, all the domains are grouped per class, and for each domain, the variables are defined in XML elements:

For example, the variable for the variable EXDOSFRM:

Remark that extra information was added:
  • The "modern" (XML) datatype of the variable, that is also used in the define.xml
  • When a CDISC codelist is attached, the NCI code of that codelist.
This would allow to automatically generate a define.xml template for each domain.

For codelists that are "sponsor defined", the "*" from the PDF is replaced by well-machine-readable XML attribute:

Also the "Assumptions" were implemented as XML, at this moment still as simple narrative:

However, the sentences about "variables generally not used ..." (whatever that means - such sentences should not appear in a standard), were structured and elements created for this:

so that software can use this information to generate an "info message" when the user tries to add one of these variables to the domain.

In a number of cases, the students could also add the "examples", formatted as XHTML within the HTML (as is also done in HL7 FHIR):

Here, some of the table formatting was done in the XML, but that can of course also being taken care of by the stylesheet.

Rules are important. Unfortunately, they are "hidden" in the SDTM-IG as simple text, which easily leads to different implementations. So, we also started adding some of the rules in machine-readable pseudo code. For example:

In future, we will add the  rules as XQuery code ("Open Rules for CDISC Standards"), which will be enormously useful once the FDA moves away from SAS-XPT and finally accepts Dataset-XML as the file format for submissions. We will use the "SDTMIG v.3.2 Conformance Rules" published by CDISC, unfortunately long after the publication of the IG itself. Essentially, such rules should be published at the same time as the standard itself. Using our approach, they can be published as part of the standard (XML) document itself.
Also remark that these rules already are available as XQuery, so we only need to transfer them from there to within the XML for the SDTM-IG.

So, we have everything as machine-readable XML, opening a lot of opportunities. But how does this all display when a normal human wants to see the information? In that case, the stylesheet is applied and the result is displayed in the browser. Here are a few screenshots:

And for the "Assumptions":

or the examples:

Looking almost 100% identical to what is seen when inspecting the PDF file.

All this work was done by four undergraduate students as part of a not-all-too-large project.
So you may ask: why did the SDTM team not take this approach?

Honestly said: I don't know.
Maybe that they never cared about machine-readable standards? Or that they prefer "business as usual"?

I often have the strong impression that SDTM-IGs are still being developed using Excel worksheets and Word documents. The better way is surely to develop them using a database, in which each variable, each assumption, each rule, even each example is a row in a table in the database.
Of course one can also use a native XML database.
It then only takes one click to generate the XML version of the IG, and a second to generate the human-readable display of the IG.
That database system could be SHARE. Unfortunately, this is done not yet: SHARE currently only contains non-machine-readable versions of the standards. Instead of publishing the standards to SHARE, they should be developed within SHARE. This would also enable services (e.g. using RESTful technologies) like those that have only be implemented by some visionary volunteers.

Whether this SDTM-IG in XML is freely available? Yes and No. It is still a prototype and not 100% ready yet.But yes, it is available to companies that want to fund the further development at my institute at the university. You can find the coordinates here.


  1. For those interested, the slides of my presentation at the CDISC European Interchange in Berlin (April 2018) on this topic can now be found at:

  2. I would like to try it. Please make it available when finished.

    I wish even the PDF would be a linkable document. It is behind "free-login-wall".