Friday, May 20, 2016

Electronic Health Record Data in SDTM

The newest publication of the FDA about "Use of Electronic HealthRecord Data in Clinical Investigations" triggered me to pick up the topic of EHRs in SDTM again. The FDA publication describes the framework in which use of EHRs in clinical research is allowed and encouraged. Although it does not contain really new information, it should take the fears of sponsors and investigators away for use of EHR data in FDA regulated clinical trials.

One of the things that will surely happen in future is that the FDA reviewer wants to see the EHR data point that was used as the source of a data point in the SDTM submission. The investigator will then ask the sponsor who will then ask the site...: another delay in bringing this innovatine new drug or therapy to the market. In the mean time patients will die ...

So can't we have the EHR datapoint in the SDTM itself?

Of course! It is even very easy, but only if the FDA would finally decide to get rid of SAS-XPT, this ancient binary format with all its stupid limitations.

Already some years ago, the CDISC XML Technologies Team developed the Dataset-XML standard, as a simple replacement for SAS-XPT. The FDA did a pilot, but since then nothing has happened - "business as usual" seems to have returned.
Dataset-XML was developed to allow the FDA a smooth transition from XPT to XML. It doesn't change anything to SDTM, it just changes the way the data is transported from A to B. However, Dataset-XML has the potential to do things better, as it isn't bound to the two-dimensional table approach of XPT (which again forces SDTM to be 2-dimensional tables).

So, let's try to do better!

Suppose that I do have a VS dataset with a systolic blood pressure for subject "CDISC01.100008" and the data point was retrieved from the EHR of the patient. Forget about adding the EHR data point in the SDTM using ancient SAS-XPT! We need Dataset-XML.

This is how the SDTM records look:

Now, the EHR is based on the new HL7-FHIR standard, and the record is very similar to the one at  How do we get this data point in our SDTM?

Dataset-XML, as it is based on CDISC ODM, is extensible. This means that XML data from other sources can be embedded as long as the namespace of the embedded XML is different from the ODM namespace. As FHIR has an XML implementation, the FHIR data point can easily be embedded into the Dataset-XML SDTM record.

In the following example (which you can download from here), I decided to add the FHIR-EHR data point to the SDTM record, and not to VSORRES (for which one could plead), as I think that the data point belongs to the record, and not to the "original result" - we will discuss this further on.

The SDTM record then becomes:

Remark that the "Observation" element "lives" in the HL7 namespace "".

continued by:

Important here is that LOINC coding is used for an exact description of the test (systolic, sitting - LOINC code 8459-0), and that SNOMED-CT is used for coding the body part. This is important - the SDTM and CT teams are still refusing to allow the LOINC code to be used as the unique identifier for the test in VS and LB. Instead, they reinvented the wheel and developed their own list of codes, leading to ambiguities. LOINC coding is mandated to be used in most national EHR systems, including the US Meaningful Use. The same applies to the use of UCUM units.

Now, if you inspect the record carefully, you will notice that a good amount of the information is present twice. The only information that is NOT in the EHR datapoint is STUDYID, USUBJID (although,..), DOMAIN, VISITNUM, VISITDY (planned study day) and VSDY (actual day). STUDYID is an artefact of SAS-XPT, as ODM/Dataset-XML could allow to group all records per subject (using ODM "SubjectData/@SubjectKey). DOMAIN is also an artefact, as within the data set, DOMAIN must always be "VS" and is given by the define.xml anyway with a reference to the correct file.VSDY is derived and can easily be calculated "on the fly" by the FDA tools. Even VSSEQ is artificial and could easily be replaced by a worldwide unique identifier (making it worldwide referenceable, as in ... FHIR). VISIT (name) is also derived in the case of a planned visit and can be looked up in TV (trial visits).

So, if we allow Dataset-XML to become more-dimensional (grouping data by subject), the only SDTM variables that explicitely need to be present are VISITNUM and VISITDY. So essentially, our SDTM record could be reduced to:


Remark the annotations I made, making the mapping to SDTM variables.

If the reviewer still likes to see the record in the classic two-dimensional table way, that's piece of cake, an extremely simple transformation (e.g. using XSLT) does the job.

Now, reviewers always complain about file sizes (however, reviewers should be forbidden to use "files"), and will surely do when they see how much "size" the FHIR information takes. But who says that the FHIR information must be in the same file? Can't it just be referenced, or better, can't we state where the information can be found using a secured RESTful web service?
This is done all the time in FHIR! So we could further reduce our SDTM record to:

Remark that the "http://..." is not simply an HTTP addres: just using it in a browser will not allow to obtain the subject's data point. The RESTful web service in our case will require authentication, usually using the OAuth2 authenticion mechanism.

Comments are very welcome - as ever ...

Tuesday, May 3, 2016

Ask SHARE - SDTM validation rules in XQuery

This weekend, after returning from the European CDISC Interchange (where I gave a presentation titled "Less is more - A Visionary View of the Future of CDISC Standards"), I continued my work on the implementation of the SDTM validation rules in the open and vendor-neutral XQuery language (also see earlier postings here and here).
This time, I worked on a rule that is not so easy. It is the PMDA rule CT2001: "Variable must be populated with terms from its CDISC controlled terminology codelist. New terms cannot be added into non-extensible codelists".
This looks like an easy one on first sight, but it isn't. How does a machine-executable rule know whether a codelist is extensible or not? Looking into an Excel worksheet is not the best way (also as Excel is not a vendor-neutral standard) and cumbersome to program (if possible at all). So we do need something better.

So we developed the human-readable, machine-executable rule (it can be found here) using the following algorithm:

  • the define.xml is taken and iteration is performed over each dataset (ItemGroupDef)
  • within the dataset definition, an iteration is performed over all the defined variables (ItemRef/ItemDef) and it is looked whether there is a codelist is attached to the variable
  • if a codelist is attached, the NCI code of the codelist is taken. A web service request "Get whether a CDISC-CT Codelist is extensible or not" is triggered which returns whether the codelist is extensible or not. Only the codelists that are not extensible are retained. This leads to a list of non-extensible codelists for each dataset
  • the next step could be that each of the (non-extensible) codelist is inspected for whether it has some "EnumeratedItem" or "CodeListItem" elements that have the flag 'def:ExtendedValue="Yes"'. This is however not "bullet proof" as sponsors may have added terms and forgot to add the flag. 
  • the step could also have been to use the web service "Get whether a coded value is a valid value for the given codelist" to query whether each of the values in the codelist is really a value of that codelist as published by CDISC. This relies on that the values in the dataset itself for the given variable are all present in the codelist (enforced by another rule). The XQuery implementation can be found here.
  • we choose however to inspect each value in the dataset of the given variable which has a non-extensible codelist for whether it is an allowed value for that codelist by using the web service "Get whether a coded value is a valid value for the given codelist". If the answer from the web service is "false", an error is returned in XML format (allowing reuse in other applications). The XQuery implementation can be found here.
For each of the XQuery implementations, you can inspect them either using NotePad or NotePad++, or a modern XML editor like Altova XML Spy, oXygen-XML or EditiX.

A partial view is below:

What does this have to do with SHARE?


All of the above mentioned RESTful web services (available at are based on SHARE content. The latter has been implemented as a relational database (it could however also have been a native XML database) and a pretty large number of RESTful web services has been build around it.

In future, the SHARE API will deliver such web services directly from the SHARE repository. So our own web services are only a "test bed" for finding out what is possible with SHARE.

So in future, forget about "black box" validation tools - simply "ask SHARE"!