Wednesday, March 28, 2018

Annotating Clinical Research Protocols


As promised, somewhat more information about the protocol annotation tool that I developed in the last few months.

The tool reads protocols as a simple text file, which is then converted to XML. Once loaded, the user can select text, and is then prompted for which coding system to use for the annotation:


Currently, about 20 different coding systems can be used. The main dialog only shows the most needed ones for protocols. Clicking "UMLS - Multiple Coding Systems" displays the other ones:


Now, don't think that we have all these coding system libraries within the software. No, we use RESTful web services to connect to publicly available services, such as UMLS RESTful web services. Some of these RESTful web services have been developed by us, but are also publicly available from our public XML4Pharma server (so YOU can use them too).

Once a coding system selected, the selected text is submitted to the RESTful web service, and the best possible match asked for. Here is a short movie about how this works for CDISC test codes:


In this case, trial study parameters (TSPARMCD) and values (TSVAL) are being derived.
Near the bottom, one can see how each annotation is stored. A UUID is generated to identify the annotation, and together with the text coordinates (start-index and length) and the coding system and code is stored internally. When exported to the XML file, this looks like:


When selecting "pulse rate":


and again choosing for "CDISC Test Codes", one gets:


resulting in an annotation: "CDISC CT VSTESTCD" in the XML:



One can of course also use any of the other coding system. Here is a short movie about assigning LOINC codes (unfortunately nearly practiced in protocols!) for lab tests:


We also applied the system to annotations for inclusion and exclusion criteria:


This allows to extract the inclusion- and exclusion-critera from the "annotated protocol", together with the "trial parameters", and automatically generate a CTR-XML (CDISC Clinical Trial Registries) file, which can then easily be converted to SDTM datasets TS and TI.

Unfortunately, CDISC does not yet encourage to use SNOMED-CT coding (another "not invented here" case). SNOMED-CT coding however can be enormously  useful, especially when data needs to be retrieved from electronic health records (these usually use LOINC and SNOMED-CT coding).
Here is a short movie about SNOMED-CT coding of some text in the protocol:


REMARK: not all movies use the same version of the software: the software is in constant evolution.

Limitations:

Unfortunately, protocol writers often use simple tables or even embedded pictures to display the study work flow or "schedule of events" (which in this form I usually name "schedule of disaster"). There is however an international standard for workflows: BPMN2 (Business Processing Model and Notation), which even has an XML implementation. I have used this standard a lot in the past e.g. for study design, but I haven't seen a single protocol yet where BPMN2 is used an especially not in XML. What a pity!

Future prospects:

If you watched the movies and know something about Machine Learning (ML) and Artificial Intelligence (AI), think about the following: you annotate 10 protocols and then feed that as an input to an ML program. As you will know, protocols often look similar or at least have similar elements. Can you then imagine that the ML program can annotate the 11th protocol itself without human interaction? Some parts such as the Inclusion/Exclusion Criteria are an easy prey for such ML systems! And then imagine that you also have a metadata repository containing standardized CRFs or CRF templates. With a bit of luck this combination could allow to automate 80% or more of the study design, AND at the same time generate the SDTM trial design datasets, and a CTR-XML dataset for submissions to clinical trial registries.

Conclusions:

Using a simple Java program in combination with RESTful web services allows to annotate protocols with codes from over 20 coding systems used in medical informatics, healthcare and biology. Such annotations allow for much more precise instructions to the sites on what exactly should be done.
Currently, in many cases interprete the instructions in different ways. For example, when you instruct the site to measure "albumin in blood" you might obtain results of 20 different tests with 30 different units.
I consider this as a first step towards the "e-Protocol". I don't claim this "annotated protocol" is the best way on the road to an "e-protocol", but at least it is a first step.

Comments are of course very welcome!

Thursday, March 22, 2018

The machine-readable SDTM-IG

One of the major problem with the CDISC SDTM Implementation Guide (SDTM-IG) is that it is a PDF document. The last published version (SDTM-IG v.3.2) is just not 400 pages long, containing information for almost 50 domains.

As the SDTM-IG (as PDF) is not well machine-readable, this means that when implementing its contents in sofware, one needs to read each line, implement each table, go through all the "assumptions", and first interprete and implement every rule (which is even not designated as "rule" in the text) in the software. This and other facts have led to validation software that completely over-interpretes the IG, and then also implemented many of the "rules" incorrectly, leading to a software that delivers many false positive errors. Unfortunately, this "buggy" software is also used by the FDA for all incoming SDTM submissions!

So implementing an SDTM-IG in software (the same applies for SEND-IG) is not only a huge task, but also error-prone - a lot of copy-and-paste may be involved. As it requires a human interpretation during the implementation, each software for generating SDTM uses its own interpretation of the standard, which of course undermines the meaning of the word "standard".

So wouldn't it be better if we have a machine-readable SDTM-IG, published by CDISC? This would enable that software just reads the electronic version and implements it, replacing weeks of writing code for it, with the result of yet-another-interpretation of the standard. The call for such an electronic version already exists for many years, but each time it is requested during the public review, the so-called "CDISC disposition" is: "considered for the future".

Now, I am sick and tired of each time hearing "considered for the future" during almost the last 10 years, so I decided to start working on a machine-readable version myself. This is also a lot of work (but maybe only 10% of the overall effort of developing a completely new version of the SDTM-IG). Fortunately, I could convince four of my undergradute students to do this in the scope of their "Bachelor project", in which they essentially learn all the aspects of working in a project team, but also must deliver a technical result. So the SDTM-IG in XML was their envisaged result.

They did a good job, and delivered the result well in time. It wasn't perfect (SDTM was new to them), so I still needed to make a few corrections and do some additions (I am still adding new "features"). We also developed an XSLT stylesheet to create a human-readable view of the standard, and which is >99% identical of what is found in the by CDISC published PDF. This means that we created a machine-readable document which at the same time displays exactly identical as the PDF document.

In the XML version of the IG, all the domains are grouped per class, and for each domain, the variables are defined in XML elements:


For example, the variable for the variable EXDOSFRM:

Remark that extra information was added:
  • The "modern" (XML) datatype of the variable, that is also used in the define.xml
  • When a CDISC codelist is attached, the NCI code of that codelist.
This would allow to automatically generate a define.xml template for each domain.

For codelists that are "sponsor defined", the "*" from the PDF is replaced by well-machine-readable XML attribute:


Also the "Assumptions" were implemented as XML, at this moment still as simple narrative:


However, the sentences about "variables generally not used ..." (whatever that means - such sentences should not appear in a standard), were structured and elements created for this:

so that software can use this information to generate an "info message" when the user tries to add one of these variables to the domain.

In a number of cases, the students could also add the "examples", formatted as XHTML within the HTML (as is also done in HL7 FHIR):


Here, some of the table formatting was done in the XML, but that can of course also being taken care of by the stylesheet.

Rules are important. Unfortunately, they are "hidden" in the SDTM-IG as simple text, which easily leads to different implementations. So, we also started adding some of the rules in machine-readable pseudo code. For example:

In future, we will add the  rules as XQuery code ("Open Rules for CDISC Standards"), which will be enormously useful once the FDA moves away from SAS-XPT and finally accepts Dataset-XML as the file format for submissions. We will use the "SDTMIG v.3.2 Conformance Rules" published by CDISC, unfortunately long after the publication of the IG itself. Essentially, such rules should be published at the same time as the standard itself. Using our approach, they can be published as part of the standard (XML) document itself.
Also remark that these rules already are available as XQuery, so we only need to transfer them from there to within the XML for the SDTM-IG.

So, we have everything as machine-readable XML, opening a lot of opportunities. But how does this all display when a normal human wants to see the information? In that case, the stylesheet is applied and the result is displayed in the browser. Here are a few screenshots:

And for the "Assumptions":

or the examples:


Looking almost 100% identical to what is seen when inspecting the PDF file.

All this work was done by four undergraduate students as part of a not-all-too-large project.
So you may ask: why did the SDTM team not take this approach?

Honestly said: I don't know.
Maybe that they never cared about machine-readable standards? Or that they prefer "business as usual"?

I often have the strong impression that SDTM-IGs are still being developed using Excel worksheets and Word documents. The better way is surely to develop them using a database, in which each variable, each assumption, each rule, even each example is a row in a table in the database.
Of course one can also use a native XML database.
It then only takes one click to generate the XML version of the IG, and a second to generate the human-readable display of the IG.
That database system could be SHARE. Unfortunately, this is done not yet: SHARE currently only contains non-machine-readable versions of the standards. Instead of publishing the standards to SHARE, they should be developed within SHARE. This would also enable services (e.g. using RESTful technologies) like those that have only be implemented by some visionary volunteers.

Whether this SDTM-IG in XML is freely available? Yes and No. It is still a prototype and not 100% ready yet.But yes, it is available to companies that want to fund the further development at my institute at the university. You can find the coordinates here.