Wednesday, March 28, 2018

Annotating Clinical Research Protocols

As promised, somewhat more information about the protocol annotation tool that I developed in the last few months.

The tool reads protocols as a simple text file, which is then converted to XML. Once loaded, the user can select text, and is then prompted for which coding system to use for the annotation:

Currently, about 20 different coding systems can be used. The main dialog only shows the most needed ones for protocols. Clicking "UMLS - Multiple Coding Systems" displays the other ones:

Now, don't think that we have all these coding system libraries within the software. No, we use RESTful web services to connect to publicly available services, such as UMLS RESTful web services. Some of these RESTful web services have been developed by us, but are also publicly available from our public XML4Pharma server (so YOU can use them too).

Once a coding system selected, the selected text is submitted to the RESTful web service, and the best possible match asked for. Here is a short movie about how this works for CDISC test codes:

In this case, trial study parameters (TSPARMCD) and values (TSVAL) are being derived.
Near the bottom, one can see how each annotation is stored. A UUID is generated to identify the annotation, and together with the text coordinates (start-index and length) and the coding system and code is stored internally. When exported to the XML file, this looks like:

When selecting "pulse rate":

and again choosing for "CDISC Test Codes", one gets:

resulting in an annotation: "CDISC CT VSTESTCD" in the XML:

One can of course also use any of the other coding system. Here is a short movie about assigning LOINC codes (unfortunately nearly practiced in protocols!) for lab tests:

We also applied the system to annotations for inclusion and exclusion criteria:

This allows to extract the inclusion- and exclusion-critera from the "annotated protocol", together with the "trial parameters", and automatically generate a CTR-XML (CDISC Clinical Trial Registries) file, which can then easily be converted to SDTM datasets TS and TI.

Unfortunately, CDISC does not yet encourage to use SNOMED-CT coding (another "not invented here" case). SNOMED-CT coding however can be enormously  useful, especially when data needs to be retrieved from electronic health records (these usually use LOINC and SNOMED-CT coding).
Here is a short movie about SNOMED-CT coding of some text in the protocol:

REMARK: not all movies use the same version of the software: the software is in constant evolution.


Unfortunately, protocol writers often use simple tables or even embedded pictures to display the study work flow or "schedule of events" (which in this form I usually name "schedule of disaster"). There is however an international standard for workflows: BPMN2 (Business Processing Model and Notation), which even has an XML implementation. I have used this standard a lot in the past e.g. for study design, but I haven't seen a single protocol yet where BPMN2 is used an especially not in XML. What a pity!

Future prospects:

If you watched the movies and know something about Machine Learning (ML) and Artificial Intelligence (AI), think about the following: you annotate 10 protocols and then feed that as an input to an ML program. As you will know, protocols often look similar or at least have similar elements. Can you then imagine that the ML program can annotate the 11th protocol itself without human interaction? Some parts such as the Inclusion/Exclusion Criteria are an easy prey for such ML systems! And then imagine that you also have a metadata repository containing standardized CRFs or CRF templates. With a bit of luck this combination could allow to automate 80% or more of the study design, AND at the same time generate the SDTM trial design datasets, and a CTR-XML dataset for submissions to clinical trial registries.


Using a simple Java program in combination with RESTful web services allows to annotate protocols with codes from over 20 coding systems used in medical informatics, healthcare and biology. Such annotations allow for much more precise instructions to the sites on what exactly should be done.
Currently, in many cases interprete the instructions in different ways. For example, when you instruct the site to measure "albumin in blood" you might obtain results of 20 different tests with 30 different units.
I consider this as a first step towards the "e-Protocol". I don't claim this "annotated protocol" is the best way on the road to an "e-protocol", but at least it is a first step.

Comments are of course very welcome!


  1. For those interested, the slides of my presentation at the CDISC European Interchange (Berlin, April 2018) on this topic can now be found at:

  2. Nice program. I would like to test it.