Saturday, March 14, 2020

Automated Generation of CDASH ODM forms starting from CDISC Library API calls, and thoughts about future CDISC Library features

In the last days, when needing a break from other (paid) work, I programmed a small Java program that queries the CDISC Library using the RESTful API, and generates CDISC-ODM forms for all domains, scenarios for the different CDASHIG  versions (1.1.1, 2.0 and 2.1).
Additionally, the user can select whether the ODM needs to be completed with all necessary CodeLists for a given CDISC-CT version (default being the latest available CDISC-CT version).

The algorithm I used fully exploits the HATEAOS features of the CDISC Library. It starts with iterating over the available domains of the given CDASHIG version, then checks within each domain (using the hyperlinks provided in the RESTful web service response) whether there are "scenarios" for that domain (which is often the case for "Findings" domains, not for other domains), and if so, again picking up the links, query for a single scenario and retrieving the information from the "fields". If there are no scenarios, the field information can be retrieved directly.

When a field then has a reference to a codelist, then a new query is performed, retrieving the codelist information. The NCI code and the name of the codelist are then combined to generate a CodeList-OID (like "CL.C66742.NY") and added to the ODM "ItemDef". If the user also wants to have the complete codelist included, then all items in the codelist are retrieved and (at the end) transformed to an ODM-XML structure.
As retrieving codelists from the CDISC Library with all their items is somewhat more time consuming, a list of unique codelists is maintained, so that the same codelist must never be queried for twice.
Just as an example, the aforementioned "NY" codelist is referenced 98 times in the CDASHIG 2.1. 

In the past, when a new version of a standard like CDASH or SDTMIG came out, it usually costed me a week of evenings (I still was a professor in medical informatics at the University of Applied Sciences in Graz at the time), doing copy-paste work from PDF files, which is not only boring work, but also very error prone. With a little programming, I can now generate electronic versions of such standards within minutes.
So, in future, when e.g. a new SDTMIG is published, and it becomes available in the CDISC Library, I can update all my systems in just a few minutes, without the need for any copy-paste.
What is still missing, what should be the next steps?

For the CDASHIG, the CDISC Library is missing the "Assumptions for the xx Domain" which essentially is a list of bullets in the PDF version. Would it be necessary/useful in an electronic version? The text in there isn't really machine-interpretable at all isn't it? For example, what could a machine do with an "assumption" like: "As required or defined by the study protocol, clinically significant results may need to be reported on the Adverse Event CRF"? The CDISC Library does already contain the "mapping instructions" for each CDASH field, but only in a human-readable version, like "This does not map directly to an SDTMIG variable. For the SDTM submission dataset, concatenate all collected CDASH DATE and TIME components and populate the SDTMIG variable EGDTC in ISO 8601 format".  

For a computerized system, this is not very helpful, but the following would probably do:

EGDTC = (ISO8601)concat(EGDAT,'T',EGTIM)

where the "(ISO8601)" is a cast, as programmers know it.  
In order to enable such machine-readable code, we must however agree on a "language", which is hard. As healthcare already seems to be doing, we may in future all agree on using HL7 CQL (Clinical Quality Language).

Another major problem however is how we publish our draft standards for public review. CDISC has already moved from PDF to "wiki sites" for this, which are easier to manage by the development teams, especially in combination with the Jira issue ticketing system. But this does not allow for e.g. "impact analysis". For example, if a new draft version of SDTMIG or Controlled Terminology comes out, how can we find out what the impact on our systems will be? Maybe the new version even has errors and might damage our existing systems upon an update? At the moment, we cannot do any such impact analysis easily, as our drafts are published in such a way that they are only usable for human eye consumption.
Even when a new draft version is published as an Excel file, it still is very difficult to do impact analysis, as it would (and that for each individual reviewer) require several days of programming and applying transformations.

A good example where something went pretty wrong in the draft publication mechanism, is the recent "LOINC to CDISC mapping" for public review. Don't understand me wrong, the CT team did a great job there (it wasn't easy as the concepts of LOINC codes and CDISC-SDTM-CT are rather different). The draft was however published as an Excel file, with different tabs for different (arbitrary) laboratory categories.

Pretty disastrous was that LBTESTCD (the test code) is missing everywhere. Or was the idea that people don't need codes anyway, and only want to see text (in LBTEST)? So, "human eye consumption" only instead of "machine-readability"?
Anyway, the publication form of the draft makes it pretty hard for me to test it in my SDTM-ETL mapping software, where it really will be of great use!

Just suppose that every draft of a new CDISC standard version, be it a CDISC model, IG, controlled terminology, was published before, or synchronized with the "human eye" presentation (PDF, wiki, Excel), to a "special corner" of the CDISC Library. Reviewers could then do impact analysis in a fully automated way by querying the CDISC Library as they do already anyway for the "final" versions of the standards.

A nice example is S-Cubed "A3 Community MDR" allowing to compare different CDISC-CT versions in a very quick way. Remark that the MDR is not based on the CDISC Library yet. A recent testimony from Erin Muhlbradt, our CDISC-CT guru: "I just wanted to say that I used the Community A3 MDR today to figure out when the MOTEST/TESTCD codelists were deprecated. It took me all of 7 seconds and about 3 of those seconds were spent figuring out which buttons to push. This is massively better than the 10s of minutes it would have taken me otherwise."

Now suppose, we could do "diffs" using the A3 Community MDR on the draft CT version, BEFORE it is published as "final", allowing us to automate QC, instead of relying on "visual inspection". The reviewers (and of course the CT team itself too) would then be able to much more easily find issues, and correct these before final publication. Let's not forget, once published as "final", there is no way back to correct issues! This can then only be done in a next version.

Having draft versions of CDISC standards in a "special corner" of the CDISC Library would also have the advantage that vendors that have systems based on CDISC standards (there are many of them) can much more easily start working on adapting their systems when a new version of the standard is upcoming. With the current PDF, wiki pages and Excel files, this is barely possible. Vendors could then submit their findings much better, report ambiguities, make much better suggestions for improvements, then when only using the "human eye". This will lead to much higher quality CDISC standards.

Once published as "final", a vendor could then possibly even deploy the new version with a single click, or just have the application connect to the CDISC Library directly, and thus make the new version available to the customers in a few seconds only. In many systems nowadays, users must wait for months …