Saturday, November 25, 2017

Implementing CDASH 2.0: first impressions

Last weekend, I started working implementing CDASH 2.0 in my software offerings.

It wasn't easy...

One can download the CDASH 2.0 package from the CDISC SHARE repository. The model and implementation guide come as HTML files (believed to be an improvement over PDF). Additionally, two Excel files are delivered.

Well, Excel files are not really machine-readable, and surely not machine-interpretable. As I want my software programs to be able to "read" CDASH 2.0 in XML (JSON or RDF may have been alternatives) I needed to do something. Unfortunately, SHARE does not provide CDASH 2.0 in any of these formats.
So I used my complete "box of tricks" to at least transform the Excel contents to ODM "ItemDefs" (though not complete), which includes exporting to CSV, write some Java code, etc..
Using the popular "ODM Study Designer", I then started generating groups of items, and then composing the forms (it uses simple drag-and-drop), also using the "CDASH 2.0 Implementation Guide", which now comes as HTML. Here the second problem occurred: in Firefox, the IG did not display correctly, some of the text was lost. In MS IE, the HTML showed correctly, but the pictures did not. OK, so use both ...  (did they really test the generated HTML in different browsers?).

I soon found out that there is a good number of discrepancies between the IG and what is in the Excel file: something like 15 CDASH variables described/used in the IG were not present in the Excel file! These included "PRDAT", "HOSTAT", "FADAT" and many more.
OK, I added them as ODM ItemDefs in my ODM file. I also had some problems understanding which forms with which content were exactly defined (but that may be due to my inexperience). So, after somewhat less than 2 days, I managed to get everything together, except for the codelists.

Now the codelists do not come with the CDASH 2.0 distribution. You have to download separately from the NCI-CDISC codelist website. So I downloaded the latest version of the CDASH-CT (2017-09-29). Fortunately, this nowadays is available as CDISC ODM (it is also still available as Excel, HTML and PDF - why?). I added the codelists to my now already in pretty good shape CDASH ODM, and starting assigning codelists to the CDASH items that require controlled terminology. This was a bit tedious, as the codelist files (in ODM) published by CDISC have the format "IT." followed by the NCI code, followed by a dot and then the codelist name as published in the CDASH-IG. For example, the CDASH codelist "EXFLRTU" (Unit of Measure for Flow Rate) is identified by the identifier "CL.C78429.EXFLRTU" in the published CDISC controlled terminology. So I needed a lot of manual lookup.
After I did add all the codelists from the CDASH-CT, a good amount of codelists where still missing. They were mentioned in the CDASH-IG, but not part of the latest CDASH controlled terminology as published by the CDISC-CT team. In most cases, I could copy-paste them however from the latest published SDTM controlled terminology.
Frightening however was that some of the controlled terminology mentioned in the CDASH-IG is not present anymore in the latest published controlled terminology, neither in the CDASH-CT nor in the SDTM-CT! Fortunately, I have a database with all published controlled terminology published starting from early 2014. So for the referenced controlled terminology that is not present anymore in the latest version (2017-09-29), I looked up the latest version where it was still present in the CDISC-CT publication. Some examples are:

  • DDTEST: 2017-03-31
  • FATEST: 2016-06-24
  • MITEST: 2016-09-30
So for example, the codelist for FATEST disappeared in all CDISC-CT publications after June 2016. Whether it was deleted, or just simply forgotten is not clear. It is something I will try to find out later.

So, what do we have?
We generated a "form" for each "use case" in the implementation guide (which corresponds to a set of rows with subsequent numbers in the tables in the implementation guide), and a "form" for each "example" that was published in the implementation guide. This brings us to a total of 37 forms with 660 different variables and 58 unique codelists.
We will make CDASH 2.0 available in our new version of the "Study Designer" which will be rolled out early 2018. The users will then have the choice between CDASH versions 1.0, 1.1 and 2.0:



All together, it took me 2.5 days to implement CDASH 2.0 in my software, during which I found a good number of errors and discrepancies between implementation guide and the Excel file that was published. I found codelists that officially do not exist anymore. Whether this relates to communication problems between the CDASH team and the CT team, or are just plane omissions in the latest controlled terminology versions is not clear to me. Also others have reported problems with controlled terminology governance in the past.

You will think: "but there has been a public review! So how did all these problems pass unnoticed during review?". Personally, I think that the reason is that our standards are still just published as PDF files and Excel worksheets. During the (much too short) public review period, nobody is going through all the pain to convert these into machine-readable files (as I just did), and then test the new standard in their tools. Almost all review is done manually, i.e. by visual inspection, which essentially is far from sufficient.

"A standard is as good as its implementation". When nobody is doing implementation testing during the review period, due to the fact that no machine-readable version is available, it is no wonder that the problems first show up after the "final" publication of the standard.

Without implementation testing, quality assurance is doomed to fail. It is as if I would write source code for a program, do quality control by visually inspecting the source code ,and then distribute it to my customers before I tested it.

Essentially, it should have been the CDASH team who did the work I did during the last 2.5 days, and this prior to public review. They could of course then still make a PDF or HTML specification for review available, but not an Excel file that is not suited at all for implementation testing. An Excel file is surely better than only the PDF: in the past, it took me over a week to implement a new SDTM-IG in software doing copy-and-paste from the PDF into an XML file. In the future, we will be able to retrieve real machine-readable information about our standards from SHARE, using a standardized API. But will this information already available through the API during the public review period?

This has caused me starting thinking about the CDISC public review period. In most cases, it is too short, and not suitable for implementation testing.
One of the ideas that is currently supported by some of the CDISC volunteers is to switch to a "Draft Standards for Trial Usage" (DSTU) process. It means that the standard is published for "trial use" during a longer time (e.g. 6 months) and that a machine-readable set of files to use in the "trial use" is delivered. In some cases, the machine-readable files can be replaced by a "reference implementation", which is usually "open source". When not everything is perfect, a DSTU-1 version can be followed by a DSTU-2 version, etc..
Essentially, this is the way HL7 works (very successfully) with the FHIR standard. It is also the way new versions of the Java computer language is developed, by the use of reference implementations.

I recognize that such an approach takes considerably more time to complete than the current approach. It would mean that we "slow down" on generating new standards. The result will however be a much much higher quality of our standards.