Wednesday, November 20, 2013

Submissions in XML - Supplemental Qualifiers

Maybe you will say: "Wait a minute Jozef, it is not allowed by the SDTM Standard to leave the supplemental qualifiers in the original domain!".
You are right - it is not allowed YET.
 So sponsors will probably continue to generate Suppqual datasets for some time, and also this is supported by SDS-XML, as SDS-XML can contain any tabular clinical data.

One of the major problems reviewers had is that when looking at a supplemental qualifier datapoint, it was (using the SASViewer) not easy to quickly find the record in the parent dataset.

So how does the "Smart SDS-XML Viewer" deal with this?

Below you find a few screenshots of the example files that came with the define.xml 2.0 specification (but now formatted as SDS-XML). It has 10 supplemental qualifier datasets, of which 3 are for the QS domain only (SUPPQSCG, SUPPQSCS and SUPPQSMM).

Let us now look at the SUPPAE dataset. For a good number of the subjects, it has one or more records.

So how can we now quickly see the parent record in the parent dataset?
The "Smart SDS-XML-Viewer" easily allows this using the menu "Tools - Show parent record of SUPPQUAL record". Just select a record (or single cell) in the SUPPAE table and then use that menu. A message shows up:

saying that it found a parent record in the AE dataset (remark that there can be more than 1 parent records, e.g. in case IDVAR is a --CAT). Now just click OK or just hit the return button. The software automatically selects the AE table, and selects and highlights the parent record:

This can also easily be achieved using the keyboard shortcut CTRL-S ("S" standing for SUPPQUAL). One can then easily toggle between the SUPPAE table and the AE table either using the mouse and clicking the tab, or using the menu using "Tools - View - Last selected table", or even very simple using the keyboard shortcut CTRL-B ("B" for back). So using CTRL-B just toggles between the two tables.

The sample file also contains SUPPQUAL datasets for which the records do not point to an individual parent record, but to a set of records in the parent dataset. For example, the SUPPQSCG dataset for which IDVAR has the value "QSCAT", meaning e.g. for the first record that the supplemental qualifier refers to all records in the QS dataset for which QSCAT is "CORNELL SCALE FOR DEPRESSION IN DEMENTIA (CSDD)". 

Selecting the fourth record and then using CTRL-S (or using the menu), then gives:

stating that there are 78 parent records, i.e. 78 records for that subject for which the QSCAT applied.
Hitting the OK button then brings us to the QS table and selects and highlights these 78 parent records:

Cool isn't it? Try to do this using the SASViewer ...

Saturday, November 16, 2013

Submissions in XML - first results

As already stated in my previous post, creating SDTM/SEND/ADaM datasets has a considerable number of advantages. Today I want to demonstrate a few of them as implemented in the "Smart SDS-XML Viewer", a software tool that we will make available as "open source" when the final specification is published by CDISC. The first advantage I want to demonstrate is real linking between data sets (also known as "joins"). For example, the FDA has insisted that the DM domain contains the information about the first and last treatment date/time, this although this information is already present in the EX dataset (first record of EXSTDTC and last record of EXENDTC). So the variables RFXSTDTC and RFXENDTC were created in the DM domain. That this violates the third normal form for good relational database design was obviously not taken into account. The reason the FDA wanted this is that their tools (SASViewer) was not able to link records in the EX and the DM datasets. The SASViewer can only read SAS-Transport-5 files but has no idea what they mean. With the new XML-based format, linking between datasets becomes very easy. I took me less than an hour to implement such a lookup for date/time of first and last treatment, so that this information appears as a tooltip on USUBJID in the DM table. Here is the screenshot:

Both the dates were taken from the EX dataset and are displayed in a very user friendly format.
One thing that could easily be added (I haven't done so) is the number of days between both days, and add this as a fourth line in the tooltip. Programming this would take me about 15 minutes I guess. If I had to do this from the SAS-Transport-5 however, I don't think I would have a chance (at least not without having to use SAS).

A second advantahe I would like to demonstrate is that supplemental qualifiers can now easily be kept in the parent domain. Most tools that generate SDTM datasets keep the supplemental qualifiers in the parent domain until the very last moment before the SAS-Transport-5 datasets must be generated. At that moment they are split of and "banned" to a separate SUPPQUAL domain such as the SUPPDM domain in case these are additional qualifiers for DM. The reason (I guess) that this was a requirement in SDTM is that there was no way to indicate in the dataset itself that a variable is a supplemental qualifier.
With the new format however, this is not necessary at all anymore. If the supplemental qualifier is marked as such in the define.xml, there is no reason anymore to "ban" it to another dataset. Software can then take care that these variables are marked as being supplemental. e.g. by a different color.

The following screenshot shows how this has been done in the "Smart SDS-XML Viewer". In our study, there are 6 supplemental qualifiers for DM. Instead of "banning" them to a SUPPDM dataset, they simply were retained in the DM dataset. In the define.xml, they have been marked as supplemental by setting the value of the "Role" attribute to "SUPPLEMENTAL QUALIFIER". As the software also reads the metadata, it knows what to do with these variables, in this case it colors them blue.

A third advantage I would like to demonstrate is the lifting of the 8-, 40-, and 200-character limitations, which caused so much pain in the past. In the following screenshot, the label for the variable COMPLT16 is displayed as a tooltip when the user hovers the mouse over the column header. In SAS-Transport-5, there was a 40-character limitation for labels, which we can  now get rid of.

Similarly, we can now get rid of the 200-character limitation. The SDTM forces us to split values with more than 200 characters into different variables and even different datasets (also here, there is a "banning" to a SUPPQUAL dataset). The splitting even has to be done in such a way that it is done between words, and not in the middle of word. In the CO domain (Comments domain), comments values have to be split and distributed over different variables, e.g. over COVAL, COVAL1 and COVAL2.
None of this all when using the new format. We do no split anything, as there is no reason anymore to do so.
The following snapshot shows a record in the CO dataset as displayed by the "Smart SDS-XML Viewer":

Sunday, November 10, 2013

Submissions in XML

The CDISC XML Technologies Team is currently working on an ODM-XML based format for SDTM, SEND and ADaM submissions (or exchange). This format is envisaged to replace the old SAS Transport 5 (SAS-XPT, .xpt) format. The immediate advantages are obvious:
  • no more 8-, 40-, and 200-character limitations 
  • limitations on test codes disappear (this e.g. hindered us to use a LOINC code for LBTESTCD, as the LOINC code starts with a number, like 1234-5) 
  • supplemental qualifiers can remain in their parent domain/dataset. All that needs to be done is to flag them as such in the define.xml file (e.g. using Role="SUPPLEMENTAL QUALIFIER")
  • no more splitting of information over different fields (the COVAL1, COLVAL2, ... disaster) or even datasets ("banning" of information with over 200 characters to supplemental qualifier datasets)
  • perfect fit with define.xml 1.0 and 2.0: validation of SDTM/SEND/ADaM datasets against the define.xml now really becomes a piece-of-cake. Both the metadata (define.xml) as well as the data (the new format) now use the same format - both are extended ODM. 
These are the advantages for the use of the standard (SDTM, SEND, ADaM) themselves. Further great advantages are:
  • we can now really obtain end-to-end as we now have one format to transport information from study design to submission. Of course the contents will differ, but at least we do not need to switch between formats (and technologies) during the process 
  • XML is the format of choice for exchange of information in the modern world. This also means that an enormous amount of software programs and software libraries are available for working with XML
  • real vendor-neutrality: As ODM is an open standard (SAS-XPT was semi-open, it was very hard to implement in software) anyone with some basic XML knowledge can now develop great software with great features that work with SDTM/SEND/ADaM datasets. In the >20 years of SAS-XPT for SDTM, I haven't seen a single successful third party software programm using it. 
As the industry will need a transition period, the XML Technologies Team will also provide some tools like:
  • tools to transform existing SAS-XPT datasets into the new format 
  • tools to transform files in the new format to the old SAS-XPT format (but who would like to do so?) 
  • tools or scripts for loading the datasets in popular statistical software packages 
  • a viewer for inspecting datasets in the new format 
Development of that viewer is my task in the team. I called it the "Smart SDS-XML Viewer" as the name of the new standard will probably be "SDS-XML" and "smart" as the viewer will have capabilities and features that will go far beyond what the SASViewer could do.
The latter was just a viewer for SAS-XPT files, it was not "SDTM-savvy", it even did not understand what SDTM is about or how it works.

The picture below shows a few of the first features that were implemented sofar:

  • simple SDTM/SEND/ADaM validation such as uniqueness of the USUBJID in the DM dataset
  • check whether the subject is really present in the DM dataset
  • validation whether all required/expected fields really have a value 
  • validation of dates: is the date a real existing date (2013-03-32 is not), does RFENDTC really come after RFSTDTC? 
  • calculation of age from BRTHDTC (when present) and RFSTDTC and checking against the value given in AGE 
  • display of "date of first study medication exposure" and "date of last study medication exposure" as retrieved from the EX dataset in the DM dataset. The latter means that we can now remove RFXSTDTC and RFENDTC from the DM domain - they should never have been there as they are copied from EX
The second screenshot shows how easily supplemental qualifiers can now be visualized: the picture shows the right side of the DM table where the supplemental DM qualifiers are shown.

The columns containing these are colored somewhat differently (that information is retrieved from the define.xml). For ease of use, the USUBJID column has been shifted.

Other features that have been implemented, but which can be better demonstrated using a movie (soon to come, stay tuned) are one-click "jumping" to the corresponding record in the DM dataset (and back), one-click jumping from a comment record in CO to its parent record in another dataset and few-click jumping from a RELREC record to its parent records.
Of course the software also allows sorting and filtering. For example, one can first load the DM dataset and e.g. filter all subjects above a certain age, and than load other datasets for those subjects only. This feature will probably make life of reviewers much much easier.

Another small feature I implemented is highlighting of values (--STRESN) that are outside the reference range (defined by --STNRLO and --STNRHI) for all findings datasets.

Now you will probably ask what the cost of this viewer software will be. The answer is "nothing". It will become available for free as "open source" with a license similar to that of OpenCDISC. So reviewers at the FDA will be able to use it for free from day 1, and users at sponsor companies will have the same tool available as what the FDA reviewers are using. Even more important: as the tool will be open source, everyone can extend it, add great new features, for example for analysis, visualization, etc.

Stay tuned for more information and the public release announcement!