Thursday, December 29, 2011

One step further - SDTM in XML

We do already submit the metadata of our SDTM submission in XML format (define.xml).
So why not submit the SDTM/SEND/ADaM data themselves as XML?
It would have so many advantages!
Only a few ones (I could name about 100):
  • really vendor-neutral (SAS XPT is not vendor-neutral at all)
  • vendors can much easier develop tools for working with the submissions, as there are so many libraries for working with XML in Java, C#, C++, Python, PHP, Perl, ... (I do not know of a single library in any of these languages to use with SAS XPT).
  • easier to validate (e.g. using XML-Schema and Schematron)
  • easy to display (through XSLT stylesheets)
  • easy to develop different "views" on the data (through stylesheets)
  • extremely easy to combine different datasets / studies etc.
  • display stylesheets can mark (e.g. by background color) violations to the SDTM-IG rules
  • much more compact than SAS XPT (one of the major complaints of the FDA is that they cannot open large SAS XPT files)
  • ...
But how could such an SDTM dataset look like?
Here is an example:

This is just simple ODM, and can be generated from the source ODM "ClinicalData" that every modern EDC system generates by a simple transformation - no expensive statistical software necessary. It can however also be generated very simply by using statistical software (if that is your preference).

Just a few remarks about the advantages (I will discuss many more in later posts):
  • the first line states 'TransactionType="Insert"'.
    One of the FDA complaints is that they obtain updates of submissions and then must load the complete data sets again into their tools, without any possibilities to compare the old data sets with the new ones. The "TransactionType" mechanism of ODM however allows to only send the updates themselves, i.e. only the data points that were changed, and it is always clear what the status of each data point is.
  • this is human-readable! Did you ever try to open SAS-XPT files with another tool than from SAS? SAS XPT is binary and so one needs to write special software (writing software is always expensive), or use SAS to be able to inspect the contents of the file
  • one can easily develop different stylesheets to get different "views" on the same set of data. With the conventional tools (SASViewer, JMP, ...) you get only one view: the tabular view.
You may see redundant information in the XML snippet. But these are due to the SDTM standard itself, as a result of the restrictions of the SAS XPT format itself: due to these restrictions, the developers of the SDTM standard had to add additional variables to make some information visible.

So the next step could be to write an extension (similar to define.xml) for submission data in XML that allow to get rid of much of the redundant information that is now present. This would further enable to reduce the file size of SDTM submission data sets.
But that is new material for another post.

Now that we talk about file size, it was the FDA that made the choice for SAS XPT in the past (although there were alternatives). SAS XPT format wastes enormous amounts of (disk) space. And now the same FDA is complaining about file size of SDTM submissions! Well, they got what they wanted (i.e. trouble) isn't it?
So, it is high time that the FDA starts investing in XML knowledge, as it is the standard for exchange of information worldwide, not only in the healthcare world, but also in the financial world, the travel world, in bioinformatics, in chemical research, in astronomy, etc. etc..

One could now state that a good alternative may be HL7-v3 messages (which are also based on XML, though more a "rape" of the XML standard). I wrote already about why that is not a good idea, and will also write some more about that in one of my next posts.

Tuesday, December 27, 2011

Overloading in define.xml - should it be allowed or forbidden?

Did you like the example from the previous post where metadata information from the study design is copied into (or imported into) the define.xml?
Then realize that some people currently working on define.xml 2.0 want to forbid this.
Their argument is that define.xml "is another animal".
They state that it should be forbidden to use any elements and attributes in define.xml files that are not explicitely mentioned in the specification, this although define.xml is an extension to the CDISC ODM standard.
So they want to forbid the use of "MeasurementUnitRef", "MeasurementUnitDef", "Question", Alias", and everything that is not explicitely mentioned in the define.xml specification.
In our interpretation of "extension" however (and this comes from the ODM specification itself), an extension means that one adds stuff, not that one removes stuff.
The ODM specification states:

Requirements for Vendor extensions to the ODM model are:
  1. The vendor must supply a XML Schema fully describing their extended ODM format.
  2. Extended ODM files should reference the proper extension Schema.
  3. The extension may add new XML elements and attributes, but may not render any standard ODM elements or attributes obsolete. Vendor extensions cannot be used for information that is normally expressed using other ODM elements.
  4. All new element and attribute names must use distinct XML namespaces to insure that there are no naming conflicts with other vendor extensions.
  5. Removing all vendor extensions from an extended ODM file must result in a meaningful and accurate standard ODM file.
  6. Vendors should be able to produce ODM files free of any vendor extensions upon request.
Applications that use extended ODM files must also accept standard ODM files.

Requirements 1 and 2 are fulfilled by the define.xml standard. Requirement 3 isn't in my opinion, as the "Name" attribute in ODM is meant for a "free text short description" whereas in define.xml it is abused (?) for the SDTM variable name (enumerated). Instead, "def:Label" is used for what should essentially go into "Name". But that's a minor issue which I hope will be repaired in the next version of define.xml.
Requirement 4 is fulfilled: the additional elements and attributes are in a separate namespace.
The next requirement (requirement 5) is more problematic. Does a define.xml file without the additional (define-specific) attributes and elements still make sense? Partially I think.
But if the define.xml standard forbids us to use regular ODM elements such as "MeasurementUnit" (reference and definition), or "Question", isn't that data loss, and doesn't this violoate requirement 5?
Requirement 6 is not a problem: it is easy to remove extension elements and attributes, e.g. using a simple XSLT stylesheet.

What about the requirement "Applications that use extended ODM files must also accept standard ODM files"?
That is highly problematic! This would require applications such as used by the FDA to accept standard ODM files, such as define.xml files that contain the snippet from the previous post.
But some of the define.xml team would like to see that elements such as "MeasurementUnit" and "Question" are marked by receiving applications as "non-standard"! Isn't that in full contradiction and conflict with the above rule?

What I want to say is that if one decides to define define.xml as an ODM extension, then the result should obey the rules for ODM extensions (as given above).
If one wants to define define.xml as a "restriction" of ODM, one should not use the extension mechanism, but write a complete new XML-schema, not based on ODM, as ODM does not foresee a mechanism for restrictions.

The next post will probably go about using ODM-XML for submitting SDTM, SEND and ADaM data to the FDA.

Sunday, December 25, 2011

A first simple example of end-to-end

Today, we will look at a simple example of CDISC end-to-end.
Suppose we have a question about the height of a subject in our CRF. As the study has been deployed in the US, in Germany, in France and in Korea, there are some translations. Therefore the ODM study design looks like:

Remark that I collapsed some of the XML nodes (the ErrorMessage nodes).
Now have a look at some of the annotations (marked with a star). The first is the attribute "SDSVarName" stating that the captured value will later go into the SDTM variable "VSORRES".
The second states that in some deployments, the units that will be used are inches, and in other deployments, the units that will be used are centimeters. This information will later go into the SDTM variable "VSORRESU".
Also remark that it is not stated under which circumstances which units will be used. This feature will be added to the next version of ODM.
The third annotation is especially important. It is the ODM "Alias" element that can be used to state a synonym for the item in a non-ODM context. In this case it states that in SDTM, the item will be used for the combinates VSORRES with VSTESTCD=HEIGHT.
Brilliant isn't it? All the information for generated the SDTM records is already present! And that at study design time!

The units of measurement are further defined in more detail in the ODM "MeasurementUnit" elements:

Also here, we could add an Alias, e.g. to add the UCUM equivalents of the units. UCUM is a worldwide standard for units of measurement, and is especially used in electronic health records like used in HL7 CDA.

Now, how will all this go into SDTM?
The information is already present at study design, so a define.xml can already be generated at this stage (and not as the last step in the process as still some vendors and processes do). Also, the first (or prototype) SDTM datasets can already be generated once the first data come in. This has the great advantage that one can already start doing quality control when the first datapoints come in. This can save huge amounts of time and money!

Part of the define.xml for this study (which can already be generated at study design time!) can then look like:

It references a codelist for the different test codes, and even more important, references a "ValueList" stating the metadata for each of the tests, such as codelists and ... units of measurement.
The ValueList looks e.g. like:

It states that for VTESTCD "HEIGHT", there is an ItemDef "I_HEIGHT" which defines the metadata for this test.

Now guess how this metadata definition for the test code "HEIGHT" looks like. Here it is:

You may now scroll up to the top of this article and compare this metadata definition for test code "HEIGHT" in define.xml, with the metadata definition for the CRF for the datapoint "Height".

Did you compare? Here is the result:


This is exactly what we mean with CDISC end-to-end!
The metadata definition for the CRF datapoint "Height" is just copied "as is" into the define.xml, containing the metadata for submission to the FDA.
No transformations necessary!

We can now also copy the "MeasurementUnit" elements from the ODM study design to the define.xml. We do not need to transform them into a codelist. They just remain valid also in the define.xml

In a next article, we will give further examples about CDISC end-to-end, and how we can reuse information from the study design into the metadata that is send to the FDA.

Saturday, December 10, 2011

What is CDISC end-to-end?

If we want to discuss something, we first need to define what we are discussing about.

CDISC end-to-end is an already old idea to use a single transport format from clinical research protocol design to submission to the authorities (especially the FDA).
It is however also the idea of re-using information from (earlier) electronic submissions to the FDA for the design or setup of new or follow-up clinical studies.

Until now, CDISC end-to-end was mostly realized and implemented using the CDISC ODM standard and its extensions, such as SDM-XML (Study Design Model in XML) and define.xml (officially "Case Report Tabulations Data Definition Standard").

The next step would be to also develop an ODM-based transport format for submitting the (SDTM/SEND/ADaM) data to the FDA.
The advantages would be tremendous:
  • get rid of the ancient SAS XPT format which has huge disadvantages
  • vendor neutral (SAS XPT isn't)
  • fits seamless with define.xml
  • can be inspected (by use of a stylesheet) in a browser - no further tools necessary
  • stylesheet can mark inconsistencies in the data and violations of the SDTM/SEND/standard
  • reduced file size relative to SAS XPT (the FDA currently has enormous problems opening large SAS XPT files)
  • extremely easy to feed to a database or data warehouse
  • analysis tools can easily be generated as it is all XML
  • and many many more ...
There are however people in CDISC and the FDA that do not want such an end-to-end format, and especially not for the "last mile" to the FDA. They argue that submission data is "something totally different".
In my opinion however, submission data is also clinical data, even when it contains some derived data and has been "categorized" into "SDTM drawers".