Tuesday, April 21, 2015

The DataSet-XML FDA Pilot - report and first comments

The FDA very recently published a report on their pilot of the use of Dataset-XML as an alternative to SAS-XPT (SAS Transport 5) for SDTM, SEND and ADaM submissions. Here is a short summary of this report together with my comments.

The pilot was limited to SDTM datasets. The FDA departments that were involved were CDER and CBER. The major objectives of the pilot were:
  • ensuring that data integrity was maintained when going from SAS datasets (sas7bdat) to Dataset-XML and back.
  • ensuring that Dataset-XML format supports longer variable names, labels and text fields than the one from SAS-XPT (which has limitations of 8 characters for names, 40 for labels and 200 for text fields).
Comment: Unfortunately, the following was not part of the testing in the pilot:
  • the capability of transporting non-ASCII-7 characters (another limitation of XPT)
Six sponsors were selected out of fourteen candidates. The selection criteria can be found in the report and will not be discussed here.

Two types of tests were performed:
a) test whether the Dataset-XML files can be transformed into sas7bdat and can be read by FDA data analysis software (e.g. JMP)
b) test whether data integrity is preserved when converting sas7bdat files to Dataset-XML files and then back to sas7bdat files

Comment: Although the report doesn't mention this at all, I heard that one of the submissions was generated using Java/Groovy. This also proves that correct Dataset-XML files can be generated by other tools than by statistical software (which was pretty hard with XPT). I.m.o. this is a valuable result that should be mentioned.

Further, sponsors were asked to submit Dataset-XML files that contain variable names longer than 8 characters, variable labels longer that 40 characters, and text content larger than 200 characters. The goal of this was to test whether Dataset-XML files can (sic) "facilitate a longer variable name (>8 characters), a longer label name (>40 characters) and longer text fields (>200 characters)".)

Comment: well, that is something we already know for many many years ...

Issues found by the FDA.

 

During the pilot, a number of issues was encountered, which could all be resolved.
  • Initially, testing was not successful due to a memory issue caused by the large dataset size. This issue was resolved after the SAS tool was updated which addressed the high memory consumption issue
Comment: Well designed modern software that is parsing XML should not use more memory than when parsing text files or SAS-XPT. See e.g. my comments about memory usage using technologies like VTD-XML in an earlier blog. It is a myth that processing large files consumes much memory. XML technologies like SAX are even known for using only small amounts of memory. The issue could however quickly be resolved by the SAS people that were cooperating in the pilot (more about that later).
  • Encoding problems in a define.xml file
Comment: this has nothing to with Dataset-XML itself. What happened was that a define.xml used curly quotes ("MS Office quotes") for the delimiters in XML attributes. Probably the define.xml was created either from copy-past from a Word document or generated from an Excel file. These "curly quotes" are non-standard and surely not supported by XML.
Generating define.xml from Excel files or Word documents is extremely bad practice. See my blog entry "Creating define.xml - best and worst practices". Ideally, define.xml files should be created even before the study starts, e.g. as a specification of what SDTM datasets are expected as a result of the study.

  • A problem with an invalid (?) variable label in the define.xml
Comment: The FDA found out that "Dataset-XML requires consistency between the data files and define.xml". Now, there is something strange with the statement about the variable label, as the latter does not appear at all in the Dataset-XML files. What I understood is that the define.xml file that came with the Dataset-XML files had one label that was not consistent with the label in the original XPT file. With Dataset-XML, define.xml becomes "leading", and that is exactly how it HAS to be. With XPT, there is a complete disconnect between the data files and the define.xml.
So yes, Dataset-XML requires that you put effort in providing a correct define.xml file (as it is leading), and that is good so.


File sizes

 

A major concern of the FDA is and always has been the file sizes of Dataset-XML files. Yes, XML files usually are larger than the corresponding XPT files. However, this does not have to be the case.
The largest files in an SDTM submission usually are the SUPPQUAL files.
SUPPQUAL files can be present for several reasons:
  • text values longer than 200 characters. Starting from the 201st character, everything is "banned" to SUPPQUAL records. This is not at all necessary when using Dataset-XML.
  • non-standard variables (NSVs). According to the SDTM-IG, NSVs may not appear in the "parent" dataset, but must be provided in the very inefficient SUPPQUAL datasets. The latter can then also grow quickly in size. If they would be allowed to remain in the parent dataset (and marked as an NSV in the define.xml) we would mostly not need SUPPQUAL datasets at all, and so the largest files would disappear from our submission. Unfortunately the report does not give us any information about what the largest files were.
Let's  take an example: I took a classic submission which has an LB.xpt file with laboratory data of 33MB and a SUPPLB.xpt file of 55MB. So the SUPPQUAL file SUPPLB.xpt is considerably larger although it only contains data for 2 variables (the LB.xpt file has data for 23 variables). The corresponding Dataset-XML files have sizes of 66 and 40 MB. So they are somewhat larger than the XPT files. If one now brings the 2 NSVs back to the parent records, the Dataset-XML file is 80MB in size (and there is no SUPPLB.xml file), so smaller than the sum of the LB.xpt and SUPPLB.xpt files.
Of course, one could also move NSVs to the parent dataset when using XPT.

In the pilot, the FDA observed file size increases (relative to XPT) up to 264%, and considers this to be a problem. Why?
It cannot be memory consumption when loading in modern analysis tools. As I have shown before, modern XML technologies like SAX and VTD-XML are known for their low memory consumption.
Disk costs can also not be an issue. The largest submission was 17GB in size which comes at a disk cost of 0.51 US$ (3 dollarcent per GB).
So what is the issue? Here is the citation from the report:
"Based on the file size observations, Dataset-XML produces much larger files than XPORT, which may impact the Electronic Submissions Gateway (ESG)".
OK, can't the ESG handle a 17GB submission? If not, let us zip the Dataset-XML files. Going back to my 80MB LB.xml file, when I do a simple zipping, it reduces to 2MB, so the file size is reduced by a factor 40! If I would do the same with my 17GB submission, it would reduce to a mere 425MB (for the whole submission), something the ESG can surely handle. So, what's the problem?
Wait a minute Jozef! Didn't we tell you that the ESG does not accept .zip files?
A few thoughts:
  • would the ESG accept .zzz files (a little trick we sometimes use to get zipped files through e-mail filters: just rename the .zip file into .zzz and it passes...).
  • would the ESG accept .rar files? RAR is another compression format, that is also very efficient.
  • why does the ESG not accept .zip files? We will ask the FDA. Is it fear of virusses? Also PDFs can contain virusses, and modern virus scanners can easily scan .zip files on virusses before unzipping. Or is it superstition? Or is it the misbelieve that zipping can change the file contents? 
  • modern software tools like the "Smart Dataset-XML Viewer" can parse zipped XML files without the need of unzipping the files first. Also SAS can read zipped files.
Compression of XML files is extremely efficient, so those claiming that large files sizes can lead to problems (I cannot see why) should surely use a compression method like ZIP or RAR.

 A few things that were not considered in the pilot:

 

  • data quality
    The combination of Dataset-XML and define.xml allow to perform better data quality checks than when using XPT. Tools can easily validate contents of the Dataset-XML against the metadata in the define.xml. With XPT this is much harder as it needs a lot of "hard coding". Although OpenCDISC supports Dataset-XML, it does not (yet) validate Dataset-XML against the information in the define.xml file (or very limited)
  • the fact that define.xml is "leading" brings a lot of new opportunities. For example, the define.xml can be used to automatically generate a relational database (e.g. by transformation into SQL "CREATE TABLE" statements), and the database can then be automatically filled from the information in the Dataset-XML files (e.g. by transformation into SQL "INSERT" statements). This is also possible with XPT, but much much harder when not having SAS available.
  • this brings us to another advantage of Dataset-XML. As it is XML, it is really an "open" and "vendor neutral" format. So software vendors, but also the FDA itself, could generate software tools to do very smart things with the contents of the submission. Also this seems not to have been considered during the course of the pilot.
  • non-ASCII support. As already stated, XPT only supports ASCII-7. This means that e.g. special Spanish characters are not supported (there are 45 Million spanish speaking people in ths US). XML can (and does by default) use UTF-8 encoding (essentially this means "unicode"), supporting a much much larger character set. This one of the main reasons why the Japanese PMDA is so interesting in Dataset-XML: XML easily supports Japanese characters where there is no support at all in ASCII-7.

 

What's next?

 

 Very little information is provided in the report. It only states:
"FDA envisages conduction several pilots to evaluate new transport formats before a decision is made to support a new format". So it might be that the FDA conducts other pilots with other proposed formats such as semantic web technology, or even maybe CSV (comma separated values).
There are also some rumours about a Dataset-XML pilot for SEND files.

Conclusions

 

The pilot was successful as the major goals were reached: ensuring data integrity during transport, and ensuring that Dataset-XML supports longer variable names, labels and text values.
The FDA keeps repeating its concerns about file sizes, although these can easily be overcome by allowing NSVs to be kept in the parent dataset, and by allowing compression techniques, which are very efficient for XML files.

Some further personal remarks

 

I have always found it very strange that the FDA complains about file sizes. It has been the FDA who has been asking for new derived variables in SDTM datasets (like the --DY variables) and for duplicate information (e.g. test name --TEST which has 1:1 relationship with test code --TESTCD). Derivations like --DY calculation can easily be done by the FDA tools themselves (it is also one of the features of the "Smart Dataset-XML Viewer"), and e.g. test name can easily be retrieved using a web service (see here and here). Removing these unnecessary derived or redundant variables from the SDTM would reduce the file sizes with at least 30-40%.

Acknowledgements

 

Special thanks are due to the SAS company and its employee Lex Jansen, who is a specialist in as well SAS as XML (well, I would even state that he is a "guru"). Lex spend a lot of time working together with the FDA people and resolving the issues. Also special thanks are due to a number of FDA people that I cannot mention by name here, for their openess and many good discussions with a number of the CDISC XML Technology team volunteers.