The pilot was limited to SDTM datasets. The FDA departments that were involved were CDER and CBER. The major objectives of the pilot were:
- ensuring that data integrity was maintained when going from SAS datasets (sas7bdat) to Dataset-XML and back.
- ensuring that Dataset-XML format supports longer variable names, labels and text fields than the one from SAS-XPT (which has limitations of 8 characters for names, 40 for labels and 200 for text fields).
- the capability of transporting non-ASCII-7 characters (another limitation of XPT)
Two types of tests were performed:
a) test whether the Dataset-XML files can be transformed into sas7bdat and can be read by FDA data analysis software (e.g. JMP)
b) test whether data integrity is preserved when converting sas7bdat files to Dataset-XML files and then back to sas7bdat files
Comment: Although the report doesn't mention this at all, I heard that one of the submissions was generated using Java/Groovy. This also proves that correct Dataset-XML files can be generated by other tools than by statistical software (which was pretty hard with XPT). I.m.o. this is a valuable result that should be mentioned.
Further, sponsors were asked to submit Dataset-XML files that contain variable names longer than 8 characters, variable labels longer that 40 characters, and text content larger than 200 characters. The goal of this was to test whether Dataset-XML files can (sic) "facilitate a longer variable name (>8 characters), a longer label name (>40 characters) and longer text fields (>200 characters)".)
Comment: well, that is something we already know for many many years ...
Issues found by the FDA.
During the pilot, a number of issues was encountered, which could all be resolved.
- Initially, testing was not successful due to a memory issue caused by the large dataset size. This issue was resolved after the SAS tool was updated which addressed the high memory consumption issue
- Encoding problems in a define.xml file
Generating define.xml from Excel files or Word documents is extremely bad practice. See my blog entry "Creating define.xml - best and worst practices". Ideally, define.xml files should be created even before the study starts, e.g. as a specification of what SDTM datasets are expected as a result of the study.
- A problem with an invalid (?) variable label in the define.xml
So yes, Dataset-XML requires that you put effort in providing a correct define.xml file (as it is leading), and that is good so.
File sizes
A major concern of the FDA is and always has been the file sizes of Dataset-XML files. Yes, XML files usually are larger than the corresponding XPT files. However, this does not have to be the case.
The largest files in an SDTM submission usually are the SUPPQUAL files.
SUPPQUAL files can be present for several reasons:
- text values longer than 200 characters. Starting from the 201st character, everything is "banned" to SUPPQUAL records. This is not at all necessary when using Dataset-XML.
- non-standard variables (NSVs). According to the SDTM-IG, NSVs may not appear in the "parent" dataset, but must be provided in the very inefficient SUPPQUAL datasets. The latter can then also grow quickly in size. If they would be allowed to remain in the parent dataset (and marked as an NSV in the define.xml) we would mostly not need SUPPQUAL datasets at all, and so the largest files would disappear from our submission. Unfortunately the report does not give us any information about what the largest files were.
Of course, one could also move NSVs to the parent dataset when using XPT.
In the pilot, the FDA observed file size increases (relative to XPT) up to 264%, and considers this to be a problem. Why?
It cannot be memory consumption when loading in modern analysis tools. As I have shown before, modern XML technologies like SAX and VTD-XML are known for their low memory consumption.
Disk costs can also not be an issue. The largest submission was 17GB in size which comes at a disk cost of 0.51 US$ (3 dollarcent per GB).
So what is the issue? Here is the citation from the report:
"Based on the file size observations, Dataset-XML produces much larger files than XPORT, which may impact the Electronic Submissions Gateway (ESG)".
OK, can't the ESG handle a 17GB submission? If not, let us zip the Dataset-XML files. Going back to my 80MB LB.xml file, when I do a simple zipping, it reduces to 2MB, so the file size is reduced by a factor 40! If I would do the same with my 17GB submission, it would reduce to a mere 425MB (for the whole submission), something the ESG can surely handle. So, what's the problem?
Wait a minute Jozef! Didn't we tell you that the ESG does not accept .zip files?
A few thoughts:
- would the ESG accept .zzz files (a little trick we sometimes use to get zipped files through e-mail filters: just rename the .zip file into .zzz and it passes...).
- would the ESG accept .rar files? RAR is another compression format, that is also very efficient.
- why does the ESG not accept .zip files? We will ask the FDA. Is it fear of virusses? Also PDFs can contain virusses, and modern virus scanners can easily scan .zip files on virusses before unzipping. Or is it superstition? Or is it the misbelieve that zipping can change the file contents?
- modern software tools like the "Smart Dataset-XML Viewer" can parse zipped XML files without the need of unzipping the files first. Also SAS can read zipped files.
A few things that were not considered in the pilot:
- data quality
The combination of Dataset-XML and define.xml allow to perform better data quality checks than when using XPT. Tools can easily validate contents of the Dataset-XML against the metadata in the define.xml. With XPT this is much harder as it needs a lot of "hard coding". Although OpenCDISC supports Dataset-XML, it does not (yet) validate Dataset-XML against the information in the define.xml file (or very limited) - the fact that define.xml is "leading" brings a lot of new opportunities. For example, the define.xml can be used to automatically generate a relational database (e.g. by transformation into SQL "CREATE TABLE" statements), and the database can then be automatically filled from the information in the Dataset-XML files (e.g. by transformation into SQL "INSERT" statements). This is also possible with XPT, but much much harder when not having SAS available.
- this brings us to another advantage of Dataset-XML. As it is XML, it is really an "open" and "vendor neutral" format. So software vendors, but also the FDA itself, could generate software tools to do very smart things with the contents of the submission. Also this seems not to have been considered during the course of the pilot.
- non-ASCII support. As already stated, XPT only supports ASCII-7. This means that e.g. special Spanish characters are not supported (there are 45 Million spanish speaking people in ths US). XML can (and does by default) use UTF-8 encoding (essentially this means "unicode"), supporting a much much larger character set. This one of the main reasons why the Japanese PMDA is so interesting in Dataset-XML: XML easily supports Japanese characters where there is no support at all in ASCII-7.
What's next?
Very little information is provided in the report. It only states:
"FDA envisages conduction several pilots to evaluate new transport formats before a decision is made to support a new format". So it might be that the FDA conducts other pilots with other proposed formats such as semantic web technology, or even maybe CSV (comma separated values).
There are also some rumours about a Dataset-XML pilot for SEND files.
Conclusions
The pilot was successful as the major goals were reached: ensuring data integrity during transport, and ensuring that Dataset-XML supports longer variable names, labels and text values.
The FDA keeps repeating its concerns about file sizes, although these can easily be overcome by allowing NSVs to be kept in the parent dataset, and by allowing compression techniques, which are very efficient for XML files.