Monday, July 9, 2018

The 0.5 MB SDTM/SEND/ADaM submission

Submission files to FDA and PMDA can be large. Reviewers there often complain that they have problems loading files or transporting them between people and applications. Validation tools complain if a variable is declared 1 byte longer than absolutely necessary. Files need to be split and then remerged.


First of all, FDA and PMDA still require us to submit in the outdated, highly inefficient XPT format. Additional, so-called "non-standard" variables need to be submitted in "SUPPxx" datasets that are even more inefficient. If they would be kept in the "parent" dataset, and marked as "non-standard" the size of the whole submission could be 50% less.

Amazon, Google, Twitter handle volumes of information that is millions times the size of our regulatory submissions. They do not complain about file sizes. Why?
The answer is simple: because they do not use files.
In these organizations, all information is stored in databases, and RESTful web services are used to exchange information between machines.

The define.xml contains the metadata of a submission. It is relatively small, usually somewhat between 0.5 and 1 MB large. It is the "sponsor's thruth about the submission", containing almost all the information for the reviewer to be able to work with the data.
The define.xml also contains the location of the data - as file references. It are these data files that can become very large. Unlike Amazon, Google and Twitter, we (clinical research) still use "files" for exchanging the information. The rest of the world is using RESTful web services.

Imagine that the sponsor keeps the study information in a system (which usually is a database), and provides an "SDTM view" of the data. Yes, SDTM is a "view" on the real data! It is just an ETL view on the data in order to make it easy for the reviewers to understand and work with the data. In conflict with good database practice, it contains a lot of redundancy and derived data, even though this is often completely unnecessary. But it is all "for the sake of easy review".
So, suppose that a sponsor can provide an "SDTM view" on the data "on the fly", and that this "view" is available through a RESTful web service.

The define.xml entry for the "dataset" DM can then look like:

No, this is NOT a hyperlink! It is a RESTful web service instruction, stating "GET the SDTM dataset DM" from study "cdisc01". The parts "Standard", "Study" and "Dataset" are parts of the REST API for the service to retrieve submission information.

This API can then easily be extended, e.g. with filtering criteria, like:


where "[base]" is the base of the restful web service (in our simple example "https://mypharmacompany/submissions/REST" and the question mark "?" is a "where" statement. So the REST string then means: provide the DM data of the subject with USUBJID "CDISC01.10008".

This can of course easily be expanded to any of the variables and datasets. For example, to obtain all the AE records for severe advents that lead to hospitalization, the string would be:


Simple, isn't it? No files, anymore, just services and queries ...
How this is implemented on the server is completely unimportant. The API is essentially the "service contract" guaranteeing that the service is exactly provided what is asked.

Now, what would be the consequences of basing electronic submissions on RESTful web services?
  • Only a define.xml file need to be submitted (0.5-1MB). Even for that, a RESTful web service could take care of.
  • No files are needed anymore at the regulatory authorities side anymore. They can however still store the information if they want, as the usual formats for returning the information from RESTful web services are XML, JSON and Turtle (RDF), which can all be used to populate databases and data warehouses.
  • At the sponsor's site, only an SDTM/SEND/ADaM database is necessary (which is already there anyway), only the API implementation needs to be taken care of
  • No 8-, 40- and 200-character limitations anymore of the XPT file format, as files are not used anyway. No problems with non-ASCII characters, as the RESTful web services uses XML or JSON.
  • No need for SUPPxx "files" - there are no files. SUPPxx datasets should also not be necessary, non-standard variables (NSVs) can just be marked as such in the define.xml.
  • The SDTM data can contain the real source data (points), like one or more FHIR resources of the electronic health record that was the source of the data.
  • Validation of data (for compliance with the standard) can also be done using RESTful web services. As the software for validation is server based, bug fixes can be done within hours, instead of having to wait for years (current validation software of the FDA).
  • Review can already start before the last data point is captured! Especially SDTM and SEND datasets are usually already assembled far earlier than before the last data point is captured, and as the SDTM database is already there anyway, regulatory authorities could already start doing some of the review before the study ends, and then after database closure, just repeat the analyses they did already before on the partial data. This can save the life of thousands of patients each year who are waiting for a new medication or treatment.
  • Easy filtering and querying: as filtering can already be done by the RESTful web service itself, it makes life much easier for the reviewer, as he/she does not need to learn the filtering features of different packages
  • Much easier and faster collaboration between reviewer and submitting pharma company. If the reviewer requires the information to be updated or differently organized, there is no need anymore to start re-generating exchanging files. Just implement the necessary change in the database, and it is there.
 Some of us will immediately object that such "http" requests would mean that anyone can execute them (e.g. in a browser) and so obain all the information. That is not true: RESTful web services can be secured by requiring authentication, often using the OAuth mechanism, but also other authentication mechanisms can be used such as "Ticket Granting Ticket" (TGT) and "Ticket", as the NIH/NLM is using in its UMLS RESTful web services.

Electronic submissions to the regulatory authorities based on only one small (define.xml) file. Doesnt't this sound cool? Such an approach could save the life of thousands of patients each year!


  1. This would be so good!!! I fully share your point of view and I think that we could even go further: We could also have an ODM XML file only using RESTful web service instruction between SHARE, for every metadata that could be used as is (defined in CDASH as ItemGroup – Item, or as Controlled Terminology – or as SDTM metadata). Then for every sponsor specific metadata, another Restful web service could use the same technology. Like this, our studies would be very “standard”. But in order to be able to generate our clinical data like this, we really need an efficient Metadata Repository solution that is allowing us to define one every clinical concept and reference them where and when we need them… With a very strong Version controlled system and a Unique Identifier that is not changing over time… Thank you again for guiding us in that direction!!