Monday, July 9, 2018

The 0.5 MB SDTM/SEND/ADaM submission

Submission files to FDA and PMDA can be large. Reviewers there often complain that they have problems loading files or transporting them between people and applications. Validation tools complain if a variable is declared 1 byte longer than absolutely necessary. Files need to be split and then remerged.


First of all, FDA and PMDA still require us to submit in the outdated, highly inefficient XPT format. Additional, so-called "non-standard" variables need to be submitted in "SUPPxx" datasets that are even more inefficient. If they would be kept in the "parent" dataset, and marked as "non-standard" the size of the whole submission could be 50% less.

Amazon, Google, Twitter handle volumes of information that is millions times the size of our regulatory submissions. They do not complain about file sizes. Why?
The answer is simple: because they do not use files.
In these organizations, all information is stored in databases, and RESTful web services are used to exchange information between machines.

The define.xml contains the metadata of a submission. It is relatively small, usually somewhat between 0.5 and 1 MB large. It is the "sponsor's thruth about the submission", containing almost all the information for the reviewer to be able to work with the data.
The define.xml also contains the location of the data - as file references. It are these data files that can become very large. Unlike Amazon, Google and Twitter, we (clinical research) still use "files" for exchanging the information. The rest of the world is using RESTful web services.

Imagine that the sponsor keeps the study information in a system (which usually is a database), and provides an "SDTM view" of the data. Yes, SDTM is a "view" on the real data! It is just an ETL view on the data in order to make it easy for the reviewers to understand and work with the data. In conflict with good database practice, it contains a lot of redundancy and derived data, even though this is often completely unnecessary. But it is all "for the sake of easy review".
So, suppose that a sponsor can provide an "SDTM view" on the data "on the fly", and that this "view" is available through a RESTful web service.

The define.xml entry for the "dataset" DM can then look like:

No, this is NOT a hyperlink! It is a RESTful web service instruction, stating "GET the SDTM dataset DM" from study "cdisc01". The parts "Standard", "Study" and "Dataset" are parts of the REST API for the service to retrieve submission information.

This API can then easily be extended, e.g. with filtering criteria, like:


where "[base]" is the base of the restful web service (in our simple example "https://mypharmacompany/submissions/REST" and the question mark "?" is a "where" statement. So the REST string then means: provide the DM data of the subject with USUBJID "CDISC01.10008".

This can of course easily be expanded to any of the variables and datasets. For example, to obtain all the AE records for severe advents that lead to hospitalization, the string would be:


Simple, isn't it? No files, anymore, just services and queries ...
How this is implemented on the server is completely unimportant. The API is essentially the "service contract" guaranteeing that the service is exactly provided what is asked.

Now, what would be the consequences of basing electronic submissions on RESTful web services?
  • Only a define.xml file need to be submitted (0.5-1MB). Even for that, a RESTful web service could take care of.
  • No files are needed anymore at the regulatory authorities side anymore. They can however still store the information if they want, as the usual formats for returning the information from RESTful web services are XML, JSON and Turtle (RDF), which can all be used to populate databases and data warehouses.
  • At the sponsor's site, only an SDTM/SEND/ADaM database is necessary (which is already there anyway), only the API implementation needs to be taken care of
  • No 8-, 40- and 200-character limitations anymore of the XPT file format, as files are not used anyway. No problems with non-ASCII characters, as the RESTful web services uses XML or JSON.
  • No need for SUPPxx "files" - there are no files. SUPPxx datasets should also not be necessary, non-standard variables (NSVs) can just be marked as such in the define.xml.
  • The SDTM data can contain the real source data (points), like one or more FHIR resources of the electronic health record that was the source of the data.
  • Validation of data (for compliance with the standard) can also be done using RESTful web services. As the software for validation is server based, bug fixes can be done within hours, instead of having to wait for years (current validation software of the FDA).
  • Review can already start before the last data point is captured! Especially SDTM and SEND datasets are usually already assembled far earlier than before the last data point is captured, and as the SDTM database is already there anyway, regulatory authorities could already start doing some of the review before the study ends, and then after database closure, just repeat the analyses they did already before on the partial data. This can save the life of thousands of patients each year who are waiting for a new medication or treatment.
  • Easy filtering and querying: as filtering can already be done by the RESTful web service itself, it makes life much easier for the reviewer, as he/she does not need to learn the filtering features of different packages
  • Much easier and faster collaboration between reviewer and submitting pharma company. If the reviewer requires the information to be updated or differently organized, there is no need anymore to start re-generating exchanging files. Just implement the necessary change in the database, and it is there.
 Some of us will immediately object that such "http" requests would mean that anyone can execute them (e.g. in a browser) and so obain all the information. That is not true: RESTful web services can be secured by requiring authentication, often using the OAuth mechanism, but also other authentication mechanisms can be used such as "Ticket Granting Ticket" (TGT) and "Ticket", as the NIH/NLM is using in its UMLS RESTful web services.

Electronic submissions to the regulatory authorities based on only one small (define.xml) file. Doesnt't this sound cool? Such an approach could save the life of thousands of patients each year!

Sunday, June 3, 2018

Source EHR records in SDTM

When submitting SDTM datasets to the FDA, the source data get lost. The reason is that essentially the generation of SDTM datasets is an ETL (extract-transform-load) process. In the case of data collection using CRFs (case report forms), this is compensated for by the FDA requirement that an annotated CRF (in PDF format) is delivered. In the define.xml, (using def:Origin), one then points to the page number(s) or the section (bookmark) of the originating field in the CRF. Such an annotated PDF-CRF must also be delivered when the CRFs were electronic. This doesn't make sense, as in such a case, an SDTM-annotated ODM-XML file would be a much better choice. It has the advantages of being really electronic, and that it can easily be visualized. But the reviewers at the FDA still cannot handle XML well it is still a PDF and XPT world.

[Stylesheet developed by David Iberson-Hurst, Assero]

But what if the source of the data is an electronic health record (EHR)? Of course one can transcribe the data into the CRF, thus again losing the original record. So, what if the reviewer wants to see the real original record? How can this be accomplished in SDTM using the mandatory SAS XPT format?

It can't.

When using CDISC Dataset-XML as the transport format however, much more is possible. Dataset-XML has been developed for transporting tabular data, but as it is based on CDISC ODM, it can also natively transport audit trails, signatures and annotations. Data points in ODM can also carry electronic health records as was several times demonstrated in the past []. The same is true for Dataset-XML, as technically, there is no difference between an ODM data point and a Dataset-XML data point. 

Already some time ago, I extended Dataset-XML to also allow HL7-FHIR "resources", i.e. FHIR-based EHR data (5 minutes work). Yesterday, I extended the popular open-source "Smart Dataset-XML Viewer" for picking up FHIR resources and visualizing them (3 hours work). 

As an example, I took the VS (vital signs) dataset of the FDA pilot study of 2013.  For a few VS records, I added the corresponding FHIR "Observation" source record. Here is how it looks in Dataset-XML:

The FHIR "Observation" resource is embedded in an ODM "ItemGroupData" which corresponds to a single SDTM record. The "Observation" resource further looks like (not everything is shown):

If you would like a copy of this dataset, just drop me ane-mail, and I will be glad to provide it.
Now, how does this look like in the open-source "Smart Dataset-XML Viewer"?
First, we need to remark that each FHIR resource contains a human-readable part (using HTML) and a machine-readable part. For the visualization in the viewer, we selected to only display the human-readable part of the FHIR resource that is what it is for. The machine-readable part is still in the VS file, and could be used by machines.

Here is the result in the viewer:

I programmed it in such a way that when the user holds the mouse over "USUBJID", the FHIR-EHR data point is displayed in a tooltip. Of course, also other types of visualization, such as in a separate window, in the browser, could easily be implemented.
Also remark that in this case, also the age, sex and actual arm of the subject is displayed, another, older (optional) feature of the "Smart Dataset-XML Viewer".

The outdated XPT format does not make it possible to add such additional information. FDA is the only organization in the whole world using (and unfortunately also mandating) this format. Using Dataset-XML, adding such additional information is "piece of cake". Implementing and deploying visualization of a specific one in the generally available, open source, "Smart Dataset-XML Viewer", a matter of hours.

Yet another argument for the FDA to move away from XPT