CDISC end-to-end: 2017

The title of this post should have been something like "CDISC SDTM and Controlled Terminology – post-coordinated versus pre-coordinated", but then most people would probably have no idea what I am talking about. So a little bit of explanation first.

CDISC SDTM uses "post-coordinated" controlled terminology. This means that controlled terms are combined "as needed" so that they can be build "as required". The consequence is that the result is dynamic, the ontology is "what you see", and any combination of terms is possible. So essentially, the combination of e.g. LBTESTCD=Albumin with LBSPEC=Blood and LBMETHOD=dipstick is valid, although you can't test albumin in blood using the dipstick method (that method is only available for albumin in urine).
"Post-coordination" has its advantages. It brings (some) order into chaos. It is especially useful when it is not known in advance (or cannot be envisaged) which tests will be performed.

Most systems in healthcare use "pre-coordination". This means that any possible combinations are assembled in advance and, when meaningful, obtain a single code. So not all combinations are possible. An example of such a system is LOINC. So in LOINC, you won't find a code for "albumin in blood measured using dipstick", but you will find a code (1751-7) for "albumin in serum or plasma measured quantitatively as mass/volume". Pre-coordinated are (must be) precise: each code should uniquely describe a term (a test in this case).

CDISC SDTM findings domains have been developed to bring "order in chaos". Essentially this means the paper world or the world where protocols do not precisely describe which tests need to be performed. For example, in the famous LZZT protocol we find the following tests defined: "Urinalysis: Color, Specific gravity, pH, Protein, Glucose, Ketones, Bilirubin, …". That's it. So not very precise. The problem with this is that each site can (and will probably) perform different tests. For example, for "glucose in urine", LOINC lists over 20 different tests (even when excluding all the "post" and "challenge" tests). When then submitted, post-coordination is necessary, but the results will not be comparable between sites, studies and sponsors. Even the combination of LBTESTCD (essentially the analyte), LBSPEC (the specimen, e.g. "urine") and LBMETHOD does not guarantee at all a unique combination. So it is no wonder at all that the FDArecently mandated the use of LBLOINC, i.e. it requires (as of 2020) that additionally, the unique LOINC identifier is added.

The problem however is not limited to laboratory tests alone. For example, there has been a discussion on the CDISC wiki about the "ebola vital signs CRF", about how the important test " highesttemperature in the last 24 hours" must be annotated for SDTM. Using SDTM, it cannot be done, as there is no way to define "in the last 24 hours".

The solution is however simple when using LOINC: the LOINC code 8315-4 "Body temperature 24 hour maximum" very exactly describes this test.

Remark that the argument "pre-coordination could result in an explosion of new CT terms ..." is nonsense if CDISC finally allows LOINC to be used (it is not a problem in healthcare ...).

This means that our current SDTM findings variables are not always able to exactly describe tests, even when using post-coordination.

Nowadays, we see that research data are more and more extracted from electronic health records (EHRs) and hospital information systems (HIS), rather than collected separately (e-Source). There are even voices that say: "in 5 years from now, everything will be e-Source". Data from e-source is almost always pre-coordinated, i.e. using pre-coordinated terminology like LOINC, SNOMED-CT, etc..
When e-Source data is used, and the data is submitted, the pre-coordinated terminology must be translated to post-coordinated terminology, which is arbitrary, ambiguous, and not always possible, as the "highest temperature in the last 24 hours" example clearly shows. For lab tests, we can use the LOINC tests 5792-7, 22705-8 and 25428-4 as an example: all three would be modeled in SDTM as LBTESTCD=GLUC, LBSPEC=URINE and LBMETHOD=TEST STRIP. One can only distinguish by looking at the results themselves and at the units used.

Both examples "maximum temperature in the last 24 hours" and "glucose in urine by test strip" demonstrate that information loss is possible or even unavoidable. So, even when the test is exactly described by a pre-coordinated code (LOINC, SNOMED, …), we are forced to submit using a post-coordinated system with loss of information or test uniqueness.

This leads me to an important conclusion: the current SDTM is not fit for use with e-Source.
It is great for the paper world and for classic EDC where data is collected separately from the healthcare world.

How can we do better? Especially when the statement "everything will be e-Source in 5 years from now" becomes true.

In the past, I published an article "An Alternative CDISC-SubmissionDomain for Laboratory Data (LB) for Use with Electronic Health Record Data" in the "European Journal of BioMedical Informatics" (EJBI) where I proposed that, at least for laboratory data coming from e-Source", the typical LBTESTCD, LBTEST, LBSPEC and LBMETHOD are replaced by a set of variables that align with the 6 dimensions of LOINC.

However, this only provides a solution for laboratory data using LOINC. There are however more coding systems used in e-Source data. For example, for microbiology data, NCBI coding is often used. This means that when using e-Source, data (pre-coordinated) using NCBI coding must be translated to one or more of the SDTM variables in the SDTM domain, which uses its own CDISC controlled terminology, and with guaranteed loss of information, as NCBI is much more specific.

Essentially, all this means that we need an alternative "e-Source" domain for each of the existing SDTM findings domains. These new domains can be much simpler than the existing SDTM domains, as much of the information for which several variables are needed in the "classic" domains, can now be in one single variable, the "test code". As these domains need to be "code system neutral", the core variables in these "e-Source" domains would be "test code", "code system" and maybe "test name". The latter is even not necessary, as there is a 1:1 relationship with "test code" and can easily be looked up automatically by computer systems e.g. using one of the many RESTfulweb services from NLM, UMLS, NIH, HIPAA etc..

So for example, for the "e-Source LB" domain, the core variables would be:
"Study ID" to "Sponsor-Defined Identifier" and then "test code" and "test system", "original test result", "original result units" (using UCUM). The classic LBCAT, LBSCAT, LBSPEC, LBMETHOD can be removed, as they are all included yet in the pre-coordinated "test code". Remark that I avoid to assign variable names, as e.g. "LBTESTCD" would mean completely different things in both variations of the LB domain. In the e-Source domain it would mean "the unique test code" whereas in the classic domain, LBTESTCD is essentially misleading, as it specifies the analyte, and not the test (remark that –TESTCD has a different meaning depending on the domain in classic SDTM).

In the "e-Source" LB domain, the first records in example 1 of the SDTM-IG (page LB-5) would look like:

Study Identifier	Domain	Unique Subject ID	Sequence Number	Test Code	Code System	Original Result	Original Result Units (UCUM)
ABC	LB	ABC-001-001	1	1751-7	LOINC	30	g/L
ABC	LB	ABC-001-001	2	6768-6	LOINC	398	[iU]/L
ABC	LB	ABC-001-001	5	26464-8	LOINC	5.9	10*9/L

Using UCUM notation is important, as UCUM notation is almost always used in e-Source, and we don't won't information loss nor conversion errors. Even more, UCUM allows automated conversions (e.g. for the "standardized result"), using one of the RESTful web services available (NLM and our own one).

The next columns in the "e-Source" LB domain would then be "reference range indicators", and the "standardized results". The latter could then use (at least for quantitative results) e.g. use the "LOINC proposed unit".

Similarly, for the "ebola hightest temperature in the last 24 hours", which cannot be exactly described at all in classic SDTM, the "e-Source" VS domain could contain a record like:

Study Identifier	Domain	Unique Subject ID	Sequence Number	Test Code	Code System	Original Result	Original Result Units (UCUM)
ABC	VS	EBO-001-001	1	8315-4	LOINC	31.5	Cel

Also here, VSCAT (and VSCAT) are not used, as it is already comprised in the test code 8315-4. In most cases, even VSPOS will be unnecessary (for e.g. blood pressure), as it is already included in the LOINC code.

Conclusions:

As it is clear that the current SDTM is not fit for use with e-Source, we make a first proposal for a set of "e-Source" findings domains, using pre-coordinated coding systems (as is already used in e-Source), and using UCUM as much as possible for unit notation.

These "e-Source" domains are not meant to replace the "classic" SDTM domains, as these remain their value for the "classic" case where data is collected separately (paper, classic EDC). These "classic" domains can then only be deprecated when "everything is e-Source", so maybe in 5 years from now?

Please remark that with this first proposal, I do not encourage the use of "tables" for regulatory submissions. At the contrary, on the longer term, we need to go to submission of "biomedical concept" data points or "resources".

But that's another discussion …

CDISC end-to-end

Sunday, December 10, 2017

SDTM and CDISC-CT: fit for e-Source?