The title of this post should have been something like
"CDISC SDTM and Controlled Terminology – post-coordinated versus pre-coordinated", but then most people would probably have no idea what
I am talking about. So a little bit of explanation first.
CDISC SDTM uses "post-coordinated" controlled
terminology. This means that controlled terms are combined "as
needed" so that they can be build "as required". The consequence
is that the result is dynamic, the ontology is "what you see", and
any combination of terms is possible. So essentially, the combination of e.g.
LBTESTCD=Albumin with LBSPEC=Blood and LBMETHOD=dipstick is valid, although you
can't test albumin in blood using the dipstick method (that method is only
available for albumin in urine).
"Post-coordination" has its advantages. It brings (some) order into chaos. It is especially useful when it is not known in advance (or cannot be envisaged) which tests will be performed.
"Post-coordination" has its advantages. It brings (some) order into chaos. It is especially useful when it is not known in advance (or cannot be envisaged) which tests will be performed.
Most systems in healthcare use "pre-coordination".
This means that any possible combinations are assembled in advance and, when
meaningful, obtain a single code. So not all combinations are possible. An
example of such a system is LOINC. So in LOINC, you won't find a code for
"albumin in blood measured using dipstick", but you will find a code
(1751-7) for "albumin in serum or plasma measured quantitatively as mass/volume".
Pre-coordinated are (must be) precise: each code should uniquely describe a term
(a test in this case).
CDISC SDTM findings domains have been developed to bring
"order in chaos". Essentially this means the paper world or the world
where protocols do not precisely describe which tests need to be performed. For
example, in the famous LZZT protocol we find the following tests defined: "Urinalysis: Color, Specific gravity, pH, Protein, Glucose,
Ketones, Bilirubin, …".
That's it. So not very precise. The problem with this is that each site can
(and will probably) perform different tests. For example, for "glucose in urine",
LOINC lists over 20 different tests (even when excluding all the
"post" and "challenge" tests). When then submitted,
post-coordination is necessary, but the results will not be comparable between
sites, studies and sponsors. Even the combination of LBTESTCD (essentially the
analyte), LBSPEC (the specimen, e.g. "urine") and LBMETHOD does not
guarantee at all a unique combination. So it is no wonder at all that the FDArecently mandated the use of LBLOINC, i.e. it requires (as of 2020) that additionally,
the unique LOINC identifier is added.
The problem however is not limited to laboratory tests
alone. For example, there has been a discussion on the CDISC wiki about the
"ebola vital signs CRF", about how the important test " highesttemperature in the last 24 hours" must be annotated for SDTM. Using SDTM, it
cannot be done, as there is no way to define "in the last 24 hours".
The solution is however simple when using LOINC: the LOINC code 8315-4 "Body temperature 24 hour maximum" very exactly describes this test.
Remark that the argument "pre-coordination could result in an explosion of new CT terms ..." is nonsense if CDISC finally allows LOINC to be used (it is not a problem in healthcare ...).
The solution is however simple when using LOINC: the LOINC code 8315-4 "Body temperature 24 hour maximum" very exactly describes this test.
Remark that the argument "pre-coordination could result in an explosion of new CT terms ..." is nonsense if CDISC finally allows LOINC to be used (it is not a problem in healthcare ...).
This means that our current SDTM findings variables are not always able to exactly describe tests, even when using post-coordination.
Nowadays, we see that research data are more and more extracted from electronic health records (EHRs) and hospital information systems (HIS), rather than collected separately (e-Source). There are even voices that say: "in 5 years from now, everything will be e-Source". Data from e-source is almost always pre-coordinated, i.e. using pre-coordinated terminology like LOINC, SNOMED-CT, etc..
When e-Source data is used, and the data is submitted, the pre-coordinated terminology must be translated to post-coordinated terminology, which is arbitrary, ambiguous, and not always possible, as the "highest temperature in the last 24 hours" example clearly shows. For lab tests, we can use the LOINC tests 5792-7, 22705-8 and 25428-4 as an example: all three would be modeled in SDTM as LBTESTCD=GLUC, LBSPEC=URINE and LBMETHOD=TEST STRIP. One can only distinguish by looking at the results themselves and at the units used.
Also here, VSCAT (and VSCAT) are not used, as it is already comprised in the test code 8315-4. In most cases, even VSPOS will be unnecessary (for e.g. blood pressure), as it is already included in the LOINC code.
Nowadays, we see that research data are more and more extracted from electronic health records (EHRs) and hospital information systems (HIS), rather than collected separately (e-Source). There are even voices that say: "in 5 years from now, everything will be e-Source". Data from e-source is almost always pre-coordinated, i.e. using pre-coordinated terminology like LOINC, SNOMED-CT, etc..
When e-Source data is used, and the data is submitted, the pre-coordinated terminology must be translated to post-coordinated terminology, which is arbitrary, ambiguous, and not always possible, as the "highest temperature in the last 24 hours" example clearly shows. For lab tests, we can use the LOINC tests 5792-7, 22705-8 and 25428-4 as an example: all three would be modeled in SDTM as LBTESTCD=GLUC, LBSPEC=URINE and LBMETHOD=TEST STRIP. One can only distinguish by looking at the results themselves and at the units used.
Both examples "maximum temperature in the last 24
hours" and "glucose in urine by test strip" demonstrate that information
loss is possible or even unavoidable. So, even when the test is exactly described
by a pre-coordinated code (LOINC, SNOMED, …), we
are forced to submit using a post-coordinated system with loss of information
or test uniqueness.
This leads me to an important conclusion: the current SDTM
is not fit for use with e-Source.
It is great for the paper world and for classic EDC where data is collected separately from the healthcare world.
It is great for the paper world and for classic EDC where data is collected separately from the healthcare world.
How can we do better? Especially when the statement
"everything will be e-Source in 5 years from now" becomes true.
In the past, I published an article "An Alternative CDISC-SubmissionDomain for Laboratory Data (LB) for Use with Electronic Health Record Data" in the "European Journal of BioMedical Informatics" (EJBI) where I
proposed that, at least for laboratory data coming from e-Source", the typical
LBTESTCD, LBTEST, LBSPEC and LBMETHOD are replaced by a set of variables that align with the 6 dimensions
of LOINC.
However, this only provides a solution for laboratory data
using LOINC. There are however more coding systems used in e-Source data. For
example, for microbiology data, NCBI coding is often used. This means that when
using e-Source, data (pre-coordinated) using NCBI coding must be translated to one
or more of the SDTM variables in the SDTM domain, which uses its own CDISC controlled
terminology, and with guaranteed loss of information, as NCBI is much more
specific.
Essentially, all this means that we need an alternative
"e-Source" domain for each of the existing SDTM findings domains.
These new domains can be much simpler than the existing SDTM domains, as much
of the information for which several variables are needed in the "classic"
domains, can now be in one single variable, the "test code". As these
domains need to be "code system neutral", the core variables in these
"e-Source" domains would be "test code", "code
system" and maybe "test name". The latter is even not necessary,
as there is a 1:1 relationship with "test code" and can easily be
looked up automatically by computer systems e.g. using one of the many RESTfulweb services from NLM, UMLS, NIH, HIPAA etc..
So for example, for the "e-Source LB" domain, the core
variables would be:
"Study ID" to "Sponsor-Defined Identifier" and then "test code" and "test system", "original test result", "original result units" (using UCUM). The classic LBCAT, LBSCAT, LBSPEC, LBMETHOD can be removed, as they are all included yet in the pre-coordinated "test code". Remark that I avoid to assign variable names, as e.g. "LBTESTCD" would mean completely different things in both variations of the LB domain. In the e-Source domain it would mean "the unique test code" whereas in the classic domain, LBTESTCD is essentially misleading, as it specifies the analyte, and not the test (remark that –TESTCD has a different meaning depending on the domain in classic SDTM).
In
the "e-Source" LB domain, the first records in example 1 of the SDTM-IG
(page LB-5) would look like:"Study ID" to "Sponsor-Defined Identifier" and then "test code" and "test system", "original test result", "original result units" (using UCUM). The classic LBCAT, LBSCAT, LBSPEC, LBMETHOD can be removed, as they are all included yet in the pre-coordinated "test code". Remark that I avoid to assign variable names, as e.g. "LBTESTCD" would mean completely different things in both variations of the LB domain. In the e-Source domain it would mean "the unique test code" whereas in the classic domain, LBTESTCD is essentially misleading, as it specifies the analyte, and not the test (remark that –TESTCD has a different meaning depending on the domain in classic SDTM).
Study Identifier
|
Domain
|
Unique
Subject ID |
Sequence Number
|
Test Code
|
Code System
|
Original Result
|
Original Result Units (UCUM)
|
ABC
|
LB
|
ABC-001-001
|
1
|
LOINC
|
30
|
g/L
|
|
ABC
|
LB
|
ABC-001-001
|
2
|
LOINC
|
398
|
[iU]/L
|
|
ABC
|
LB
|
ABC-001-001
|
5
|
LOINC
|
5.9
|
10*9/L
|
Using UCUM notation is important, as UCUM notation is almost
always used in e-Source, and we don't won't information loss nor conversion
errors. Even more, UCUM allows automated conversions (e.g. for the
"standardized result"), using one of the RESTful web services
available (NLM and our own one).
The next columns in the "e-Source" LB domain would
then be "reference range indicators", and the "standardized
results". The latter could then use (at least for quantitative results) e.g.
use the "LOINC proposed unit".
Similarly, for the "ebola hightest temperature in the
last 24 hours", which cannot be exactly described at all in classic SDTM, the
"e-Source" VS domain could contain a record like:
Study Identifier
|
Domain
|
Unique
Subject ID |
Sequence Number
|
Test Code
|
Code System
|
Original Result
|
Original Result Units (UCUM)
|
ABC
|
VS
|
EBO-001-001
|
1
|
LOINC
|
31.5
|
Cel
|
Also here, VSCAT (and VSCAT) are not used, as it is already comprised in the test code 8315-4. In most cases, even VSPOS will be unnecessary (for e.g. blood pressure), as it is already included in the LOINC code.
Conclusions:
As it is clear that the current SDTM is not fit for use with
e-Source, we make a first proposal for a set of "e-Source" findings
domains, using pre-coordinated coding systems (as is already used in e-Source),
and using UCUM as much as possible for unit notation.
These "e-Source" domains are not meant to replace
the "classic" SDTM domains, as these remain their value for the
"classic" case where data is collected separately (paper, classic EDC).
These "classic" domains can then only be deprecated when
"everything is e-Source", so maybe in 5 years from now?
Please remark that with this first proposal, I do not
encourage the use of "tables" for regulatory submissions. At the contrary,
on the longer term, we need to go to submission of "biomedical concept"
data points or "resources".
But that's another discussion …