CDISC Biomedical Concepts (usually abbreviated as "BCs") and SDTM Dataset Specializations (further abbreviated as "DSSs") are great building blocks for setting up clinical studies, the latter linking the BCs to specific SDTM domains and guaranteeing correct relations between the different variable values , especially for Findings SDTM datasets.
For example, the BC "diastolic blood pressure" (C25299) is implemented as DSS "DIABP" stating that the domain for this measurement is VS (Vital Signs), the test code is VSTESTCD with value DIABP, and VSTEST="Diastolic
Blood Pressure", the expected values of VSORRES and VSSTRESN are of type "integer", and the value for VSORRESU (Result or Finding in Original Units) and VSSTRESU (Standard Units) is expected to be "mmHg". It also states that the value for VSPOS (Vital Signs Position of Subject) is expected to be one of "PRONE", "SEMI-RECUMBENT",
"SITTING", "STANDING", and "SUPINE".
Essentially, this is also what we will usually find in the "Valuelevel Metadata" (VLM) in the define.xml.
When one carefully things about this, one could also state that the DSS defines a "rule". But, when having mapped the collected data to SDTM, e.g. using the SDTM-ETL mapping software, how can we ensure that these "rules" have been followed?
This is where CDISC CORE (CDISC Open Rules Engine) comes into play.
CORE comes already with a very large rule set for SDTMIG, SENDIG, ADaM, and FDA and FDA business rules. However, CORE also allows to add own sets of rules, either by adding them to the CORE cache as an "own standard", or by
keeping them in a separate directory as a set of YAML or JSON files, to which one can point using the "-lr" (or "--local-rules") parameter when using the CLI.
But how can we use these "Dataset Specializations" as "rules" in CORE?
The DSSs can be retrieved from the CDISC Library and its API in an automated way. For each of the Specializations, we can then analyze the
content and transform that to a CORE rule in YAML format.
Early June this year, we developed a software program to automate this process, and are currently still refining it. For each DSS, it generates an average of about
3 rules, of course depending on the complexity and content of the DSS. For example, we must take into account of whether a variable is "required", "expected" or "permissible".
Very recently (September 2025), CDISC updated the BCs and DSs in the CDISC-Library, so we ran the software again, leading to 2,627 (yes, you read it well: two thousand six hundred twenty-seven) CORE rules. That will be a hell to review and curate ...
Let's have a first look at an example of such a rule, derived from the BC "CDISC ADAS-Cog - Orientation Summary Score" (C100238).
As every CORE rule, it starts with the list of standards
and versions it is applicable to:
followed by the rule logic itself:
stating that when FTTESTCD=ADCOR, FTCAT=ADAS-COG, FTSCAT=ORIENTATION, then the value of FTRESN (which represents the "score") must be an integer value between 0 and 8.
A similar rule is also generated that under the same pre-conditions,
the value in FTORRES must be a string with allowed values being "0", "1", … "8", and the same for FTSTRESC.
The lower part of the YAML then provides further information about the status of the rule, an Id, the description, and the violation message. Also the latter is automatically generated. Then follows the information to which SDTM classes and domains the rule is applicable, and whether the rule needs to be applied on the record level or the dataset level:
In this case for this questionnaire, FTSTRESC is equal to FTSTRESN, but this does not have to be always the case of course.
Another, very different rule automatically generated is derived from the BC "Ventricular Arrhythmia ECG Assessment" (C111320):
stating that when EGTESTCD=VTARRY, EGCAT=FINDING", the value of EGSTRESC is enumerated to 15 possible values.
Most of you probably have already seen that some of the code logic has been commented out. This was done by the software itself. And this is where the hard part starts …
For EGEVAL, one sees that, as a pre-condition, it states that the value must be one of "ADJUCATION COMMITTEE", ... to "VENDOR". However, EGEVAL is "permissible". So, if we leave the rule unchanged as it is, and EGEVAL is not
provided in the dataset to be tested, the final test in the rule will not be executed, and if EGSTRESC is something else than is provided in the list, this will not be detected. So in this case, one may want to uncomment the lines 58 to 64.
Allowing the permissible variable to be absent or not populated is however not always a good idea …
This especially applies to lab tests for LBSPEC, which is also "permissible.
A very simple example is "Carbon Monoxid" (C139084). For the DSS "Carbon Monoxid Concentration in Blood", we get:
stating that when LBTESTCD=CMONOX and LBSPEC=BLOOD, then LBORRES must be a floating point number.
Allowing LBSPEC to be absent or empty doesn't make sense here, although the SDTMIG states it is "permissible".
Remark that we do not have any indication here about "what property" is measured in this case, except for from the title. In LOINC, this is captured by the variable "Property" (examples are Scnc, Mcnc, Pres, …), in CDISC, starting from SDTMIG-3.4 by the variable "--RESTYP" (Result Type). Similarly, no indication is provided about what the result tpye must be, like "Quantitive", "Ordinal", … In LOINC this is provided by the required variable "Scale Type", in SDTM by the new variable --RESSCL (Result Scale). It allows to e.g. distinguish between a quantitative test and an ordinal test. This is especially important for urine tests, where in some cases a concentration is measured, and in other cases the presence as "ordinal", e.g. "+1", "+2", or essentially as a sort of boolean ("Positive", "Negative"). The current DSSs however never contain the important --RESSCL variable nor the --RESTYP variable, as these were only introduced in IG version 3.4, and the DSS-system is supposed to also be applicable to earlier versions of the SDTM-IG.
When going through the DSSs for lab test, and comparing with the LOINC database, we found that some types of tests are not represented yet in the DSS suite, e.g. these that have LOINC "property" being SRat (Substance Ratio) or "MRat" (Mass ratio), which are about quantities per unit of time.
Let us start with what looks to be a rather simple example: Potassium in urine.
There is one DSS with code KURIN "Potassium Concentration in Urine".
If we look into the LOINC database selecting component=potassium and system=urine, we get 13 codes, of which one (111459-4) is a "generic" one. The list is:
All have the "scale type" "Quantitative" (Qn), i.e. a number is expected for the result.
This would probably mean we want to add LBRESSCL=QUANTITATIVE to the DSS.
Would this be good enough for a "dataset specialization", or do we need to "slice" more?
If we look more careful, we see that there are distinct groups for "property":
-
"SRat" and "MRat" for "property", meaning "Substance Rate" and "Mass Rate", i.e. a quantity per (some) unit of time, where "Substance" essentially means "molar
units", and "Mass" means "mass units".
- "Scnc" for property, meaning "Substance Concentration", i.e. a concentration in molar units.
The current "KURIN" DSS treats
the case of "Scnc" (as it says "Concentration") - LOINC code 2828-2, as also present in the properties of this DSS itself.
The second thing we need to look into is the "time aspect". The one described in "KURIN" is essentially "Pt" (Point in Time), although the "dataset specialization" doesn't says so explicitly.
There are however others, such as "24H", for which in the "long common name" we find "in 24 hour Urine". This makes sense, as concentrations in urine usually vary a lot during
the day (this is especially the case for glucose), so having an "average" over 24 hours really makes sense.
P.S. It must not always be the "average". For example in malaria, very important is the maximum
body temperature over a period of 24 hours.
But we also find "12H" and "2H" for the "time aspect".
So, how could this flow into a "dataset specialization"? Should we have one for the case of "Point
in time" (this is the KURIN case), and another for the other time aspects, leading to have LBPDUR (Planned Duration) included, with the possible values "PT24H", "PT12H" and "PT2H" (as we
use ISO-8601 "duration" format in SDTM)? But maybe also other values (for which there is no LOINC code) should be added?
We do not have a DSS (yet) for the case of "Substance Rate" and "Mass Rate". Such a DSS will always have a "time aspect", i.e. we will need to have LBPDUR included. So in this case, LBPDUR
is "key".
If we limit ourselves to what we get from the set of LOINC codes, the possible values will be "PT6H", "PT24H", "PT1H", "PT12H", each with their unit, like "mmol/h"
or "mEq/12h". For the latter, we see we don't have a unit in CDISC Controlled Terminology, we only have "mEq/day" and "meq/h". Remark the different ways of writing "milliequivalent"
here! The reason is that in CDISC Controlled Terminology (CDISC-CT), the "Unit" codelist is a … list, which i.m.o. can never be complete. UCUM however is a "system" or "notation", not a list, which just makes more sense. The
UCUM notation for "milliequivalent per 12 hours" is meq/(12.h), but as in UCUM the "12" is just a number, also "meq/(23.h)" would be valid. Not only does UCUM allow unit conversions to be automated,
even between "conventional" and "SI" units.
Unfortunately, SDTM does not allow the use of UCUM (yet), although there is some overlap, e.g. for concentrations.
One may also discuss whether for the "Dataset Specialization", one should split between "Mass (Concentration/Rate)" and "Substance (Concentration/Rate)". CDISC has decided not to do so, LOINC did. Reason is probably the difference in usage: LOINC is "pre-coordinated", and in healthcare, labs will usually only either use "Mass (concentration/rate)" or "Substance (concentration/rate)" for a specific measurement, depending on country, region of the world, culture, … CDISC SDTM however is "post-coordinated", meaning "assignment/categorization afterwards". BCs are however meant to act "pre-coordinated", i.e. for planning studies. When describing what needs to be done in a study, the BC needs to take different CROs, sites and labs in different parts of the world into account, some of which will be used to use "Mass" and others to use "Substance". That's fine, as conversion or standardization to a single one for submission (e.g. LBSTRESN/LBSTRESU) is possible, and can easily be automated when UCUM notation is used for the units. But for the protocol development, the BC must allow the implementers to have a choice.
Essentially, one could also say that each LOINC code represents a BC. For this purpose however, LOINC is too strict, as it does not allow any choices at all, i.e. it represents the lowest level of granularity . BCs act on a higher level, still allowing choices. For example, the BC for "Potassium Measurement" (C64853) allows the specimen to be "Blood", "Plasma", "Serum" or "Urine". Further details and split ups are then provided by the "Dataset Specialization". This means that the BC allows, and I would even say requires a DSS to be implementable: in a clinical trial protocol, one cannot just state "use the Biomedical Concept 'Potassium Measurement'", as that could mean that one CRO/site interprets this as "in blood", whereas the other interprets this as "in urine", with non-comparable results as a consequence.
In the past, we did have BCs that were more detailed, like "Urine Glucose Test Strip Measurement", but these have been retired. There now only is "Glucose Measurement" (C105585). Some
Biomedical Concepts however (still) have sub-concepts, like "Blood Pressure" (C54706), which has "Diastolic Blood Pressure" (C25299) and "Systolic Blood Pressure" (C25298).
For "Glucose Measurement",
the "Dataset Specializations" are now "Plasma Equivalent Glucose" (GLUCPE), "Glucose Presence in Urine" (GLUCURINPRES), "Glucose Concentration in Urine" (GLUCURIN), "Glucose in
Urine by Dipstick" (GLUCUA), "Glucose Concentration in Serum or Plasma" (GLUCSERPL), "Glucose Concentration in Serum" (GLUCSER), "Glucose Concentration in Plasma" (GLUCPL), "Glucose
Concentration in Blood" (GLUCBLD).
If we do a query:
SELECT distinct(`system`) FROM loincdb_281.loinc where component='glucose' and property IN ('Scnc','Mcnc');
in LOINC, asking for which
(SDTM) SPECIMEN values would be possible for "glucose concentration", we get a list of 27 possible values:
So, depending on priority, we may also want to develop several more.
If we look at the "Dataset Specialization" for the BC "Potassium Measurement" (C64853) we only find specializations that are about "point in time concentrations" (KURIN, KSERPL, KBLD), not any about "rate" or "collected over time" (that have a non-"Pt" "time aspect" in LOINC), so also here, there is still room for extension.
This raises the question whether it would be possible to use the LOINC database to generate "Dataset Specializations" in an automated or semi-automated way.
For example, if in the LOINC database, I do a select "Component=Potassium", I get 57 rows:
already suggesting a good number of possible SDTM specializations based on the specimen. So I do a query:
SELECT distinct(system) … WHERE component=Potassium, leading to:
getting 29 distinct values for the specimen … So this promises a lot of work when I want to treat them all for CDISC DSSs. I could of course also use the query:
SELECT … WHERE component=Potassium
ORDER BY system, leading to:
and so on …
And when I then e.g. want to concentrate on "Potassium in Stool", I see that this could lead to 2 "Dataset Specialization", one for "Concentration in Stool" (but LOINC only seems
to have a "Substance Concentration" code) and one for "Concentration collected over time in Stool", where the time aspect comes into play, causing LBPDUR to be represented in the DSS.
The decision then to make is whether this is sufficient to
really make 2 "Dataset Specialization", or to merge (or keep them) in only one. Such a decision may not always to be easy to make …
Of course such an approach can only work for those domains for which there is a good amount of LOINC codes, i.e. laboratory, vital signs, microbiology, virology, …
In the coming weeks, and if I find the time, I want to try to develop software that can help speeding up the development of Dataset Specializations using LOINC queries.
I will keep you informed …
Comments are, as always, very welcome!