CDISC end-to-end: 2025

Sunday, October 5, 2025

Generating custom CORE Rules starting from CDISC SDTM Dataset Specializations

CDISC Biomedical Concepts (usually abbreviated as "BCs") and SDTM Dataset Specializations (further abbreviated as "DSSs") are great building blocks for setting up clinical studies, the latter linking the BCs to specific SDTM domains and guaranteeing correct relations between the different variable values , especially for Findings SDTM datasets. For example, the BC "diastolic blood pressure" (C25299) is implemented as DSS "DIABP" stating that the domain for this measurement is VS (Vital Signs), the test code is VSTESTCD with value DIABP, and VSTEST="Diastolic Blood Pressure", the expected values of VSORRES and VSSTRESN are of type "integer", and the value for VSORRESU (Result or Finding in Original Units) and VSSTRESU (Standard Units) is expected to be "mmHg". It also states that the value for VSPOS (Vital Signs Position of Subject) is expected to be one of "PRONE", "SEMI-RECUMBENT", "SITTING", "STANDING", and "SUPINE".
Essentially, this is also what we will usually find in the "Valuelevel Metadata" (VLM) in the define.xml.

When one carefully things about this, one could also state that the DSS defines a "rule". But, when having mapped the collected data to SDTM, e.g. using the SDTM-ETL mapping software, how can we ensure that these "rules" have been followed?

This is where CDISC CORE (CDISC Open Rules Engine) comes into play.
CORE comes already with a very large rule set for SDTMIG, SENDIG, ADaM, and FDA and FDA business rules. However, CORE also allows to add own sets of rules, either by adding them to the CORE cache as an "own standard", or by keeping them in a separate directory as a set of YAML or JSON files, to which one can point using the "-lr" (or "--local-rules") parameter when using the CLI.

But how can we use these "Dataset Specializations" as "rules" in CORE?

The DSSs can be retrieved from the CDISC Library and its API in an automated way. For each of the Specializations, we can then analyze the content and transform that to a CORE rule in YAML format.
Early June this year, we developed a software program to automate this process, and are currently still refining it. For each DSS, it generates an average of about 3 rules, of course depending on the complexity and content of the DSS. For example, we must take into account of whether a variable is "required", "expected" or "permissible".

Very recently (September 2025), CDISC updated the BCs and DSs in the CDISC-Library, so we ran the software again, leading to 2,627 (yes, you read it well: two thousand six hundred twenty-seven) CORE rules. That will be a hell to review and curate ...

Let's have a first look at an example of such a rule, derived from the BC "CDISC ADAS-Cog - Orientation Summary Score" (C100238).
As every CORE rule, it starts with the list of standards and versions it is applicable to:

followed by the rule logic itself:

stating that when FTTESTCD=ADCOR, FTCAT=ADAS-COG, FTSCAT=ORIENTATION, then the value of FTRESN (which represents the "score") must be an integer value between 0 and 8.
A similar rule is also generated that under the same pre-conditions, the value in FTORRES must be a string with allowed values being "0", "1", … "8", and the same for FTSTRESC.

The lower part of the YAML then provides further information about the status of the rule, an Id, the description, and the violation message. Also the latter is automatically generated. Then follows the information to which SDTM classes and domains the rule is applicable, and whether the rule needs to be applied on the record level or the dataset level:

In this case for this questionnaire, FTSTRESC is equal to FTSTRESN, but this does not have to be always the case of course.

Another, very different rule automatically generated is derived from the BC "Ventricular Arrhythmia ECG Assessment" (C111320):

stating that when EGTESTCD=VTARRY, EGCAT=FINDING", the value of EGSTRESC is enumerated to 15 possible values.

Most of you probably have already seen that some of the code logic has been commented out. This was done by the software itself. And this is where the hard part starts …
For EGEVAL, one sees that, as a pre-condition, it states that the value must be one of "ADJUCATION COMMITTEE", ... to "VENDOR". However, EGEVAL is "permissible". So, if we leave the rule unchanged as it is, and EGEVAL is not provided in the dataset to be tested, the final test in the rule will not be executed, and if EGSTRESC is something else than is provided in the list, this will not be detected. So in this case, one may want to uncomment the lines 58 to 64.

Allowing the permissible variable to be absent or not populated is however not always a good idea …
This especially applies to lab tests for LBSPEC, which is also "permissible.

A very simple example is "Carbon Monoxid" (C139084). For the DSS "Carbon Monoxid Concentration in Blood", we get:

stating that when LBTESTCD=CMONOX and LBSPEC=BLOOD, then LBORRES must be a floating point number.
Allowing LBSPEC to be absent or empty doesn't make sense here, although the SDTMIG states it is "permissible".

Remark that we do not have any indication here about "what property" is measured in this case, except for from the title. In LOINC, this is captured by the variable "Property" (examples are Scnc, Mcnc, Pres, …), in CDISC, starting from SDTMIG-3.4 by the variable "--RESTYP" (Result Type). Similarly, no indication is provided about what the result tpye must be, like "Quantitive", "Ordinal", … In LOINC this is provided by the required variable "Scale Type", in SDTM by the new variable --RESSCL (Result Scale). It allows to e.g. distinguish between a quantitative test and an ordinal test. This is especially important for urine tests, where in some cases a concentration is measured, and in other cases the presence as "ordinal", e.g. "+1", "+2", or essentially as a sort of boolean ("Positive", "Negative"). The current DSSs however never contain the important --RESSCL variable nor the --RESTYP variable, as these were only introduced in IG version 3.4, and the DSS-system is supposed to also be applicable to earlier versions of the SDTM-IG.

When going through the DSSs for lab test, and comparing with the LOINC database, we found that some types of tests are not represented yet in the DSS suite, e.g. these that have LOINC "property" being SRat (Substance Ratio) or "MRat" (Mass ratio), which are about quantities per unit of time.

Let us start with what looks to be a rather simple example: Potassium in urine.
There is one DSS with code KURIN "Potassium Concentration in Urine".

If we look into the LOINC database selecting component=potassium and system=urine, we get 13 codes, of which one (111459-4) is a "generic" one. The list is:

All have the "scale type" "Quantitative" (Qn), i.e. a number is expected for the result.
This would probably mean we want to add LBRESSCL=QUANTITATIVE to the DSS.

Would this be good enough for a "dataset specialization", or do we need to "slice" more?
If we look more careful, we see that there are distinct groups for "property":
- "SRat" and "MRat" for "property", meaning "Substance Rate" and "Mass Rate", i.e. a quantity per (some) unit of time, where "Substance" essentially means "molar units", and "Mass" means "mass units".
- "Scnc" for property, meaning "Substance Concentration", i.e. a concentration in molar units.

The current "KURIN" DSS treats the case of "Scnc" (as it says "Concentration") - LOINC code 2828-2, as also present in the properties of this DSS itself.

The second thing we need to look into is the "time aspect". The one described in "KURIN" is essentially "Pt" (Point in Time), although the "dataset specialization" doesn't says so explicitly.

There are however others, such as "24H", for which in the "long common name" we find "in 24 hour Urine". This makes sense, as concentrations in urine usually vary a lot during the day (this is especially the case for glucose), so having an "average" over 24 hours really makes sense.
P.S. It must not always be the "average". For example in malaria, very important is the maximum body temperature over a period of 24 hours.

But we also find "12H" and "2H" for the "time aspect".
So, how could this flow into a "dataset specialization"? Should we have one for the case of "Point in time" (this is the KURIN case), and another for the other time aspects, leading to have LBPDUR (Planned Duration) included, with the possible values "PT24H", "PT12H" and "PT2H" (as we use ISO-8601 "duration" format in SDTM)? But maybe also other values (for which there is no LOINC code) should be added?

We do not have a DSS (yet) for the case of "Substance Rate" and "Mass Rate". Such a DSS will always have a "time aspect", i.e. we will need to have LBPDUR included. So in this case, LBPDUR is "key".
If we limit ourselves to what we get from the set of LOINC codes, the possible values will be "PT6H", "PT24H", "PT1H", "PT12H", each with their unit, like "mmol/h" or "mEq/12h". For the latter, we see we don't have a unit in CDISC Controlled Terminology, we only have "mEq/day" and "meq/h". Remark the different ways of writing "milliequivalent" here! The reason is that in CDISC Controlled Terminology (CDISC-CT), the "Unit" codelist is a … list, which i.m.o. can never be complete. UCUM however is a "system" or "notation", not a list, which just makes more sense. The UCUM notation for "milliequivalent per 12 hours" is meq/(12.h), but as in UCUM the "12" is just a number, also "meq/(23.h)" would be valid. Not only does UCUM allow unit conversions to be automated, even between "conventional" and "SI" units.

Unfortunately, SDTM does not allow the use of UCUM (yet), although there is some overlap, e.g. for concentrations.

One may also discuss whether for the "Dataset Specialization", one should split between "Mass (Concentration/Rate)" and "Substance (Concentration/Rate)". CDISC has decided not to do so, LOINC did. Reason is probably the difference in usage: LOINC is "pre-coordinated", and in healthcare, labs will usually only either use "Mass (concentration/rate)" or "Substance (concentration/rate)" for a specific measurement, depending on country, region of the world, culture, … CDISC SDTM however is "post-coordinated", meaning "assignment/categorization afterwards". BCs are however meant to act "pre-coordinated", i.e. for planning studies. When describing what needs to be done in a study, the BC needs to take different CROs, sites and labs in different parts of the world into account, some of which will be used to use "Mass" and others to use "Substance". That's fine, as conversion or standardization to a single one for submission (e.g. LBSTRESN/LBSTRESU) is possible, and can easily be automated when UCUM notation is used for the units. But for the protocol development, the BC must allow the implementers to have a choice.

Essentially, one could also say that each LOINC code represents a BC. For this purpose however, LOINC is too strict, as it does not allow any choices at all, i.e. it represents the lowest level of granularity . BCs act on a higher level, still allowing choices. For example, the BC for "Potassium Measurement" (C64853) allows the specimen to be "Blood", "Plasma", "Serum" or "Urine". Further details and split ups are then provided by the "Dataset Specialization". This means that the BC allows, and I would even say requires a DSS to be implementable: in a clinical trial protocol, one cannot just state "use the Biomedical Concept 'Potassium Measurement'", as that could mean that one CRO/site interprets this as "in blood", whereas the other interprets this as "in urine", with non-comparable results as a consequence.

In the past, we did have BCs that were more detailed, like "Urine Glucose Test Strip Measurement", but these have been retired. There now only is "Glucose Measurement" (C105585). Some Biomedical Concepts however (still) have sub-concepts, like "Blood Pressure" (C54706), which has "Diastolic Blood Pressure" (C25299) and "Systolic Blood Pressure" (C25298).
For "Glucose Measurement", the "Dataset Specializations" are now "Plasma Equivalent Glucose" (GLUCPE), "Glucose Presence in Urine" (GLUCURINPRES), "Glucose Concentration in Urine" (GLUCURIN), "Glucose in Urine by Dipstick" (GLUCUA), "Glucose Concentration in Serum or Plasma" (GLUCSERPL), "Glucose Concentration in Serum" (GLUCSER), "Glucose Concentration in Plasma" (GLUCPL), "Glucose Concentration in Blood" (GLUCBLD).

If we do a query:
SELECT distinct(`system`) FROM loincdb_281.loinc where component='glucose' and property IN ('Scnc','Mcnc');
in LOINC, asking for which (SDTM) SPECIMEN values would be possible for "glucose concentration", we get a list of 27 possible values:

So, depending on priority, we may also want to develop several more.

If we look at the "Dataset Specialization" for the BC "Potassium Measurement" (C64853) we only find specializations that are about "point in time concentrations" (KURIN, KSERPL, KBLD), not any about "rate" or "collected over time" (that have a non-"Pt" "time aspect" in LOINC), so also here, there is still room for extension.

This raises the question whether it would be possible to use the LOINC database to generate "Dataset Specializations" in an automated or semi-automated way.

For example, if in the LOINC database, I do a select "Component=Potassium", I get 57 rows:

already suggesting a good number of possible SDTM specializations based on the specimen. So I do a query:
SELECT distinct(system) … WHERE component=Potassium, leading to:

getting 29 distinct values for the specimen … So this promises a lot of work when I want to treat them all for CDISC DSSs. I could of course also use the query:
SELECT … WHERE component=Potassium ORDER BY system, leading to:

and so on …

And when I then e.g. want to concentrate on "Potassium in Stool", I see that this could lead to 2 "Dataset Specialization", one for "Concentration in Stool" (but LOINC only seems to have a "Substance Concentration" code) and one for "Concentration collected over time in Stool", where the time aspect comes into play, causing LBPDUR to be represented in the DSS.
The decision then to make is whether this is sufficient to really make 2 "Dataset Specialization", or to merge (or keep them) in only one. Such a decision may not always to be easy to make …

Of course such an approach can only work for those domains for which there is a good amount of LOINC codes, i.e. laboratory, vital signs, microbiology, virology, …

In the coming weeks, and if I find the time, I want to try to develop software that can help speeding up the development of Dataset Specializations using LOINC queries.
I will keep you informed …

Comments are, as always, very welcome!

Sunday, September 14, 2025

SDTM Mapping without Excel - A Comparison of Software Offerings

Certara, who acquired Pinnacle21 a few years ago, seems to have started a marketing offensive for Pinnacle21 "Enterprise" (P21E), now claiming "Faster Mapping from Source to Target SDTM" and "without spreadsheets" (as using spreadsheets is the classic approach for setting up and managing the mapping specifications).
As our own SDTM-ETL (competing) mapping software provides this already for about 10 years, I was pretty curious, so I attended their 35-minutes webinar.

After a few pure marketing slides, it was explained what "Methods and benefits of cross-study reuse" are. Ok, you don't need to convince me about that. Also this is something our SDTM-ETL software provides already for a long time.

(the image is from a presentation of a customer in 2015)

It was then shown what the "without spreadsheet" mapping approach in P21E is. To my surprise, it looks as the approach reduces to have the "external" spreadsheet being replaced by an "internal" spreadsheet, which then needs to be adapted for each new study from within the software, which, in my opinion, creates a new (vendor) dependency.
It also uses "predefined specs" for each of the SDTM versions. It was unclear to me whether these can be edited or corrected when something is wrong. In our software we call these "templates" and are delivered as define.xml files, meaning that if something is wrong, it can easily be corrected. In SDTM-ETL, the use of define.xml for the templates also enables to generate a "near-submission-ready" define.xml, as all the metadata is kept in sync with an underlying instance of the define.xml during the mapping development
From the P21E demo, it also looks as for the SDTM variables, the maximal length needs to be manually adapted for each new study - I did not hear anything about an automated adaption from the data themselves, as we have in our software.

Most disappointing however was that the mappings are still kept in tables, just as it was in the past, but then tables within the software instead of external Excel files. So, no user-friendly graphical user interface, no "drag-and-drop" from source data, no wizards e.g. for codelist-codelist mapping, no automated generation of the mapping script. None of all that, at least from what was presented.

It also looks to me as it does not have its own execution engine, but still completely relies on either SAS (expensive....) or R for the execution. This also means that the software still requires very good knowledge of either SAS or R on top of very good knowledge of SDTM, for which in our SDTM-ETL there is a lot of aid by wizards, and e.g. all the "CDISC Notes" available by a simple click or "Ctrl-H".

Of course, the advantage of having all the "mapping specifications" for all studies within the software is that it is easy to make comparisons ("diffs") between studies, something that is more difficult when working with Excel worksheets.

What was also shown was validation of results, but it looked to me as that was still using the P21 validation software, and not the modern CDISC CORE engine. For example, when just generating LB and SUPPLB, the validation result complains about missing files like DM and EX and … This is really not something I want to see when my mappings are still in development. When using CORE, one can just ask to validate the just generated datasets, and then even select which rules to apply and which ones to exclude. For example, when still developing the mappings, I would always exclude the "FDA Business Rules", as these only make sense for (near-) submission-ready packages. And I would include them when I am nearing the end getting everything ready.

Regarding SUPPLB (as any other SUPPxx datasets), it looks as it requires its very own mapping specifications. In our SDTM-ETL software, "Non-standard" variables (that need to be "banned" to SUPPxx) are treated as just normal variables within the domain, and are "split-off" at the very last end, which also guarantees consistency through --SEQ between parent domain and SUPPxx datasets. How this can be guaranteed when SUPPxx-s are handled separate right from the start (as I think is required by P21E), I don't know, it doesn't looks easy to me ...

What I did like however was the availability of a RESTful web service to communicate with the own SAS or R engine. But this is of course really needed when P21E does not have its own execution engine.

What I also liked is that the system seems to implement LOINC-CDISC mapping. It is not clear to me how this done, or whether it uses the original by CDISC published LOINC-CDISC mappings (2,400 mappings for 1,400 distinct LOINC codes). In SDTM-ETL a RESTful web service is used for that queries a database that has almost over 18,000 mappings for almost 10,000 LOINC codes). I presume P21E does not use these extended mappings, but I may be wrong. LOINC-CDISC mappings for 1,400 LOINC codes is a bit low for real-life usage.

What I also like is that there is at least minimal way of traceability checking in P21E, which is similar to what we have in SDTM-ETL. It e.g. however does not guarantee that all subjects have been taken into account, for which we do have some features for in SDTM-ETL.

So, does what is presented by Certara really speed up SDTM generation (as Certara claims)? I think, yes, maybe a bit. Essentially, so far as I can see, the presented software is just a mapping specifications management tool. It does not help in generating the mappings though, as SDTM-ETL does. There are no drag-and-drop features, no wizards, no build-in SDTM knowledge, no own execution engine, and no CORE validation.
So, I think it may be a small step forward, but at the same time has the great danger of vendor-lock-in.

I also asked ChatGPT about a comparision between P21E, and this is what it provided:

Of course, I do not take any responsibility for what ChatGPT emitted ...

Your comments are of course, as always, very welcome!