Sunday, February 20, 2022

Wait a Second! CDISC and Units of Measure.

I recently worked on a customer project where we needed to use the CDISC "UNIT" controlled terminology list (NCI code C71620) and we needed to do some simple derivations, combining measurements that have different units.

One of these units is the time unit "seconds". The CDISC notation for that unit is "sec" (NCI code C42535). We needed to combine this with a distance, like "m" (meter) and "cm" (centimeter) to obtain a velocity. 

To my surprise, the "submission value" for such velocities is not consistent. For the time part, either "sec" is used, or "s". For example (taken from the CDISC Library, CT-version 2021-12-26): 

 

When the distance part is centimeter (cm) or micrometer (um), the time part is "s" (seconds), whereas when the distance is meter (m) or millimeter (mm), the time part is "sec". 

Logical, isn't it?

No, not at all! When then further looking into other "composed" units that have "second" as part of the time unit, one finds out that either "sec" or "s" is used, apparently at random.
In our data, velocity is not measured directly, but derived from distance and time. The time part is always "s", as the source system uses UCUM notation for units. If we just combine the unit for the distance and the unit for the time, as is always valid when using UCUM notation, we e.g. get "m/s" and "cm/s". The first one however (m/s) INVALID however according to CDISC controlled terminology. So, we cannot easily automate the velocity calculation without needing to write some "implementation logic" that is essentially "illogical" ...

For minutes, we find a similar, but less catastrophic situation. The CDISC unit for "minutes" is "min" (C48154). For "counts per minute", the unit however is "cpm", so one may presume that "min" is replaced by "m", which essentially was meant for "meter". Similarly, the CDISC unit "disintegrations per minute" is "DPM", so fully uppercase. However, for "disintegrations per minute per milligram", the unit is "dpm/mg", with the disintegrations part being written lowercase.

So, also here, if I need to combine (e.g. in a calculation) "disintegrations" and a weight, like milligram, I cannot simply combine the two units as is always possible when using UCUM notation.

Now, I would like to standardize my data in -STRESN and -STRESU. Suppose that I want to standardize to "m/sec" units. So, I, or better, my computer, needs to know how many "cm/s" go into a "m/sec". Can my computer find that from the published CDISC controlled terminology? I looked into it, and found no way...
There is even no way that my computer can find out in an automated way that 1 inch is 2.54 centimeters ...

How can we do better?

CDISC does not allow UCUM notation for units. There is however some overlap, especially for units for concentrations, but it is limited. UCUM does allow for fully automated unit conversion between units without "lists of conversion factors". How that works is explained in an older blog. In combination with either the LOINC code for the test, or the molecular weight, it is even possible to automate conversions between "US conventional" and "SI" units. The National Library of Medicine (NLM) has a free RESTful web service that can be used by anyone for conversions based on UCUM and validation of units. Last year (2021), the service was called over 260,000,000 (yes indeed: over a quarter of a billion) times, with a new all-time record this January of almost 44,000,000 calls.
These numbers just demonstrate the success of UCUM worldwide.

So, why does CDISC still not allow UCUM notation in SDTM? Let's have a look what the "CDISC knowledge Base" tells about UCUM:

 

The first argument against UCUM: "CDISC has not received a request for the concept represented by the expression" is nonsense. First of all, it is not true. For example, I have done a request for including the UCUM notation "[in_i]" for inches (allowing to automatically convert inches to any other unit for length, even to things like "nautical miles"), but it was refused! Also, as UCUM is a system, and CDISC-UNIT a list, one could easily make millions of "requests" (as combinations) for new UNIT terms to the CDISC "list". One can already think that "CDISC will not be amused".
The second argument about mathematical synonymous units, but CDISC only allowing only one, is nonsense too in my opinion. These "synonymous" units are exactly the power of UCUM. Any computer software that implements UCUM automatically knows that 1 g/L = 1 dg/dL = 1 mg/mL. CDISC doesn't like this, so, to me it looks as the CDISC UNIT list is not meant for use with computers.

The argument that with the "CDISC lists", reviewers do not need to "mentally translate" is weird, as such "translations" can be fully automated by e.g. using the NLM RESTfulweb service, and that is exactly what we are doing with -STRESN/-STRESU isn't it? With UCUM, the "standardization" to -STRESN/-STRESU can fully be automated, with CDISC units, it may be a nightmare to program.
And even CDISC publishes synonyms for its units ...

An example that is given by CDISC is that CDISC doesn't support "mg/mL", meaning that when I get an original lab value in "mg/mL", I need to convert that to "g/L" and put that in SDTM -ORRESU, which is defined as "Original units in which the data were collected". When I do this conversion (in order to not violate against CDISC-CT), the submitted unit is not the one anymore as "originally received", breaking traceability, and being error prone anyway.

The argument about some UCUM notation units being "unfamiliar" to most lay users is strongly biased towards CDISC units: after all, the "lay users" must also learn the CDISC units isn't it? Or does the lay user know what "dpm/mL" is? And does the "lay user" understand why in one case, the list uses "sec" for seconds and in other cases "s" when it is a combined term? I would even say that CDISC units confuse the lay users, due to their inconsistency.

And even when the database stores the units using "unfamiliar" UCUM unit, it doesn't mean that it needs to be presented to the user in the same way isn't it? The above arguments i.m.o. testifies that the authors of the above "knowledge base" do not really understand UCUM, and/or are heavily biased against it. It also testifies of "paper thinking", which is not of modern times anymore. It is also no wonder that an ever growing part of healthcare IT systems solely use UCUM notation for units, and none uses CDISC units. In some countries, including my own one, UCUM notation is even mandated by law for use in electronic health records, just like LOINC is mandated. 

So, why is CDISC still blocking the use of UCUM notation e.g. in SDTM submissions?

 

Reactions to this blog are of course always welcome!