Monday, September 29, 2014

The meaning of "Unit"

Last week, I worked on a mapping between the CDISC-CT [UNIT] codelist, and UCUM. For every "unit" published by CDISC (there are about 550 of them), I tried to find an appropriate UCUM notation. Then I used the mapping (which was done using extending the Excel worksheet provided by CDISC) to generate a relational database and also generated a RESTful web service.
So, when you make an HTTP request "" the corresponding UCUM notation {beats}/min will be returned. Similarly, if you submit "", then "mm[Hg]" will be returned.

I then implemented this web service in the Smart Dataset-XML viewer: when the user right-clicks a cell with a unit (e.g. --ORRESU or --STRESU value), the web service is triggered and the UCUM notation is shown (when the cell value is a valid unit from the [UNIT] list). A few screenshots are shown below:

In some cases, the CDISC notation follows the UCUM notation, but this is surely not always the case, especially for non-SI units the deviations are considerably.

What difficulties did I encounter during the mapping exercise?
Quite a few ...
Some CDISC "units" are not units at all. For example "Virtual Pixel" (NCI C71620).
Other "units" are mixing up objects "what it is about" and units. For example "g/mol Creatinine". UCUM has recognized that this bad habit exists and has solved this by so-called "annotations" (see the UCUM specification). So the UCUM notation for this is "g/mol{creatinine}.
In my opinion, CDISC should control annotations for use in clinical research, not the units themselves.
A difficulty that arose, and costed me quite an amount of time is the "unit" "U/kg". The CDISC definition is: "An arbitrary unit of substance content expressed in units of biological activity per unit of mass equal to one kilogram. Unit per kilogram is also used as a dose calculation unit expressed in arbitrary units per one kilogram of body mass". This sounds like a dual definition, i.e. "U" is used for two different things. When it is a unit of biological or catalytic activity the UCUM unit "U" can be used which is equal to 1 umol/min:

So when a biologial activity is meant, the corresponding UCUM notation for "U/kg" would then simply be "U/kg" which is equal to 1 umol/min/kg.

When "arbitrary units per one kilogram of body mass" is meant (second part of the CDISC definition), then it is something arbitrary, and depending on what is measured. In such a case, an annotation must be used. So, in the second case, the UCUM notation must be {Unit}/kg.

It is OK that a "CDISC unit" means completely two different things depending on the use case? I don't think so. Is "arbitrary unit" a unit anyway? Isn't the wording "arbitrary units" a "contradictio in terminis" anyway?

Do you also think CDISC should stop developing controlled terminology for "units" and use UCUM?

You reactions are as always highly appreciated.

Saturday, September 20, 2014

SDTM: let the service do the work - not the dataset

This week, I found some time to continue working on SDTM. Or better: on services for SDTM. In my previous blog entry, I already showed how web services can help working with controlled terminology such as the LOINC codelist for laboratory tests.
I know extended this for CDISC controlled terminology (CDISC-CT) in general, based on the work of my student Wolfgang Hof. First, I download the latest CDISC-CT (june 26) from the NCI website as a set of XML files. Starting from these, I generated and populated a database with about 6 tables. I then wrote some RESTful services so that remote applications can retrieve information for answering questions like:
  • what is the test name for test code XYZ?
  • what is the NCI code for test code or test name XYZ (or the other way around)?
  • what is the CDISC definition of controlled term ABC?
  • are there any synonyms for controlled term ABC?
I hope to make these services available for the general public in the next few weeks.

Then I implemented a good number of these services in the "Smart Dataset-XML Viewer". Here is a screenshot as an example:

What you see that is when the user hovers the mouse over a test code (in this case a LBTESTCD: SPGRAV), the web service is triggered, the test name and NCI code is retrieved from the remote server/database and displayed as a tooltip on the cell contents.
When the user right-clicks the LBTESTCD cell, the web service is triggered and looks up the "CDISC definition" for the given test code and displays it in a separate window (left upper corner).
When the user right-clicks the LOINC code for this test (in this case 2965-2) a request is send to the RESTful web service of the "National Library of Medicine", returning the address of a website with explanations about the test, which is then displayed in a browser window that pops up.

On the right, you also see some yellow-colored cells. These indicate that there is something special with the data. In the current case, the cell is colored because its value is lower than the low normal range limit. This is not done by a web service, but by the viewer software itself. Thus, when using this feature, the SDTM variable LBNRIND is superfluous and can be removed from the SDTM specification ("let the service do the work - not the dataset"). Other such features that are already present in the "Smart Dataset-XML Viewer" are:

  • show date of first and last exposure in the DM dataset (retrieved from EX)
  • show --DY value on any --DTC value (calculated from difference with RFSTDTC in DM)
  • show visit name on VISITNUM (retrieved from TV)
Essentially this means that many of the SDTM variables (all the ones that are "derived") are superfluous. We estimate that about 1 in 3 SDTM variables could be removed from the SDTM-IG as they can be calculated "on the fly" from the data that are already present in the datasets, or being retrieved by a web service. For example, all --TEST variables are superfluous, as their value can be obtained from a web service.

Now, this is just the tip of the iceberg. So many other things are possible which can considerably contribute to data quality in SDTM submissions. A few examples:

  • the web service informs about what the usual units for the test or observation are. For example: mm[Hg] for SYSBP and DIABP, cm and [in_i] (inches) for WEIGHT, no units for SPGRAV. This can be used to test whether the combination of ORRES and ORRESU is reasonable and acceptable
  • if it were allowed to use UCUM notation for ORRESU/STRESU (unfortunately it is not yet, although all EHR systems and Hospital Information Systems work with UCUM - it is even mandated by Meaningful Use), then the value of --STRESN could be automatically calculated. The combination of the value of --TESTCD with --ORRES and --ORRESU could be send to the web service with the request "please calculate the standardized numerical value", as the web service already knows to what unit the value must be standardized to for the specific test. This would even enable to have such normalizations as some of the values for blood pressure are e.g. in [psi] (pounds per square inch)
In my opinion, these are the kind of features and services people will expect from SHARE in the future. SHARE should be more than a repository of standard specifications, it should behave as an semi-intelligent system that help sponsors and reviewers improve data quality of electronic submissions.

For those who like these features of the "Smart Dataset-XML Viewer" and these web services, I am still working on improving the features and extending them, and I hope to make a new (branched off) version of the Viewer available on the Sourceforge website within the next 1-3 weeks. So please remain a bit patient ...

Comments are of course always welcome!

Wednesday, September 3, 2014

LOINC Web Services

In my previous post, I showed how a simple web service (using REST) can be used to retrieve information about a LOINC code from a remote public server. In the "Smart Dataset-XML Viewer", when the user hovers the mouse over a LOINC code, additional information about that LOINC code is displayed as a tooltip. I now extended this principle to connect to a webservice from the National Library of Medicine, named MedLinePlus Connect. What the webservice is however returning is snippet of XML containing a reference to a website that contains a lot of explanation about the given test. The way I implemented this is such that when the user right-clicks a LOINC code in the "Smart Dataset-XML Viewer", the MedLinePlus webservice is called, the website reference is retrieved, and the user's default browser is opened with the given URL. So when I right-click "3094-0" in the viewer, a new browser window pops up, giving me the MedLinePlus information about the corresponding "Blood Urea Nitrogen" (BUN) test:

 Cool isn't it?

Is this rocket science? No, not at all, it only costed me 3 hours (including writing this blog) to implement this in the "Smart Dataset-XML Viewer".

MedlinePlus also has similar web services for SNOMED-CT, ICD-9 and ICD-10, and medications (RXCUI).So what I want to do in the next few days is to see whether we can implement more of these web services in the "Smart Dataset-XML Viewer".