Sunday, November 15, 2020

Automated Generation of CDISC Biomedical Concepts starting from LOINC codes. Part 1: first beginnings


CDISC is more and more working on "biomedical concepts", also in order to be able to move away from the "table thinking" caused by SDTM. The original idea of SDTM has probably been to mimic a relational database, but that did not work anymore after a while: the relations are more and more implicit (you have to learn about them in an Implementation Guide usually in PDF or HTML form). This was fine 20 years ago, but it is, especially to the strong growth of as well SDTM domains as variables), not good enough anymore in 2020.

We find such "biomedical concepts" (I will further abbreviate this as "BCs"), in a lot of the "CDISC Therapeutic Area User Guides" (TAUGs), usually as "mind map" images (mostly generated by the CMap software) embedded in the PDF, so not electronically. The way these "mind maps", representing the BCs, were set up differed quite a lot in the past, but it looks as consensus has been found in the last years on the color coding and shape conventions.
However, as long as there are no real guidelines on how to develop and generate such BCs, such a diversity will remain. There is also currently no good way to compare BCs, as these are currently not available in a machine-readable form.
Especially the latter is and will be a serious "show stopper" for the use of BCs in real life.

Very recently, during the US CDISC Interchange, CDISC coworker Sally Cassells came with the announcement that CDISC wants to start using Define-XML files to store BCs. This is indeed a very good idea, as, first of all BCs are about metadata (so is Define-XML), and the Define-XML's "ValueLists" indeed are used to keep relations between items, such as "when VSTESTCD=SYSBP or DIABP, then VSSRESTC=mmHg"

A bit problematic about this approach however would be that the BCs defined and stored in this way are extremely "SDTM-centric" (or SEND or ADAM-centric), neglecting "the rest of the world". I will explain this in more detail in an example later on.

Biomedical concepts are not new, neither are they limited to clinical research. Such concepts have been used for many many years in health care, only they use other names there.
For example, OpenEHRArchetypes are very close to BCs. And in FHIR, we have "FHIR profiles" such as for blood pressure. Such "profiles" usually use LOINC coding for describing the "observations" (named "Findings" at CDISC).
Also LOINC is more and more developing links between codes, e.g. using "panels". Classically in medical laboratory language, a "panel" is a set of lab tests that are usually ordered, and reported as a whole set. A typical such panel is the "Hepatic Function Panel 2000" (LOINC code 24325-3) containing 7 tests (each of course with their own test code):

It is usually ordered by medical doctors to obtain information on how well the liver is functioning. LOINC is also more and more publishing "LOINC answers to LOINC codes" and linking them to other coding systems.

For example, for "Rhesus Type in Blood" (LOINC code 10331-7):


including the SNOMED-CT code for the answer, which is usually used in electronic health records.

This looks pretty like a Define-XML ValueList "LBORRES where LBTESTCD='RH'' (where "RH" is "Rhesus Factor") isn't it?
CDISC calls this "Codetables" and has published a number of them (not for LB however), but unfortunately without linking to other coding systems such as SNOMED-CT which is heavily used all over the world in healthcare.

So, and this is question I have been asking myself already for a longer time: now that we have the CDISC LOINC-LB mapping that is also available as a RESTful web service, and that even has been extended for VS and MB (currently only for COVID-19 tests), wouldn't it be possible to automatically generate CDISC BCs automatically (or semi-automatically) directly from LOINC panels? This would allows us to auto-generate BCs and their Define-XML ValueList prototypes for hundreds, maybe thousands of panels that are routinely used in hospitals anyway. We would then not only store these BCs as Define-XMLs, but also make them available as a kind of "knowledge graphs", e.g. in the modern RDF format, so really machine-readable, and transformable to mind maps such as currently used in TAUGs.

This week, I finally found the time to start on such an attempt.
A well documented example of a BC is that for "vital signs blood pressure", but I did not want to take that, as there are already some people within the CDISC community who think that BCs are only applicable to blood pressure. So, I looked for something a bit more difficult, but not something too difficult either. 

After inspecting the LOINC database, I decided to start with the LOINC panel "Microalbumin Urin 24 hours panel" (LOINC code 58431-8): 


This panel is used to obtain information about microalbumin excretion in urine collected over a (planned) time span of 24 hours. The tests are:

- total concentration of microalbumin in the 24 hour urine
- excretion rate of microalbumin (2 codes)
- the actual collection duration
- the total volume of urine collected over the collection duration

In order to retrieve all metadata, I refreshed my SQL knowledge, and wrote the following SQL statement:

SELECT loinc_num, component, property, time_aspct, system, scale_typ, method_typ, class, unitsrequired, shortname, cdisc_common_tests, example_units, long_common_name, example_ucum_units
FROM loincdb_268.loinc
WHERE loinc_num IN (SELECT loinc FROM loincdb_268.panelsandforms WHERE ParentLoinc = '58431-8' and loinc != ParentLoinc);

leading to:


I then started thinking "how to start"?

So, I started looking at a number of CDISC BC "concept maps", trying to find out how they exactly work. I then decided to take the BC "Glycosylated Hemoglobin A1c Assay", which has been presented as part of the "CDISC 360" project and is a little bit less complicated than the BC for "Hemoglobin A1C to Hemoglobin Ratio Measurement".

I soon found out that even within these, there seems to be different approaches how to develop such a BC, and especially in the order the concepts are treated and the naming of the relations. The latter is however of ultimate importance for enabling to compare BCs, merge them, make subsets and supersets.
I hope and presume that such exact instructions ("how to") will be an outcome of the CDISC project.

What I also soon found out is that, these BCs are extremely "CDISC-centric", i.e. they exclude everything that is outside the CDISC world. That is a pity, as that would mean that they are e.g. unusable for the use case that "real world data" (RWD) and electronic health records (EHRs) are used as the source. So, in my own attempts, I tried to avoid this pitfall, and also try to use other coding systems where useful.

Just as an example from the CDISC 360 project:


This map suggests that "blood" is only a CDISC-CT "specimen type" and does not exist outside the "CDISC world". This is of course not true. My own approach (also found in BCs from some TAUGs) is that there is a specimen "blood", that is categorized as "LBSPEC" with the NCI code C12434 and the name "BLOOD" in CDISC-CT.
In SNOMED-CT however, the same concept has code 119297000 (blood specimen).
So I decided to use the "classic" approach of the TAUGs to just take "specimen" as a "general" concept, and color-code it as explained in the TAUGs:

so that I can then both refer to as well the CDISC code for "blood" as a specimen, as well as to the SNOMED-CT code.

Below is a very first map that I generated manually. It is far from perfect, far from complete, and only 1 way of the many that are possible. Please think about that when starting throwing rotten tomatoes.
I do however think it contains the most important information, including some (but not all) that can be very useful when using RWD or EHRs.
Please also remark that I take the freedom to update this picture as I proceed in my project.