CDISC end-to-end: 2016

Thursday, October 13, 2016

Units and ODM 2.0

A few of us are already making thoughts about what should be the requirements for a CDISC ODM 2.0 standard. Especially integration with healthcare is one of the main topics. Support for RESTful web services and an additoinal JSON implementation are surely on the list.

One of the main problems with the current version of ODM 1.3 is the way units of measure are handled. 10 years ago, when ODM 1.3 was developed, we were not aware of UCUM yet, nor of LOINC and other coding systems in healthcare. At that time, we were just starting experimenting with extracting information from electronic healthcare records (EHRs) anyway. ODM was very case report form (CRF) centric, without much consideration of how one can (automatically) populate a CRF from an electronic health record or a hospital information system (HIS) anyway.

The way units of measure are implemented in ODM is very simple: one just defines a list of units of measure, and then reference them later. For example:

Defining "inches" and "centimeters". What these exactly mean (e.g. that "centimeter" means 1/100 of a meter and that the latter is an SI unit) is not included, nor is any conversion information (e.g. that 1 inch is 2.54 cm). It even doesn't state what the property is (in the case "length").

In the definition of the data point (an ItemDef), these are then referenced, e.g. by:

Stating that the height can either be expressed in a unit "inches" or a unit "centimeter" whatever these may mean - a machine will not really understand. Also a machine will not understand that this is about body height. For that we need to add some semantic information, like a LOINC code. Currently, this can be done using the "Alias" child element:

(P.S. some elements have been collapsed for clarity)
Also remark the "SDTM annotation", stating that this datapoint will later come into VSORRES in the case that VSTESTCD has the value "HEIGHT".

So, how is this implemented into the CRF? The ODM doesn't tell us. Are there two checkboxes on the CFR, one with "in" and one with "cm" and the investigator needs to check one of them? Or are there two versions of the CRF, one for anglo saxon countries with "inches" preprinted and one for countries using metric units and "cm" preprinted?

If there is only one unit of measure assigned, the case is clear. For example, for a blood presssure:

with a single reference to a unit of measure "millimeter mercury column":

A computer system however does not know what this really means (semantically), e.g. that it represents a pressure. We can however add that information by adding the UCUM notation, using an "Alias" again:

And if we look in the UCUM "ucum-essence.xml" file, we get all information for free:

stating that "meter mercury column" is a unit for the property "pressure" and that it is equal to 133.322 kilopascal. For "millimeter mercury column", the systems knows that in UCUM there is a prefix "m" with meaning "milli" and value "0.001" (also defined in the ucum-essence.xml).

This also allows to do unit conversions in an automated way, even using publicly available RESTful web services for UCUM unit conversions. Like that, a system can easily find out that a blood pressure of 2.5 [psi] (pounds per square inch) corresponds to 129.29 mm[Hg].

So, as an intermediate conclusion, we can state that it is already possible to give semantic meaning to measurements (by providing their LOINC code) and the units in which they are expressed (by providing the UCUM notation), by using the "Alias" mechanism.

This would also enable systems to automatically extract information from EHRs (e.g. using HL7 CDA or FHIR) as in these systems, body height is coded using the LOINC code "8302-2" and the value MUST be given using UCUM notation. For example (as FHIR):

with the LOINC code in the "code" element (middle part of the snapshot), the value in the "value" element (near the bottom) and the (UCUM) unit in the "code" element under "valueQuantity" (lower part of the snapshot).

Is this sufficient?

I do not think so.

"Alias" can be used for anything, and the content of "Context" is not standardized. Also, we should encourage the use of UCUM, as the current codelist for units developed by CDISC is a disaster anyway. Even for pre-clinical studies, to be submitted as SEND, the use of UCUM unis would be a great stepf forward. So we are thinking about "promoting" the UCUM notation to an attribute on MeasurementUnitDef itself, something like (but don't pin me on that!):

However, that doesn't solve everything...

When talking about measurements and units in clinical research, and especially for laboratory tests, I think we can see the following categories:

The measurement has no unit. For example: "pH"
We know the exact unit of measure in advance. For example "millimeter mercury column" for a "blood pressure". This is covered by the current use of "MeasurementUnit" in ODM. The unit can then be preprinted on the form, and/or stored in the database as the one we know will always be the case
There is a choice of units. For example: choice between "cm" and "inches". Also this is covered except for how it is "rolled out", e.g. by different CRF versions based on culture or country
We don't know what units we will get back. This is often the case for lab tests. Unfortunately, most protocols do not provide suffient details about what exactly should be done, they e.g. simply state "do a glucose in urine test". We can then expect a multitude of units (or their absence) back: one lab will report in mg/dL, one in mmol/L, other will provide ordinal information (1+, 2+, ... - no units), making comparison hard (how will we standardize to --STRESU in SDTM?). In such a case, the unit information is usually a field on the CRF. For example:

The latter is OK, as the "question" about the unit is just another question, we loose the information that a) it is a unit, and b) it is a unit for the albumin concentration. of course, this could e.g. be solved by an "Alias", like:

but this is not a very elegant solution, as the content of "Alias" is not standardized.
In CDA and FHIR, this is easy, as these do not define what is to be measured, just what has been measured. In ODM, it is just a bit more difficult.
Now, I do not know the solution for this, but it is something that we (the XML technologies team) will need to tackle.

Friday, September 30, 2016

Generating Define-XML: new software

Although my posting "Creating define.xml - best and worst practices" is the most read blog entry of my blog site, people seem not to learn (or maybe do not want to learn). Almost daily, I read complaints of people who use "black box software" for generating define.xml that is free of charge, starting from Excel worksheets, and not getting the result they would like to obtain, at least not when viewing the define.xml visualized by the stylesheet. Some even do not realize that what they see in the browser is not the define.xml, but only a visualization of it.

There does not seem to be much user-friendly software on the market for generating or working with Define-XML. As already explained in the above-mentioned blog, the best way to generate a define.xml is "upfront" and a few software software packages for mapping to SDTM and at the same time generating a good define.xml exist. One of them is my own SDTM-ETL software.

For people that cannot use this approach (e.g. for legacy studies), there was not much out there yet. They usually used the "Excel" approach, often leading to very bad results.

We recently released a new software named the "ODM Study and Define,xml Designer 2016", which can be used for both setting up study designs in ODM format and for generating define.xml files. For the latter, 4 use cases are supported:

creating a define,xml from scratch
creating a define.xml starting from an SDTM template (SDTM 1.2, 1.3 or 1.4 - SDTM-IG 3.1.2, 3.1.3 or 3.2)
creating a define.xml starting from a set of SAS-XPT files
starting from an incomplete define.xml file, e.g. generated by other tools

In all cases, the user can choose between define.xml v.1.0 and v.2.0. Also the upcoming v.2.1 will be supported as soon as it published by CDISC. The user can also choose between all CDISC controlled terminology versions that were released since 2013.

Unlike the "black box tools", the software comes with a very nice graphical user interface, has very many wizards, and performs validation using validation rules developed by CDISC. For example for generating the define.xml v.2.0 "Where Clauses", there is a wizard:

making it extremely easy to develop "where-clauses".

At each moment during the process, the user can inspect the generated define.xml, either as XML, as a tree structure, or visualized in the user's favorite browser, and using the default CDISC stylesheet or using an own stylesheet.

The validation features go beyond anything else that is currently available, and can be done on different levels. Moreover, unlike with other tools, no false positive errors are generated. This is due to the fact that the developer of the software (well, that's me, a CDISC volunteer for 15 years now) is one of the co-developers of the Define-XML standard, and a CDISC authorized Define-XML trainer (I give most of the CDISC Define-XML trainings in Europe), and thus knows every detail of the standard.

The software is not free-of-charge, but it is not expensive either. So, there is now no excuse anymore for generating bad define.xml files!

Information, including a user manual, can be found at:
http://www.xml4pharma.com/CDISC_Products/ODM-Define_Designer_2016.html

Thursday, September 15, 2016

FDA and SAS Transport 5 - survey results

As promised on LinkedIn, I analyzed the results of the survey where people were asked the question "In my opinion, the FDA should ..." with the following possible answers:

Continue requiring SAS-Transport-5 (XPT) as the transport format
Move to XML (e.g. CDISC Dataset-XML) as soon as possible
Move to RDF as soon as possible
Other

People were also asked who they are working for: Pharma Sponsor, CRO, Service Provider, Software Company, or Other.
We had 57 answers (which is considerably less than I had hoped for). Here are the first results:

with a relative good distribution between all groups (some ticked more than 1 box), with a slight overrepresentation of pharma sponsors (which isn't a surprise as they do the FDA submissions).

And here come the results about the question on the exchange format:

Over 50% voted for moving to an XML-based format like CDISC-Dataset-XML, about 25% for moving to RDF. A minority of less than 20% voted for continuation of the current FDA policy to require SAS Transport 5.

I tried to make a detailed analysis looking for relations between the answer about the preferred format and the company type, but didn't find any. The only slight trends I could see (but statistically not significant at all) is that RDF is a bit overrepresented in the "Sponsor" group, and that "SAS-Transport-5" is slightly overrepresented in the "CRO" group. Only 3 (out of the 20) "sponsor voters" voted for "Continue requiring SAS-Transport-5".

The survey also allowed to provide comments. Here are the most interesting ones:

If it's not broken, don't fix it. Pharma is a big industry and slow to change/adapt
We must move beyond the restrictive row/column structure
SDTM is useless and error prone. We need modern data models and semantics
Consider JSON also. Get rid of Supplemental domains
Going for RDF means that ADaM, SDTM and the rest could be all linked together ...

If anyone would like to analyze the results in more detail, just mail me and I can send the complete results as a spreadsheet or as CSV or similar.

Saturday, June 4, 2016

MedDRA rules validation in XQuery and the Smart Dataset-XML Viewer

Yesterday, I accomplished something that I believed was difficult, but after all wasn't: to develop the FDA and PMDA MedDRA validation rules in XQuery (it's easy if you know how).

The problem with MedDRA is that it is not open and public - you need a license. After you got one (I got a free one as I, as a professor in medical informatics, use MedDRA for research. Once you have the license, you can download the files. When I did, I expected some modern file format like XML or JSON or so, but to my surprise, the files come as oldfashioned ASCII (".asc") files with the "$" character as field separator. From the explanations that come with the files, it is said that the files can be used to build a relational database. However, the license does not allow me to redistribute the information in any form, so I could not build a RESFful web service that could then be used in the validation. As also the other validator just uses the ".asc" files "as is", I needed to find out how XQuery can read ASCII files that do not represent any XML.
I regret that MedDRA is not open and free for everyone (CDISC controlled terminology is). How can we ever empower patients to report adverse events when each patient separately needs to apply for a MedDRA license? This model is not of this time anymore...

The FDA and PMDA each contain about 20 rules that involve MedDRA. One of them is i.m.o. not implementable in software. Rule FDAC165/(PMDA)SD2006 states "MedDRA coding info should be populated using variables in the Events General Observation Class, but not in SUPPQUAL domains". How the hell can a software know whether a variable in a SUPPQUAL domain has to do with MedDRA? The only way I can see is that there is codelist attached to that variable pointing to MedDRA. If this is not the case, one can only guess (something computers are not so good in).

As MedDRA files are text files that do not represent XML, we cannot use the usual XQuery constructs to read them. Fortunately, XQuery 3.0 comes with the function "unparsed-text-lines()" which (among others) takes a file address as an argument. The file address however needs to be formatted as a URI, e.g.:

unparsed-text-lines('file:///e:/Validation_Rules_XQuery/meddra_17_1_english/MedAscii/pt.asc')

This function reads the file line by line. If it is then combined with the function "tokenize" which split strings in tokens based on a field separator, then XQuery can also easily read such oldfashioned text files. So the beginning of our XQuery file (here for rule FDAC350/SD2007), after all the namespace and functions declarations, looks like:

The first five lines in this part (18-22) define.the location of the define.xml file and of the MedDRA pt.asc (preferred terms file). For each of them, we use a "base" as we later want to enable that these are passed from an external program.

In line 24, the file is parsed, the result is an array of strings "$lines". In lines 26-29, we select the first item in each line (with the "$" character as the field separator). As such "$ptcodes" now simply consists of all the PT codes (preferred term codes).

Then, the define.xml file is read, and the AE, MH and CE dataset definitions are selected:

An iteration is started over the AE, MH and CE datasets (note that the selection allows for "splitted" datasets), and in each of them, the define.xml OID is captured of the --OID is captured, together with the variable name (which can be "AEPTCD", "MHPTCD" or "CEPTCD"). The location of the dataset is then obtained from the "def:leaf" element.

In the next part, we iterate over each record in the dataset, and get the value of the --PTCD variable (line 50):

and then check whether the value captured is one in the list of allowed PT codes (line 53). If it is not, an error message is generated (lines 53-55).

That's it! Once you know how it works, it is so easy: it took me about less than 15 minutes to develop each of these 20 rules.

I talked about these XQuery rules implementations with an FDA representative at the European CDISC Interchange in Vienna. When he saw the demo, his face became slightly pale, and he asked me: "Do you know what we paid these guys to implement our rules in software, and you tell me your implementation comes for free and is fully transparent?".

Beyond free, open and fully transparent (if that were not sufficient) the advantage of these rules is that the rules are completely independent of the software to do the validation: anyone can now write his own software without needing to code the rules themselves. You could even create a server that validates your complete repository of submissions during night time. As the messages come as XML, you can easily reuse them in any application that you want (try this with Excel!).

In the next section, I would like to explain how extremely easy it is to write software for executing the validations. The "Smart Dataset-XML Viewer" allows you to do these validations (but you can choose not do do any validation at all, or only for some rules), so I just took a few code snippets to explain this. We use the well-known open source Saxon library for XML parsing and validation, developed by the XML-guru Michael Kay, which is both available for Java and for C# (.NET). If you would like to see the complete implementation of our code, just go to the SourceForge site, where you can download the complete source code of just browse through. The most interesting class is the class "XQueryValidation" in the package "edu.fhjoanneum.ehealth.smartdatasetxmlviewer.xqueryvalidation".
Here is a snippet:

First of all, the file location of the define.xml is transformed to a "URI". A new StringBuffer is prepared to keep all the messages. In the following lines, the Saxon XQuery engine is initialized

and the base location and file name of the define.xml file is passed to the XQuery engine (remark that the define.xml can also be located in a native XML database, with one collection per submission, something that also the FDA and PMDA could easily do). This "passing" is done in the lines with "exp.bindObject" (in the center of the snippet).
In case MedDRA is involved in the rule execution, the same is done in the last part of the snippet (whether a rule required MedDRA is given by the "requiresMedDRA" attribute in the XML file containing all the rules:

The rule is then executed, and the error messages (as XML strings) captured in the earlier defined StringBuffer:

So, the contents of the "messageBuffer" StringBuffer is essentially an XML that can be parsed, or just written to file, or transformed to a table, or stored in a native XML database, or ...

In order to accept the passing of parameters from the program to the XQuery rule, we only need to change the hardcoded file paths and locations to "external" ones, i.e. stating that some program will be responsible for passing the information. In the XQuery itself, this is done by:

As one sees, lines 17-19 have been commented out, and lines 14-16 are lines 14-16 declare that the values for the location of the define.xml file and of the directory with MedDRA files will come from a calling program.

In the "Smart Dataset-XML Viewer", the user can himself decide where the MedDRA files are located (so it is not necessary to copy files to the directory of the application), using the button "MedDRA Files Directory":

A directory chooser than shows up, allowing to set where the MedDRA files need to be read from. This can also be a network drive, as is pretty usual in companies.

If you are interested in implementing these MedDRA validation rules, just download them from our website, or use the RESTful web service to get the latest update.

Again, the "Smart Dataset-XML Viewer" is completely "open source". Please feel free to use the source code, to extend it, to use parts of it in your own applications, to redistribute it with your own applications, etc.. Of course, we highly welcome it when you also donate source code of extensions that you wrote back, so that we can further develop this software.

Friday, May 20, 2016

Electronic Health Record Data in SDTM

The newest publication of the FDA about "Use of Electronic HealthRecord Data in Clinical Investigations" triggered me to pick up the topic of EHRs in SDTM again. The FDA publication describes the framework in which use of EHRs in clinical research is allowed and encouraged. Although it does not contain really new information, it should take the fears of sponsors and investigators away for use of EHR data in FDA regulated clinical trials.

One of the things that will surely happen in future is that the FDA reviewer wants to see the EHR data point that was used as the source of a data point in the SDTM submission. The investigator will then ask the sponsor who will then ask the site...: another delay in bringing this innovatine new drug or therapy to the market. In the mean time patients will die ...

So can't we have the EHR datapoint in the SDTM itself?

Of course! It is even very easy, but only if the FDA would finally decide to get rid of SAS-XPT, this ancient binary format with all its stupid limitations.

Already some years ago, the CDISC XML Technologies Team developed the Dataset-XML standard, as a simple replacement for SAS-XPT. The FDA did a pilot, but since then nothing has happened - "business as usual" seems to have returned.
Dataset-XML was developed to allow the FDA a smooth transition from XPT to XML. It doesn't change anything to SDTM, it just changes the way the data is transported from A to B. However, Dataset-XML has the potential to do things better, as it isn't bound to the two-dimensional table approach of XPT (which again forces SDTM to be 2-dimensional tables).

So, let's try to do better!

Suppose that I do have a VS dataset with a systolic blood pressure for subject "CDISC01.100008" and the data point was retrieved from the EHR of the patient. Forget about adding the EHR data point in the SDTM using ancient SAS-XPT! We need Dataset-XML.

This is how the SDTM records look:

Now, the EHR is based on the new HL7-FHIR standard, and the record is very similar to the one at https://www.hl7.org/fhir/observation-example-bloodpressure.xml.html. How do we get this data point in our SDTM?

Dataset-XML, as it is based on CDISC ODM, is extensible. This means that XML data from other sources can be embedded as long as the namespace of the embedded XML is different from the ODM namespace. As FHIR has an XML implementation, the FHIR data point can easily be embedded into the Dataset-XML SDTM record.

In the following example (which you can download from here), I decided to add the FHIR-EHR data point to the SDTM record, and not to VSORRES (for which one could plead), as I think that the data point belongs to the record, and not to the "original result" - we will discuss this further on.

The SDTM record then becomes:

Remark that the "Observation" element "lives" in the HL7 namespace "http://hl7.org/fhir".

continued by:

Important here is that LOINC coding is used for an exact description of the test (systolic, sitting - LOINC code 8459-0), and that SNOMED-CT is used for coding the body part. This is important - the SDTM and CT teams are still refusing to allow the LOINC code to be used as the unique identifier for the test in VS and LB. Instead, they reinvented the wheel and developed their own list of codes, leading to ambiguities. LOINC coding is mandated to be used in most national EHR systems, including the US Meaningful Use. The same applies to the use of UCUM units.

Now, if you inspect the record carefully, you will notice that a good amount of the information is present twice. The only information that is NOT in the EHR datapoint is STUDYID, USUBJID (although,..), DOMAIN, VISITNUM, VISITDY (planned study day) and VSDY (actual day). STUDYID is an artefact of SAS-XPT, as ODM/Dataset-XML could allow to group all records per subject (using ODM "SubjectData/@SubjectKey). DOMAIN is also an artefact, as within the data set, DOMAIN must always be "VS" and is given by the define.xml anyway with a reference to the correct file.VSDY is derived and can easily be calculated "on the fly" by the FDA tools. Even VSSEQ is artificial and could easily be replaced by a worldwide unique identifier (making it worldwide referenceable, as in ... FHIR). VISIT (name) is also derived in the case of a planned visit and can be looked up in TV (trial visits).

So, if we allow Dataset-XML to become more-dimensional (grouping data by subject), the only SDTM variables that explicitely need to be present are VISITNUM and VISITDY. So essentially, our SDTM record could be reduced to:

and:

Remark the annotations I made, making the mapping to SDTM variables.

If the reviewer still likes to see the record in the classic two-dimensional table way, that's piece of cake, an extremely simple transformation (e.g. using XSLT) does the job.

Now, reviewers always complain about file sizes (however, reviewers should be forbidden to use "files"), and will surely do when they see how much "size" the FHIR information takes. But who says that the FHIR information must be in the same file? Can't it just be referenced, or better, can't we state where the information can be found using a secured RESTful web service?
This is done all the time in FHIR! So we could further reduce our SDTM record to:

Remark that the "http://..." is not simply an HTTP addres: just using it in a browser will not allow to obtain the subject's data point. The RESTful web service in our case will require authentication, usually using the OAuth2 authenticion mechanism.

Comments are very welcome - as ever ...

Tuesday, May 3, 2016

Ask SHARE - SDTM validation rules in XQuery

This weekend, after returning from the European CDISC Interchange (where I gave a presentation titled "Less is more - A Visionary View of the Future of CDISC Standards"), I continued my work on the implementation of the SDTM validation rules in the open and vendor-neutral XQuery language (also see earlier postings here and here).
This time, I worked on a rule that is not so easy. It is the PMDA rule CT2001: "Variable must be populated with terms from its CDISC controlled terminology codelist. New terms cannot be added into non-extensible codelists".
This looks like an easy one on first sight, but it isn't. How does a machine-executable rule know whether a codelist is extensible or not? Looking into an Excel worksheet is not the best way (also as Excel is not a vendor-neutral standard) and cumbersome to program (if possible at all). So we do need something better.

So we developed the human-readable, machine-executable rule (it can be found here) using the following algorithm:

the define.xml is taken and iteration is performed over each dataset (ItemGroupDef)
within the dataset definition, an iteration is performed over all the defined variables (ItemRef/ItemDef) and it is looked whether there is a codelist is attached to the variable
if a codelist is attached, the NCI code of the codelist is taken. A web service request "Get whether a CDISC-CT Codelist is extensible or not" is triggered which returns whether the codelist is extensible or not. Only the codelists that are not extensible are retained. This leads to a list of non-extensible codelists for each dataset
the next step could be that each of the (non-extensible) codelist is inspected for whether it has some "EnumeratedItem" or "CodeListItem" elements that have the flag 'def:ExtendedValue="Yes"'. This is however not "bullet proof" as sponsors may have added terms and forgot to add the flag.
the step could also have been to use the web service "Get whether a coded value is a valid value for the given codelist" to query whether each of the values in the codelist is really a value of that codelist as published by CDISC. This relies on that the values in the dataset itself for the given variable are all present in the codelist (enforced by another rule). The XQuery implementation can be found here.
we choose however to inspect each value in the dataset of the given variable which has a non-extensible codelist for whether it is an allowed value for that codelist by using the web service "Get whether a coded value is a valid value for the given codelist". If the answer from the web service is "false", an error is returned in XML format (allowing reuse in other applications). The XQuery implementation can be found here.

For each of the XQuery implementations, you can inspect them either using NotePad or NotePad++, or a modern XML editor like Altova XML Spy, oXygen-XML or EditiX.

A partial view is below:

What does this have to do with SHARE?

Everything!

All of the above mentioned RESTful web services (available at www.xml4pharmaserver.com/WebServices/index.html) are based on SHARE content. The latter has been implemented as a relational database (it could however also have been a native XML database) and a pretty large number of RESTful web services has been build around it.

In future, the SHARE API will deliver such web services directly from the SHARE repository. So our own web services are only a "test bed" for finding out what is possible with SHARE.

So in future, forget about "black box" validation tools - simply "ask SHARE"!

Monday, March 28, 2016

FDA SDTM validation rules, XQuery and the "Smart Dataset-XML Viewer"

During the last days, I could again make considerable progress in writing FDA and CDISC SDTM and ADaM validation rules in the vendor-neutral XQuery language (a W3C standard).

With this project, we aim to:

come to a real vendor neutral, as well human-readable as machine-executable set of validation rules (no black-box implementations anymore)
have rules that are easily readable by persons in the CDISC community, and commented on
develop rules that do not lead to false positives
come to a reference implementation of the validation rules, meaning that, after acceptance by CDISC, other implementations (e.g. from commercial vendors) always need to come to the same result for the same test case
make these rules available by CDISC SHARE for applications and humans, by using RESTful web services and the SHARE API

I was now also able to implement these rules in the "Smart Dataset-XML Viewer":

The set of rules itself is provided as an XML file, for which we have already a RESTful web service for rapid updates, meaning that if someone finds a bug or an issue with a rule implementation, it can updated within hours, and the software can automatically retrieve the corrected rule implementation of the rule (no more waiting for the next software release or software bug fix).

In the "Smart Dataset-XML Viewer", the validation is optional, and when the user clicks the button "Validation Rules Selections", all the available rules are listed, and can be selected/deselected, meaning that the user (and not the software) decides for which rules the submission data sets are validated:

Some of these rules use web services themselves, for example to detect whether an SDTM variable is "required", "expected" or "permissible", something that cannot be obtained from the define.xml.
A great advantage is that any rule violations are immediately visible in the viewer itself, i.e. the user does not need to retrieve the information from an Excel file anymore and then look up the record manually in the data set.

At the same time, all violations are gathered into an XML structure, which can easily be (re)used in other applications (we do not consider Excel as a suitable information exchange format between software applications).

And even better, all this is real "open source" without any license or redistribution limitations, so that people can integrate the "Smart Dataset-XML Viewer", including its XQuery validation, into any other application, even commercial ones.

I am currently continueing working on this implementation, and on the validation rules in XQuery. I did most of the FDA-SDTM rules (well, at least those that are not wrong,ununderstandable or an expectation rather than a rule).

I also did about 40% of the ADaM 1.3validation checks, and will start on the CDISC SDTM conformance rules as soon as they are officially published by CDISC.
I can however use help with the ADaM validation rules, as I lack some suitable real-life test files. So if you do ADaM validation in your company and have some basic XQuery knowledge (or willing to acquire it), please let me know, so that we can make rapid progress on this.
Another nice thing about having the rules in XQuery is that companies can easily start developing their own sets of validation rules in this vendor-neutral language, be it for SDTM, SEND or ADaM, and just add them to a specific directory in the "Smart Dataset-XML Viewer", after they will immediately become available to the viewer.

I hope to make a first release on SourceForge (application + source code) in the next few weeks, so stay tuned!

Thursday, February 25, 2016

Phil's webinar on LOINC and CDISC

Today, I attended an excellent "CDISC members only Webinar" given by Phil Pochon (Covance) on LOINC and it's use in CDISC.
Phil is an early contributor to CDISC (considerably longer than I am), and very well known for the development of the CDISC Lab Standard and his contributions to SDTM.

So Phil is one of the people I highly respect.

Phil explained the concepts of LOINC very well and especially the differences with the CDISC controlled terminology for lab tests and results.

Also he answered the questions that were posed extremely well, giving his opinion about how LOINC should be used in combination with CDISC standards (there isn't a CDISC policy on this yet).

In this blog, I want to extend on some of the questions that were posed and on which I have a different opinion (Phil knows that).

There were several questions about how to map information from local labs (such as local lab test codes) to LOINC. Phil gave some suggestions about looking at the units used, the method provided (if any), and so on.
My opinion about this is: "don't": if the lab cannot provide you the LOINC code with the test result, don't try to derive it. Even when "LBLOINC" would become "expected" in SDTM in the future, I would suggest not to try to derive it (SDTM is about captured data, not about derived data). Reason is that such a derivation may lead to disaster, also because the reviewer at the FDA cannot see whether the given LOINC code comes from the instrument that did the measurement, or was "guessed" by the person that created the SDTM files. This is a serious data quality issue.

There was a short discussion about whether labs should provide the LOINC code together with each test result for each measurement. My opinion about that is that if your lab cannot provide LOINC codes, you should not work with that lab anymore. Also, sponsors and CROs should have a statement in their data transfer agreements with labs that the latter should not only deliver "a" LOINC code with each result, but should deliver "the correct" LOINC code with each result.

Phil also answered a question about having LOINC codes specified for lab test in the protocol. He stated that this is a long-term goal. I would however state that sponsors should start doing this now. Even if not all labs can always provide what (LOINC code) is described in the protocol, giving the (expected/preferred/suggested) LOINC code in the protocol would immediately increase data quality, as the bandwith of what is finally delivered would surely become smaller. For example, for "glucose in urine", there is a multitude of lab test, ranging from quantitative to ordinal to qualitative, each with a different LOINC code. It is impossible to bring all these results to a single "standardized value" (this is required by SDTM). Providing the (expected/preferred/suggested) LOINC code in the protocol and passing this information to the labs would at least reduce the breadth of different tests that were actually done, making the mapping to SDTM considerably easier, and at the same time improving data quality considerably.

An interesting question was whether LBLOINC applies to LBORRES (original result) or to LBSTRES(C/N) (standardized result). I checked it in the SDTM and it is not specified there. It still states "Dictionary-derived LOINC code for LBTEST", which is a disaster definition as LBLOINC should be taken from the data transfer itself, and not be derived (probably leading to disaster.)
If I understood it well, Phil suggested to apply it to the "standardized" result. For example, if I obtain the glucose value in "moles/l" (e.g. LOINC code 22705-8) but standardize on "mg/dL", this would mean that I need to (automatically?) transform this LOINC code into the alternative on in "mg/dL" which is (I think) 5792-7.
In my opinion, one should not do this, but use the LOINC code that was delivered by the lab, so on the original result. Why? Let me take an example: Suppose one of the labs has delivered the value as "ordinal", so using values like +1, +2, etc.. (LOINC code 25428-4). How can I ever standardize these to a concentration? If I can, I guess this is highly arbitrary, and thus leads to another decrease in data quality. So I would propose that LBLOINC is always given as the value that is provided by the lab (so original results) and that that is clearly explained in the SDTM-IG.

Another interesting discussion was on the use of UCUM unit notation. According to Phil, most of the lab units published by the CDISC team in the past are identical with or very similar to the UCUM notation. My experience is different. What was very interesting is that Phil told that when they (the CDISC-CT team) receive a "new term request" for a lab unit, they first look into UCUM to see whether there is one there, and if so, take that one. I am very glad about that!
But Phil also told that they get many requests for new unit terms that do not fit into the UCUM system (like arbitrary units for some special tests), so that they then develop their own one.
Personally, I think they shouldn't. If a unit term is so special that it does not fit with UCUM, and also cannot be handled by so-called "UCUM annotations" (CDISC should standardize on the annotations, not on the units), then I wonder whether the requested term is good and generic enough to be standardized on at all. After all, the [UNIT] codelist is extensible.

My personal opinion is still that CDISC should stop the development of lab test (code) terminology and steadily move to LOINC completely. For example, it is my opinion that it should now already be allowed to simply put the LOINC code in LBTESTCD (maybe with an "L" in front, as test codes are still not allowed to start with a number, a rule stemming from the punchcard era...).