Saturday, July 1, 2017

Submission dataset validation: Regular expressions versus XQuery

An interesting post recently showed up at the Pinnacle21 forum regarding validation of ISO-8601 durations. For those not familiar with ISO-8601 durations, this is about expressing time spans ("durations") in a machine-readable format. For example, a duration of 1 week is expressed as "P1W", a duration of 1 month and 21 days as "P1M21D". You can even have more complicated durations such as "3 days and 5 milliseconds" which is expressed as "P3DT0.005S".

The discussion was about rule FDAN039 (a SEND rule), which states "Value of Duration, Elapsed Time, and Interval variables (--DUR, --ELTM, --EVLINT) must conform to the ISO 8601 international standard". This rule is a consequence of section "Intervals of Time and Use of Duration for --DUR Variables" in the SDTM-IG and section 4.4.3 "Intervals of Time and Use of Duration for --DUR Variables" in the SEND-IG.

First remark that the rule as formulated by the FDA (was it?) is not 100% correctly defined, as it suggests that the valid ISO-8601 value "2017-07-01" is a valid value for e.g. "CLELTM", which it isn't. Rules should be exact!

The discussion on the Pinnacle21 forum was about the regular expression to validate this rule. Here is a snapshot:

Got it? Understood it?

I am not going to show you the alternative the forum contributor proposed as it is as unreadable as the one above ...

So, how would the rule implementation look like if we were (finally) allowed to submit data in XML? In that case, we could easily use XQuery, and make the rule implementation independent of the software used. Such rule implementations are already public available for all SDTM, SEND and most ADaM validation rules published by CDISC, FDA and PMDA.
Here is the core of the rule FDAN039 expressed in XQuery:

The first 8 lines check whether the value is a valid "week" duration expression ("PnW"), the last line checks whether the value is either a valid XML-Schema duration (which is 1:1 implementation of ISO-8601 "duration" except for "week duration"), or is a valid "week" duration.
An error is then returned when the value is not a valid schema duration and not a valid "week" duration:

Which of both, the regular expression (from Pinnacle21) or the XQuery do you find better readable? Which of both can one best check on whether the rule is correctly implemented?

Time to move away from SAS Transport 5 and to move to a modern format like XML ... It would make validation so much easier ...

Your comments are welcome as always!


  1. Jozef, I appreciate your blog, and like to move to a modern transport format as much as you do. But please do not blame a complicated regular expression on the transport format. Pinnacle 21, if they choose to, could have easily implemented your logic in Java. Any language that is a little bit advanced can implement the logic that you implemented in XQuery.

  2. Lex, I completely agree. They could have implemented the (Java) logic easily, but they still refuse to do so. Problem with them is that they regard everything as "char" (text) and "numeric", and they do not think any further than that. If one does not realize that there is something beyond that, the only way to validate text is using ... regular expressions.

  3. I probably made a "mental jump" when writing the blog and did not explain well enough ...
    During Define-XML courses that I teach, I see that many participants "think" in terms of "char" and "num" due to XPT, and do not understand about e.g. ISO-8601. One can then do one of two things: move away from XPT (and force them to think in terms of modern datatypes), or educate them about the difference between the transport format and the real datatype. I am always surprised that many of them (even after working for years with SDTM) regard ISO-8601 as something "inconvenient", not realizing it is an international standard. But they learn it in 30 minutes or less.
    But with XML they might well start regarding everything as text. I once had a french colleague stating that one cannot use XML (or JSON, or similar) because one cannot put numeric values in it, as it is ... text ;-)