Saturday, July 25, 2020

SAS Transport 5, non-ASCII characters , UTF-8, and the FDA Technical Conformance Guide

With the Chinese authorities now accepting CDISC standards, and some evolutions in Japan, and the new FDA "Study Data Technical Conformance Guide",  there have been a lot of discussions on the use of non-ASCII characters and UTF-8 encoding in the last weeks.

CDISC submission standards are tightly bound to the "SAS Transport 5" format, usually also designated as "XPT" format. It is a data format stemming from the era of mainframe computers, meant for the exchange of SAS data between IBM mainframes and VAX computers (do you still have one of these?). The specification is said to be "open", you can find the specification on the SAS website. The format is discouraged for use by the "US Library of Congress": "The Library of Congress Recommended Formats Statement (RFS) does not list either of the SAS Transport File Formats as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for platform-independent, character-based formats for datasets".

Although the format is said to be "open", there is a major problem: numeric values are using the "IBM mainframe representation", which is not used by modern computers anymore. The specification on the SAS website just mentions this, but leaves it to the implementor to find out how this IBM notation for numbers works. So, I do not consider the format as "open", as it has dependencies on other, undocumented, formats. For example, until now, I haven't been able to find "IBM notation" libraries for modern computer languages such Java, C#, Python. I had to program the "format" myself, bit per bit …

Another "feature" of SAS Transport 5 is that it only supports US-ASCII.

So, essentially, it is not possible to represents any other characters than US-ASCII when using SAS Transport 5 as a transport format. The same applies to the newer SAS Transport 8.

So, for non-ASCII characters, people have started using "tricks".

In modern computers, we work with bytes. 1 byte has 8 bits. Essentially this means that, when using 1 byte per character, one can support 256 different characters. Remark that a lower-case and an upper-case character are different things.
US-ASCII supports 128 characters, i.e. it uses only 7 of the 8 bytes. This is were the "tricks" start …

How a character is represented by bits ("0" and "1"s) is named "encoding". There a numerous encodings of characters in the world, and you will probably have met problems with them when you see a "very strange" character or a rectangle symbol on your screen when you expected a true character for your language. So, defining the character encoding, when possible in the file itself, is of utmost importance when exchanging documents between partners in a process.

For example, here are some encodings from the US-ASCII "code table":

Character

bits

A

100 0001

a

110 0001

Z

101 1010

z

111 1010

:

011 1010

line feed

000 1010

carriage return

000 1101

Most of these characters are "printable characters", a designation stemming from the time that we did not have computer screens (yes, I am that old) and where output was only possible on lineprinters. "Non-printable" characters like the "line feed" or "carriage return" (they are NOT the same) were (and are still used) to steer the printer (nowadays the computer screen) to state that a new line needs to be started.

Officially, SAS Transport 5 only supports (7-bits) US-ASCII. As in modern computers, there is then still one bit left, people have been starting "tricks" to also be able to represent non-ASCII characters.

Most of these are based on "extended ASCII", i.e. all ASCII characters remain as they are, using 7 bits, and additional characters are added also using the 8th bit. These are however not always standardized: there are many "propriety" extensions. Some however are standardized, like the ISO-8859-1 encoding, also called "ISO Latin 1". It uses 8 bits (so exactly 1 byte) per character, and supports most European languages, plus a number of others, like Kurdish and Indonesian. Surprisingly, it does not support the German character "ß".

So, what happens if we use the full 8 bits in a SAS Transport 5 file for such a "European" character using ISO-8859-1?

The answer is: "start praying".

As SAS Transport 5 officially only supports US-ASCII, it will depend on the tool that is used at the receiving site whether it supports the 8 bits, and whether it then also assumes that the encoding is ISO-8859-1. As it was never meant for different encodings, there is also no way in the XPT file itself to state that ISO-8859-1 was used, and that the receiving tool must use the 8th bit.
So, if you do submit an XPT file to the FDA, and used an "extended ASCII" encoding like ISO-8859-1, better inform the FDA about this, stating exactly which 8-bit encoding (there are numerous) was used, and then … start praying.

Modern computers and computer languages and modern software are using "UTF-8 encoding". Unlike US-ASCII and its simple extensions, it is a "variable-width" encoding, meaning that for some characters (like "English" characters) it uses only 1 byte, and for some other characters (like "Asian" characters), it uses more than 1 byte.
UTF-8 is "backward compatible" with US-ASCII, meaning that the same 7 bit values (0 or 1) are used when the character is a "ASCII" character. This means that a software that is able to read UTF-8 encoded files can also read "US-ASCII"-encoded files. Please remark that this is something else than reading the "numeric" values of SAS Transport 5! For such, one need to do very special things.
UTF-8 uses 1 to 4 bytes for a character. For example, "Hebrew" and "Arabic" characters require two bytes, "Japanese", "Chinese" and "Korean" characters require 3 bytes.
Important to note is also that the encoding is flexible ("variable width"). For example, when one has a mixture of e.g. "English" and "Japanese" characters, one byte will be used for the "English" character, and 3 bytes will be used for the "Japanese" characters. This will be of extreme importance when discussing the use of "mixed" text in SAS Transport 5, using some "tricks", where such flexibility is not present at all.

Four bytes are then used by UTF-8 for e.g. mathematical symbols,  and emoji. How a computer program knows whether it needs to read a single byte for a character or several bytes is pretty technical. For those interested, please read the Wikipedia article and look for "continuation bytes". Also remark that ISO-8859-1 and UTF-8 are not fully compatible: they only encode the same way for ASCII characters, not for the other "European" characters.

As SAS Transport 5 does not support UTF-8 at all, our Japanese colleagues have started using "tricks" to also be able to represent "Japanese" characters in SAS Transport 5 files. One of these it to use the SHIFT-JIS encoding using one to two bytes per character.
This looks as a nice "workaround", but it isn't, as it has severe consequences.

First of all, as SAS Transport 5 does not have a mechanism to indicate which encoding was used (as it supposes US-ASCII), one need to inform the PMDA (Japanese regulatory authority) which files are SHIFT-JIS encoded (using 1-2 bytes per character), and which are still US-ASCII encoded, so that each of them is correctly treated by the reviewer. Start praying – in Japanese …

In Japan, a single text message can be composed ofcharacters from four different writing systems . Kanji has tens of thousands of characters, which are represented by pictures. Hiragana and Katakana are syllabaries, each containing about 80 sounds, which are also represented as ideographs. The Roman characters include some 95 letters, digits, and punctuation marks. These must all be covered. This of course requires a lot from the software. Most modern software however uses UTF-8 as the default, and it would thus be extremely expensive to add support for SHIFT-JIS. Only Japanese companies will probably want to do such investments. Also remark that SHIFT-JIS and UTF-8 are not compatible.

Here is an example of such mixed content taken from the Apache C++ Standard Library web page:


SAS Transport 5 however is a "fixed length" format, meaning that for variable values, one defines the maximum length (in ASCII characters) for each variable, and that when the variable is shorter, the remaining is filled with blanks. For example, when I have two record in MH, with values for MHDECOD being "Fever" (5 characters) and another "Amygdalohippocampectomy" (23 characters), in the file, always 23 characters will be used, even when "Fever" has only 5 characters, to which then 18 blanks are added. I.e.:

A

m

y

g

d

a

l

o

h

i

p

p

o

c

a

m

p

e

c

t

o

m

y

F

e

v

e

r

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Total number of bytes used: 46.

This makes SAS Transport 5 extremely inefficient, which is not known by many people. So, essentially, the famous FDA/Pinnacle21 rule "The allotted length for each column containing character (text) data should be set to the maximum length of the variable used across all datasets in the study except for suppqual datasets" has been introduced (by FDA) to counteract the inefficiency of its own mandated format ("build badly first, repair later" principle). So, in the case of SHIFT-JIS, what is the "maximum length", is it the number of bytes (personally,  I presume so), or the length in characters? And what about the "200 character limitation"? Essentially, this cannot be correct when using an encoding that (partially) uses two bytes for a character such as SHIFT-JIS. As for most of the "Japanese" characters, two bytes will be necessary, this reduces the maximum number of (Japanese) characters to 100. Now, one usually requires less characters in Japanese for the same English expression, but it still is a concern. And what is then about the "8-character" and "40-character" limitations? Does this mean a reduction to 4 and 20 characters when using SHIFT-JIS? Of course, for most variables where Japanese (verbatim) terms will occur, the latter limitation will not be so important.

The latest FDA "Study Data Technical Conformance Guide" contains a very weird paragraph about non-ASCII characters.

Here is a screenshot:

What is meant here?  

If one has read the above on UTF-8, being a "variable multibyte encoding", and for the use case of SAS Transport 5 format,one can only come to the conclusion that this doesn't make sense.It looks to me as this has been written by some who does not have any technical insight on encodings.Whereas one could use an "extended ASCII" encoding like ISO-8859-1 (1 byte = 1 extended character) and "hope and pray" it is readable by the tools at the FDA, I would not know how to generate an XPT file with UTF-8 encoding,and at the same time comply with the FDA rules.It would require a careful counting of the number of bytes used (not the number characters), as no more than 200 bytes can be used for a variable value. It would probably also generate a huge number of errors when validating with the validation software the FDA is using.

I haven't tried it out yet, but I intend to generate such a file, e.g. with Chinese or Japanese comments (SDTM-CO domain), encoded as UTF-8 and SHIFT-JIS, using XPT format. I will do some in the next months, and hope to have everything ready by the 2020 CDISC US Interchange where I would like to present my findings as a poster. I will also try to make calculations about the effect on file size, and compare that with using CDISC Dataset-XML format, a modern, XML-based (using UTF-8 encoding by default) format and standard developed by CDISC as a replacement for the outdates SAS Transport 5 format.



 

2 comments:

  1. Very informative article. Thank you.
    I asked the FDA eData Technical Assistance group about this recently, and I was told to remap ASCII characters with codes > 128 to their analogs in the 32 – 128 character range. In our data, this is almost exclusively seen in the Conmed verbatim terms.

    ReplyDelete
  2. Thanks! I wonder what FDA means by "remap ASCII characters with codes > 128 to their analogs" ...
    For example, what would be the "analog" for character 143 (Ã…)?
    I have the impression FDA has no clue about "non-English" characters. With 13.5% Spanish-speaking population in the USA, this looks frightening to me...

    ReplyDelete