With the Chinese authorities now accepting CDISC standards, and some evolutions in Japan, and the new FDA "Study Data Technical Conformance Guide", there have been a lot of discussions on the use of non-ASCII characters and UTF-8 encoding in the last weeks.
CDISC submission standards are tightly bound to the "SAS Transport 5" format, usually also designated as "XPT" format. It is a data format stemming from the era of mainframe computers, meant for the exchange of SAS data between IBM mainframes and VAX computers (do you still have one of these?). The specification is said to be "open", you can find the specification on the SAS website. The format is discouraged for use by the "US Library of Congress": "The Library of Congress Recommended Formats Statement (RFS) does not list either of the SAS Transport File Formats as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for platform-independent, character-based formats for datasets".
Although the format is said to be "open", there is a major problem: numeric values are using the "IBM mainframe representation", which is not used by modern computers anymore. The specification on the SAS website just mentions this, but leaves it to the implementor to find out how this IBM notation for numbers works. So, I do not consider the format as "open", as it has dependencies on other, undocumented, formats. For example, until now, I haven't been able to find "IBM notation" libraries for modern computer languages such Java, C#, Python. I had to program the "format" myself, bit per bit …
Another "feature" of SAS Transport 5 is that it only supports US-ASCII.
So, essentially, it is not possible to represents any other
characters than US-ASCII when using SAS Transport 5 as a transport format. The
same applies to the newer SAS Transport 8.
So, for non-ASCII characters, people have started using "tricks".
In modern computers, we work with bytes. 1 byte has 8 bits.
Essentially this means that, when using 1 byte per character, one can support
256 different characters. Remark that a lower-case and an upper-case character
are different things.
US-ASCII supports 128 characters, i.e. it
uses only 7 of the 8 bytes. This is were the "tricks" start …
How a character is represented by bits ("0" and "1"s) is named "encoding". There a numerous encodings of characters in the world, and you will probably have met problems with them when you see a "very strange" character or a rectangle symbol on your screen when you expected a true character for your language. So, defining the character encoding, when possible in the file itself, is of utmost importance when exchanging documents between partners in a process.
For example, here are some encodings from the US-ASCII "code table":
Character |
bits |
A |
100 0001 |
a |
110 0001 |
Z |
101 1010 |
z |
111 1010 |
: |
011 1010 |
line feed |
000 1010 |
carriage return |
000 1101 |
Most of these characters are "printable characters", a designation stemming from the time that we did not have computer screens (yes, I am that old) and where output was only possible on lineprinters. "Non-printable" characters like the "line feed" or "carriage return" (they are NOT the same) were (and are still used) to steer the printer (nowadays the computer screen) to state that a new line needs to be started.
Officially, SAS Transport 5 only supports (7-bits) US-ASCII. As in modern computers, there is then still one bit left, people have been starting "tricks" to also be able to represent non-ASCII characters.
Most of these are based on "extended ASCII",
i.e. all ASCII characters remain as they are, using 7 bits, and additional
characters are added also using the 8th bit. These are however not
always standardized: there are many "propriety" extensions. Some
however are standardized, like the ISO-8859-1 encoding,
also called "ISO Latin 1".
It uses 8 bits (so exactly 1 byte) per character, and supports most European
languages, plus a number of others, like Kurdish and Indonesian. Surprisingly,
it does not support the German character "ß".
So, what happens if we use the full 8 bits in a SAS Transport 5 file for such a "European" character using ISO-8859-1?
The answer is: "start praying".
As SAS Transport 5 officially only supports US-ASCII, it will
depend on the tool that is used at the receiving site whether it supports the 8
bits, and whether it then also assumes that the encoding is ISO-8859-1. As it
was never meant for different encodings, there is also no way in the XPT file itself
to state that ISO-8859-1 was used, and that the receiving tool must use the 8th bit.
So, if you do submit an XPT file to the FDA, and used an "extended
ASCII" encoding like ISO-8859-1, better inform the FDA about this, stating
exactly which 8-bit encoding (there are numerous) was used, and then … start praying.
Modern computers and computer languages and modern software are
using "UTF-8 encoding". Unlike US-ASCII
and its simple extensions, it is a "variable-width" encoding, meaning
that for some characters (like "English" characters) it uses only 1
byte, and for some other characters (like "Asian" characters), it
uses more than 1 byte.
UTF-8 is "backward compatible" with US-ASCII, meaning that the same 7
bit values (0 or 1) are used when the character is a "ASCII"
character. This means that a software that is able to read UTF-8 encoded files
can also read "US-ASCII"-encoded files. Please remark that this is something
else than reading the "numeric" values of SAS Transport 5! For such,
one need to do very special things.
UTF-8 uses 1 to 4 bytes for a character. For example, "Hebrew" and "Arabic"
characters require two bytes, "Japanese", "Chinese" and
"Korean" characters require 3 bytes.
Important to note is also that the encoding is flexible ("variable
width"). For example, when one has a mixture of e.g. "English"
and "Japanese" characters, one byte will be used for the "English"
character, and 3 bytes will be used for the "Japanese" characters.
This will be of extreme importance when discussing the use of "mixed"
text in SAS Transport 5, using some "tricks", where such flexibility
is not present at all.
Four bytes are then used by UTF-8 for e.g. mathematical symbols, and emoji. How a computer program knows whether it needs to read a single byte for a character or several bytes is pretty technical. For those interested, please read the Wikipedia article and look for "continuation bytes". Also remark that ISO-8859-1 and UTF-8 are not fully compatible: they only encode the same way for ASCII characters, not for the other "European" characters.
As SAS Transport 5 does not support UTF-8 at all, our Japanese
colleagues have started using "tricks" to also be able to represent
"Japanese" characters in SAS Transport 5 files. One of these it to
use the SHIFT-JIS encoding using one to two
bytes per character.
This looks as a nice "workaround", but it isn't, as it has severe
consequences.
First of all, as SAS Transport 5 does not have a mechanism to indicate which encoding was used (as it supposes US-ASCII), one need to inform the PMDA (Japanese regulatory authority) which files are SHIFT-JIS encoded (using 1-2 bytes per character), and which are still US-ASCII encoded, so that each of them is correctly treated by the reviewer. Start praying – in Japanese …
In Japan, a single text message can be composed ofcharacters from four different writing systems . Kanji has tens of thousands of characters, which are represented by pictures. Hiragana and Katakana are syllabaries, each containing about 80 sounds, which are also represented as ideographs. The Roman characters include some 95 letters, digits, and punctuation marks. These must all be covered. This of course requires a lot from the software. Most modern software however uses UTF-8 as the default, and it would thus be extremely expensive to add support for SHIFT-JIS. Only Japanese companies will probably want to do such investments. Also remark that SHIFT-JIS and UTF-8 are not compatible.
Here is an example of such mixed content taken from the Apache C++ Standard Library web page:
SAS Transport 5 however is a "fixed length" format,
meaning that for variable values, one defines the maximum length (in ASCII
characters) for each variable, and that when the variable is shorter, the
remaining is filled with blanks. For example, when I have two record in MH,
with values for MHDECOD being "Fever" (5 characters) and another
"Amygdalohippocampectomy" (23 characters), in the file, always 23 characters
will be used, even when "Fever" has only 5 characters, to which then
18 blanks are added. I.e.:
A |
m |
y |
g |
d |
a |
l |
o |
h |
i |
p |
p |
o |
c |
a |
m |
p |
e |
c |
t |
o |
m |
y |
F |
e |
v |
e |
r |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Total number of bytes used: 46.
The latest FDA "Study Data Technical Conformance Guide" contains a very weird paragraph about non-ASCII characters.
Here is a screenshot:
What is meant here?
If one has read the above on UTF-8, being a "variable multibyte encoding", and for the use case of SAS Transport 5 format,one can only come to the conclusion that this doesn't make sense.It looks to me as this has been written by some who does not have any technical insight on encodings.Whereas one could use an "extended ASCII" encoding like ISO-8859-1 (1 byte = 1 extended character) and "hope and pray" it is readable by the tools at the FDA, I would not know how to generate an XPT file with UTF-8 encoding,and at the same time comply with the FDA rules.It would require a careful counting of the number of bytes used (not the number characters), as no more than 200 bytes can be used for a variable value. It would probably also generate a huge number of errors when validating with the validation software the FDA is using.
I haven't tried it out yet, but I intend to generate such a file, e.g. with Chinese or Japanese comments (SDTM-CO domain), encoded as UTF-8 and SHIFT-JIS, using XPT format. I will do some in the next months, and hope to have everything ready by the 2020 CDISC US Interchange where I would like to present my findings as a poster. I will also try to make calculations about the effect on file size, and compare that with using CDISC Dataset-XML format, a modern, XML-based (using UTF-8 encoding by default) format and standard developed by CDISC as a replacement for the outdates SAS Transport 5 format.
Very informative article. Thank you.
ReplyDeleteI asked the FDA eData Technical Assistance group about this recently, and I was told to remap ASCII characters with codes > 128 to their analogs in the 32 – 128 character range. In our data, this is almost exclusively seen in the Conmed verbatim terms.
Thanks! I wonder what FDA means by "remap ASCII characters with codes > 128 to their analogs" ...
ReplyDeleteFor example, what would be the "analog" for character 143 (Ã…)?
I have the impression FDA has no clue about "non-English" characters. With 13.5% Spanish-speaking population in the USA, this looks frightening to me...