Friday, March 27, 2015

Creating define.xml - best and worst practices

Define.xml is a CDISC standard, based on XML, allowing sponsors to provide metadata for SDTM, ADaM and SEND electronic submissions to the FDA and the Japanese Ministry of Health.
But define.xml is more than that: it is also a very good way to exchange metadata for SDTM, ADaM and SEND datasets between partners in the clinical trial world.

As a CDISC define.xml instructor, I am often asked what the best practices are to generate define.xml files that not only are conform to the standard, but also correctly and concisely describe the metadata for SDTM, ADaM or SEND datasets.

In this blog, I will discuss a few of what are, in my opinion, the best practices for generating define.xml. Although never asked for, I will also list what i.m.o. are the worst practices for creating define.xml.

In the following, I will use SDTM a lot as an example, but usually the same applies to ADaM and SEND.

Best practices

If you are a sponsor outsourcing the generation of SDTM datasets (e.g. to a CRO), the best practice is to generate a define.xml that can be used by the company that will need to provide the SDTM datasets, as a specification of what will have to be delivered. This means that it might be an extremely good idea to generate a define.xml for a study, even before the study has started. If you had similar studies before for which you do already have a define.xml, this is pretty straightforward. Your define.xml that you supply as a specification does not need to be complete, but if you designed your study well (taking into account that you want to submit it later), it should be nearly-complete.

Here is a slide set of my colleague Philippe Verplancke about this, even showing how a define.xml can be used as a direct source for CRF design for the next study.

Another best practice is to design your study using a study design tool, and generate an ODM metadata file from it, and do the mapping between study design and SDTM, even before the study starts. There are several good software tools on the market for doing so. I have a customer who is even doing the complete mapping between ODM and SDTM long before the study starts, as a quality control for the study design. The idea is very simple: "if we cannot map the study design to SDTM now (before the study starts), we will be in  big trouble later". The define.xml itself is used to store the mapping instructions.

Even if your study has already started, it is a very good practice to do this mapping long before the database closes. Partial data can be used to test the mappings, and if you do it well, you can just run the mappings only once (maybe taking half an hour or so) after data base closure. And you do already have your define.xml - you only need to remove the mapping instructions from it.

The reason for this is that the best tools on the market all use the same method: during generating the mappings (usually using drag-and-drop for generating mapping scripts that can then be refined), a define.xml is kept in the background (also keeping the mappings) and automatically synschronized. So if you change something in the mapping, this is automatically reflected in the underlying define.xml.

Whether you use a special tool or a statistical package, essential is that you have a process in place where your define.xml is fully synchronized with your generated or to-be generated SDTM datasets.

Other good practices

Learn XML or have someone in your team that has XML knowledge. My undergraduate students learn XML in just two lectures (each of 90 minutes), and one or two exercise afternoons. Learning XML is easy.
With a litte XML knowledge, you can generate a define.xml file for your SDTM datasets starting from an existing sample file (like the one published by the CDISC define.xml team). Estimated effort: about 1-2 days. Use an XML editor (and not NotePad or WordPad or so - disaster is preprogrammed) - there are even some very good XML editors that are free.
When you edit a define.xml file, be sure that you regularly validate it against the define.xml XML-schema (most XML editors have this functionality) which ensures you that your basic structure is correct. Read the "define.xml validation whitepaper" for more information. There are some good tools on the market for validating define.xml files, however, do avoid those that do not allow you to validate your define.xml against the define.xml XML-Schema (such tools have implemented non-official "rule interpretations" which are the vendors own interpretations of the define.xml specification).

There are also a few tools on the market that allow you to generate / edit define.xml in a user-friendly way without needing to see the XML itself (several new ones have been announced). As long as such a tool allows you to validate your define.xml against the XML-Schema and against other rules from the define.xml specification, and allow you to inspect the source XML and allow you inspect the result using the stylesheet, such a tool may be a good choice.

Bad practices

A bad practice that is often followed is to try to generate a define.xml file from a set of SDTM files (XPT files) using a "black box" tool (post-generation). You can use such a tool to generate a "first guess" of a define.xml file for your study, but you will still need either to edit it using an XML editor, or to use a tool mentioned before, to complete and fine-tune the define.xml.
Many users however expect such tools to generate a correct and ready-to-submit define.xml file. They inspect the result using the stylesheet (without inspecting the source XML nor validating it against the XML-Schema), and are surprised that it doesn't work or doesn't provide what they expect.
One such tool allows you to write the "instructions" in an Excel worksheet and then use that to "automatically" create a define.xml file. There is however no manual how to write these instructions, you are told you should take an existing define.xml file (for example published by CDISC), and convert that to an Excel worksheet with such "instructions" and learn from that how to write or complete the worksheet for your own submission. In the time needed to find out how that works, you could already have become an XML expert!

Many people think that define.xml is what they see in the browser (using the stylesheet), not realizing (or not wanting to) that there is an XML file behind that, a file that is machine-readable and that can be used for much more than just displaying the study metadata. Unfortunately, most reviewers at the FDA do not realize this either.

Worst practices

Some companies have similar processes in place where the define.xml is generated post-SDTM, usually using Excel worksheets, or even using Word documents. Especially the latter is extremely dangerous, as XML usually uses UTF-8 encoding (the international standard), and your Word document might be using an encoding that is incompatible with UTF-8. So if your copy-and-paste from a Word document into an XML document, and you have characters that go beyond ASCII-7, be not surprised when you see the strangest things in the result.

  • create your define.xml before your study starts
  • use a tool or process that keeps your define.xml in sync with your mappings between study design and SDTM
  • post-SDTM generation of define.xml is a "last resource method" - try to avoid it
  • do not expect "black-box" tools to generate a "submission-ready" define.xml file for you
  • avoid the use of Excel worksheets and Word documents to generate a define.xml
  • learn XML - it is not difficult! Even if you use good tools, it will help you to understand what you have created
Also read our newest post with even more information.


  1. Excellent summary Jozef! That matches exactly how we see/do things at Formedix.

  2. Jozef, thanks for pointing out the encoding issues when using Word or Excel as source of metadata. This has been a great deal of the issues I have seen with Define-XML.
    Also a pity that the free XML editor, Exchange XML, that you point to has not been developed since 2010. It was a great entry point for people who start exploring XML. As Google Code is shutting down it will soon be gone.
    Great summary indeed!