NDA Harmonization Standards

All Collections | Clinical/Phenotypic | Neurosignal Recordings | Omics | GUID | Resolve Identifiers | Data Definition | Validation | Submitting data in BIDS format | Submitting Data from REDCap

Information on what is accepted and methods for harmonizing data to the NIMH Data Archive (NDA) standard can be found here, listed by data type.

Standards for all Collections

All NDA Collections are expected to submit a Research Subject and Pedigree data structure, using either the https://nda.nih.gov/data_structure.html?short_name=ndar_subject01 or https://nda.nih.gov/data_structure.html?short_name=genomics_subject02 template (if omics submission is expected). The Research Subject and Pedigree structure contain core information about all study subjects, including diagnosis, phenotype, and family pedigree; information on biosamples that have been collected and/or analyzed, including sample type, description, and the subject ID in a given biorepository; and subject IDs in other information systems. Investigators should create a new entry in the Research Subject and Pedigree structure every time a subject’s diagnosis or phenotype is updated or a new biosample is collected and/or analyzed.

Clinical/Phenotypic Data Standards

NDA supports an unlimited amount of clinical, demographic, and phenotypic data associated with human subjects research (see Data Dictionary). To ensure the harmonization of data across projects, all data submitted to NDA must conform to a standardized data structure as defined in the Data Dictionary. See the steps to data sharing for more information.

Using Existing Assessments

The NDA Data Dictionary contains thousands of clinical assessments that have been used by mental health researchers to collect data for clinical research studies. Many of these assessments are utilized across different research studies, facilitating harmonization, reproducibility, and secondary data use. NDA encourages researchers to search the Data Dictionary to locate existing structures that can be used to harmonize all their assessments. Researchers can request to extend existing NDA data structures by adding new variables or values to accommodate research plans. Simply email the NDA Help Desk with the request details.

Submitting New Assessments

Projects submitting data to the NDA can provide new assessments as part of creating their Data Expected. Choose "Add new Data Expected" and then select "upload new definition" in the Data Expected web tool. See the tutorial here. New structure requests should be submitted in the same as all NDA data structures. It is also helpful to provide an electronic copy of the assessment manual with any instructions and supporting documentation such as codebooks. NDA encourages all users submitting new assessments to include a clear description of the new structure, describing the purpose of the assessment, information on data collection methodologies, and assessment source, if applicable. The assessment description will be displayed to all users in the NDA Data Dictionary and will support other projects collecting similar assessments as well as secondary data users. NDA staff will then curate the definition and make it available to the research community for data submission.

Back to top of page

Neuro-signal Recordings

NDA accepts evoked response/event-based data from EEG, fMRI, eye tracking, MEG, and EGG experiments. Each of these types of data has one standard structure in the Data Dictionary that can be used to upload the associated files. These are:

These structures allow you to provide required information and contain a data file element that allows you to specify the path of associated data files for upload. To provide information on experimental parameters used to collect the data (e.g. event/task descriptions, acquisition hardware, postprocessing, etc.) on a participant-record level, these data types also require you to define an Experiment. This is done in your project's Collection, and when the Experiment definition is completed it will be assigned an ID number that must be provided in the "experiment_id" element in the appropriate structure. You can find a tutorial on using the definition tool here. When successfully harmonized, this data will include the CSV submission structure specifying the location of associated data files, and an associated Experiment in your Collection. The CSV and your associated data files are then uploaded using the Validation and Upload Tool. Additionally, the NDA provides imaging QA for submitted imaging files using the FSL Fast/First computational pipelines. These quality assurance results are now available to authorized users for query and download under ' Evaluated Data ' in the Data Dictionary.

PII Commonly Found in Imaging Headers

The NIMH Data Archive (NDA) Data Submission Agreement (DSA) terms and conditions (clause 3) state submitters agree that all submitted data have been de-identified so that the identities of subjects cannot be readily ascertained or otherwise associated with the data by the NIMH Data Archive staff or secondary data users (https://nda.nih.gov/nda/standard-operating-procedures.html#sop5).

When submitting neuroimaging data, NDA often discovers PII in the header fields of DICOM or NIIFTI files. It is helpful to work with your scanner technician to understand their normal process for entering information at the time of data collection, as this can give you hints about where information is most likely to be entered by mistake or fields that will only occasionally contain data but should still be checked.

NDA’s recommended approach when submitting these types of files is to inspect the header information before submitting image files and cleaning only the fields that contain the PII to ensure that all data is de-identified. Please inspect all fields before submitting images as PII is sometimes entered in non-standard fields either by accident or by local convention. Wiping all the header information before submitting can limit the ability of other researchers to conduct secondary analyses using the data submitted.

A variety of third-party software tools are available to facilitate this process. Dr. Chris Rorden, at the University of South Carolina, has compileda list of stand-alone tools that are free to use. Moreover, below is a list of libraries or built-in functions available for interacting with DICOM files in most common programming languages:

Back to top of page

Omics Definition

To submit omics data to the NDA, you will first need to use the Experiments tab of your project's Collection to create an Experiment that defines parameters like molecule, platform, software, etc. You can find a tutorial on using the definition tool here. Once the experiment is created, it will be assigned an ID number. You can then use the standard omics definition ( genomics_sample structure) to provide the required sample information and enter the Experiment ID in the field of the same name. This CSV file includes a field for specifying the path of the associated omics data files you will upload. Once your Experiment is defined, your CSV populated, and files specified, you can upload this data using the Validation and Upload Tool. Please note Contact us at the NDA Help Desk for more information.

Projects submitting omics data should use the summary data structure Genomics Subject, rather than the standard Research Subject structure mentioned in the section on clinical/phenotypic data.

Acceptable Genomic File Types

The following table provides guidance to users submitting genomic data to NDA and outlines the expected standards.

The level of data designates the degree of unprocessed data, with Level 0 data being raw data directly from an instrument. NDA will generally not take this type of data.

Standards for aligned and variant call file types are indicated in the table and are preferred instead of platform specific file types. When a GA4GH standard exists, please use that standard.

Any analyzed data that relates genomic data to phenotype, or other biological states, should be submitted in an NDA study.

Data Level Description File Types Submission
Level 0 Raw data directly from instrument
  • NDA will generally not take this type of data
Level 1 Basic data after initial processing of raw input data (DNA sequence)
  • FASTQ
  • NDA encourages submission of FASTQ files only when CRAM (or BAM) files containing both aligned and unaligned reads are not submitted.
Level 2 Data after an initial round of processing or computation to clean the data and assess basic quality measures (Aligned DNA sequence)
  • BAM
  • CRAM
  • BAI
  • NDA encourages submission of CRAM (and CRAI) files, rather than BAM (and BAI) files.
Level 3 Analysis to identify genetic variants, gene expression patterns, or other features of the dataset (Variant Call Format)
  • VCF
  • PLINK
  • NDA encourages submission of VCF files with associated information such as the sequencing technology, the type of biological variation captured in the VCF, and the extent of variant annotation.
  • PLINK files can also be submitted. Submission of PLINK files should include PED+MAP, or TPED+TFAM, or BED+FAM+BIM.
  • BED files can also be uploaded.
Other Misc, Platform Specific File Types and legacy data
  • FAST5
  • SFF
  • HDF5
  • tbi
  • NDA will take other formats/types of experimental genomic data as appropriate.
Other SNP Array
  • CEL
  • IDAT
  • NDA will take other formats/types of experimental genomic data as appropriate.
Analyzed Data Final analysis that relates the genomic data to phenotype or other biological states (can include structural variant data)
  • XHM and others

Back to top of page

NDA GUID

GUID Training

The NDA GUID is a subject ID that allows researchers to share data specific to a study participant without exposing personally identifiable information (PII) and makes it possible to match participants across labs and research data repositories. The NDA GUID is the subject ID standard developed for autism research and now adopted across mental health. Every data structure in the NDA Data Dictionary includes this identifier (labeled as the element subjectkey). Additionally, the GUID is used by researchers publishing the results of primary or secondary analyses on data shared through NDA to associate subjects to cohorts in an NDA Study. This allows a researcher to link publications directly to data in NDA (see NDA Study).

The tool itself is an electronic application or command-line application that you can launch directly on Windows, Mac, or Linux operating systems. It supports single-subject data entry or bulk GUID generation. Email us at NDAHelp@mail.nih.gov for information on the command-line tool.

To create a GUID requires an individual's legal name at birth, date of birth, sex, and city/municipality of birth. Because information on the birth certificate is constant over an individual's life, i t is very important to include the information as it appears on the birth certificate. Otherwise, a subject mismatch will occur if the research subject enrolls in other autism research studies and another source is used. When generating GUIDs for twin subjects, the Get GUIDs for Multiple Subjects function must be used as described below in order to prevent a false positive match.

If you are submitting data to NDA, you can check the box to request access to the GUID Tool when creating your account. Please contact us if you already have an account and need access to the GUID Tool.

You can find more information about this feature on the NDA GUID website.

Back to top of page

Resolve Subject Identifiers

The mental health research community has standardized the NDA GUID for cross-project subject identifiers. However, many other identifiers do remain in use by the research community. To resolve the appropriate GUID/subjectkey and ensure that no duplicate subjects exist in data retrieved from NDA, use the Resolve Subject Identifiers interface (single or multiple entries using the CSV template).

If a match is not found, NDA has not yet received that subject identifier. To add subject identifier associations to NDA, include the subject identifier and submit to NDA using one of our Resolve Identifiers data structures (e.g. ndar_subject, genomics_subject). We will then resolve the identifiers for you within NDA allowing us to collectively fix duplicate subject identifiers that are used in different systems. Note that NDA pseudoGUID promotions are automatically applied to the repository. If a source exists that we currently don't support, please contact us at the NDA Help Desk so that we can add it.

Back to top of page

Data Definition

NDA has worked with the mental health research community to create a Data Dictionary containing standard structures for hundreds of assessments.

Here are a few notes about the data dictionary:

  • You can browse structures by Type (e.g. Omics, Neurosignal Recordings, Clinical Assessments), Source (e.g. NDAR, PediatricMRI, AGRE, NDCT) or Category (e.g. Behavior, IQ) to identify available data structures, as well as refine the results displayed with a text search.
  • By clicking on the name of a data structure, you can view a list of data elements and their attributes, download the detailed definition and a blank template for the submission of data, and see any related URLs.
  • IQ descriptions have been removed by request but are available if needed.
  • The column "Submission" indicates whether NDA currently accepts data uploads of this structure. If Submission is "Not Allowed" a more current measure usually exists.
  • There is a link to a Change History on each structure page that shows all the changes to the structure made within the last six months.
  • The Alias column in the structure displays other names that the Validation and Upload Tool will recognize for a specific data element.
  • NDA supports translation for data values, allowing data to be converted from a lab-specific value to the NDA-recognized value (e.g. Male to M). While this may be helpful to labs that have already collected data using a different set of values, most labs that are not using the standard should consider performing this conversion prior to submission.
  • A web service into our Data Dictionary is available with no authentication required. It is available at https://nda.nih.gov/api/datadictionary/v2/docs/swagger-ui.html and please contact us with any questions.
  • For Autism Centers of Excellence II grantees, the ACE Common Measures Version 2 are used across projects, replacing the original ACE Common Measures, which have been deprecated.
  • Research Subject : This is a general summary structure that allows you to provide one record for each participant indicating the NDAR clinical diagnosis. If a clinical diagnosis is not provided elsewhere, which is typical for control subjects, a diagnosis can be provided using this measure. Additionally, this data structure is used to provide subject identifiers beyond the NDA GUID (e.g. AGRE, Rutgers, SFARI), allowing us to match subjects across repositories (see Resolve Identifiers). The same structure is also used by all projects collecting data not related to ASD.
  • Genomics Subject : This structure is required for Omics submission to NDAR in place ' of Research Subject. It is similar to Research Subject, but includes other data-/bio-repository identifiers that also allow us to match subjects across these repositories (see Resolve Identifiers).
  • Genetic Test: This structure allows you to specify a genetic test used and its result. It should currently include options for all known genetic tests. Please contact us at the NDA Help Desk if a new test is not represented in this form.

Back to top of page

Data Validation

NDA requires all data to be successfully validated prior to submission. This means that researchers submitting data must use the Validation and Upload Tool, available freely from NDA websites, to check their files against the standards defined in the Data Dictionary structure. Essentially, the tool allows you to specify where your data is located or drag and drop your files in, and will then inspect your files to determine which data structures they use (short_name and version are used by NDA to identify the data structure), and verify the data in your records conforms to the definition. When your data passes validation, you can then use the same tool to create and upload a submission package recognized by NDA.

A few notes about the details of using the Validation and Upload Tool:

  • Fields marked as Required in the data structure must be a column in your data and cannot be null/blank. If you do not have the data for that element, you must positively identify that the data is not available as defined by the valid values. If the valid values do not provide such an entry, contact us and we will add it.
  • Empty Recommended fields (as indicated in the same "Required" column), will not prevent submission. Please note that all item-level details are expected. The Validation and Upload Tool provides a warning if data for a recommended field is null and no warning if data for an optional field is null.
  • The Validation and Upload Tool will test each row to ensure that it is harmonized to the Data Dictionary and will validate that the GUID/subjectkey exists within NDA. Each field must conform to the data element's value range if one has been defined.
  • The notation of "::" is used to indicate a range. For instance, 0::1200 for interview_age means within a range of 0 to 1200 months old.
  • Associated files (e.g. genomic and imaging files) do not need to be loaded into the tool. When the file is validated upon loading, or you re-run the validation, the tool will check for the existence of the file in the specified location.
  • The NDA is currently piloting programmatic data submission through web services. If you are interested in joining this pilot, please contact the NDA Help Desk.

For information on how to use the NDA Validation and Upload Tool, please review the tutorials here.

Back to top of page

Submitting data in BIDS format

To support the BIDS format for imaging data, the Manifest data element should be used. The Manifest data element is like the File data element in that the data submission template specifies the location of an XML or JSON file containing a collection of files. This supports the capability to create NDA data structures that describe a collection of related file resources. This new element treats the files included in the Manifest as associated files for purposes of submission and these are ingested and stored as individual objects in AWS S3 Object storage, which also enables users to directly access specific files from the collection of files.

Click here for more information about Manifest data elements, or refer to our GitHub repository for examples and helper scripts.

Back to top of page

Submitting Data from REDCap

For researchers who use REDCap for data acquisition purposes, data transformations are often necessary to comply with the NIMH Data Archive (NDA) data harmonization standards. Assigned data curators will help with structure creation and variable mapping. Some examples of data transformations that may be necessary to REDCap data are as follows:

  • Splitting REDCap summary forms into individual NDA structures
  • Merging outputs from REDCap multichoice answers into one NDA element
  • Merging REDCap timepoint-dependent variables into NDA elements based on interview_date data
  • Using consistent names for each unique REDCap field
  • Editing REDCap field descriptions as if there were no descriptive fields

For more information, please review the Guidance for Preparing REDCap Data for NDA .

Back to top of page