FAIR data and FAIRtracks
Track data should be deposited in ways that allow for machine actionability, in line with the FAIR principles
The FAIR data principles provide technical guidelines to enable the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data. The main focus is on the machine-actionability of these aspects, i.e. the technical capability of performing these operations in an automatized way, with minimal human intervention.
Metadata. Metadata and metadata models play a major role in this process, and should contain:
- Global and persistent identifiers to datasets
- A number of attributes providing descriptive information about the context, quality, condition, and characteristics of the data
- Metadata attributes should be linked to controlled, shared vocabularies (or ontologies).
Identifiers and ontologies. To enable machine-actionability, the metadata needs to be indexed in a searchable resource and made retrievable via the identifiers using a standardized communication protocol. Moreover, a high level of standardization is required to achieve semantic interoperability allowing, e.g., for integration of different datasets. Linking metadata fields to ontologies provides context to the dataset as a self-describing information bundle where the links to ontologies provide the foundation to machine interpretation, inference, and logic.
Track data and FAIR principles. The degree to which deposited track data comply to the FAIR principles vary greatly, from near-perfect FAIRification practices in the context of certain consortia to the almost complete lack of metadata linked to track files in a range of smaller projects. Some common issues are listed in Figure 1.2. One of the major weaknesses is the lack of suitable uniform metadata schemas that can work across track collections. The lack of uniform metadata strongly limits the possibility of reusing or repurposing track data and hinders automation of these processes, especially when it comes to the "long arm" of deposited track data files. Furthermore, the lack of provenance information might introduce artefacts in the analyses. This lack of proper annotations and of a well-defined and universally adopted metadata standard is related to the lack of a central repository for track data, as described in the section Track collections.
The ambitions of the FAIRtracks project are two-fold:
Provide a set of pragmatic metadata schemas for genomic tracks that comply with the FAIR principles and are adopted and further developed by the community as a minimal metadata exchange standard, providing a unified view into both:
- novel track data depositions
- legacy track collections/data portals.
Provide a set of services to be integrated with downstream tools and libraries so that analytical end user can more easily discover and reuse from the massive amounts of track data that has been and is being created:
- for various species
- from different types of sample material
- by applying diverse types of experiment assays and in silico processing workflows.
See Figure 1.3 for an overview of the secondary data life cycle that falls within the scope of the FAIRtracks project.
Assign a persistent reference to your published track data
FAIR principle F1 stipulates the need to assign identifiers to data and metadata that are both 1)
globally unique and 2) persistent. The
GO FAIR website page on the topic further asserts:
Principle F1 is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers.
Globally unique and persistent identifiers allows the linking of research data with different aspects of the research environment (Figure 3.1).
The need for track file identifiers: Track data files are seldom assigned identifiers directly; often it is only the raw sequence files used to generate the track files that are assigned identifiers, typically the accession numbers to data repositories such as the Sequence Read Archive (SRA) or the European Genome-Phenome Archive (EGA). The ENCODE project represents a notable exception: each track is associated with an identifier resolvable through Identifiers.org (see Figure 3.2) and a dedicated web page. Furthermore, a universally accessible service to assign and register identifiers to single track files and collections is currently missing.
We therefore strongly recommend the implementation of a track registry
Track files contain condensed data from bioinformatics workflows and are thus dependent on specific parameter settings and cannot be perfectly recreated from the raw without also perfectly reproducing the full analysis workflow, which is often a difficult task. We therefore strongly recommend the implementation of a track registry that preserves the full context of tracks by assigning global identifiers not only to the track data files but also to the associated metadata.
Built for track file identifiers: The FAIRtracks draft standard is developed from the ground up to support globally unique and persistent identifiers for track files and could be suitable for use as a basis for a potential global registry of track metadata (see Figure 3.3). For now, global track identifiers are allowed, but not enforced by the FAIRtracks standard. Instead, we require the inclusion of local track identifiers within the dataset as well as Uniform Resource Locators (URLs) to track files, which, unfortunately, come without any guarantees of persistence or uniqueness. The FAIRtracks standard still provides globally unique and persistent identifiers to track files in an indirect manner, using document DOIs.
DOI as document identifier: In case a direct identifier is not attached to a track file, the identifier of a parent record (e.g. study or experiment) can be used instead. On top of this, FAIRtracks requires a global identifier for the metadata file itself using a document identifier (DOI). In principle, a track file can thus be uniquely pin-pointed by a joint identifier containing the DOI of the FAIRtracks document and the locally unique track identifier. As our policy requires support for DOI versioning and DOI reservation prior to publication, we currently recommend Zenodo for publishing FAIRtracks documents. We would extend our list of recommended repositories and archives to any domain-specific services meeting our requirements.
"Mix-tape" track collection identifiers: Apart from the main use case of annotating primary track collections deposited in some repository, FAIRtracks is designed to also allow a more novel use case: to annotate secondary "mixtape" track collections of track files originating from different primary sources. The main example of this use case is to annotate the exact track data files used to generate the findings of a scientific publication, whether these track files represent novel data, are directly reused from other repositories, are regenerated from the raw data or in other ways derived from the original track files. To allow the provenance of such "mixtape" reuse of tracks, assigning globally unique and persistent identifiers to track collections would be advantageous. Full support for secondary track collections is scheduled for version 2 of the FAIRtracks standard (coming soon). Currently, this concept is most fully developed in the form of GSuite files in the context of the GSuite HyperBrowser.
References to external records: FAIRtracks supports and recommends the inclusion of global identifiers to external records containing detailed metadata. We require these global identifiers represented in Compact Uniform Resource Identifies (CURIE) form resolvable through Identifiers.org (see Figures 3.2 and 3.3). A mapping service from existing URIs to the corresponding CURIEs is desirable, as it would enhance the conversion of existing metadata to the FAIRtracks standard.
Formal ontologies should be used for biological terms to provide context to the metadata
An ontology is a "representation of the shared background knowledge for a community" (Stevens, Rector & Hull, 2010). More than just a controlled vocabulary, an ontology provides a formal conceptualization of the nature and structure of the objects it refers to (Guarino, 2006). The ontology terms have formal definitions and relationships and are typically arranged hierarchically in the main structure, as illustrated with the term phagocyte in Figure 4.1. Each ontology term is assigned a persistent unique identifier (PID) which enables interoperability across datasets, services, repositories, and ontologies.
Linking terms across ontologies: Interoperability of biological terms across ontologies by the use of PIDs is invaluable for describing complex biological knowledge with composite annotations. Expanding the example from Figure 4.1, we see in Figure 4.2 that the phagocyte cell type in the Cell Ontology is linked to the biological process phagocytosis as described in the Gene Ontology through the relation:
phagocyte subClassOf: capable of some phagocytosis
On a technical level, this relationship has been allowed through the addition of the PID of the Gene Ontology-term phagocytosis as a foreign ID to the Cell Ontology (Figure 4.3).
Registries of ontologies:
- The most relevant ontologies are registered and accessible through the Ontology Lookup Service (OLS) and the NCBO BioPortal. These services allow for lookup of terms across multiple ontologies and provide support for ontology discovery based on a limited set of metadata fields.
- The OBO foundry community provides access to a set of interoperable ontologies that are both logically well-formed and scientifically accurate, following these sets of principles.
- FAIRsharing annotates ontologies with richer metadata and provides lists of related records, including standards and databases. For these reasons, FAIRsharing is a valuable tool for discovering ontologies by assessing their use in the communities.
One concept – one term! In order to provide a clean and simple interface to the end users, FAIRtracks aims to map each concept to one and only one ontology term. To this end, the ontologies have been carefully selected and organized in such a way that the domains do not overlap. Table 4.1 lists the ontologies, controlled vocabularies and databases used in the latest version of the FAIRtracks metadata standard.
Composite fields: Core biological ontologies are often overlapping considerably domain-wise, not
least due to the widespread practice of importing parts of other ontologies (as exemplified in
Figure 4.3). However, most ontologies have certain branches or subdomains where they are
particularly strong. In FAIRtracks we take advantage of this by splitting a few of the most
important fields, in particular the fields
sample.sample_type, into more
precise subfields. Each subfield is then linked to a specific branch of a specific ontology which is
particularly strong in that subdomain.
Summary fields: Many subfields are only relevant to certain types of records and will thus have
missing values elsewhere. To counteract this we provide the general fields
sample.sample_type.summary that are automatically generated based
on logic particular to each type of record (see section Augmentation
below). End users and downstream software can then opt to ignore the subfields (as the values might
be missing or might be too detailed) and instead depend only on the summary fields. The FAIRtracks
standard (in its augmented form) guarantees that the values of the summary fields are present across
all types of experiments and samples.
Community influence on ontology choices: When we developed the FAIRtracks standard in the context of the initial ELIXIR Implementation study, ontologies were chosen based on perceived quality as well as community uptake. The selection was however also, to a certain extent, a subjective process. If you have opinions on the ontology choices, please join us as an early adopter to make your voice heard (see the Community page)!
|experiment||technique||The Ontology for Biomedical Investigation (OBI)||planned process|
|experiment||target.sequence_feature||The Sequence Ontology (SO)||sequence_feature|
|experiment||target.gene_id||HUGO Gene Nomenclature Committee (HGNC)||CURIEs in the "hgnc" namespace|
|experiment||target.gene_product_type||The National Cancer Institute Thesaurus (NCIt)||Gene product|
Table 3.1: Ontologies, databases, and controlled vocabularies used in v1.0.2 of the FAIRtracks metadata standard, ordered by FAIRtracks schema/field name. If ancestor terms are specified, only terms that are somehow descending from the ancestor terms are allowed, effectively limiting the domain of the field to particular sub-branches of particular ontologies.
Automated power-up of metadata to improve human interaction
We developed the FAIRtracks metadata standard and the associated ecosystem of services with the aim of fulfilling two main objectives:
To facilitate deposition of novel track data and unification of metadata from legacy track collections in compliance with the FAIR data principles.
Improving downstream reuse of track data through services for data discovery and retrieval applied to unified and FAIR metadata.
Metadata for data providers vs end users: In the initial development phase we quickly discovered that these two objectives are somewhat at odds with each other. Downstream users will typically prefer strict, homogeneous, and human-readable metadata. Data providers, on the other hand, will typically prefer streamlined deposition, variation in metadata content, and machine-operable metadata.
Human-readable labels for ontology terms? This gap can be illustrated in the case of ontology terms. If data providers were to only provide the machine-operable identifiers for the terms, human users will not be able to directly understand them. Furthermore, it would be cumbersome for downstream developers and analytical end users to have to implement ontology lookup functionality in order to make use of the metadata. On the other hand, forcing data providers to provide both identifiers and human-readable labels for all ontology terms would discourage data deposition.
Bridging the gap: Based on the above considerations and similar examples, we discovered a need for intermediate solutions that provide ontology lookup functionality and other FAIR-oriented features to help bridge this gap:
Minimal and augmented versions of FAIRtracks: Data providers that adopt the FAIRtracks metadata standard only need to fill the minimal set of fields that are marked as "required", which together constitutes the minimal version of the FAIRtracks standard (see Figure 5.1). The FAIRtracks augmentation service is implemented as a REST API that takes a minimal FAIRtracks-compliant JSON document as input, adds a set of fields with human-readable values to the document, and provides this augmented FAIRtracks-compliant JSON document as output. The fields added include human-readable ontology labels, ontology versions, human-readable summary fields, and other pragmatic fields useful for interactions with the end user (see Figure 5.2). This service simplifies the job on the data providers side and, at the same time, improves the quality of the data discovery and retrieval operations on the users' side in both automated and manual usage scenarios.
Automated validation of your metadata documents
Standards and validators: An important aspect of any kind of standardization is to define mechanisms for validating adherence to the standard. Such validation should be precise and thorough enough to uphold the required level of quality, while at the same time not be too much of a burden for adopters. In the domain of interoperable web services, the de facto standard is to deploy a HTTP-based API that follows some level of RESTfulness. The de facto standard for data representation is JSON documents, and the de facto standard that allows for annotation and validation of JSON documents is JSON Schema.
JSON Schema: Syntactically, a JSON Schema is just a JSON document that makes use of a standardized vocabulary and structure, as defined in a particular version of the JSON Schema specification. JSON Schemas are used to describe JSON data formats by describing a set of restrictions to the content and structure of JSON documents. These restrictions are typically upheld by a particular authority, such as a metadata standard or a REST API. An important property of a JSON Schema document is that it is both human-readable and machine-actionable. JSON Schema validators are available in most programming languages, simplifying the process of automatic validation of JSON documents according to the respective JSON Schemas.
FAIRtracks validator: The FAIRtracks metadata standard is implemented as a set of JSON Schemas, following the above-mentioned de facto standards for interoperable web services (see Figure 6. 1). Use of JSON Schema provides a way to formalize restrictions that can easily be machine-validated. We provide the FAIRtracks validation service to allow data providers to verify the adherence of JSON metadata documents towards the FAIRtracks metadata standard.
Features: The FAIRtracks validator extends standard JSON Schema validation technology through additional modules that allows for:
- Validation of ontology terms against specific ontology versions
- Checking CURIEs against the registered entries at the Identifiers.org resolution service
- Checking restrictions on a full set of documents, e.g. whether identifiers are unique across all documents and whether the records referred to by foreign keys actually exists.
Managing ontology versions: To our knowledge, there is no consensus of how to relate versions of metadata schemas with versions of the ontologies they depend on. For FAIRtracks standard, as well as the augmentation and validation services, we decided to only relate to the most recent version of each ontology. Our reasoning behind this choice is detailed in Technical note #1 (to the side). If you know more about this than we do, please let us know!.