Figure 1.1: Overview of the FAIR Principles for research data. [From Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 (2016), made available under the CC BY 4.0 license]
Figure 1.2: Important topics where the current state of track data and metadata have potential for improvements, as mapped to the FAIR data principles. [From Gundersen S et al. Recommendations for the FAIRification of genomic track metadata. F1000Research 2021, 10(ELIXIR):268]
Figure 1.3: The secondary data life cycle in scope of the FAIRtracks project. Genomic track files are primarily deposited together with raw data and detailed metadata through the primary data life cycle. The secondary data life cycle supported by FAIRtracks is built around the FAIRtracks metadata (draft) standard, a minimal metadata exchange schema designed to refer back to the raw data/metadata if more details are required. The grey box shows the scope of the FAIRtracks project with current and potential integrations. Omnipy is a general Python library for scalable and reproducible data wrangling which aims to be useful across data models and research disciplines. (Illustration is planned for the RDMkit and is available for download from the Materials page)
FAIR data and FAIRtracks
Track data should be deposited in ways that allow for machine actionability, in line with the FAIR principles
The FAIR data principles provide technical guidelines to enable the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data. The main focus is on the machine-actionability of these aspects, i.e. the technical capability of performing these operations in an automatized way, with minimal human intervention.
Metadata. Metadata and metadata models play a major role in this process, and should contain:
- Global and persistent identifiers to datasets
- A number of attributes providing descriptive information about the context, quality, condition, and characteristics of the data
- Metadata attributes should be linked to controlled, shared vocabularies (or ontologies).
Identifiers and ontologies. To enable machine-actionability, the metadata needs to be indexed in a searchable resource and made retrievable via the identifiers using a standardized communication protocol. Moreover, a high level of standardization is required to achieve semantic interoperability allowing, e.g., for integration of different datasets. Linking metadata fields to ontologies provides context to the dataset as a self-describing information bundle where the links to ontologies provide the foundation to machine interpretation, inference, and logic.
Track data and FAIR principles. The degree to which deposited track data comply to the FAIR principles vary greatly, from near-perfect FAIRification practices in the context of certain consortia to the almost complete lack of metadata linked to track files in a range of smaller projects. Some common issues are listed in Figure 1.2. One of the major weaknesses is the lack of suitable uniform metadata schemas that can work across track collections. The lack of uniform metadata strongly limits the possibility of reusing or repurposing track data and hinders automation of these processes, especially when it comes to the "long arm" of deposited track data files. Furthermore, the lack of provenance information might introduce artefacts in the analyses. This lack of proper annotations and of a well-defined and universally adopted metadata standard is related to the lack of a central repository for track data, as described in the section Track collections.
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
The ambitions of the FAIRtracks project are two-fold:
-
Provide a set of pragmatic metadata schemas for genomic tracks that comply with the FAIR principles and are adopted and further developed by the community as a minimal metadata exchange standard, providing a unified view into both:
- novel track data depositions
- legacy track collections/data portals.
-
Provide a set of services to be integrated with downstream tools and libraries so that analytical end user can more easily discover and reuse from the massive amounts of track data that has been and is being created:
- for various species
- from different types of sample material
- by applying diverse types of experiment assays and in silico processing workflows.
See Figure 1.3 for an overview of the secondary data life cycle that falls within the scope of the FAIRtracks project.
Figure 1.1: Overview of the FAIR Principles for research data. [From Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 (2016), made available under the CC BY 4.0 license]
Figure 1.2: Important topics where the current state of track data and metadata have potential for improvements, as mapped to the FAIR data principles. [From Gundersen S et al. Recommendations for the FAIRification of genomic track metadata. F1000Research 2021, 10(ELIXIR):268]
Figure 1.3: The secondary data life cycle in scope of the FAIRtracks project. Genomic track files are primarily deposited together with raw data and detailed metadata through the primary data life cycle. The secondary data life cycle supported by FAIRtracks is built around the FAIRtracks metadata (draft) standard, a minimal metadata exchange schema designed to refer back to the raw data/metadata if more details are required. The grey box shows the scope of the FAIRtracks project with current and potential integrations. Omnipy is a general Python library for scalable and reproducible data wrangling which aims to be useful across data models and research disciplines. (Illustration is planned for the RDMkit and is available for download from the Materials page)
Figure 2.1: Overview of the metadata model of European Nucleotide Archive (ENA), which has evolved from the INSDC data model. [From the ENA web site, available under the CC BY 4.0 license]
Figure 2.2: Overview of the key objects in the FAIRtracks metadata standard. A "track" is an atomic element representing a genomic track data file. Each "track" is generated by an "experiment", physically or in silico. Physical "experiments" link out to "samples". Sets of "experiments" are contained within a "study" object. Finally, "tracks" are grouped into "track collections", which directly matches the existing track hub object. A "track collection" can also refer to an ad hoc collection of tracks, e.g., documenting the input data of published analyses. [From Gundersen S et al. Recommendations for the FAIRification of genomic track metadata. F1000Research 2021, 10(ELIXIR):268]
Track Metadata Models
Metadata models used to describe track data should be interoperable with other models commonly used to describe genomic datasets
INSDC. Many of the metadata models for annotation of genomic datasets available today evolved from INSDSeq. INSDSeq is the official supported XML format of the International Nucleotide Sequence Database Collaboration (INSDC), a long-standing foundational initiative that operates between NCBI (in the USA), EMBL-EBI (in the UK) and DDBJ (in Japan) to facilitate genomic data exchange. The INSDSeq standard proposes the assignment of a number of interlinked metadata objects to each data file, the most relevant among these being “Experiment”, “Study”, and “Sample”. Modern implementations of this model are the European Nucleotide Archive (ENA) metadata model and the Sequence Read Archive (SRA) schemas (see Figure 2.1).
ISA. The Investigation Study Assay (ISA) metadata framework provides a flexible solution for rich description and annotation of experimental outputs. The ISA abstract model exhibits a hierarchical nested structure comprising the "Investigation", "Study", and "Assay" metadata categories. The ISA model is implemented in tabular, JSON and Resource Description Framework (RDF) formats and is supported by dedicated software, ISA tools.
Table 2.1 lists these and other well-developed data models in use for genomic track data.
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
Objects. Developed in relation to these data models and others, the FAIRtracks data model is based on five key objects: Track collection, Study, Sample, Experiment, and Track. The relationships between the objects are illustrated and explained in Figure 2.2, while the mapping of the FAIRtracks objects to other data models are shown in Table 2.2.
Metadata fields. The additional attributes for each object type in FAIRtracks have been defined by striking a compromise between the work imposed on the producer and the consumer of the metadata. As a result, the mandatory attributes are only the ones that appeared necessary for generic re-analysis of the data. Notably, these fields also include resolvable references to relevant existing metadata records in external resources, in compliance with the FAIR data principles (see above).
Figure 2.1: Overview of the metadata model of European Nucleotide Archive (ENA), which has evolved from the INSDC data model. [From the ENA web site, available under the CC BY 4.0 license]
Figure 2.2: Overview of the key objects in the FAIRtracks metadata standard. A "track" is an atomic element representing a genomic track data file. Each "track" is generated by an "experiment", physically or in silico. Physical "experiments" link out to "samples". Sets of "experiments" are contained within a "study" object. Finally, "tracks" are grouped into "track collections", which directly matches the existing track hub object. A "track collection" can also refer to an ad hoc collection of tracks, e.g., documenting the input data of published analyses. [From Gundersen S et al. Recommendations for the FAIRification of genomic track metadata. F1000Research 2021, 10(ELIXIR):268]
Originator | Description | Standard file format | Standard description | Databases/Dataportals | Notes |
---|---|---|---|---|---|
Nucleotide Sequence Database Collaboration (INSDC) | Long-standing foundational initiative that operates between NCBI, EMBL-EBI and DDBJ to facilitate genomic data exchange. | XML | Modern implementations of this model are the ENA metadata model and the Sequence Read Archive (SRA) schemas. | NCBI, EMBL-EBI and DDBJ | |
International Human Epigenome Consortium (IHEC) | Organization aiming at coordinating the production of reference maps of human epigenomes. | XML | Extended from SRA described at GitHub repo. | BLUEPRINT and ENCODE | IHEC data portal |
Functional annotation of animal genomes (FAANG) | Consortium to discover basic functional knowledge of genome function to decipher the genotype-to-phenotype (G2P) link in farmed animals. | Inherited from hosting plateform (BioSamples, EBI, ENA SRA, NCBI, ArrayExpress or ENA) | The FAANG metadata model supports the MIAMEand MINSEQE guidelines. | FAANG data portal | Use ontology Experimental Factor Ontology (EFO) and the ontologies it imports. |
Table 2.1: Description of the main metadata models relevant for life science data. Please report any errors or omissions to our GitHub repo as an issue, or provide a PR.
FAIRtracks | INSDC | ISA | Track Hub Registry | GSuite |
---|---|---|---|---|
Track collection | Submission (SRA) & Dataset (ENA) | Investigation | Track Hub | Track collection |
Study | Study | Study | ||
Sample | Sample | Sample | ||
Experiment | Experiment & Analysis | Assay & Process | ||
Track | Analysis | Data | Track | Track |
Table 2.2: Mapping of objects in the FAIRtracks metadata standard to objects in other metadata standards. [Adapted from Gundersen S et al. Recommendations for the FAIRification of genomic track metadata. F1000Research 2021, 10(ELIXIR):268]
Figure 3.1: Globally unique and persistent identifiers allow linking research data with different aspects of the research environment, such as physical samples, experiment setup, in silico analyses, studies, and publications. [From Plomp, E., 2020. Going Digital: Persistent Identifiers for Research Samples, Resources and Instruments. Data Science Journal, 19(1), p.46, made available under the CC BY 3.0 license.]
Figure 3.2: Screenshot from the ELIXIR-supported Identifiers.org, which resolves globally unique and persistent identifiers in the form of CURIEs and returns URLs to repository web pages containing information about the referred object. Identifiers.org is a partner of the US-based Names to things (N2T.net), which provides similar services.
Figure 3.3: Example track record according to the "Track" sub-schema of the FAIRtracks metadata standard. Highlighted are the fields for globally unique and persistent identifiers, locally unique identifiers, some references to external records, as well as track file URLs.
Identifiers
Assign a persistent reference to your published track data
FAIR principle F1 stipulates the need to assign identifiers to data and metadata that are both 1)
globally unique and 2) persistent. The
GO FAIR website page on the topic
further asserts:
Principle F1 is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers.
Globally unique and persistent identifiers allows the linking of research data with different aspects of the research environment (Figure 3.1).
The need for track file identifiers: Track data files are seldom assigned identifiers directly; often it is only the raw sequence files used to generate the track files that are assigned identifiers, typically the accession numbers to data repositories such as the Sequence Read Archive (SRA) or the European Genome-Phenome Archive (EGA). The ENCODE project represents a notable exception: each track is associated with an identifier resolvable through Identifiers.org (see Figure 3.2) and a dedicated web page. Furthermore, a universally accessible service to assign and register identifiers to single track files and collections is currently missing.
We therefore strongly recommend the implementation of a track registry
Track files contain condensed data from bioinformatics workflows and are thus dependent on specific parameter settings and cannot be perfectly recreated from the raw without also perfectly reproducing the full analysis workflow, which is often a difficult task. We therefore strongly recommend the implementation of a track registry that preserves the full context of tracks by assigning global identifiers not only to the track data files but also to the associated metadata.
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
Built for track file identifiers: The FAIRtracks draft standard is developed from the ground up to support globally unique and persistent identifiers for track files and could be suitable for use as a basis for a potential global registry of track metadata (see Figure 3.3). For now, global track identifiers are allowed, but not enforced by the FAIRtracks standard. Instead, we require the inclusion of local track identifiers within the dataset as well as Uniform Resource Locators (URLs) to track files, which, unfortunately, come without any guarantees of persistence or uniqueness. The FAIRtracks standard still provides globally unique and persistent identifiers to track files in an indirect manner, using document DOIs.
DOI as document identifier: In case a direct identifier is not attached to a track file, the identifier of a parent record (e.g. study or experiment) can be used instead. On top of this, FAIRtracks requires a global identifier for the metadata file itself using a document identifier (DOI). In principle, a track file can thus be uniquely pinpointed by a joint identifier containing the DOI of the FAIRtracks document and the locally unique track identifier. As our policy requires support for DOI versioning and DOI reservation prior to publication, we currently recommend Zenodo for publishing FAIRtracks documents. We would extend our list of recommended repositories and archives to any domain-specific services meeting our requirements.
"Mix-tape" track collection identifiers: Apart from the main use case of annotating primary track collections deposited in some repository, FAIRtracks is designed to also allow a more novel use case: to annotate secondary "mixtape" track collections of track files originating from different primary sources. The main example of this use case is to annotate the exact track data files used to generate the findings of a scientific publication, whether these track files represent novel data, are directly reused from other repositories, are regenerated from the raw data or in other ways derived from the original track files. To allow the provenance of such "mixtape" reuse of tracks, assigning globally unique and persistent identifiers to track collections would be advantageous. Full support for secondary track collections is scheduled for version 2 of the FAIRtracks standard (coming soon). Currently, this concept is most fully developed in the form of GSuite files in the context of the GSuite HyperBrowser.
References to external records: FAIRtracks supports and recommends the inclusion of global identifiers to external records containing detailed metadata. We require these global identifiers represented in Compact Uniform Resource Identifies (CURIE) form resolvable through Identifiers.org (see Figures 3.2 and 3.3). A mapping service from existing URIs to the corresponding CURIEs is desirable, as it would enhance the conversion of existing metadata to the FAIRtracks standard.
Figure 3.1: Globally unique and persistent identifiers allow linking research data with different aspects of the research environment, such as physical samples, experiment setup, in silico analyses, studies, and publications. [From Plomp, E., 2020. Going Digital: Persistent Identifiers for Research Samples, Resources and Instruments. Data Science Journal, 19(1), p.46, made available under the CC BY 3.0 license.]
Figure 3.2: Screenshot from the ELIXIR-supported Identifiers.org, which resolves globally unique and persistent identifiers in the form of CURIEs and returns URLs to repository web pages containing information about the referred object. Identifiers.org is a partner of the US-based Names to things (N2T.net), which provides similar services.
Figure 3.3: Example track record according to the "Track" sub-schema of the FAIRtracks metadata standard. Highlighted are the fields for globally unique and persistent identifiers, locally unique identifiers, some references to external records, as well as track file URLs.
Figure 4.1: Example ontology term from the Cell Ontology: The cell type phagocyte is represented by a term (or concept) with a description, a persistent unique identifier (PID) (here Concept ID
) and relationships (see also Figure 4.3). The current figure shows the other terms that are direct or indirect ancestors of phagocyte: native cell, motile cell, - defensive cell, and stuff accumulating cell, all the way to the top-level terms. As can be gathered from the illustration, the same concept phagocyte is inserted several places in the hierarchy simultaneously, uniquely referenced by its PID in all relationships. Numerous descendants of the term phagocyte exist, but are not shown in the illustration. [The visualization has been fetched from the NCBO BioPortal]
Figure 4.2: Examples of domain knowledge represented as generously annotated relationships connecting the Gene Ontology term phagocytosis to other terms in the context of the Cell Ontology. [From Ontobee: "phagocytosis" in Cell Ontology. See also Figure 4.3]
Figure 4.3: The Gene Ontology term phagocytosis merged into the Cell Ontology. [From Ontobee: "phagocytosis" in Cell Ontology]
Ontologies
Formal ontologies should be used for biological terms to provide context to the metadata
An ontology is a "representation of the shared background knowledge for a community" (Stevens, Rector & Hull, 2010). More than just a controlled vocabulary, an ontology provides a formal conceptualization of the nature and structure of the objects it refers to (Guarino, 2006). The ontology terms have formal definitions and relationships and are typically arranged hierarchically in the main structure, as illustrated with the term phagocyte in Figure 4.1. Each ontology term is assigned a persistent unique identifier (PID) which enables interoperability across datasets, services, repositories, and ontologies.
Linking terms across ontologies: Interoperability of biological terms across ontologies by the use of PIDs is invaluable for describing complex biological knowledge with composite annotations. Expanding the example from Figure 4.1, we see in Figure 4.2 that the phagocyte cell type in the Cell Ontology is linked to the biological process phagocytosis as described in the Gene Ontology through the relation:
phagocyte subClassOf: capable of some phagocytosis
On a technical level, this relationship has been allowed through the addition of the PID of the Gene Ontology-term phagocytosis as a foreign ID to the Cell Ontology (Figure 4.3).
Registries of ontologies:
- The most relevant ontologies are registered and accessible through the Ontology Lookup Service (OLS) and the NCBO BioPortal. These services allow for lookup of terms across multiple ontologies and provide support for ontology discovery based on a limited set of metadata fields.
- The OBO foundry community provides access to a set of interoperable ontologies that are both logically well-formed and scientifically accurate, following these sets of principles.
- FAIRsharing annotates ontologies with richer metadata and provides lists of related records, including standards and databases. For these reasons, FAIRsharing is a valuable tool for discovering ontologies by assessing their use in the communities.
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
One concept – one term! In order to provide a clean and simple interface to the end users, FAIRtracks aims to map each concept to one and only one ontology term. To this end, the ontologies have been carefully selected and organized in such a way that the domains do not overlap. Table 4.1 lists the ontologies, controlled vocabularies and databases used in the latest version of the FAIRtracks metadata standard.
Composite fields: Core biological ontologies are often overlapping considerably domain-wise, not
least due to the widespread practice of importing parts of other ontologies (as exemplified in
Figure 4.3). However, most ontologies have certain branches or subdomains where they are
particularly strong. In FAIRtracks we take advantage of this by splitting a few of the most
important fields, in particular the fields experiment.target
and sample.sample_type
, into more
precise subfields. Each subfield is then linked to a specific branch of a specific ontology which is
particularly strong in that subdomain.
Summary fields: Many subfields are only relevant to certain types of records and will thus have
missing values elsewhere. To counteract this we provide the general fields
experiment.target.summary
and sample.sample_type.summary
that are automatically generated based
on logic particular to each type of record (see section Augmentation
below). End users and downstream software can then opt to ignore the subfields (as the values might
be missing or might be too detailed) and instead depend only on the summary fields. The FAIRtracks
standard (in its augmented form) guarantees that the values of the summary fields are present across
all types of experiments and samples.
Community influence on ontology choices: When we developed the FAIRtracks standard in the context of the initial ELIXIR Implementation study, ontologies were chosen based on perceived quality as well as community uptake. The selection was however also, to a certain extent, a subjective process. If you have opinions on the ontology choices, please join us as an early adopter to make your voice heard (see the Community page)!
Figure 4.1: Example ontology term from the Cell Ontology: The cell type phagocyte is represented by a term (or concept) with a description, a persistent unique identifier (PID) (here Concept ID
) and relationships (see also Figure 4.3). The current figure shows the other terms that are direct or indirect ancestors of phagocyte: native cell, motile cell, - defensive cell, and stuff accumulating cell, all the way to the top-level terms. As can be gathered from the illustration, the same concept phagocyte is inserted several places in the hierarchy simultaneously, uniquely referenced by its PID in all relationships. Numerous descendants of the term phagocyte exist, but are not shown in the illustration. [The visualization has been fetched from the NCBO BioPortal]
Figure 4.2: Examples of domain knowledge represented as generously annotated relationships connecting the Gene Ontology term phagocytosis to other terms in the context of the Cell Ontology. [From Ontobee: "phagocytosis" in Cell Ontology. See also Figure 4.3]
Figure 4.3: The Gene Ontology term phagocytosis merged into the Cell Ontology. [From Ontobee: "phagocytosis" in Cell Ontology]
Schema | Field | Ontology | Ancestor terms | Comments |
---|---|---|---|---|
experiment | technique | The Ontology for Biomedical Investigation (OBI) | planned process | |
experiment | technique | EDAM | Operation | |
experiment | target.sequence_feature | The Sequence Ontology (SO) | sequence_feature | |
experiment | target.gene_id | HUGO Gene Nomenclature Committee (HGNC) | CURIEs in the "hgnc" namespace | |
experiment | target.gene_product_type | The National Cancer Institute Thesaurus (NCIt) | Gene product |
Table 4.1: Ontologies, databases, and controlled vocabularies used in v1.0.2 of the FAIRtracks metadata standard, ordered by FAIRtracks schema/field name. If ancestor terms are specified, only terms that are somehow descending from the ancestor terms are allowed, effectively limiting the domain of the field to particular sub-branches of particular ontologies.
Figure 5.1: Example of minimal FAIRtracks metadata record adhering to the Experiment schema. Only the identifiers are included to represent the ontology term values of the fields technique
and target.sequence_feature
. [Extracted from the minimal version of the FAIRtracks-aligned BLUEPRINT metadata document]
Figure 5.2: Example of augmented FAIRtracks metadata record adhering to the Experiment schema. The ontology term identifiers have been mapped to human-readable labels according to the latest version of the ontologies. Additionally, the field target.summary
have been automatically filled based on the contents of fields technique.term_id
, target.sequence_feature.term_label
and target.details
. [Extracted from the augmented version of the FAIRtracks-aligned BLUEPRINT metadata document]
Augmentation
Automated power-up of metadata to improve human interaction
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
We developed the FAIRtracks metadata standard and the associated ecosystem of services with the aim of fulfilling two main objectives:
-
To facilitate deposition of novel track data and unification of metadata from legacy track collections in compliance with the FAIR data principles.
-
Improving downstream reuse of track data through services for data discovery and retrieval applied to unified and FAIR metadata.
Metadata for data providers vs end users: In the initial development phase we quickly discovered that these two objectives are somewhat at odds with each other. Downstream users will typically prefer strict, homogeneous, and human-readable metadata. Data providers, on the other hand, will typically prefer streamlined deposition, variation in metadata content, and machine-operable metadata.
Human-readable labels for ontology terms? This gap can be illustrated in the case of ontology terms. If data providers were to only provide the machine-operable identifiers for the terms, human users will not be able to directly understand them. Furthermore, it would be cumbersome for downstream developers and analytical end users to have to implement ontology lookup functionality in order to make use of the metadata. On the other hand, forcing data providers to provide both identifiers and human-readable labels for all ontology terms would discourage data deposition.
Bridging the gap: Based on the above considerations and similar examples, we discovered a need for intermediate solutions that provide ontology lookup functionality and other FAIR-oriented features to help bridge this gap:
![FAIRtracks-logo-transparent-180-[fixed].png](/_nuxt/img/1d83130.png)
Minimal and augmented versions of FAIRtracks: Data providers that adopt the FAIRtracks metadata standard only need to fill the minimal set of fields that are marked as "required", which together constitutes the minimal version of the FAIRtracks standard (see Figure 5.1). The FAIRtracks augmentation service is implemented as a REST API that takes a minimal FAIRtracks-compliant JSON document as input, adds a set of fields with human-readable values to the document, and provides this augmented FAIRtracks-compliant JSON document as output. The fields added include human-readable ontology labels, ontology versions, human-readable summary fields, and other pragmatic fields useful for interactions with the end user (see Figure 5.2). This service simplifies the job on the data providers side and, at the same time, improves the quality of the data discovery and retrieval operations on the users' side in both automated and manual usage scenarios.
Figure 5.1: Example of minimal FAIRtracks metadata record adhering to the Experiment schema. Only the identifiers are included to represent the ontology term values of the fields technique
and target.sequence_feature
. [Extracted from the minimal version of the FAIRtracks-aligned BLUEPRINT metadata document]