In this article we will describe how SolveBio manages a wide variety of reference data from hundreds of sources, each with many distinct versions. SolveBio’s platform is designed to help scientists manage and make sense of distinct, isolated genomic reference datasets and integrate them while maintaining data provenance.
SolveBio “depositories” are the top-level container of versioned datasets. Depositories are like folders on your computer. A depository can represent a project such as ClinVar and dbSNP, or an organization such as the NCBI. Depositories must be uniquely named across SolveBio. For this reason, private depositories are namespaced under your team’s domain. For example, if your team is named “MyTeam” and you use the domain “myteam.solvebio.com”, your private depositories will have the “myteam:” prefix.
Depositories can have one or more versions. The version format is based on a popular convention known as Semantic Versioning, which is typically used in software development. The convention is flexible but also helps users visually understand changes to underlying datasets. Versions use the following format: “<major>.<minor>.<patch>-<label>”.
Here are some examples:
You can create any number of versions. Version names must be unique within a depository. Similar to depositories, versions are like folders within a depository “folder”.
Finally, every depository version can contain one or more datasets. Datasets represent the semi-structured data that users can browse and query through MESH or the SolveBio API. Dataset names must be unique within each version of a depository. When new depository versions are created, they start out empty (without any datasets). Datasets must be created first prior to importing.