Creating Vault Data Feeds

Before reading this article, make sure to read the Data Management Overview.

SolveBio Data Feeds contain one or more datasets and can be versioned over time.

Datasets naming follows a simple convention: “{Feed}/{Version}/{Dataset}”. For example, the Variants dataset in ClinVar version 3.7.1-2016-04-1 would have the following name: “ClinVar/3.7.1-2016-04-18/Variants”. All private datasets are prefixed with your team’s domain. If your domain is “myteam.solvebio.com”, your data feeds will start with “myteam:” (e.g. “myteam:MyFeed/1.0.0/MyDataset”). Data feeds and datasets typically follow the “CamelCase” naming convention and they should contain only letters, numbers, dashes, and underscores.

Datasets are similar to spreadsheets or traditional SQL tables, but support semi-structured (“NoSQL” style) data. Each record in a dataset may contain one or more field/value pairs which are defined by the dataset’s template (and in some cases can be altered dynamically). Dataset size limitations are determined by your SolveBio plan. Standard datasets can contain up to 30 gigabytes of data which is equivalent to about 20 million records.

Dataset Templates

When creating a dataset, a template must be provided. Templates describe a dataset’s fields: the field names, underlying data types, and entity types. Field names are case-sensitive and should not contain spaces or special characters (anything other than letters, numbers, or underscores). SolveBio recommends the “snake case” format (i.e. “my_field” rather than “MyField”). Fields cannot start with anything other than a letter or a number. The following data types are available:

  • string (default): A “keyword” field useful for exact match filtering.
  • text: A free-text field tokenized and indexed for search.
  • integer: A signed 32-bit integer between -231 and 231-1.
  • long: A signed 32-bit integer between -263 and 263-1.
  • short: A signed 16-bit integer between -32,768 and 32,768.
  • float: A single-precision 32-bit IEEE 754 floating point.
  • double: A double-precision 64-bit IEEE 754 floating point.
  • boolean: Either “true” or “false”.

Fields may also represent a SolveBio entity such as a gene, variant, or chromosomal region. See the documentation on SolveBio entities for more information. SolveBio provides example dataset templates (see an example) which can be used to create common datasets such as those used for VCF file importing. Once a dataset is created, it can be edited or used as an import destination. It will also be visible to all the members of your SolveBio team.

 

Genomic Datasets

Many SolveBio datasets are “genomic datasets”. This means that they contain a reserved field called “genomic_coordinates”. Each record in the dataset refers to a single feature on the genome such as a gene, variant, transcript, or anything else. The location is embedded in the genomic_coordinates field which contains the genome build (“build”), the start position (“start”), the end position (“stop”), and the chromosome (“chromosome”). SolveBio uses a one-based fully-closed numbering system for all coordinates (same as a VCF file and standard genomics data from the NCBI).

Genomic datasets can have one or more genome builds representing the data. For example, the ClinVar Variants dataset can be queried in GRCh37 (hg19) or GRCh38 (hg38/hg20) coordinates.

When creating a genomic dataset, you can supply one or more genome builds. SolveBio does not automatically liftover the coordinates between builds. A dataset’s genome builds are simply a path to the data. You are responsible to importing the correct data to the correct dataset and genome build path.

 

Reserved Dataset Fields

SolveBio sometimes uses reserved fields for special cases. The following fields are reserved:

  • _commit: Represents a dataset commit ID.
  • _id: Each record has a unique ID within the dataset.

 

Access Controls

See the article about Data Feed Permissions on SolveBio.

 

Data Management Caveats

Managing datasets of any size should be treated with proper care. SolveBio provides you with the tools and best practices for creating and managing your datasets. In the end, you and your team must decide on conventions that work for your projects.

One important thing to remember is that while it is relatively easy to make “backwards compatible” changes to a dataset, it’s much harder to undo them (i.e. introduce “breaking changes”). It is always worth spending some time upfront to decide on a template (the fields and data types) for your datasets. Once a dataset is created, you can always add new fields by simply importing data with the new fields. However, dataset field names and data types cannot be changed or removed (they can only be hidden from view).

 

Copying/Cloning Datasets

It is not currently possible to copy an entire dataset from the API. For more information about copying datasets contact SolveBio Support (support@solvebio.com).

 

 

Have more questions? Submit a request

0 Comments

Article is closed for comments.
Powered by Zendesk