How does SolveBio import VCFs?
SolveBio’s import system transforms a VCF (version 4.0 and up) into a standard JSON format. The format is compatible with our standard “Variants” dataset template available on our website. VCF format can vary between tools, and SolveBio’s import system try to accommodate different formats. However, in some cases it may be desirable to use a custom template and convert the VCF to JSON prior to importing into SolveBio.
In SolveBio’s default VCF importer, each row in the VCF (after the VCF header) is parsed in order, and the various columns are parsed to the best of our ability. The following sections describe how the different elements are parsed:
Some VCF metadata is extracted from the header and stored in the “dataset commit” record, which represents the set of changes made to a dataset such as the records imported from a VCF file.
Dataset field: genomic_coordinates.chromosome
Chromosome values are imported as-is. SolveBio expects the following chromosome values: 1 to 22, X, Y, and MT.
Dataset field: genomic_coordinates.start and genomic_coordinates.stop
The position column in a VCF is used as the “start” coordinate in a SolveBio dataset (within the “genomic_coordinates” field). The “stop” coordinate is the sum of the “start” position and the length of the reference allele, minus 1 (i.e. for a SNV, the “start” and “stop” values will be equal).
Row ID (ID)
Dataset field: row_id
The row ID is a unique identifier (typically a dbSNP rs ID). This value is preserved as-is.
Reference Alleles (REF)
Dataset field: allele
The reference bases are expected to be one of the following: A, C, G, T, N. The value is preserved as-is.
Alternate Alleles (ALT)
Dataset field: alternate_alleles
If the row contains multiple alternate alleles (ALTs), the row is duplicated for each allele. For this reason, the number of rows indexed may be larger than the number of rows in your original VCF file. The values are preserved as-is.
Info Field (INFO)
Dataset field: info
The info field (INFO) sometimes contains a semicolon-separated series of short keys with optional values. SolveBio’s VCF importer will attempt to convert these key/value pairs into a nested dataset field.
Quality Score (QUAL)
The phred-scaled quality score column (QUAL) is not currently parsed.
Filter Field (FILTER)
The filter column (FILTER) is not currently parsed.
Do you save zygosity information?
The default VCF parser currently does not transfer zygosity information during an import. If you need this data, contact SolveBio Support for help.
What about VCFs with multiple samples?
The default VCF parser currently does not make a distinction between single and multiple sample VCF files. If you need to handle multiple samples in a specific way please contact SolveBio Support.
What fields are extracted from the header?
The following header fields are extracted from a VCF:
Contact SolveBio Support if you want additional fields to be extracted.
What is the largest VCF that I can import?
SolveBio has tested VCF files with 1 million variants, however it may be possible to import larger files. Feel free to try out any file size.
What happens after a VCF is imported?
After VCFs are fully imported you'll be able to browse the variants in the dataset chosen during import. The uploaded VCF files can then be safely deleted.
Can I upload JSON format?
SolveBio supports JSON uploads. You may parse your VCF files locally and upload them as JSON.