How to import a large amount of .bed, .gff, .vcf, .paf, .sam files into an SQL database?

Question

Are there best practices to load different bioinformatics file formats such as VCF, BED, GFF, and SAM to SQL databases? I am wondering how people out there do that efficiently.

All of these three formats are tab-separated files, so basically the following should work. I feel weird about it since most people I know don't use MySQL to work with these files.

LOAD DATA LOCAL INFILE 'bed.bed'
INTO TABLE bed-file
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS (list of the columns)
SET creation_date  = STR_TO_DATE(@creation_date, '%m/%d/%y');

Did you ever find a satisfactory answer to this question? I'd love to hear more about what sort of solution, if any, you ended up with. — Jessime Kirk, Nov 13 '22 at 04:07
@JessimeKirk not really. I've heard recently about Glow. Maybe it will be helpful to you. — 0x90, Nov 13 '22 at 05:29

Alex Reynolds · Answer 1 · 2021-11-09T00:02:30.307

The clodius tool puts tab-delimited data from a BED file into an rtree in a sqlite3 database:

https://github.com/higlass/clodius/blob/c98bb16ade93402fcea3b749d705c52ea165b609/clodius/cli/aggregate.py#L457-L777

Data are stored in this database at multiple "zoom levels", which allows for fast queries at wide genomic extents:

https://github.com/higlass/clodius/blob/develop/clodius/tiles/beddb.py

Aggregation (input) and querying (output) seem efficient in practice, but efficiency may depend on aspects of your specific dataset, i.e., whether you mean efficiency in execution time of INSERT or SELECT statements, or in space efficiency in the overall database size.

Other data can be converted to BED via vcf2bed, sam2bed, gff2bed, etc. and then aggregated. This is not the only way to do things, of course, but hopefully this can suggest directions to explore.

score 1 · Answer 2 · answered Nov 09 '21 at 19:52

You might also take a look at this genomics-focused sqlite extension: https://github.com/mlin/GenomicSQLite

As described in the repository README:

This SQLite3 loadable extension adds features to the ubiquitous embedded RDBMS supporting applications in genome bioinformatics:

genomic range indexing for overlap queries & joins

in-SQL utility functions, e.g. reverse-complement DNA, parse "chr1:2,345-6,789"

automatic streaming storage compression (also available standalone)

reading directly from HTTP(S) URLs (also available standalone)

pre-tuned settings for "big data"

How to import a large amount of .bed, .gff, .vcf, .paf, .sam files into an SQL database?

2 Answers2

Linked