17

I would like to convert a BED format to GFF3.

The only useful tool that I could find via a google search seems to be Galaxy, and I do not feel very comfortable with online tools, plus the webserver is currenlty under maintenance.

Does anyone knows about a command-line tool that can handle this conversion?

Edit: here are some lines of my BED file:

$ head -4 last_minion-r7_sort.bed
211000022278137 175 211 8e5d0959-7cdb-49cf-9298-94ed3b2aedb5_Basecall_2D_000_2d 42  +
211000022279134 0   503 e8a9c6b8-bad2-4a7e-97d8-ca4acb34ff70_Basecall_2D_000_2d 69  -
211000022279134 24  353 e258783d-95a3-41f5-9ad5-bb12311dbaf4_Basecall_2D_000_2d 45  -
211000022279134 114 429 26601afb-581a-41df-b42b-b366148ea06f_Basecall_2D_000_2d 100 -

The bed file thus has 6 columns as for now: chromosome, start coordinate, end coordinate, read name, score, strand. This file was obtained from conversion of MAF format (as output of alignment of RNA-seq reads to reference genome, using LAST) converted to SAM using maf-convert, then to BAM using samtools, finally to BED using bedtools.

The aim of my conversion is basically to convert SAM -> GTF, for post-processing. Since there is no straightforward way to do this, I am going through steps, the only way to do this in my knowledge is : SAM -> BAM -> BED -> GFF3 -> GTF but for now I am stuck in the BED -> GFF3 part.

aechchiki
  • 2,676
  • 11
  • 34
  • 2
    There isn't necessarily a very meaningful way to do that. Can you post a few lines from your BED file? Also, what's the end goal? – Devon Ryan Aug 08 '17 at 10:00
  • 2
    Yes, it would be helpful to see some of the BED. Is it BED12 or BED6, what does it represent? In GFF3 who have to say what your intervals represent. No reason it can't be done though. Oh, and are you happy with code? – Ian Sudbery Aug 08 '17 at 10:47
  • 2
    @DevonRyan I have edited the question, hopefully answers your comments. – aechchiki Aug 08 '17 at 11:09
  • 2
    @AminaEchchiki While I could provide code (or you could just use the tool in Galaxy), I don't think the information from the SAM file is being accurately transmitted to BED in a way that's needed to produce a useful GFF3 or GTF file. I strongly recommend you write a tool to directly produce a GTF/GFF from either SAM or MAF (presuming none exist). – Devon Ryan Aug 08 '17 at 11:15
  • 1
    if this is an organism that has splicing, your convsion to bed is unlikely to be accurate. I will post BAM -> GTF code and BED to GFF3, but let me know if your reads are spliced/ – Ian Sudbery Aug 08 '17 at 11:29
  • 1
    @IanSudbery yes my reads are spliced (D. melanogaster) – aechchiki Aug 08 '17 at 11:30
  • 3
    Right, in which case you should generate your bed files with bamtobed --bed12, but really you should go direct from BAM to GFF/GTF. – Ian Sudbery Aug 08 '17 at 11:35
  • 1
    @IanSudbery thanks for the tip, will do that – aechchiki Aug 08 '17 at 11:36

7 Answers7

8

To answer the question as asked, for people googling. For BED6, in python:

#contigs.tsv contians chromosome names and lengths in two columns
for line in open("contigs.tsv"):
   fields = line.strip().split("\t")
   print fields[0], ".", "contig","1",str(fields[1]), ".", "+", ".", "ID=%s" % fields[0]

for line in open("my_bed_file.bed"): fields = line.strip().split("\t")

note: BED is 0-based, half-open, GFF is 1-based, closed

start = str(int(fields[1]) + 1) print fields[0], "bed", "interval", fields[1], fields[2], fields[4], fields[5], ".", "ID=%s;parent=%s" % (fields[3], fields[0])

For bed12, in python:

#contigs.tsv contians chromosome names and lengths in two columns
for line in open("contigs.tsv"):
   fields = line.strip().split("\t")
   print fields[0], ".", "contig","1",str(fields[1]), ".", "+", ".", "ID=%s" % fields[0]

for line in open("my_bed12.bed"):

fields = line.strip().split("\t") contig = fields[0]

note: BED is 0-based, half-open, GFF is 1-based, closed

start= int(fields[1]) + 1) end=fields[2] name=fields[3] score=fields[4] strand=fields[5] print contig, "bed", "interval", str(start), end, score, strand, ".", "ID=%s;parent=%s" % (name, contig)

block_starts = map(int,fields[11].split(",")) block_sizes = map(int, fields[10].split(","))

for (block, (bstart, blen)) in enumerate(zip(block_starts, block_sizes)): bend = start + bstart + blen print contig, "bed", "block", str(start + bstart), str(bend), score, strand, ".", "ID=%s_%i;parent=%s" %(name, block, name)

Ram RS
  • 2,297
  • 10
  • 29
Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21
8

Galaxy has API and API-consuming libraries (such as BioBlend) that will allow you to interactively script against it without opening the graphical interface at all.

However you can also take almost any tool out of Galaxy and use it independently since everything is open source. The converter you mentioned is available as a Python script here and the tool 'wrapper' which you can use to understand how to invoke the python script is next to it.

Severian
  • 196
  • 3
5

To convert BAM to GTF, which is the best way to get a file to compare with cuffcompare:

import pysam

bamfile=pysam.AlignmentFile("my_bam_file.bam")

for alignment in bamfile.fetch():

if alignment.is_unmapped:
    continue

contig = bamfile.get_reference_name(alignment.reference_id)
name = alignment.query_name

if alignment.is_reverse:
    strand = "-"
else: 
    strand = "+"

for start, end in alignment.getblocks():
    # note: BED is 0-based, half-open, GFF is 1-based, closed
    print contig, "BAM", "exon", str(start + 1), str(end), "0", strand, ".", 'gene_id "%s"; transcript_id "%s";' % (name, name)

NB, this will also work with a SAM file as long as it is headered.

Ram RS
  • 2,297
  • 10
  • 29
Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21
  • 2
    As a note for future googlers, if your bam file comes from a pipeline producing unmapped BAMs, the code as it is will return error ValueError: file header is empty (mode='rb') - is it SAM/BAM format?. As a workaround it is necessary to add options check_header=False, check_sq=False while reading the bam.
    @IanSudbery might be useful to add it as a comment to your code
    – aechchiki Aug 10 '17 at 07:21
  • 3
    @Amina Echchiki, this error is not because your BAM contains unmapped reads, but because your BAM has no header. BAMs without headers are not valid BAM files. I am surprised fetching the contig name worked if the file has no header, even if you set check_header=False, check_sq=False because it uses the header to do so. – Ian Sudbery Aug 10 '17 at 13:38
4

Bioconductor makes this so easy. It does the coordinate conversion on import.

library(rtracklayer)

import the bed file

bed.ranges <- import.bed('regions.bed')

export as a gff3 file

export.gff3(bed.ranges,'regions.gff3')

And people wonder why R is so popular for bioinformatics...

Also if needed, you could go straight from BAM file to gff3 in R as well.

Ram RS
  • 2,297
  • 10
  • 29
story
  • 1,573
  • 1
  • 8
  • 15
  • 3
    As long as you have enough memory to hold the whole bed in memory at once (not a given if it comes from a BAM) this is indeed an easy option. – Ian Sudbery Aug 09 '17 at 17:21
3

I recently developed bed2gff to quickly convert .bed files to a gff3 format, a tool written in Rust. Could be of help here!

  • 3
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Community Oct 03 '23 at 23:49
2

If anybody needs a simple bed to gff converter, simple in the sense that only takes a bed with a single type of feature, lets say peaks and translate it to GFF I recently wrote this, is Bash and as far as I know all the tools it uses should be present in most Linux / Mac operating systems.

https://github.com/UBMI-IFC/bed2gff/tree/main

It cannot deal, however with multi featured Bed files as it ignores everything from the 7th column and beyond.

0

You could also use agat_convert_bed2gff. agat is generally a very useful tool when needing to convert from / to / between gff formats. See here how to use it to convert from bed to gff. https://agat.readthedocs.io/en/latest/tools/agat_convert_bed2gff.html

Niklas
  • 106
  • 5