Most Popular

1500 questions
6
votes
2 answers

BERT Language Model and Gene Sequences - How Do I Relate Clusters of Sequences?

I hope you'll indulge a question from a computer scientist with limited bioinformatics knowledge. I've been working with the Google tool for language modeling called BERT. It's generally regarded as state of the art when encoding language. It…
simusid
  • 161
  • 5
6
votes
4 answers

Is there a command line tool to split a SAM/BAM file by CB (cell barcode) tag?

I have a BAM file from a single cell sequencing experiment. Each read has had the cell barcode annotated in the CB tag. Some reads do not have a CB tag. For each of an arbitrary set of barcodes, I would like to split out the reads for each of those…
winni2k
  • 2,266
  • 11
  • 28
6
votes
2 answers

Theoretical limit of human genome compression

How small can a compressed file containing the human genome be? I'm aware that this question cannot actually be answered, since it is asking for the Kolmogorov complexity of the human genome, which is not computable. So reformulating: What is the…
Rexcirus
  • 171
  • 3
6
votes
1 answer

Index a BAM file using pysam

(How) can you index a BAM file using pysam? When I tried the intuitive pysam.index I got: import pysam my_bam = pysam.AlignmentFile("regular_bwamem_mapping.bam",…
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
6
votes
3 answers

How can I compute gene expression for a set of RNA reads?

I'm trying to compute a gene expression profile for an organism. I have gene nucleotide sequences of the mentioned organism stored in a fasta file and a set of paired reads stored in two separate files with the same fasta format. Now I want to…
hhoomn
  • 325
  • 1
  • 5
6
votes
1 answer

Y Chromosome Aligned Reads in scATAC-seq data from a female-derived cell line?

I'm working with scATAC-Seq data on the K562 cell line, which is supposed to be derived from a female patient. While following the scATAC-seq data analysis pipeline, after performing bowtie alignment they recommend filtering out all reads aligned to…
OrdiNeu
  • 140
  • 6
5
votes
3 answers

How to append numbers only on duplicates sequence names?

I have a reference database with contains 100s of sequences in fasta format. Some of these sequences have duplicate names like so: >1_uniqueGeneName atgc >1_anotherUniqueGeneName atgc >1_duplicateName atgc >1_duplicateName atgc Is is possible to…
AudileF
  • 955
  • 8
  • 25
5
votes
3 answers

Obtaining all protein sequences with a particular domain architecture from Pfam

I want to get the alignment of chain A of 1kf6 (PDB ID) from the pfam database here. This protein chain has two main domains (FAD_binding_2 and Succ_DH_flav_C). In pfam there is a link to one of these domains and after clicking in one of the…
Sara
  • 777
  • 1
  • 6
  • 18
5
votes
1 answer

Obtaining data table headers from GEO using GEOquery

For a study in GEO, I would like to obtain the data table header descriptions, specifically the "VALUE" column for all samples in the study. If you go here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99511) and then scroll down and click…
5
votes
2 answers

samtools mpileup skipping read

I run the command: samtools mpileup -O -s -q20-B -Q20 -f hg19.fa -r chr1:569929-569931 myFile.bam and get: chr1 569929 G 7 ...,,., EEEEEEE NTTVTTU 53,48,42,60,30,29,27 chr1 569930 C 6 ...,., EEAEEE NTTTTU 54,49,43,31,30,28 chr1 …
aerijman
  • 645
  • 5
  • 14
5
votes
1 answer

How can I dock a protein to a nucleic acid?

I have a protein of interest and I would like to now how it interacts with RNA. I have structures of both molecules. What tool can I use?
John Deo
  • 53
  • 3
5
votes
2 answers

High percentage of poly A sequences in 10X chromium R2 read

I'm currently analyzing two samples of eosinophil cells isolated from mouse lung and the samples are of very different quality. According to the Cell Ranger summary 56% of the reads can be mapped to the transcriptome in the first sample and only 32%…
PPK
  • 886
  • 4
  • 13
5
votes
1 answer

design formula question

I am not sure I am building the proper design formula for the question I want to answer I have the following samples with three factors; clone, the structure and the condition. clone structure diabetic 1 07 2D Dia 2 21 2D …
5
votes
2 answers

Tool for predicting interactions in the cell

What tools are available to predict, based on the structure of a certain protein, its interactions within a cell? For example, I am considering the split GFP protein and I am trying to predict if there will be any interactions which might hinder its…
jaslibra
  • 524
  • 2
  • 9
5
votes
2 answers

How can I locate duplicated regions in a sequence?

I am facing an issue when trying to align short reads against a region in human chr5. The two Sensory Motor Neuron genes, (SMN1 and SMN2) are almost 100% identical and this causes the aligner to fail to align reads correctly since each read matches…
terdon
  • 10,071
  • 5
  • 22
  • 48