Most Popular
1500 questions
6
votes
2 answers
BERT Language Model and Gene Sequences - How Do I Relate Clusters of Sequences?
I hope you'll indulge a question from a computer scientist with limited bioinformatics knowledge. I've been working with the Google tool for language modeling called BERT. It's generally regarded as state of the art when encoding language. It…
simusid
- 161
- 5
6
votes
4 answers
Is there a command line tool to split a SAM/BAM file by CB (cell barcode) tag?
I have a BAM file from a single cell sequencing experiment. Each read has had the cell barcode annotated in the CB tag. Some reads do not have a CB tag. For each of an arbitrary set of barcodes, I would like to split out the reads for each of those…
winni2k
- 2,266
- 11
- 28
6
votes
2 answers
Theoretical limit of human genome compression
How small can a compressed file containing the human genome be?
I'm aware that this question cannot actually be answered, since it is asking for the Kolmogorov complexity of the human genome, which is not computable. So reformulating:
What is the…
Rexcirus
- 171
- 3
6
votes
1 answer
Index a BAM file using pysam
(How) can you index a BAM file using pysam?
When I tried the intuitive pysam.index I got:
import pysam
my_bam = pysam.AlignmentFile("regular_bwamem_mapping.bam",…
Kamil S Jaron
- 5,542
- 2
- 25
- 59
6
votes
3 answers
How can I compute gene expression for a set of RNA reads?
I'm trying to compute a gene expression profile for an organism. I have gene nucleotide sequences of the mentioned organism stored in a fasta file and a set of paired reads stored in two separate files with the same fasta format. Now I want to…
hhoomn
- 325
- 1
- 5
6
votes
1 answer
Y Chromosome Aligned Reads in scATAC-seq data from a female-derived cell line?
I'm working with scATAC-Seq data on the K562 cell line, which is supposed to be derived from a female patient. While following the scATAC-seq data analysis pipeline, after performing bowtie alignment they recommend filtering out all reads aligned to…
OrdiNeu
- 140
- 6
5
votes
3 answers
How to append numbers only on duplicates sequence names?
I have a reference database with contains 100s of sequences in fasta format. Some of these sequences have duplicate names like so:
>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc
>1_duplicateName
atgc
>1_duplicateName
atgc
Is is possible to…
AudileF
- 955
- 8
- 25
5
votes
3 answers
Obtaining all protein sequences with a particular domain architecture from Pfam
I want to get the alignment of chain A of 1kf6 (PDB ID) from the pfam database here. This protein chain has two main domains (FAD_binding_2 and Succ_DH_flav_C). In pfam there is a link to one of these domains and after clicking in one of the…
Sara
- 777
- 1
- 6
- 18
5
votes
1 answer
Obtaining data table headers from GEO using GEOquery
For a study in GEO, I would like to obtain the data table header descriptions, specifically the "VALUE" column for all samples in the study.
If you go here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99511)
and then scroll down and click…
Ezra Bekele
- 83
- 4
5
votes
2 answers
samtools mpileup skipping read
I run the command:
samtools mpileup -O -s -q20-B -Q20 -f hg19.fa -r chr1:569929-569931 myFile.bam
and get:
chr1 569929 G 7 ...,,., EEEEEEE NTTVTTU 53,48,42,60,30,29,27
chr1 569930 C 6 ...,., EEAEEE NTTTTU 54,49,43,31,30,28
chr1 …
aerijman
- 645
- 5
- 14
5
votes
1 answer
How can I dock a protein to a nucleic acid?
I have a protein of interest and I would like to now how it interacts with RNA. I have structures of both molecules.
What tool can I use?
John Deo
- 53
- 3
5
votes
2 answers
High percentage of poly A sequences in 10X chromium R2 read
I'm currently analyzing two samples of eosinophil cells isolated from mouse lung and the samples are of very different quality.
According to the Cell Ranger summary 56% of the reads can be mapped to the transcriptome in the first sample and only 32%…
PPK
- 886
- 4
- 13
5
votes
1 answer
design formula question
I am not sure I am building the proper design formula for the question I want to answer
I have the following samples with three factors; clone, the structure and the condition.
clone structure diabetic
1 07 2D Dia
2 21 2D …
mario rossi
- 53
- 3
5
votes
2 answers
Tool for predicting interactions in the cell
What tools are available to predict, based on the structure of a certain protein, its interactions within a cell?
For example, I am considering the split GFP protein and I am trying to predict if there will be any interactions which might hinder its…
jaslibra
- 524
- 2
- 9
5
votes
2 answers
How can I locate duplicated regions in a sequence?
I am facing an issue when trying to align short reads against a region in human chr5. The two Sensory Motor Neuron genes, (SMN1 and SMN2) are almost 100% identical and this causes the aligner to fail to align reads correctly since each read matches…
terdon
- 10,071
- 5
- 22
- 48