6

There is high throughput sequencing data here, and I don't know what format it is in.

It was submitted in 2009, and the description says the following:

  • Library strategy: ncRNA-Seq

  • Library source: transcriptomic

  • Library selection: size fractionation

  • Instrument model: Illumina Genome Analyzer II

  • Description: CIPPNK, tar file of Illumina *_seq.txt files provided as supplementary file

I got the archive here:

ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM455nnn/GSM455387/suppl/GSM455387%5FWT%5FCIPPNK%5Fseq%5Fs1%2Etar%2Egz

Inside, there are 330 files from s_1_0001_seq.txt to s_1_0330_seq.txt which are tab-delimited text files where the first column is always 1, the second has the number found in the file name, then 2 mysterious integers, and then what looks like a read of length 36, with sometimes a dot instead of a letter:

$ head s_1_0330_seq.txt 
1   330 690 785 TTCCTACATTGTTCCCCCATGCTGTTGGCACCATCA
1   330 44  145 TTTTTATCACGAGTTTTAAATCTGTAGTCACCATCA
1   330 53  141 AATAATGCATAACAAAACGGAATCTGTAGAA.AAA.
1   330 784 461 TAATTGTAGTGATTGATCAATCTGTAGGCACCATCA
1   330 588 634 TATTATGCACATTTTCTAGTTCACTGTAGGCACCAT
1   330 718 678 TTACATGTTTCGGGTAGGAGCCTGTAGGCACCATCA
1   330 635 834 TGTGATCATTAGTTCAAAGCCCCCTGTCGGCACCCT
1   330 494 523 TGAAAATCAAAAATGCTGAACTGTAGGCACCATCAA
1   330 393 783 TTTTTTTTTAAATTTAAAAAAACTGTAGGCACCATC
1   330 48  148 GTTTAACCGTGTAGACGTTGGTTTCTGTAGGCACCA

This seems to be some "old" high-throughput sequencing format. I think someone in a 2008 message in sequanswers was dealing with this type of file:

http://seqanswers.com/forums/showpost.php?p=1841&postcount=8

What is this format that seemed so standard back then, that the authors did not give more info than describing the files as "Illumina *_seq.txt files"? I do not dare ask them for such a trivial question (the given contact is a Nobel prize laureate and is probably too busy to answer random bioinformatics questions).

In particular, what are the columns 3 and 4, and what do the dots mean?

terdon
  • 10,071
  • 5
  • 22
  • 48
bli
  • 3,130
  • 2
  • 15
  • 36

1 Answers1

7

This is an early Solexa/Illumina sequencer format. The columns are the identifier to location on the flowcell. I believe the "." was the original placeholder for an unread base, which has been replaced by and "N" in current Illumina sequencing output.

From on http://www.crg.eu/en/content/processing-and-analysis-illumina-sequencing-data.

seq.txt (Gerald)

4 1 23 1566 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG

The first 4 columns are an ID that provides location details of the cluster in the flow cell, followed by the read sequence. Seq.txt file size typically ranges between 250 Megabytes and 1 Gigabyte. One file is generated per lane.

Bioathlete
  • 2,574
  • 12
  • 29
  • Ah, so if the authors did not provide other files, I guess the only way I can convert this to fastq is to chose arbitrary qualities. – bli Oct 31 '17 at 15:12
  • Correct this format did not contain quality scores. There was a qseq.txt file that did. Although to be honest back then the quality scores were no that meaningful as the Illumina algorithm was not very accurate in assigning them. – Bioathlete Oct 31 '17 at 16:23