Tutorial: Using Fasta/Fastq parsers

Fasta and fastq are the de facto standard formats for raw reads generated by high-throuput sequencing machines. Fasta files do not have a quality score associated with each base, while fastq do, the so-called “phred score”. HTSeq has dedicated parsers for them and their gzipped counterparts.

This tutorial will walk you through a few routine operations with fasta/fastq files. The example files used here are all in the example_data folder of the HTSeq repository on GitHub.

Parsing a fasta file

Let’s start by reading a fasta file using FastaReader:

>>> import HTSeq
>>> f = HTSeq.FastaReader('fastaEx.fa')

You can then iterate over the sequences:

>>> for read in f:
...     print(read)
AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC
AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG

Each read is a Sequence object with a few properties, notably name, seq, and descr (for “description”). Notice that seq is not a Python string but a bytestring instead, i.e. each character only uses exactly 1 byte. This is for efficiency purposes. To convert each read.seq into a string, you can use the standard decode method:

>>> read.seq
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
>>> read.seq.decode()
'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'

Although opening files like this is possible, FastaReader supports context management, which is safer because it automatically closes the file for you:

>>> with HTSeq.FastaReader('fastaEx.fa') as f:
...     for read in f:
...         print(read)

Note

HTSeq does not load the whole file in memory, so you can parse files much larger than your computer RAM as long as you don’t store the reads along the way.

Parsing a fastq file

Fastq files are handled approximately in the same way, but use FastqReader:

>>> with HTSeq.FastqReader('fastqEx.fastq') as f:
...     for read in f:
...         print(read.seq)
...         print(read.qualstr)
b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'

Of course, each read has now quality scores (that’s the whole point of fast**Q**), which are also a bytestring (again, for efficiency reason). It is also customary to use numeric values for the base-by-base qualities or phred scores:

>>> read.qual
array([33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
    33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
    33, 33, 33, 33, 35, 35, 35, 35, 35, 35, 35, 35, 35, 30, 30, 30, 30,
    30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
    30, 30], dtype=uint8)

Instead of returning a bytestring, read.qual returns a numpy array of unsigned integers (i.e. bytes) containing the numeric value of the quality score in the usual phred or 33-based ASCII encoding. This number q can be related to the likelihood of a mistake in the base calling by the formula \(q = -10 * log_{10}[ P_{mistake} ]\). Of course, the higher the quality, the lower the chance of it being a mistake (since the log is a monotonically increasing function).

Parsing gzip-compressed files

Both parsers handle gzipped files, which are commonly used to save disk space, transparently:

>>> with HTSeq.FastqReader('fastqExgzip.fastq.gz') as f:
...     for read in f:
...         print(read.seq)
...         print(read.qualstr)
...
...
b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'

Note

Although some technologies produce reads with constant length (e.g. Illumina), in other scenarios (e.g. Oxford Nanopore) the read length might be heterogeneous. Be careful if you rely on read length for your computations.

Comparison with other libraries

  • Biopython is a set of freely available tools for biological computation that includes fasta/fastq manipulation. Compared to HTSeq, which is focused on high-throughput sequencing analyses on large genomes, it has a broader scope. If you are having trouble running a specific analysis with our parsers, you should take a look at it.

  • scikit-bio is also broader in scope and implements parsers for fasta/fastq.