Tutorial: Using Fasta/Fastq parsers
Fasta and fastq are the de facto standard formats for raw reads generated by
high-throuput sequencing machines. Fasta files do not have a quality score associated
with each base, while fastq do, the so-called “phred score”. HTSeq
has dedicated
parsers for them and their gzipped counterparts.
This tutorial will walk you through a few routine operations with fasta/fastq files.
The example files used here are all in the example_data
folder of the HTSeq
repository on GitHub.
Parsing a fasta file
Let’s start by reading a fasta file using FastaReader
:
>>> import HTSeq
>>> f = HTSeq.FastaReader('fastaEx.fa')
You can then iterate over the sequences:
>>> for read in f:
... print(read)
AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC
AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG
Each read is a Sequence
object with a few properties, notably name
, seq
,
and descr
(for “description”). Notice that seq
is not a Python string but a
bytestring instead, i.e. each character only uses exactly 1 byte. This is for efficiency
purposes. To convert each read.seq
into a string, you can use the standard decode
method:
>>> read.seq
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
>>> read.seq.decode()
'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
Although opening files like this is possible, FastaReader
supports context
management, which is safer because it automatically closes the file for you:
>>> with HTSeq.FastaReader('fastaEx.fa') as f:
... for read in f:
... print(read)
Note
HTSeq
does not load the whole file in memory, so you can parse files
much larger than your computer RAM as long as you don’t store the reads along the way.
Parsing a fastq file
Fastq files are handled approximately in the same way, but use FastqReader
:
>>> with HTSeq.FastqReader('fastqEx.fastq') as f:
... for read in f:
... print(read.seq)
... print(read.qualstr)
b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'
Of course, each read has now quality scores (that’s the whole point of fast**Q**), which are also a bytestring (again, for efficiency reason). It is also customary to use numeric values for the base-by-base qualities or phred scores:
>>> read.qual
array([33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
33, 33, 33, 33, 35, 35, 35, 35, 35, 35, 35, 35, 35, 30, 30, 30, 30,
30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
30, 30], dtype=uint8)
Instead of returning a bytestring, read.qual
returns a numpy array of unsigned
integers (i.e. bytes) containing the numeric value of the quality score in the
usual phred or 33-based ASCII encoding. This number q
can be related to the likelihood of
a mistake in the base calling by the formula \(q = -10 * log_{10}[ P_{mistake} ]\).
Of course, the higher the quality, the lower the chance of it being a mistake (since
the log is a monotonically increasing function).
Parsing gzip-compressed files
Both parsers handle gzipped files, which are commonly used to save disk space, transparently:
>>> with HTSeq.FastqReader('fastqExgzip.fastq.gz') as f:
... for read in f:
... print(read.seq)
... print(read.qualstr)
...
...
b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'
Note
Although some technologies produce reads with constant length (e.g. Illumina), in other scenarios (e.g. Oxford Nanopore) the read length might be heterogeneous. Be careful if you rely on read length for your computations.
Comparison with other libraries
Biopython is a set of freely available tools for biological computation that includes fasta/fastq manipulation. Compared to
HTSeq
, which is focused on high-throughput sequencing analyses on large genomes, it has a broader scope. If you are having trouble running a specific analysis with our parsers, you should take a look at it.scikit-bio is also broader in scope and implements parsers for fasta/fastq.