Tutorial: Using the SAM/BAM/CRAM parsers¶
High-throughput sequencing machines commonly produce fastq files. For many types of
analysis, the reads are then aligned to a reference genome by a “mapper” or “aligner”
software. Most of those software srote these output alignments as SAM/BAM/CRAM files.
HTSeq has a set of parsers for these file formats.
This tutorial will walk you through a few routine operations with SAM/BAM/CRAM files.
The example files used here are all in the
example_data folder of the
repository on GitHub.
What’s the difference?¶
- SAM is an uncompressed, text format
- BAM is its compressed, binary sibiling
- CRAM is a newer compressed, binary format that evolved from BAM but is less widespread
samtools can be used to simply convert between them from the command line.
SAM files sometimes have a header containing the names and lengths of chromosomes and a few more info (e.g. the aligner name). For BAM files, the header is mandatory instead of optional.
Parsing a SAM file¶
To parse a SAM (text) file, you can use the
>>> f = HTSeq.SAM_Reader('yeast_RNASeq_excerpt_withNH.sam')
You can use context management to ensure the file is closed properly. If we look at the first 3 reads only:
>>> with HTSeq.SAM_Reader('yeast_RNASeq_excerpt_withNH.sam') as f: ... for i, read in enumerate(f): ... print(read) ... if i == 2: ... break <SAM_Alignment object: Read 'HWI-EAS225:1:10:1284:1974#0/1' aligned to VIII:[394100,394136)/+> <SAM_Alignment object: Read 'HWI-EAS225:1:10:1284:986#0/1', not aligned> <SAM_Alignment object: Read 'HWI-EAS225:1:10:1284:2012#0/1' aligned to VII:[1001605,1001641)/+>
Some SAM/BAM files derive from paired-end experiments. The
above iterates over reads, not read pairs. See below for a parser that specifically
deals with read pairs.
Parsing a BAM file¶
To parse a BAM (binary) file, you can use the sibiling class
BAM_Reader. Using a
>>> with HTSeq.BAM_Reader('SRR001432_head.bam') as f: ... for i, read in enumerate(f): ... print(read) ... if i == 2: ... break <SAM_Alignment object: Read 'SRR001432.1 USI-EAS21_0008_3445:8:1:107:882 length=25', not aligned> <SAM_Alignment object: Read 'SRR001432.2 USI-EAS21_0008_3445:8:1:82:90 length=25', not aligned> <SAM_Alignment object: Read 'SRR001432.3 USI-EAS21_0008_3445:8:1:639:904 length=25', not aligned>
As for SAM files, the iteration is over each single read, not read pairs.
Illumina sequencers can sequence a DNA molecule from both ends, leading to so-called paired-end reads. It is sometimes useful to examine both reads of each pair (called mates) at the same time, because they are two representative of the same original DNA molecule.
HTSeq has two classes to achieve this goal, depending on whether the SAM/BAM file
is “unsorted” aka sorted by name, or “sorted” aka sorted by position. In the former
case, the alignments from the two mates are found in the BAM file on consecutive lines.
For name-sorted SAM/BAM files, you can use
>>> with HTSeq.BAM_Reader('SRR001432_head.bam') as f: ... for (read1, read2) in HTSeq.pair_SAM_alignments(f): ... print(read1) ... print(read2)
For position-sorted SAM/BAM files, you can use
>>> with HTSeq.BAM_Reader('SRR001432_head.bam') as f: ... for (read1, read2) in HTSeq.pair_SAM_alignments_with_buffer(f): ... print(read1) ... print(read2)
In practical terms, the main difference between these two functions is their operation when they meet a read but the next read is not the mate: the first function assumes the first read is an incomplete pair, while the second function waits to see if the mate will appear later on in the file.
The second function is used in
htseq-count to provide the -r pos option
for position-sorted BAM files. However, because the first read has to be stored in
memory until the second read is found, this approach incurs a significant memory cost.
It is recommended to call
htseq-count on unsorted/name-sorted BAM files, and
samtools sort can be used to “unsort” a BAM file.