There are a large number of different tools to align short reads to a reference. Most of them use their own output format, even though the SAM format seems to become the common standard now that many of the newer tools use it.
HTSeq aims to offer a uniform way to analyse alignments from different tools. To this end,
for all supported alignment formats a parse class is offered that reads an alignment file
and generates an iterator over the individual alignment records. These are represented as
objects of a sub-class of
Alignment and hence all offer a common interface.
So, you can easily write code that should work for all aligner formats. As a simple example, consider this function that counts the number of reads falling on each chromosome:
>>> import collections >>> def count_in_chroms( alignments ): ... counts = collections.defaultdict( lambda: 0 ) ... for almnt in alignments: ... if almnt.aligned: ... counts[ almnt.iv.chrom ] += 1 ... return counts
If you have a SAM file (e.g., from BWA or BowTie), you can call it with:
>>> sorted(count_in_chroms(HTSeq.BAM_Reader("yeast_RNASeq_excerpt.sam")).items()) [('2-micron', 46), ('I', 362), ('II', 1724), ('III', 365), ('IV', 3015), ('IX', 648), ('V', 999), ('VI', 332), ('VII', 2316), ('VIII', 932), ('X', 1129), ('XI', 1170), ('XII', 4215), ('XIII', 1471), ('XIV', 1297), ('XV', 2133), ('XVI', 1509)]
If, however, you have done your alignment with Eland from the SolexaPipeline, which
uses the “Solexa export” format, you can use the same function, only using
>>> count_in_chroms( HTSeq.SolexaExportReader( "mydata_export.txt" ) )
Both class generate iterators of similar objects. On the other hand, some formats contain more information
and then the
Alignment objects from these contain additional fields.
Depending on the format of your alignment file, choose from the following parsers:
All of these are derived from
FileOrSequence. When asked for an iterator,
Alignment objects of types
SolexaExportAlignment. See below for their properties.
Adding support for a new format is very easy. Ask me if you need something and I can probably add it right-away. Alternatively, you can convert your format to the SAM format. The SAMtools contain Perl skripts to convert nearly all common formats.
BAM_Reader.peek( num = 1 ):
Peek into a SAM file or connection, reporting the first
numrecords. If you then call an iterator on the
BAM_Reader, the record will be yielded again.
This is the base class of all Alignment classes. Any class derived from
Alignmenthas at least the following attributes:
The read. An object of type
SequenceWithQuality. See there for the sub-attributes.
Note that some aligners store the reverse complement of the read if it was aligned to the ‘-’ strand. In this case, the parser revers-complements the read again, so that you can be sure that the read is always presented as it was sequenced (see also
A boolean. Some formats (e.g., those of Maq and Bowtie) contain only aligned reads (and the aligner collects the unaligned reads in a seperate FASTQ file if requested). For these formats,
True. Other formats (e.g., SAM and Solexa Export) list all reads, including those which could not be aligned. In that case, check
alignedto see whether the read has an alignment.
An object of class
The genomic interval to which the read was aligned (or
GenomicIntervalfor the sub-attributes. Note that different formats have different conventions for genomic coordinates. The parser class takes care of normalizing this, so that
ivalways adheres to the conventions outlined in :class:GenomicInterval. Especially, all coordinates are counted from zero, not one.
A boolean. True if the read stems from a paired-end sequencing run. (Note: At the moment paired-end data is only supported for the SAM format.)
Some aligners store the reverse complement of the read if it was aligned to the ‘-’ strand. For these aligners, the Alignment class is derived from
AlignmentWithSequenceReversal, which undoes the reverse-complement if necessary to ensure that the
readattribute always presents the read in the ordder in which it was sequenced.
To get better performance, this is done via lazy evaluation, i.e., the reverse complement is only calculated when the
readattribute is accessed for the first time. The original read as read from the file is stored as well. You can access both with these attributes:
SequenceWithQualitiesobject. The read as it was found in the file.
SequenceWithQualitiesobject. The read as it was sequenced, i.e., an alias for
Format-specific Alignment classes¶
Note: All format-specific Alignment classes take a string as argument for their constructor. This
is a line from the alignment file describing the alignment and is passed in by the corresponding
Reader object. As you do not create
Alignment objects yourself but get them from the
object you typically never call the constructor yourself.
BowtieAlignmentobjects contain all the attributes from
AlignmentWithSequenceReversal, and, in addition, these:
A string. The
reservedfield from the Bowtie output file. See the Bowtie manual for its meaning.
A string. The substitutions string that describes mismatches in the format
22:A>C, 25:C>Tto indicate a change from A to C in position 22 and from C to T in position 25. No further parsing for this is offered yet.
SAM_Alignmentobjects contain all the attributes from
AlignmentWithSequenceReversal, and, in addition, these:
An int. The alignment quality score in Phread style encoding.
A list of
CigarOperationobjects, as parsed from the extended CIGAR string. See
A boolean. Whether the alignment is secondary. (See SAM format reference, flag 0x0100. See also supplementary alignments, flag 0x0800.)
A boolean. Whether the read failed a platform quality check. (See SAM format reference, flag 0x0200.)
A boolean. Whether the read is a PCR or optical duplicate. (See SAM format reference, flag 0x0400.)
A boolean. Whether the alignment is supplementary. (See SAM format reference, flag 0x0800.)
These methods access the optional fields:
Returns the optional field
tag. See SAM format reference for the defined tags (which are two-letter strings).
Returns a dict with all optional fields, using their tags as keys.
This method is useful to write out a SAM file:
Constructs a SAM line to describe the alignment, which is returned as a string.
SAM_Alignment objects can represent paired-end data. If
Alignment.paired_endis True, the following fields may be used:
A boolean. Whether the mate was aligned
A string. Takes one of the values “first”, “second”, “unknown” and “not_paired_end”, to indicate whether the read stems from the first or second pass of the paired-end sequencing.
Boolean. Whether the mates form a proper pair. (See SAM format reference, flag 0x0002.)
GenomicPositionobject. The start (i.e., left-most position) of the mate’s alignment. Note that mate_start.strand is opposite to iv.strand for proper pairs.
An int. The inferred size of the insert between the reads.
This function takes a generator of
SAM_Alignmentobjects (e.g., a
BAM_Readerobject) and yields a sequence of pairs of alignments. A typical use may be:
for first, second in HTSeq.BAM_Reader( "some_paired_end_data.sam" ): print("Pair, consisting of") print(" ", first) print(" ", second)
SAM_Alignmentobjects, representing two reads of the same cluster. For this to work, the SAM file has to be arranged such that paired reads are always in adjacent lines. As the SAM format requires that the query names (first column of the SAM file) is the same for mate pairs, this arrangement can easily be achieved by sorting the SAM file lines lexicographically.
Special care is taken to properly pair up multiple alignment lines for the same read.
In the SAM format, alignments for paired-end reads must be reported in paired alignment records. If the mate of an alignment record is missing, this fact is counted and, at the end, a warning stating the number of such violating reads is issued. The singleton alignments are yielded as pairs, with the alignment in the first or second element of the pair (depending on the sequencing pass it originates from) and the other element is set to
This function pairs up reads in a SAM file, in the same manner as
pair_SAM_alignments()but does not require that mated alignments appear in adjacent records, i.e., the SAM file does not need to be sorted by read name beforehand. Rather, once the first alignment of a pair is encountered, it is stored in a buffer until its mated alignment is encountered, and then both are yielded together as pair. It is recommended that the data should be sorted by position, because then, mated alignments will typicalle not be too distant from each other in the file and hence only a limited number of alignments have to be held concurrently in the buffer, thereby reducing memory needs. To avoid overflowing the system’s memory, the function stops and raises an exception once the number of alignment records held in the buffer exceeds
SolexaExportAlignmentobjects contain all the attributes from
AlignmentWithSequenceReversal, and, in addition, these:
A boolean. Whether the read passed the chastity filter. If
A string. For
aligned==False, a code indicating why no match could be found. See the description of the 11th column of the Solexa Export format in the SolexaPipeline manual for the meaning of the codes. For
Some alignment programs, e.g., Bowtie, can output multiple alignments, i.e., the same read is reported consecutively with different alignments. This function takes an iterator over alignments (as provided by one of the alignment Reader classes) and bundles consecutive alignments regarding the same read to a list of Alignment objects and returns an iterator over these.
When reading in SAM files, the CIGAR string is parsed and stored as a list of
CigarOperation objects. For example, assume, a 36 bp read has been aligned
to the ‘+’ strand of chromosome ‘chr3’, extending to the right from position
1000, with the CIGAR string
"20M6I10M". The function :function:parse_cigar
spells out what this means:
>>> HTSeq.parse_cigar( "20M6I10M", 1000, "chr2", "+" ) [< CigarOperation: 20 base(s) matched on ref iv chr2:[1000,1020)/+, query iv [0,20) >, < CigarOperation: 6 base(s) inserted on ref iv chr2:[1020,1020)/+, query iv [20,26) >, < CigarOperation: 10 base(s) matched on ref iv chr2:[1020,1030)/+, query iv [26,36) >]
We can see that the map includes an insert. Hence, the affected coordinates run from 1000 to 1030 on the reference (i.e., the chromosome) but only from 0 to 36 on the query (i.e., the read).
We can convenient access to the parsed data by looking at the attributes of the three
objects in the list.
CigarOperation(type, size, rfrom, rto, qfrom, qto, chrom, strand, check=True)¶
This class represents a CIGAR atomic operation (e.g. 5I - insertion of 5 bases). All constructor parameters except the last one are positional, i.e. the order matters and you cannot assign them by name:
type (str): the type of operation, must be one of M/I/D/N/S/H/P/=/X.
size (int): the length of the operation, must be a positive integer.
rfrom (int): the beginning of the operation in the reference genome.
rto (int): the end of the operation in the reference genome.
qfrom (int): the beginning of the operation in the query (sequencing read).
qto (int): the end of the operation in the query (sequencing read).
chrom (str): the name of the chromosome.
strand (str): the strandness of the operation, must be +/-/. .
check (bool, default True): whether to check the operation for internal consistency.
Instances of this class have only one method:
Check whether the operation is internally consistent. Returns a boolean.
Instances of this class contain the following attributes:
The type of the operation. One of the letters M, I, D, N, S, H, or P. Use the dict cigar_operation_names to transform this to names:
>>> sorted(HTSeq.cigar_operation_names.items()) [('=', 'sequence-matched'), ('D', 'deleted'), ('H', 'hard-clipped'), ('I', 'inserted'), ('M', 'matched'), ('N', 'skipped'), ('P', 'padded'), ('S', 'soft-clipped'), ('X', 'sequence-mismatched')]
The number of affected bases, an int.
GenomicIntervalspecifying the affected bases on the reference. In case of an insertion, this is a zero-length interval.