Other parsers
VCF_Reader
and VariantCall
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
See the definitions at
As usual, there is a parser class, called VCF_Reader, that can generate an
iterator of objects describing the structural variant calls. These objects are of type VariantCall
and each describes one line of a VCF file. See below for an example.
- class HTSeq.VCF_Reader(filename_or_sequence)
As a subclass of
FileOrSequence
, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.When requesting an iterator, it generates objects of type
VariantCall
.- metadata
VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).
- parse_meta(header_filename=None)
The VCF_Reader normally does not parse the meta-information and also the
VariantCall
does not contain unpacked metainformation. The function parse_meta reads the header information either from the attachedFileOrSequence
or from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.
- make_info_dict()
This function will parse the info string and create the attribute
infodict
which contains a dict with key:value-pairs containig the type-information for each entry of theVariantCall
’s info field.
- class HTSeq.VariantCall(line, nsamples=0, sampleids=[])
A VariantCall object always contains the following attributes:
- alt
The alternative base(s) of the
VariantCall
. This is a list containing all called alternatives.
- chrom
The Chromosome on which the
VariantCall
was called.
- filter
This specifies if the
VariantCall
passed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).
- format
Contains the format string specifying which per-sample information is stored in
VariantCall.samples
.
- id
The id of the
VariantCall
, if it has been found in any database, for unknown variants this will be “.”.
- info
This will contain either the string version of the info field for this
VariantCall
or a dict with the parsed and processed info-string.
- pos
A
HTSeq.GenomicPosition
that specifies the position of theVariantCall
.
- qual
The quality of the
VariantCall
.
- ref
The reference base(s) of the
VariantCall
.
- samples
A dict mapping sample-id’s to subdicts which use the
VariantCall.format
as keys to store the per-sample information.
- unpack_info(infodict)
This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.
Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):
>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" )
>>> vcfr.parse_meta()
>>> vcfr.make_info_dict()
>>> for vc in vcfr:
... print vc,
1:10327:'T'->'C'
1:10433:'A'->'AC'
1:10439:'AC'->'A'
1:10440:'C'->'A'
FIXME The example above is not run, as the example file is still missing!
Wiggle Reader
The Wiggle format (file extension often .wig
) is a format to describe numeric scores assigned to base-pair positions on a genome.
The class WiggleReader
is parser for such files.
- class HTSeq.WiggleReader(filename_or_sequence, verbose=True)
The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A
WiggleReader
object generates an iterator, which yields pairs of the form(iv, score)
, whereiv
is aGenomicInterval
object andscore
is afloat
with the score that the file assigns to the specified interval. Ifverbose
is set to True, the user is alerted to skipped lines (comments orbrowser
lines) by a message printed to the standard output.
BED Reader
The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.
- class HTSeq.BED_Reader(filename_or_sequence)
The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A
BED_Reader
object generates an iterator, which yields aGenomicFeature
object for each line in the BED file (except for lines starting withtrack
, which are skipped).The attributes of the yielded
GenomicFeature
objects are as follows:iv
a
GenomicInterval
object with the coordinates as given by the 1st, 2nd, 3rd, and 6th column of the BED file. If the BED file has less than 6 columns, the strand is set to “.
”.name
the name of feature as given in the 4th column, or
unnamed
, if the file has only three columnstype
always the string
BED line
score
a float with the score as given by the 5th column (or
None
if the BED file has less 5 columns).thick
a
GenomicInterval
object containg the “thick” part of the feature, as specified by the 6th and 7th column, with chromosome and strand copied fromiv
(orNone
if the BED file has less 7 columns).itemRgb
a list of three
int
values, taken from the 8th column (None
if the BED file has less 8 columns). In a BED file, this triple is meant to specify the colour in which the feature should be drawn in a browser.
BigWig Reader
The BigWig format is a binary, compressed version of both Wiggle and bedGraph.
HTSeq
supports it via pyBigWig (a great library, btw, thank you!), mainly for use
with GenomicArray
instances, i.e. sparse data on genomic intervals.
- class HTSeq.BigWig_Reader(filename)
This class is instantiated with the name of or path to a BigWig file. The file is opened upon instantiation, and the class can be used as a context manager (i.e. using “with”).
- Methods
- BigWig_Reader.chroms()
Return the list of chromosomes and their lengths, as a dictionary.
Example:
bw.chroms() -> {‘chr1’: 4568999, ‘chr2’: 87422, …}
- HTSeq.intervals(self, chrom, strand='.', raw=False)
Lazy iterator over genomic intervals
- Args:
chrom (str)
: The chromosome/scaffold to find intervals for.strand ('.', '+', or '-')
: Strandedness of the yieldedGenomicInterval
. If raw=True, this argument is ignored.raw (bool)
: IfTrue
, return the raw triplet from pyBigWig. IfFalse
, return the result wrapped in a GenomicInterval with the appropriate strandedness.