Other parsers
VCF_Reader and VariantCall
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
See the definitions at
As usual, there is a parser class, called VCF_Reader, that can generate an
iterator of objects describing the structural variant calls. These objects are of type VariantCall
and each describes one line of a VCF file. See below for an example.
- class HTSeq.VCF_Reader(filename_or_sequence)
As a subclass of
FileOrSequence, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.When requesting an iterator, it generates objects of type
VariantCall.- metadata
VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).
- parse_meta(header_filename=None)
The VCF_Reader normally does not parse the meta-information and also the
VariantCalldoes not contain unpacked metainformation. The function parse_meta reads the header information either from the attachedFileOrSequenceor from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.
- make_info_dict()
This function will parse the info string and create the attribute
infodictwhich contains a dict with key:value-pairs containig the type-information for each entry of theVariantCall’s info field.
- class HTSeq.VariantCall(line, nsamples=0, sampleids=[])
A VariantCall object always contains the following attributes:
- alt
The alternative base(s) of the
VariantCall. This is a list containing all called alternatives.
- chrom
The Chromosome on which the
VariantCallwas called.
- filter
This specifies if the
VariantCallpassed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).
- format
Contains the format string specifying which per-sample information is stored in
VariantCall.samples.
- id
The id of the
VariantCall, if it has been found in any database, for unknown variants this will be “.”.
- info
This will contain either the string version of the info field for this
VariantCallor a dict with the parsed and processed info-string.
- pos
A
HTSeq.GenomicPositionthat specifies the position of theVariantCall.
- qual
The quality of the
VariantCall.
- ref
The reference base(s) of the
VariantCall.
- samples
A dict mapping sample-id’s to subdicts which use the
VariantCall.formatas keys to store the per-sample information.
- unpack_info(infodict)
This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.
Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):
>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" )
>>> vcfr.parse_meta()
>>> vcfr.make_info_dict()
>>> for vc in vcfr:
... print vc,
1:10327:'T'->'C'
1:10433:'A'->'AC'
1:10439:'AC'->'A'
1:10440:'C'->'A'
FIXME The example above is not run, as the example file is still missing!
Wiggle Reader
The Wiggle format (file extension often .wig) is a format to describe numeric scores assigned to base-pair positions on a genome.
The class WiggleReader is parser for such files.
- class HTSeq.WiggleReader(filename_or_sequence, verbose=True)
The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A
WiggleReaderobject generates an iterator, which yields pairs of the form(iv, score), whereivis aGenomicIntervalobject andscoreis afloatwith the score that the file assigns to the specified interval. Ifverboseis set to True, the user is alerted to skipped lines (comments orbrowserlines) by a message printed to the standard output.
BED Reader
The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.
- class HTSeq.BED_Reader(filename_or_sequence)
The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A
BED_Readerobject generates an iterator, which yields aGenomicFeatureobject for each line in the BED file (except for lines starting withtrack, which are skipped).The attributes of the yielded
GenomicFeatureobjects are as follows:iva
GenomicIntervalobject with the coordinates as given by the 1st, 2nd, 3rd, and 6th column of the BED file. If the BED file has less than 6 columns, the strand is set to “.”.namethe name of feature as given in the 4th column, or
unnamed, if the file has only three columnstypealways the string
BED linescorea float with the score as given by the 5th column (or
Noneif the BED file has less 5 columns).thicka
GenomicIntervalobject containg the “thick” part of the feature, as specified by the 6th and 7th column, with chromosome and strand copied fromiv(orNoneif the BED file has less 7 columns).itemRgba list of three
intvalues, taken from the 8th column (Noneif the BED file has less 8 columns). In a BED file, this triple is meant to specify the colour in which the feature should be drawn in a browser.
BigWig Reader
The BigWig format is a binary, compressed version of both Wiggle and bedGraph.
HTSeq supports it via pyBigWig (a great library, btw, thank you!), mainly for use
with GenomicArray instances, i.e. sparse data on genomic intervals.
- class HTSeq.BigWig_Reader(filename)
This class is instantiated with the name of or path to a BigWig file. The file is opened upon instantiation, and the class can be used as a context manager (i.e. using “with”).
- Methods
- BigWig_Reader.chroms()
Return the list of chromosomes and their lengths, as a dictionary.
Example:
bw.chroms() -> {‘chr1’: 4568999, ‘chr2’: 87422, …}
- HTSeq.intervals(self, chrom, strand='.', raw=False)
Lazy iterator over genomic intervals
- Args:
chrom (str): The chromosome/scaffold to find intervals for.strand ('.', '+', or '-'): Strandedness of the yieldedGenomicInterval. If raw=True, this argument is ignored.raw (bool): IfTrue, return the raw triplet from pyBigWig. IfFalse, return the result wrapped in a GenomicInterval with the appropriate strandedness.