.. _features: ******** Features ******** .. currentmodule:: HTSeq .. doctest:: :hide: >>> import HTSeq The easiest way to work with annotation is to use :class:`GenomicArray` with ``typecode=='O'`` or :class:`GenomicArrayOfSets`. If you have your annotation in a flat file, with each line describing a feature and giving its coordinates, you can read in the file line for line, parse it (see the standard Python module ``csv``), use the information on chromosome, start, end and strand to create a :class:`GenomicInterval` object and then store the data from the line in the genomic array at the place indicated by the genomic interval. For example, if you have data in a tab-separated file as follows: .. doctest:: >>> for line in open( "feature_list.txt" ): #doctest:+NORMALIZE_WHITESPACE ... print(line) chr2 100 300 + "gene A" chr2 200 400 - "gene B" chr3 150 270 + "gene C" Then, you could load this information as follows:: >>> import csv >>> genes = HTSeq.GenomicArray( [ "chr1", "chr2", "chr3" ], typecode='O' ) >>> for (chrom, start, end, strand, name) in \ ... csv.reader( open("feature_list.txt"), delimiter="\t" ): ... iv = HTSeq.GenomicInterval( chrom, int(start), int(end), strand ) ... genes[ iv ] = name Now, to see whether there is a feature at a given :class:`GenomicPosition`, you just query the genomic array:: >>> print(genes[ HTSeq.GenomicPosition( "chr3", 100, "+" ) ]) None >>> print(genes[ HTSeq.GenomicPosition( "chr3", 200, "+" ) ]) gene C See :class:`GenomicArray` and :class:`GenomicArrayOfSets` for more sophisticated use. ``GFF_Reader`` and ``GenomicFeature`` ===================================== One of the most common format for annotation data is GFF_ (which includes GTF_ as a sub-type). Hence, a parse for GFF files is included in HTSeq. .. _GFF: http://www.sanger.ac.uk/resources/software/gff/spec.html .. _GTF: http://mblab.wustl.edu/GTF22.html As usual, there is a parser class, called **GFF_Reader**, that can generate an iterator of objects describing the features. These objects are of type :class`GenomicFeature` and each describes one line of a GFF file. See Section :ref:`tour` for an example. .. class:: GFF_Reader( filename_or_sequence, end_included=True ) As a subclass of :class:`FileOrSequence`, GFF_Reader can be initialized either with a file name or with an open file or another sequence of lines. When requesting an iterator, it generates objects of type :class:`GenomicFeature`. The GFF specification is unclear on whether the end coordinate marks the last base-pair of the feature (closed intervals, ``end_included=True``) or the one after (half-open intervals, ``end_included=False``). The default, True, is correct for Ensembl GTF files. If in doubt, look for a CDS or stop_codon feature in you GFF file. Its length should be divisible by 3. If "end-start" is divisible by 3, you need ``end_included=False``. If "end-start+1" is divisible by 3, you need ``end_included=True``. GFF_Reader will convert the coordinates from GFF standard (1-based, end maybe included) to HTSeq standard (0-base, end not included) by subtracting 1 from the start position, and, for ``end_included=True``, also subtract 1 from the end position. .. attribute:: GFF_Reader.metadata GFF_Reader skips all lines starting with a single '#' as this marks a comment. However, lines starying with '##' contain meta data (at least accoring to the Sanger Institute's version of the GFF standard.) Such meta data has the format ``##key value``. When a metadata line is encountered, it is added to the ``metadata`` dictionary. .. class:: GenomicFeature( name, type_, interval ) A GenomicFeature object always contains the following attributes: .. attribute:: GenomicFeature.name A name of ID for the feature. As the GFF format does not have a dedicated field for this, the value of the first attribute in the *attributes* column is assumed to be the name of ID. .. attribute:: GenomicFeature.type The type of the feature, i.e., a string like ``"exon"`` or ``"gene"``. For GFF files, the 3rd column (*feature*) is taken as the type. .. attribute:: GenomicFeature.interval The interval that the feature covers on the genome. For GFF files, this information is taken from the first (*seqname*), the forth (*start*), the fifth (*end*), and the seventh (*strand*) column. When created by a :class:`GFF_Reader` object, the following attributes are also present, with the information from the remaining GFF columns: .. attribute:: GenomicFeature.source The 2nd column, denoted *source* in the specification, and intended to specify the data source. .. attribute:: GenomicFeature.frame The 8th column (*frame*), giving the reading frame in case of a coding feature. Its value is an integer (0, 1, or 2), or the string ``'.'`` in case that a frame is not specified or would not make sense. .. attribute:: GenomicFeature.score The 6th column (*score*), giving some numerical score for the feature. Its value is a float, or ``'.'`` in case that a score is not specified or would not make sense .. attribute:: GenomicFeature.attr The last (9th) column of a GFF file contains *attributes*, i.e. a list of name/value pairs. These are transformed into a dict, such that, e.g., ``gf.attr['gene_id']`` gives the value of the attribute ``gene_id`` in the feature described by ``GenomicFeature`` object ``gf``. The parser for the attribute field is reasonably flexible to deal with format variations (it was never clearly established whetehr name and value should be sperarated by a colon or an equal sign, and whether quotes need to be used) and also does a URL style decoding, as is often required. In order to write a GFF file from a sequence of features, this method is provided: .. method:: GenomicFeature.get_gff_line( with_equal_sign=False ) Returns a line to describe the feature in the GFF format. This works even if the optional attributes given above are missing. Call this for each of your ``GenomicFeature`` objects and write the lines into a file to get a GFF file. .. function:: parse_GFF_attribute_string( attrStr, extra_return_first_value=False ) This is the function that :class:`GFF_Reader` uses to parse the attribute column. (See :attr:`GenomicFeature.attr`.) It returns a dict, or, if requested, a pair of the dict and the first value.