|
FILE FORMAT DESCRIPTION
The GTF format
|
|
Purpose
GTF (Gene Transfer Format) has been designed to interchange exon-intron structures of genes.
It extends the earlier defined GFF (General Feature Format) by additional fields.
Structure
GTF is a tab-separated format, with each line describing a respective feature (e.g., exon, intron,
CDS, start/stop codon, ..) using the following fields:
<seqname> <source> <feature> <start> <end> <score> <strand>
<frame> [attributes] [comments]
Each attribute is a pair of: identifier "value";
Textual attributes should be surrounded by doublequotes. Attributes must end in a semicolon which must then be separated
from the start of any subsequent attribute by exactly one space character. Commonly used identifiers are for instance
gene_id , transcript_id and exon_id .
Optional comments are ignored by AStalavista.
Field description
<seqname>
The FPC (fingerprint contig) ID from the Golden Path. The field is mandatory and needed for the correct clustering of transcripts
in AStalavista.
<source>
The source column should be a unique label indicating where the annotations came from --- typically the name of either a prediction
program or a public database. The field is ignored by AStalavista.
<feature>
GTF defines the following features: "CDS ", "start_codon ", "stop_codon " and "exon ".
AStalavista requires the feature "exon " for obtaining the exon-intron structure of each transcript. The feature "CDS "
may be provided optionally to describe the coding sequence starting with the first translated codon and proceeding to the last translated codon.
Unlike Genbank annotation, the stop codon is not included in the CDS for the terminal exon. All other features (e.g., "start_codon " or "stop_codon ")
are ignored by AStalavista.
<start> <end>
Integer start and end coordinates of the feature relative to the beginning of the sequence named in <seqname> .
<start> must be less than or equal to <end> . Sequence numbering starts at 1 .
Values of and that extend outside the reference sequence are technically acceptable, but they are discouraged for purposes of this project.
<score>
The score field (a float value) is not be used in AStalavista, so it may be replaced by a dot.
<frame>
Nucleotides to go from the start position of the current features to match the first position of the next codon.
The feature is ignored by AStalavista.
gene_id
An unique identifier for the gene the corresponding feature is assigned to. AStalavista performs its own transcript clustering
procedure and therefore ignores gene identifiers provided in the attribute list.
transcript_id
An unique identifier for the transcript the corresponding feature is assigned to. This attribute is mandatory for the correct
functioning of AStalavista.
exon_id
An unique identifier for the exon the corresponding feature is assigned to. AStalavista generates exons in a cluster of
transcripts non-redundantly, i.e., exons with identical start/stop coordinates are regarded as identical even
if their exon_id s and/or transcript_id s differ.
Example
Here is an example in which the "exon " and the "CDS " feature are used.
The GTF lines describe a 5 exon transcript with 3 translated exons.
AB000381 gene_id exon
150 200 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id exon
300 401 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id CDS
380 401 . + 0 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id exon
501 650 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id CDS
501 650 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id exon
700 800 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 gene_id CDS
700 707 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
The ASTA format
|
|
Purpose
ASTA (Alternative Splicing Transfer) format is a flat file format designed to describe exon-intron structures in order to allow
for an easy interchange of alternative splicing (AS) events.
Structure
ASTA files are tab-separated and describe one AS event per line, each one respectively containing the fields:
<structure> <seqname> <transcriptID1> <var splice sites #1> <transcriptID2> <var splice sites #2>
Field description
<structure>
The exon-intron structure of the AS event as described by the AS code with the relative position of variable splice sites
(see AS codes for a description of the AS codes).
<seqname>
The FPC (fingerprint contig) ID from the Golden Path.
<transcriptID1>
Identifier of the first transcript involved in the variation.
<var splice sites #1>
List containing the genomic coordinates of variable splice sites in transcript #1, i.e., splice sites that are not used in transcript #2.
<transcriptID2>
Identifier of the second transcript involved in the variation.
<var splice sites #2>
List with the coordinates of variable splice sites in transcript #2, i.e., splice sites that are not used in transcript #1.
Note that the list may be empty, e.g., in case of an exon skipping event.
Examples
1^ , 2^ chr4 CG1909-RB 603475 CG1909-RA 603502
The line describes an alternative donor event on chromosome chr4 , where the distal donor
(at position 603475 ) is used in the transcript CG1909-RB and the proximal donor
is used in transcript CG1909-RA at position 603502 .
1^2- , 0 chrX CG11412-RA 1200035,1200111 CG11412-RC
The line describes an exon skipping event on chromosome chrX , where an exon (located at start=
1200035 , end= 1200111 ) is used in transcript CG11412-RA but skipped
in transcript CG11412-RC .
|
|