Shannon Information theory. Read about the application of information theory to molecular biology at Dr. Tom Schneider's page.
Presently the system is based on April 2003 reference sequence. It will be soon extended to all drafts available in UCSC genome browser, where the user can choose the draft they are interested in.
Multiple accession numbers may be attached to the same functional gene name. In such cases, a list of the mRNA accession numbers are displayed, allowing the user to choose one of them.
It is recommended for the user to choose the mRNA accession number with largest range of base pairs.
If you find that a particular accession number is missing in the UCSC genome April 2003 assembly, then you can make use of this option "Submit own sequence". This option enables you to submit your own sequence and make desired mutation(s). Hopefully, the server will be updated very soon, so that it is not limited to April 2003 draft alone.
Some of the accession numbers still do not have a associated gene name. In such cases, the user can use this option where mutational analyses can be done with out requiring associated HUGO designated gene names.
Wondering why "Gene Name" field is still asked? Then you click this link.
Designated Gene Name is a HUGO designated gene name, which is present in the UCSC genome browser. To know the name of associated genes use the following link UCSC Genome Browser or Genew database search engine. In case you don't find the gene name, you can choose the "Submit mRNA Accession #" option, where the gene name asked is for naming conventions only.
This is the Mutation / Variant field where the user can submit mutation / variant. The Mutation indicated should be in strict conformation with HUGO Designated Mutation Nomenclature. The user can analyze multiple mutations / variants by submitting multiple mutations / variants separated by a '+'.
The Window Range is the region, in bases before and after the base, where the mutation takes place. It is the region where the information content of sites will be calculated. The sites falling outside the range of the window will be neglected. In case of haplotypes, all the sites falling in-between the bases where the mutations are taking place will be considered. The window range is limited to only 1000 bases to reduce the overhead of scanning all the base pairs. The default value is 54, which is twice of acceptor Ri(b,l) matrix range.
There are variety of Information weight matrix (Ri(b,l)) matrices available, which can recognize certain kind of sites. The user is given the option to choose one or multiple Information weight matrices. The acceptor and donor Ri(b,l) matrices are scanned by default. In the near future, more Ri(b,l) matrices will be added to the list.
The binding site selection method has been redesigned, using checkboxes instead of a drop down menu. Over time, the number of models developed for ASSEDA has increased, making the old selection method cumbersome. The new method allows for easier selection of specific models, and can easily be expanded without adding clutter. Simply check the models before submitting your mutation. Additional models of RNA binding proteins involved in splicing are planned to be added in the near future.
List of available binding sites: Donors and acceptors (human and mouse), branch point, SF2/ASF (SRSF1), SC35 (SRSF2), SRp40 (SRSF5), SRp55 (SRSF6), hnRNPA1, hnRNPH1.
CDS (CoDing Segment) introduces a complex section that describes the gene open reading frame (ORF), the portion of the sequence that codes for a protein product.
It is observed that most of the authors indicating the mutation considered initial start codon as position 1, where as, on contrary in some of the publications the start position of the gene is considered to be position 1. To facilitate the user's preference to set the parameters according to their numbering terminology this option is provided:
Every region of DNA has six possible reading frames, three in each direction. The resulting visualization map of binding sites is configured such that only forward frames are shown. When the user selects this option, the resulting visualization map of binding sites will indicate all of the three forward frames with amino acids encoded. This enables the user to analyze whether the mutation made shifts within the reading frame or not.
mRNA Accession Number is the accession number associated with the gene name. The user has to enter the accession number which is present in the April 2003 draft of the UCSC genome assembly, as this system is based on that draft. The user can find the mRNA Accession Number of the gene from the links Genew database search engine or from UCSC Genome Browser. The accession number should not be the refseq accession number.
If you find that a particular accession number is missing in the April 2003 draft of the UCSC genome assembly, then you can make use of the option "Submit own sequence". This option enables you to submit your own sequence and make desired mutation(s). Hopefully, the server will be updated very soon, so that it is not limited to the April 2003 draft alone.
This is simply to provide the gene name in the results pages. The user can enter any gene name, but it will not be tested or verified. It is used to generate comprehensive information results only.
Direction is the strand of the sequence pasted in the sequence text box. The user can specify either '+' or '-' strand.
This is the text box where the user can paste in his own sequence. The sequence is expected to contain only characters a, g, c, or t. If any other characters are found, they will be removed from the sequence.
Depending upon the type of the option chosen (submitting own sequence or submitting designated gene name or mRNA accession number), it will take approximately 30 to 60 seconds to analyze one mutation when the load is optimum. A longer delay may be expected if load is high.
The mutation(s) / variant(s) submitted is/are parsed and the base pairs where the changes are taking place are identified. All the base pairs falling in the window range from those base pairs are pulled out from the library file (of that chromosome) which consists of millions of bases. To identify and pull out specific parts of the chromosome will naturally lead to delay.
Not to forget, "It's always Worth Waiting!"
Genomic Coordinate: The genomic coordinate number of the base where the information content is measured.
Position Relative to Natural Site: The relative distance of the base from the closest natural site.
Closest Natural Site: The genomic coordinate number of the closest natural site. This link, when clicked, pops up a window containing information content information of all the natural sites of that particular mRNA accession.
Initial(Ri): Initial information content measured at the base before the mutation is made.
Final(Ri): Final information content measured at the base after the mutation is made.
ΔRi: Final(Ri) - Initial(Ri); change of information content obtained at the site due to a mutation or variant.
Fold change: A single bit difference in Ri value corresponds to at least a two-fold difference in binding site strength. Fold change indicates the change in binding affinity of two sites.
Fold Change = 2ΔRi where ΔRi = difference between their respective individual information contents of two sites (wild type, mutant type) % Binding (Final/Initial): Indicates the change of binding energy calculated as a percentage.
Initial(Z): Z score for this evaluation, assuming that individual information values form a Gaussian distribution.
Final(Z): Z score after mutation.
ΔZ: Change in Z score obtained at the site due to a mutation or variant.
This system uses the Delila system tools for the identification of potential sites.
The sequence submitted is scanned by different weight matrices selected by the user. Acceptor and Donor weight matrices are used by default. The number of bases scanned by each matrix is twice the length of the weight matrix on either side of the base(s) where change is made. Since the acceptor weight matrix scans the longest number of bases (27 bases), twice the length of acceptor window on both sides of the base where change is made sums up to about 110 bases.
The fold change in binding affinity of two sites ( wild-type, mutant) is 2ΔRi , where ΔRi is the difference between their respective individual information contents.
The architecture diagram can be found here, and the program flow can be found here.
By default, the ASSEDA server will only report potential binding sites that have a calculated bit score of 0 bits or more. We allow users to change this minimum in case they want to: 1) Increase the threshold to filter regions with a high number of potential binding sites, or 2) Decrease the threshold below 0 bits to investigate very weak splice binding sites that may be supported by splicing regulatory elements.
CDS (CoDing Segment) introduces a complex section that describes the gene open reading frame (ORF), the portion of the sequence that codes for a protein product. It is observed that most of the authors indicating the mutation considered initial start codon as position 1, where as, on contrary in some of the publications the start position of the gene is considered to be position 1. To facilitate the user's preference to set the parameters according to their numbering terminology this option is provided: Open Reading Frame ( Initial CDS position in NCBI mRNA Accession): The initial start codon is considered as position 1. First Position of the NCBI mRNA Accession: The first position of the mRNA Accession is considered as position 1.
Every region of DNA has six possible reading frames, three in each direction. The resulting visualization map of binding sites is configured such that only forward frames are shown. When the user selects this option, the resulting visualization map of binding sites will indicate all of the three forward frames with amino acids encoded. This enables the user to analyze whether the mutation made shifts within the reading frame or not.
The enables the user to generate results with out visualization map ie. sequence walkers.
When calculating total exon information content (when Molecular Phenotype Predictionby Exon Definition is selected), splicing regulatory elements are not accounted for by default. If the user suspects that a mutation is altering an ESE/ISS, then it can be included into the calculation (currently, only SF2/ASF and SC35 sites are available). As a single mutation can lead to multiple redundant changes, only one altered site is considered (ie. if two sites are weakened, the one which was initially stronger is considered as it is the one most likely to be used).
When calculating total exon information content, ESE/ISS consideration can be selected by the user (above). This option allows the user to treat these splicing regulatory elements as exonic splicing enhancers (exonic enhancer strength is added to calculation) or intronic splicing silencer (altered intronic regulatory elements are subtracted to calculation). A second gap surprisal is also factored into the calculation, which is specific for regulatory binding site type (SF2/ASF and SC35) and if ESE or ISS is selected.
Within the Phenotype Prediction tab, the second tab (Isoform Structure) will display a diagram for each predicted exon splice form. When a natural site is weakened, exon skipping can occur. ASSEDA will draw the exon skipping splice form if 1) the mutation abolishes the natural site (below 1.6 bit final Ri) or 2) lead to a natural site decrease of at least 7 bits (128 fold decrease in binding affinity). This option allows the user to change the 7 bit value. This exon skipping splice form will also appear in the Custom UCSC Track tab.
There has been recent changes to the Exon Definition formulation in regards to the impact of negative values. This update will not affect individual Ri values, but may affect previous computations of Ri,total involving sites which were abolished. The Logic and Formulation of Exon Definition for Splice and Splicing Regulatory Factors is described in detail here.