Skip Header

 

Download

annbioch.txt

----------------------------------------------------------------------------
        UniProt - Swiss-Prot Protein Knowledgebase
        Swiss Institute of Bioinformatics (SIB); Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
        Protein Information Resource (PIR); Washington DC, USA
----------------------------------------------------------------------------

Description: A primer on UniProtKB/Swiss-Prot annotation
Name:        annbioch.txt
Release:     56.0 of 22-Jul-2008

----------------------------------------------------------------------------

============
Introduction
============

UniProtKB/Swiss-Prot is defined as an annotated protein sequence database.
We make every effort possible to ensure that all available biochemical
information accompanies the sequence data and that this information is as
complete and up-to-date as possible. This annotation is a labor-intensive
process that involves assessment of information from published articles
along with use of a variety of programs/algorithms. Use is also made of
Swiss-Prot itself in order to maintain standard nomenclature and
description comments. We describe here the steps we take to add all
relevant biochemical information to new entries going into Swiss-Prot. 

There are different scenarios with respect to biochemical information that
accompanies sequence data reports. Sometimes scientists isolate and then
biochemically characterize the protein encoded by the gene they have
sequenced. Other times they infer this information through similarity to
other proteins within the same, conserved family. If it does not belong to a
particular family they infer through purely sequence similarity. Then we
have the genome sequence data that does not often have an accompanying
citation reporting any such classification. Below are the steps we use to
analyze these reports and how we assess what and how to add this information
to the sequence entries.

In all the scenarios below a new entry is taken from TrEMBL and, generally,
the first step is to get a copy of the article(s) given in the reference
lines. Then the sequence is aligned, using FastA or Blast, against all
existing Swiss-Prot and TrEMBL entries. This allows us, quickly and easily,
to assess if and how the sequence relates to existing families in SWISS-
PROT. The next step is to read the article(s), assess the information
given and add relevant comments and features to the entry.

It is important to note that the following is just an outline of the
annotation process. The whole process of assessing information for addition
into Swiss-Prot entries is MUCH MORE complex.

(I) Article(s) reports sequencing (nucleic acid and/or amino acid) and
    biochemical characterization

Often from reading the abstract of the paper and analyzing the FastA
results, we can see that the protein belongs to a particular family. In
these cases, care is taken to look at other members of the family and to
become familiar with the annotation that already exists. Any standard
annotation that is common to the family, for example, the description
line(s) and the keywords, can be added to the new entry. Other comments and
features, specific to the family, can be added in conjunction with reading
the paper. Any additional information from the paper, for example
post-translational modifications, is added to the entry.

(II) Article(s) reports sequencing and with no biochemical characterization

In the majority of articles reporting gene sequencing, the gene is
translated to give the protein sequence but the in vivo protein is rarely
isolated and characterized. Often a probe from a similar organism is used to
pinpoint the gene and then the authors infer biochemical characteristics. In
these cases, curators assess what the authors imply with the results of the
alignments against Swiss-Prot and TrEMBL. When the sequence "hits" against a
particular family the description line(s), the similarity comments and
keywords specific to the family, can be added to the new entry. More care is
taken when looking at function, subunit and sequence features. This is the
first of the cases where we can introduce three of four adjectives commonly
found in Swiss-Prot, namely "probable", "potential" and "by similarity" (for
a description of "putative" please see later under Genome Data).

When a gene has been identified from probing with the gene from another
organism and that gene encodes a characterized protein the description line
will be copied over from the corresponding protein sequence entry. When
present in the existing entry and it is not species specific, the function
and other comment lines are added along with "by similarity" in parentheses.
It should be noted "by similarity" is used when the comment/feature in the
existing entry has been proved, categorically, to be so.

Examples:

a) Swiss-Prot entry where authors have biochemically characterized the
   protein.

ID   AMPA_ECOLI              Reviewed;         503 AA.
AC   P68767; P11648; Q2M649;
DT   21-DEC-2004, integrated into UniProtKB/Swiss-Prot.
DT   21-DEC-2004, sequence version 1.
DT   23-OCT-2007, entry version 33.
DE   Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP)
DE   (Leucyl aminopeptidase) (Aminopeptidase A/I).
GN   Name=pepA; Synonyms=carP, xerB; OrderedLocusNames=b4260, JW4217;
OS   Escherichia coli (strain K12).
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Escherichia.
OX   NCBI_TaxID=83333;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA], AND PROTEIN SEQUENCE OF 1-20.
RC   STRAIN=K12;
RX   MEDLINE=89356633; [Pubmed: 2670557]
RA   Stirling C.J., Colloms S., Collins J.F., Szatmari G., Sherratt D.J.;
RT   "xerB, an Escherichia coli gene required for plasmid ColE1 site-
RT   specific recombination, is identical to pepA, encoding aminopeptidase
RT   A, a protein with substantial similarity to bovine lens leucine
RT   aminopeptidase.";
RL   EMBO J. 8:1623-1627(1989).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RC   STRAIN=K12;
RX   MEDLINE=95341674; [Pubmed: 7616564] [Article from publisher]
RA   Charlier D., Hassanzadeh G., Kholti A., Gigot D., Pierard A.,
RA   Glansdorff N.;
RT   "carP, involved in pyrimidine regulation of the Escherichia coli
RT   carbamoylphosphate synthetase operon encodes a sequence-specific DNA-
RT   binding protein identical to XerB and PepA, also required for
RT   resolution of ColEI multimers.";
RL   J. Mol. Biol. 250:392-406(1995).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / MG1655 / ATCC 47076;
RX   MEDLINE=95334362; [Pubmed: 7610040] [Article from publisher]
RA   Burland V.D., Plunkett G. III, Sofia H.J., Daniels D.L.,
RA   Blattner F.R.;
RT   "Analysis of the Escherichia coli genome VI: DNA sequence of the
RT   region from 92.8 through 100 minutes.";
RL   Nucleic Acids Res. 23:2105-2119(1995).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / MG1655 / ATCC 47076;
RX   MEDLINE=97426617; [Pubmed: 9278503] [Article from publisher]
RA   Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA   Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA   Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA   Mau B., Shao Y.;
RT   "The complete genome sequence of Escherichia coli K-12.";
RL   Science 277:1453-1474(1997).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
RX   [Pubmed: 16738553] [Article from publisher]
RA   Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S.,
RA   Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.;
RT   "Highly accurate genome sequences of Escherichia coli K-12 strains
RT   MG1655 and W3110.";
RL   Mol. Syst. Biol. 2:E1-E5(2006).
RN   [6]
RP   MUTAGENESIS OF GLU-354.
RX   MEDLINE=94335644; [Pubmed: 8057849]
RA   McCulloch R., Burke M.E., Sherratt D.J.;
RT   "Peptidase activity of Escherichia coli aminopeptidase A is not
RT   required for its role in Xer site-specific recombination.";
RL   Mol. Microbiol. 12:241-251(1994).
RN   [7]
RP   X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS).
RX   [Pubmed: 10449417] [Article from publisher]
RA   Strater N., Sherratt D.J., Colloms S.D.;
RT   "X-ray structure of aminopeptidase A from Escherichia coli and a model
RT   for the nucleoprotein complex in Xer site-specific recombination.";
RL   EMBO J. 18:4513-4522(1999).
CC   -!- FUNCTION: Presumably involved in the processing and regular
CC       turnover of intracellular proteins. Catalyzes the removal of
CC       unsubstituted N-terminal amino acids from various peptides.
CC       Required for plasmid ColE1 site-specific recombination but not in
CC       its aminopeptidase activity. Could act as a structural component
CC       of the putative nucleoprotein complex in which the Xer
CC       recombination reaction takes place.
CC   -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|-
CC       Yaa-, in which Xaa is preferably Leu, but may be other amino acids
CC       including Pro although not Arg or Lys, and Yaa may be Pro. Amino
CC       acid amides and methyl esters are also readily hydrolyzed, but
CC       rates on arylamides are exceedingly low.
CC   -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity).
CC   -!- ENZYME REGULATION: Inhibited by zinc and EDTA.
CC   -!- SUBUNIT: Homohexamer.
CC   -!- SIMILARITY: Belongs to the peptidase M17 family.
CC   -!- CAUTION: The ligation for manganese is based on the ligation for
CC       zinc, an inhibitor, in the crystallographic structure reported in
CC       PubMed:10449417. The ligation for manganese in the active form of
CC       the enzyme may differ.
DR   EMBL; X15130; CAA33225.1; -; Genomic_DNA.
DR   EMBL; X86443; CAA60164.1; -; Genomic_DNA.
DR   EMBL; U14003; AAA97157.1; -; Genomic_DNA.
DR   EMBL; U00096; AAC77217.1; -; Genomic_DNA.
DR   EMBL; AP009048; BAE78257.1; -; Genomic_DNA.
DR   PIR; S04462; APECA.
DR   RefSeq; AP_004756.1; -.
DR   RefSeq; NP_418681.1; -.
DR   PDB; 1GYT; X-ray; A/B/C/D/E/F/G/H/I/J/K/L=1-503.
DR   IntAct; P68767; -.
DR   MEROPS; M17.003; -.
DR   GeneID; 948791; -.
DR   GenomeReviews; U00096_GR; b4260.
DR   GenomeReviews; AP009048_GR; JW4217.
DR   KEGG; ecj:JW4217; -.
DR   KEGG; eco:b4260; -.
DR   EchoBASE; EB0688; -.
DR   EcoGene; EG10694; pepA.
DR   BioCyc; EcoCyc:EG10694-MONOMER; -.
DR   GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP.
DR   GO; GO:0030145; F:manganese ion binding; IEA:HAMAP.
DR   HAMAP; MF_00181; -; 1.
DR   InterPro; IPR011356; Peptidase_M17.
DR   InterPro; IPR000819; Peptidase_M17_C.
DR   InterPro; IPR008283; Peptidase_M17_N.
DR   PANTHER; PTHR11963:SF3; Peptidase_M17; 1.
DR   Pfam; PF00883; Peptidase_M17; 1.
DR   Pfam; PF02789; Peptidase_M17_N; 1.
DR   PIRSF; PIRSF001116; Ctsl_amnpptdse; 1.
DR   PRINTS; PR00481; LAMNOPPTDASE.
DR   PROSITE; PS00631; CYTOSOL_AP; 1.
PE   1: Evidence at protein level;
KW   3D-structure; Aminopeptidase; Complete proteome;
KW   Direct protein sequencing; Hydrolase; Manganese; Metal-binding;
KW   Protease.
FT   CHAIN         1    503       Cytosol aminopeptidase.
FT                                /FTId=PRO_0000165750.
FT   ACT_SITE    282    282       Potential.
FT   ACT_SITE    356    356       Potential.
FT   METAL       270    270       Manganese 2 (Probable).
FT   METAL       275    275       Manganese 1 (Probable).
FT   METAL       275    275       Manganese 2 (Probable).
FT   METAL       293    293       Manganese 2 (Probable).
FT   METAL       352    352       Manganese 1 (Probable).
FT   METAL       354    354       Manganese 1 (Probable).
FT   METAL       354    354       Manganese 2 (Probable).
FT   MUTAGEN     354    354       E->A: Loss of activity.
FT   STRAND        2      6
FT   HELIX        10     12
FT   STRAND       18     23
FT   TURN         24     26
FT   HELIX        30     36
FT   STRAND       39     41
FT   HELIX        42     49
FT   STRAND       59     64
FT   STRAND       69     77
FT   HELIX        86    102
FT   STRAND      106    110
FT   HELIX       112    114
FT   HELIX       122    137
FT   STRAND      156    160
FT   HELIX       164    166
FT   HELIX       167    192
FT   TURN        195    197
FT   HELIX       200    213
FT   TURN        214    217
FT   STRAND      218    223
FT   HELIX       225    230
FT   HELIX       234    241
FT   STRAND      243    245
FT   STRAND      248    255
FT   STRAND      265    275
FT   HELIX       287    294
FT   HELIX       295    310
FT   STRAND      313    325
FT   STRAND      337    339
FT   STRAND      345    347
FT   HELIX       355    365
FT   HELIX       366    369
FT   STRAND      372    378
FT   HELIX       382    388
FT   TURN        389    391
FT   STRAND      392    398
FT   HELIX       400    413
FT   STRAND      417    419
FT   HELIX       424    427
FT   HELIX       428    430
FT   STRAND      433    439
FT   HELIX       446    455
FT   STRAND      463    467
FT   TURN        469    471
FT   STRAND      472    474
FT   HELIX       476    478
FT   HELIX       486    496
SQ   SEQUENCE   503 AA;  54880 MW;  643DED17EAC44DCD CRC64;
     MEFSVKSGSP EKQRSACIVV GVFEPRRLSP IAEQLDKISD GYISALLRRG ELEGKPGQTL
     LLHHVPNVLS ERILLIGCGK ERELDERQYK QVIQKTINTL NDTGSMEAVC FLTELHVKGR
     NNYWKVRQAV ETAKETLYSF DQLKTNKSEP RRPLRKMVFN VPTRRELTSG ERAIQHGLAI
     AAGIKAAKDL GNMPPNICNA AYLASQARQL ADSYSKNVIT RVIGEQQMKE LGMHSYLAVG
     QGSQNESLMS VIEYKGNASE DARPIVLVGK GLTFDSGGIS IKPSEGMDEM KYDMCGAAAV
     YGVMRMVAEL QLPINVIGVL AGCENMPGGR AYRPGDVLTT MSGQTVEVLN TDAEGRLVLC
     DVLTYVERFE PEAVIDVATL TGACVIALGH HITGLMANHN PLAHELIAAS EQSGDRAWRL
     PLGDEYQEQL ESNFADMANI GGRPGGAITA GCFLSRFTRK YNWAHLDIAG TAWRSGKAKG
     ATGRPVALLA QFLLNRAGFN GEE
//

b) Swiss-Prot entry where no characterization has taken place but where
   information has been added because the sequences are highly comparable
   and so we believe, beyond reasonable doubt, that it is such a protein.

The lines that have been indented are those where information has been
added.

ID   AMPA_HAEIN              Reviewed;         491 AA.
AC   P45334;
DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
DT   01-NOV-1995, sequence version 1.
DT   02-OCT-2007, entry version 58.
 DE   Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP)
 DE   (Leucyl aminopeptidase).
GN   Name=pepA; OrderedLocusNames=HI1705;
OS   Haemophilus influenzae.
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Pasteurellales;
OC   Pasteurellaceae; Haemophilus.
OX   NCBI_TaxID=727;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 51907 / DSM 11121 / KW20 / Rd;
RX   MEDLINE=95350630; [Pubmed: 7542800]
RA   Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F.,
RA   Kerlavage A.R., Bult C.J., Tomb J.-F., Dougherty B.A., Merrick J.M.,
RA   McKenney K., Sutton G.G., FitzHugh W., Fields C.A., Gocayne J.D.,
RA   Scott J.D., Shirley R., Liu L.-I., Glodek A., Kelley J.M.,
RA   Weidman J.F., Phillips C.A., Spriggs T., Hedblom E., Cotton M.D.,
RA   Utterback T.R., Hanna M.C., Nguyen D.T., Saudek D.M., Brandon R.C.,
RA   Fine L.D., Fritchman J.L., Fuhrmann J.L., Geoghagen N.S.M.,
RA   Gnehm C.L., McDonald L.A., Small K.V., Fraser C.M., Smith H.O.,
RA   Venter J.C.;
RT   "Whole-genome random sequencing and assembly of Haemophilus influenzae
RT   Rd.";
RL   Science 269:496-512(1995).
 CC   -!- FUNCTION: Presumably involved in the processing and regular
 CC       turnover of intracellular proteins. Catalyzes the removal of
 CC       unsubstituted N-terminal amino acids from various peptides (By
 CC       similarity).
 CC   -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|-
 CC       Yaa-, in which Xaa is preferably Leu, but may be other amino acids
 CC       including Pro although not Arg or Lys, and Yaa may be Pro. Amino
 CC       acid amides and methyl esters are also readily hydrolyzed, but
 CC       rates on arylamides are exceedingly low.
 CC   -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity).
 CC   -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity).
 CC   -!- SIMILARITY: Belongs to the peptidase M17 family.
DR   EMBL; L42023; AAC23351.1; -; Genomic_DNA.
DR   PIR; C64137; C64137.
DR   RefSeq; NP_439847.1; -.
DR   HSSP; P11648; 1GYT.
DR   MEROPS; M17.003; -.
DR   GeneID; 949712; -.
DR   GenomeReviews; L42023_GR; HI1705.
DR   KEGG; hin:HI1705; -.
DR   TIGR; HI1705; -.
DR   BioCyc; HINF71421:HI_1705-MONOMER; -.
DR   GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP.
DR   GO; GO:0030145; F:manganese ion binding; IEA:HAMAP.
DR   HAMAP; MF_00181; -; 1.
DR   InterPro; IPR011356; Peptidase_M17.
DR   InterPro; IPR000819; Peptidase_M17_C.
DR   InterPro; IPR008283; Peptidase_M17_N.
DR   PANTHER; PTHR11963:SF3; Peptidase_M17; 1.
DR   Pfam; PF00883; Peptidase_M17; 1.
DR   Pfam; PF02789; Peptidase_M17_N; 1.
DR   PIRSF; PIRSF001116; Ctsl_amnpptdse; 1.
DR   PRINTS; PR00481; LAMNOPPTDASE.
DR   PROSITE; PS00631; CYTOSOL_AP; 1.
PE   3: Inferred from homology;
 KW   Aminopeptidase; Complete proteome; Cytoplasm; Hydrolase; Manganese;
 KW   Metal-binding; Protease.
 FT   CHAIN         1    491       Cytosol aminopeptidase.
 FT                                /FTId=PRO_0000165758.
 FT   ACT_SITE    275    275       Potential.
 FT   ACT_SITE    349    349       Potential.
 FT   METAL       263    263       Manganese 2 (By similarity).
 FT   METAL       268    268       Manganese 1 (By similarity).
 FT   METAL       268    268       Manganese 2 (By similarity).
 FT   METAL       286    286       Manganese 2 (By similarity).
 FT   METAL       345    345       Manganese 1 (By similarity).
 FT   METAL       347    347       Manganese 1 (By similarity).
 FT   METAL       347    347       Manganese 2 (By similarity).
SQ   SEQUENCE   491 AA;  53529 MW;  71376DDB1B0076EB CRC64;
     MKYQAKNTAL SQATDCIVLG VYENNKFSKS FNEIDQLTQG YLNDLVKSGE LTGKLAQTVL
     LRDLQGLSAK RLLIVGCGKK GELTERQYKQ IIQAVLKTLK ETNTREVISY LTEIELKDRD
     LYWNIRFAIE TIEHTNYQFD HFKSQKAETS VLESFIFNTD CAQAQQAISH ANAISSGIKA
     ARDIANMPPN ICNPAYLAEQ AKNLAENSTA LSLKVVDEEE MAKLGMNAYL AVSKGSENRA
     YMSVLTFNNA PDKNAKPIVL VGKGLTFDAG GISLKPAADM DEMKYDMCGA ASVFGTMKTI
     AQLNLPLNVI GVLAGCENLP DGNAYRPGDI LTTMNGLTVE VLNTDAEGRL VLCDTLTYVE
     RFEPELVIDV ATLTGACVVA LGQHNSGLVS TDNNLANALL QAATETTDKA WRLPLSEEYQ
     EQLKSPFADL ANIGGRWGGA ITAGAFLSNF TKKYRWAHLD IAGTAWLQGA NKGATGRPVS
     LLTQFLINQV K
//

The alignment below shows that the degree of sequence similarity is such
that we can classify, beyond reasonable doubt, this protein as an
aminopeptidase A/I.

AMPA_ECOLI  MEFSVKSGSPEKQRSACIVVGVFEPRRLSPIAEQLDKISDGYISALLRRG
AMPA_HAEIN  MKYQAKN-TALSQATDCIVLGVYENNKFSKSFNEIDQLTQGYLNDLVKSG
            *.. .*. .. .* ..***.**.* ...*   ...*....**...*...*

AMPA_ECOLI  ELEGKPGQTLLLHHVPNVLSERILLIGCGKERELDERQYKQVIQKTINTL
AMPA_HAEIN  ELTGKLAQTVLLRDLQGLSAKRLLIVGCGKKGELTERQYKQIIQAVLKTL
            **.** .**.**...... ..*.*..****. **.******.** ...**

AMPA_ECOLI  NDTGSMEAVCFLTELHVKGRNNYWKVRQAVETAKETLYSFDQLKTNKSEP
AMPA_HAEIN  KETNTREVISYLTEIELKDRDLYWNIRFAIETIEHTNYQFDHFKSQKAET
            ..*...*....***...*.*. **..* *.** ..* * **..*..*.*.

AMPA_ECOLI  RRPLRKMVFNVPTRRELTSGERAIQHGLAIAAGIKAAKDLGNMPPNICNA
AMPA_HAEIN  S-VLESFIFNTDC----AQAQQAISHANAISSGIKAARDIANMPPNICNP
            .  * ...**.      . ...** *. **..*****.*..********.

AMPA_ECOLI  AYLASQARQLADSYSKNVITRVIGEQQMKELGMHSYLAVGQGSQNESLMS
AMPA_HAEIN  AYLAEQAKNLAEN-STALSLKVVDEEEMAKLGMNAYLAVSKGSENRAYMS
            ****.**..**.. *...  .*..*..* .***..****..**.* . **

AMPA_ECOLI  VIEYKGNASEDARPIVLVGKGLTFDSGGISIKPSEGMDEMKYDMCGAAAV
AMPA_HAEIN  VLTFNNAPDKNAKPIVLVGKGLTFDAGGISLKPAADMDEMKYDMCGAASV
            *..........*.************.****.**...************.*

AMPA_ECOLI  YGVMRMVAELQLPINVIGVLAGCENMPGGRAYRPGDVLTTMSGQTVEVLN
AMPA_HAEIN  FGTMKTIAQLNLPLNVIGVLAGCENLPDGNAYRPGDILTTMNGLTVEVLN
            .*.*. .*.*.**.***********.*.*.******.****.* ******

AMPA_ECOLI  TDAEGRLVLCDVLTYVERFEPEAVIDVATLTGACVIALGHHITGLMANHN
AMPA_HAEIN  TDAEGRLVLCDTLTYVERFEPELVIDVATLTGACVVALGQHNSGLVSTDN
            ***********.********** ************.***.* .**....*

AMPA_ECOLI  PLAHELIAASEQSGDRAWRLPLGDEYQEQLESNFADMANIGGRPGGAITA
AMPA_HAEIN  NLANALLQAATETTDKAWRLPLSEEYQEQLKSPFADLANIGGRWGGAITA
             **..*..*.....*.******..******.* ***.****** ******

AMPA_ECOLI  GCFLSRFTRKYNWAHLDIAGTAWRSGKAKGATGRPVALLAQFLLNRAGFNGEE
AMPA_HAEIN  GAFLSNFTKKYRWAHLDIAGTAWLQGANKGATGRPVSLLTQFLINQVK
            * ***.**.**.***********  * .********.**.***.*..


(III) Protein sequence data from translation of genome sequencing data

Genome sequencing has caused a massive influx of data into the nucleotide
sequence databases and this has lead to the same influx into TrEMBL giving
thousands of entries waiting to go into Swiss-Prot. This sequence data is
submitted to the nucleotide sequence databases and is reported in
publications that show the entire genome sequence as well as genes that are
predicted by a number of methods. Apart from these gene designations the
papers rarely include experimental information about any of the predicted
proteins from these analyses. By making use of what is reported coupled to
the assessment of results from sequence alignments, that hit against both
characterized and part-characterized protein sequences (see above), we make
an effort to add relevant biochemical information to these translated
protein sequences.

The first step here is to align the translated sequences against Swiss-Prot
and TrEMBL. (We run against TrEMBL as an additional check for exact matches
so helping in the attempt to reduce redundancy in our data and to pick up
PROSITE/Pfam information that may be missing from the entry that is being
worked with). This is described fully further on. The results give rise to a
number of scenarios and they are:

  1. identical to an existing sequence in Swiss-Prot from the same organism,
  2. identical to an existing sequence in Swiss-Prot from a different
     organism which may or may not be related
  3. strong similarity (i.e. many residues are conserved residues), over the
     entire sequence, to an existing entry (from a related or different
     organism)
  4. strong similarity only at regions in the sequence (from same, related
     or different organism)
  5. some similarity to one or more existing entries
  6. no similarity to any existing entries

Here is a detailed description of all the above scenario.


1) Identical to an existing sequence in Swiss-Prot from the same organism

Update the existing Swiss-Prot entry by adding the new reference and new
EMBL DR line. Check new reference for any additional information.


2) Identical to an existing sequence in Swiss-Prot from a different organism
   which may or may not be related

We create a new entry based on the template entry. The majority of the
annotation information (comments, features, etc) are copied with the
qualifier "By similarity" added. For example, the entry shown below has
been annotated based on the 100% identical (at protein level) entry from
E.coli which was shown in section II above.

ID   AMPA_ECO57              Reviewed;         503 AA.
AC   P68768; P11648;
DT   21-DEC-2004, integrated into UniProtKB/Swiss-Prot.
DT   21-DEC-2004, sequence version 1.
DT   02-OCT-2007, entry version 22.
DE   Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP)
DE   (Leucyl aminopeptidase) (Aminopeptidase A/I).
GN   Name=pepA; Synonyms=carP, xerB; OrderedLocusNames=Z5872, ECs5237;
OS   Escherichia coli O157:H7.
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Escherichia.
OX   NCBI_TaxID=83334;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=O157:H7 / EDL933 / ATCC 700927 / EHEC;
RX   MEDLINE=21074935; [Pubmed: 11206551] [Article from publisher]
RA   Perna N.T., Plunkett G. III, Burland V., Mau B., Glasner J.D.,
RA   Rose D.J., Mayhew G.F., Evans P.S., Gregor J., Kirkpatrick H.A.,
RA   Posfai G., Hackett J., Klink S., Boutin A., Shao Y., Miller L.,
RA   Grotbeck E.J., Davis N.W., Lim A., Dimalanta E.T., Potamousis K.,
RA   Apodaca J., Anantharaman T.S., Lin J., Yen G., Schwartz D.C.,
RA   Welch R.A., Blattner F.R.;
RT   "Genome sequence of enterohaemorrhagic Escherichia coli O157:H7.";
RL   Nature 409:529-533(2001).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=O157:H7 / Sakai / RIMD 0509952 / EHEC;
RX   MEDLINE=21156231; [Pubmed: 11258796] [Article from publisher]
RA   Hayashi T., Makino K., Ohnishi M., Kurokawa K., Ishii K., Yokoyama K.,
RA   Han C.-G., Ohtsubo E., Nakayama K., Murata T., Tanaka M., Tobe T.,
RA   Iida T., Takami H., Honda T., Sasakawa C., Ogasawara N., Yasunaga T.,
RA   Kuhara S., Shiba T., Hattori M., Shinagawa H.;
RT   "Complete genome sequence of enterohemorrhagic Escherichia coli
RT   O157:H7 and genomic comparison with a laboratory strain K-12.";
RL   DNA Res. 8:11-22(2001).
CC   -!- FUNCTION: Presumably involved in the processing and regular
CC       turnover of intracellular proteins. Catalyzes the removal of
CC       unsubstituted N-terminal amino acids from various peptides.
CC       Required for plasmid ColE1 site-specific recombination but not in
CC       its aminopeptidase activity. Could act as a structural component
CC       of the putative nucleoprotein complex in which the Xer
CC       recombination reaction takes place (By similarity).
CC   -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|-
CC       Yaa-, in which Xaa is preferably Leu, but may be other amino acids
CC       including Pro although not Arg or Lys, and Yaa may be Pro. Amino
CC       acid amides and methyl esters are also readily hydrolyzed, but
CC       rates on arylamides are exceedingly low.
CC   -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity).
CC   -!- ENZYME REGULATION: Inhibited by zinc and EDTA (By similarity).
CC   -!- SUBUNIT: Homohexamer (By similarity).
CC   -!- SIMILARITY: Belongs to the peptidase M17 family.
DR   EMBL; AE005174; AAG59459.1; -; Genomic_DNA.
DR   EMBL; BA000007; BAB38660.1; -; Genomic_DNA.
DR   PIR; E91283; E91283.
DR   PIR; G86124; G86124.
DR   RefSeq; NP_290893.1; -.
DR   RefSeq; NP_313264.1; -.
DR   SMR; P68768; 1-503.
DR   GeneID; 913804; -.
DR   GeneID; 959777; -.
DR   GenomeReviews; BA000007_GR; ECs5237.
DR   GenomeReviews; AE005174_GR; Z5872.
DR   KEGG; ece:Z5872; -.
DR   KEGG; ecs:ECs5237; -.
DR   BioCyc; ECOL83334:ECS5237-MONOMER; -.
DR   GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP.
DR   GO; GO:0030145; F:manganese ion binding; IEA:HAMAP.
DR   HAMAP; MF_00181; -; 1.
DR   InterPro; IPR011356; Peptidase_M17.
DR   InterPro; IPR000819; Peptidase_M17_C.
DR   InterPro; IPR008283; Peptidase_M17_N.
DR   PANTHER; PTHR11963:SF3; Peptidase_M17; 1.
DR   Pfam; PF00883; Peptidase_M17; 1.
DR   Pfam; PF02789; Peptidase_M17_N; 1.
DR   PIRSF; PIRSF001116; Ctsl_amnpptdse; 1.
DR   PRINTS; PR00481; LAMNOPPTDASE.
DR   PROSITE; PS00631; CYTOSOL_AP; 1.
PE   3: Inferred from homology;
KW   Aminopeptidase; Complete proteome; Hydrolase; Manganese;
KW   Metal-binding; Protease.
FT   CHAIN         1    503       Cytosol aminopeptidase.
FT                                /FTId=PRO_0000165752.
FT   ACT_SITE    282    282       Potential.
FT   ACT_SITE    356    356       Potential.
FT   METAL       270    270       Manganese 2 (By similarity).
FT   METAL       275    275       Manganese 1 (By similarity).
FT   METAL       275    275       Manganese 2 (By similarity).
FT   METAL       293    293       Manganese 2 (By similarity).
FT   METAL       352    352       Manganese 1 (By similarity).
FT   METAL       354    354       Manganese 1 (By similarity).
FT   METAL       354    354       Manganese 2 (By similarity).
SQ   SEQUENCE   503 AA;  54880 MW;  643DED17EAC44DCD CRC64;
     MEFSVKSGSP EKQRSACIVV GVFEPRRLSP IAEQLDKISD GYISALLRRG ELEGKPGQTL
     LLHHVPNVLS ERILLIGCGK ERELDERQYK QVIQKTINTL NDTGSMEAVC FLTELHVKGR
     NNYWKVRQAV ETAKETLYSF DQLKTNKSEP RRPLRKMVFN VPTRRELTSG ERAIQHGLAI
     AAGIKAAKDL GNMPPNICNA AYLASQARQL ADSYSKNVIT RVIGEQQMKE LGMHSYLAVG
     QGSQNESLMS VIEYKGNASE DARPIVLVGK GLTFDSGGIS IKPSEGMDEM KYDMCGAAAV
     YGVMRMVAEL QLPINVIGVL AGCENMPGGR AYRPGDVLTT MSGQTVEVLN TDAEGRLVLC
     DVLTYVERFE PEAVIDVATL TGACVIALGH HITGLMANHN PLAHELIAAS EQSGDRAWRL
     PLGDEYQEQL ESNFADMANI GGRPGGAITA GCFLSRFTRK YNWAHLDIAG TAWRSGKAKG
     ATGRPVALLA QFLLNRAGFN GEE
//

3) Strong similarity (i.e. many residues are identical or conserved), over
   the entire sequence, to an existing entry (from a related or different
   organism)

There is no fixed cut-off point in percentage sequence similarity. It is
from experience that the curators assess whether similarity is considered to
be strong or weak. For each individual case, we must also look to see
whether sequences are highly conserved between species. To exhibit this,
please look at the following example.

This entry has been created from data submitted from the Schizosaccharomyces
pombe genome project.

ID   CHMU_SCHPO              Reviewed;         251 AA.
AC   O13739;
DT   15-JUL-1998, integrated into UniProtKB/Swiss-Prot.
DT   01-JAN-1998, sequence version 1.
DT   23-OCT-2007, entry version 53.
DE   Probable chorismate mutase (EC 5.4.99.5) (CM).
GN   ORFNames=SPAC16E8.04c;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
CC   -!- CATALYTIC ACTIVITY: Chorismate = prephenate.
CC   -!- ENZYME REGULATION: Allosterically regulated.
CC   -!- PATHWAY: Metabolic intermediate biosynthesis; prephenate
CC       biosynthesis; prephenate from chorismate: step 1/1.
CC   -!- SUBUNIT: Homodimer (By similarity).
CC   -!- SIMILARITY: Contains 1 chorismate mutase domain.
DR   EMBL; CU329670; CAB11033.1; -; Genomic_DNA.
DR   PIR; T37784; T37784.
DR   HSSP; P32178; 5CSM.
DR   KEGG; spo:SPAC16E8.04c; -.
DR   GeneDB_Spombe; SPAC16E8.04c; -.
DR   BioCyc; SPOM-XXX-01:SPOM-XXX-01-001828-MONOMER; -.
DR   ArrayExpress; O13739; -.
DR   GO; GO:0005829; C:cytosol; IDA:GeneDB_SPombe.
DR   GO; GO:0005634; C:nucleus; IDA:GeneDB_SPombe.
DR   InterPro; IPR008238; Chor_mut_AroQ_eu.
DR   InterPro; IPR002701; Chorismate_mut.
DR   Gene3D; G3DSA:1.10.590.10; Chor_mut_AroQ_eu; 1.
DR   PANTHER; PTHR21145; Chor_mut_AroQ_eu; 1.
DR   Pfam; PF01817; CM_2; 1.
DR   PIRSF; PIRSF017318; Chor_mut_AroQ_eu; 1.
DR   TIGRFAMs; TIGR01802; CM_pl-yst; 1.
DR   PROSITE; PS51169; CHORISMATE_MUT_3; 1.
PE   2: Evidence at transcript level;
KW   Allosteric enzyme; Amino-acid biosynthesis;
KW   Aromatic amino acid biosynthesis; Complete proteome; Isomerase.
FT   CHAIN         1    251       Probable chorismate mutase.
FT                                /FTId=PRO_0000119205.
FT   DOMAIN        1    251       Chorismate mutase.
SQ   SEQUENCE   251 AA;  29050 MW;  1AC18AE4C1E6C4B7 CRC64;
     MSLVNEKLKL ENIRSALIRQ EDTIIFNFLE RAQFPRNEKV YKSGKEGCLN LENYDGSFLN
     YLLHEEEKVY ALVRRYASPE EYPFTDNLPE PILPKFSGKF PLHPNNVNVN SEILEYYINE
     IVPKISSPGD DFDNYGSTVV CDIRCLQSLS RRIHYGKFVA EAKYLANPEK YKKLILARDI
     KGIENEIVDA AQEERVLKRL HYKALNYGRD AADPTKPSDR INADCVASIY KDYVIPMTKK
     VEVDYLLARL L
//

When aligned to its closest homolog in Swiss-Prot and TrEMBL the
following results are obtained:

CHMU_YEAST  MDFTKPETVLNLQNIRDELVRMEDSIIFKFIERSHFATCPSVYEANHPG-
CHMU_SCHPO  MSLVNEK--LKLENIRSALIRQEDTIIFNFLERAQFPRNEKVYKSGKEGC
            *.... .  *.*.***..*.* **.***.*.**..*.   .**.... *

CHMU_YEAST  LEIPNFKGSFLDWALSNLEIAHSRIRRFESPDETPFFPDKIQKSFLPSIN
CHMU_SCHPO  LNLENYDGSFLNYLLHEEEKVYALVRRYASPEEYPF-TDNLPEPILP--K
            *.. *..****.. * . * ... .**..**.* ** .*......**  .

CHMU_YEAST  YPQILAPYAPEVNYNDKIKKVYIEKIIPLISKRDGDDKNNFGSVATRDIE
CHMU_SCHPO  FSGKFPLHPNNVNVNSEILEYYINEIVPKISSP-GDDFDNYGSTVVCDIR
            ..  .. .. .** *..* . **..*.* **.. *** .*.**... **

CHMU_YEAST  CLQSLSRRIHFGKFVAEAKFQSDIPLYTKLIKSKDVEGIMKNITNSAVEE
CHMU_SCHPO  CLQSLSRRIHYGKFVAEAKYLANPEKYKKLILARDIKGIENEIVDAAQEE
            **********.********. ..   *.*** ..*..** ..*...* **

CHMU_YEAST  KILERLTKKAEVYGVDPTNES-GERRITPEYLVKIYKEIVIPITKEVEVE
CHMU_SCHPO  RVLKRLHYKALNYGRDAADPTKPSDRINADCVASIYKDYVIPMTKKVEVD
            ..*.**  **  ** *... .  . **.......***. ***.**.***.

CHMU_YEAST  YLLRRLEE
CHMU_SCHPO  YLLARLL
            *** **

The sequences show a high degree of similarity over their entire lengths and
so it is highly likely that the sequence from the Schizosaccharomyces pombe
genome project is indeed a chorismate mutase. This allows us to add the
standard description line as well as comments describing catalytic activity,
the pathway the enzyme is involved in as well as the relevant keywords. We
can also add a subunit comment but here we add "(by similarity)" to show
that this information has come from a characterized protein(s) (in this case
from CHMU_YEAST (P32178)) and has not been experimentally determined in S.
pombe. In addition, due to the fact that this protein has been biochemically
characterized we add "probable" to the DE line to indicate this e.g.
"Probable chorismate mutase."


4) Strong similarity only at regions in the sequence (from same, related
   or different organism)

These cases often pick up on areas within a sequence responsible for binding
sites of, for example, cofactors, metals, DNA-binding and ATP/GTP. Here, a
function can often be assigned leading to description lines, comments and
keywords being added to the new entry. In some cases, however, even though
areas are conserved there is no evidence to characterize the protein. It
should be noted that we also make use of domain/families databases such as
PROSITE and Pfam in these cases. Below are examples of both these cases.

The entry below is again from the S.pombe genome project.

ID   PPK14_SCHPO             Reviewed;         566 AA.
AC   Q09831;
DT   01-FEB-1996, integrated into UniProtKB/Swiss-Prot.
DT   01-FEB-1996, sequence version 1.
DT   23-OCT-2007, entry version 52.
DE   Serine/threonine-protein kinase ppk14 (EC 2.7.11.1).
GN   Name=ppk14; ORFNames=SPAC4G8.05;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
RN   [2]
RP   IDENTIFICATION.
RX   [Pubmed: 15821139] [Article from publisher]
RA   Bimbo A., Jia Y., Poh S.L., Karuturi R.K.M., den Elzen N., Peng X.,
RA   Zheng L., O'Connell M., Liu E.T., Balasubramanian M.K., Liu J.;
RT   "Systematic deletion analysis of fission yeast protein kinases.";
RL   Eukaryot. Cell 4:799-813(2005).
CC   -!- CATALYTIC ACTIVITY: ATP + a protein = ADP + a phosphoprotein.
CC   -!- SIMILARITY: Belongs to the protein kinase superfamily. Ser/Thr
CC       protein kinase family. KIN82 subfamily.
CC   -!- SIMILARITY: Contains 1 protein kinase domain.
DR   EMBL; CU329670; CAA91206.1; -; Genomic_DNA.
DR   PIR; S62482; S62482.
DR   HSSP; P31751; 1GZK.
DR   KEGG; spo:SPAC4G8.05; -.
DR   GeneDB_Spombe; SPAC4G8.05; -.
DR   BioCyc; SPOM-XXX-01:SPOM-XXX-01-000780-MONOMER; -.
DR   ArrayExpress; Q09831; -.
DR   GO; GO:0004674; F:protein serine/threonine kinase activity; TAS:GeneDB_SPombe.
DR   InterPro; IPR000719; Prot_kinase_core.
DR   InterPro; IPR008271; Ser_thr_pkin_AS.
DR   InterPro; IPR002290; Ser_thr_pkinase.
DR   Pfam; PF00069; Pkinase; 1.
DR   ProDom; PD000001; Prot_kinase; 1.
DR   SMART; SM00220; S_TKc; 1.
DR   PROSITE; PS00107; PROTEIN_KINASE_ATP; FALSE_NEG.
DR   PROSITE; PS50011; PROTEIN_KINASE_DOM; 1.
DR   PROSITE; PS00108; PROTEIN_KINASE_ST; 1.
PE   2: Evidence at transcript level;
KW   ATP-binding; Complete proteome; Kinase; Nucleotide-binding;
KW   Serine/threonine-protein kinase; Transferase.
FT   CHAIN         1    566       Serine/threonine-protein kinase ppk14.
FT                                /FTId=PRO_0000086043.
FT   DOMAIN      195    485       Protein kinase.
FT   NP_BIND     201    209       ATP (By similarity).
FT   ACT_SITE    320    320       Proton acceptor (By similarity).
FT   BINDING     224    224       ATP (By similarity).
SQ   SEQUENCE   566 AA;  63482 MW;  3D18B4F84E10AA13 CRC64;
     MNELHDGESS EEGRINVEDH LEEAKKDDTG HWKHSGTAKP SKFRAFIRLH FKDSRKFAFS
     RKKEKELTSE DSDAANQSPS GAPESQTEEE SDRKIDGTGS SAEGGDGSGT DSISVIKKSF
     FKSGRKKKDV PKSRNVSRSN GADTSVQREK LKDIFSPHGK EKELAHIKKT VATRARTYSS
     NSIKICDVEV GPSSFEKVFL LGKGDVGRVY LVREKKSGKF YAMKVLSKQE MIKRNKSKRA
     FAEQHILATS NHPFIVTLYH SFQSDEYLYL CMEYCMGGEF FRALQRRPGR CLSENEAKFY
     IAEVTAALEY LHLMGFIYRD LKPENILLHE SGHIMLSDFD LSKQSNSAGA PTVIQARNAP
     SAQNAYALDT KSCIADFRTN SFVGTEEYIA PEVIKGCGHT SAVDWWTLGI LFYEMLYATT
     PFKGKNRNMT FSNILHKDVI FPEYADAPSI SSLCKNLIRK LLVKDENDRL GSQAGAADVK
     LHPFFKNVQW ALLRHTEPPI IPKLAPIDEK GNPNISHLKE SKSLDITHSP QNTQTVEVPL
     SNLSGADHGD DPFESFNSVT VHHEWD
//

By looking at the alignment we can see that the areas conserved are around
ATP-binding sites (which is picked up by PROSITE and Pfam too) and the
active site is also conserved. Hence we can add this information to the
entry as can be seen in the feature table by similarity. This shows that
there is no experimental proof but that it is very likely to be a
serine/threonine protein kinase because conserved features of that family of
proteins are present in the sequence.

Below is the alignment to highlight this.

PPK14_SCHPO       MNELHDGESSEEGRINVEDHLEEAKKDD---TGHWKHSGTAKPSKFRAFIRLHFKDSR
NRC2_NEUCR      MPSTKNANGEGHFPSRIKQFFRINSGSKDHKDRDAHTTSSSHGGAPRADAKTPSGFRQSR
                   .:  :**.   .**:   :::...**.    .* . *. . ..:  *     *::**

PPK14_SCHPO     KFAFSRKKEKELTSED-------SDAANQSPSGAPESQ--TEEESD-----RKIDGTGSS
NRC2_NEUCR      FFSVGRLRSTTVVSEGNPLDESMSPTAHANPYFAHQGQPGLRHHNDGSVPPSPPDTPSLK
                 *:..* :.. :.**.       * :*: .*  * :.*   ....*        * .. .

PPK14_SCHPO     AEGGDGSGTDSISVIKKSFFKSGRKKKDVPKSRNVS---RSNG---ADTSVQRE---KLK
NRC2_NEUCR      VDGPEGS-QQPTAATKEELARKLRRVASAPNAQGLFSKGQGNGDRPATAELSKEPLEESK
                .:* :**  :. :. *:.: :. *:  ..*:::.:    :.**   * :.:.:*   : *

PPK14_SCHPO     DIFSPHGKEKE--------------------LAHIKKTVATRARTYSSNSIKICDVEVGP
NRC2_NEUCR      DSNTVGFAEQKPNNDSSTSLAAPDADGLGALPPPIRQSPLAFRRTYSSNSIKVRNVEVGP
                *  :    *::                     . *:::  :  *********: :*****

PPK14_SCHPO     SSFEKVFLLGKGDVGRVYLVREKKSGKFYAMKVLSKQEMIKRNKSKRAFAEQHILATSNH
NRC2_NEUCR      QSFDKIKLIGKGDVGKVYLVKEKKSGRLYAMKVLSKKEMIKRNKIKRALAEQEILATSNH
                .**:*: *:******:****:*****::********:******* ***:***.*******

PPK14_SCHPO     PFIVTLYHSFQSDEYLYLCMEYCMGGEFFRALQRRPGRCLSENEAKFYIAEVTAALEYLH
NRC2_NEUCR      PFIVTLYHSFQSEDYLYLCMEYCSGGEFFRALQTRPGKCIPEDDARFYAAEVTAALEYLH
                ************::********* ********* ***:*:.*::*:** ***********

PPK14_SCHPO     LMGFIYRDLKPENILLHESGHIMLSDFDLSKQSNSAGAPTVIQARNAPSAQNAYALDTKS
NRC2_NEUCR      LMGFIYRDLKPENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKS
                *****************:***************:..* **:* .:*..*:..  ::****

PPK14_SCHPO     CIADFRTNSFVGTEEYIAPEVIKGCGHTSAVDWWTLGILFYEMLYATTPFKGKNRNMTFS
NRC2_NEUCR      CIANFRTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFA
                ***:********************.**************:*****.********** **:

PPK14_SCHPO     NILHKDVIFPEYADAPSISSLCKNLIRKLLVKDENDRLGSQAGAADVKLHPFFKNVQWAL
NRC2_NEUCR      NILREDIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLGARAGASDIKTHPFFRTTQWAL
                ***::*: **::*.**.**.***.******:**** ***::***:*:* ****:..****

PPK14_SCHPO     LRHTEPPIIPKLAPIDEKGNPNISHLKESKSLDITHSPQNTQTVEVPLSNLSG-ADHGDD
NRC2_NEUCR      IRHMKPPIVPNQGRG--IDTLNFRNVKESESVDISGSRQMGLKGEPLESGMVTPGENAVD
                :** :***:*: .     .. *: ::***:*:**: * *   . *   *.:   .::. *

PPK14_SCHPO     PFESFNSVTVHHEWD
NRC2_NEUCR      PFEEFNSVTLHHDGDEEYHSDAYEKR
                ***.*****:**: *


5) Some similarity to one or more existing entries

It is in this category that the adjective "putative" comes into play. For
these cases, again there is no experimental proof that the protein exists
and there is only limited evidence to point the protein to a particular
family. Again, we have no fixed rules on what is "limited" and what isn't.
It is a judgement that we make based on which family it is and which, if
any, areas are conserved. Below is one example of many that exist in
Swiss-Prot. From the alignments and from hits to the pattern databases we
attempt to add any information so that it is not lost. By using putative in
the description line we are showing that there is evidence within the
sequence data but that we do not want to classify indefinitely until
experimental proof is available. When it is, the entry will be updated
accordingly. Staying with the S.pombe project the following shows this.

ID   YA55_SCHPO              Reviewed;         513 AA.
AC   Q09735;
DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
DT   01-NOV-1995, sequence version 1.
DT   23-OCT-2007, entry version 56.
DE   Putative aminopeptidase C13A11.05 (EC 3.4.11.-).
GN   ORFNames=SPAC13A11.05;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
CC   -!- COFACTOR: Binds 2 zinc ions per subunit (By similarity).
CC   -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity).
CC   -!- SIMILARITY: Belongs to the peptidase M17 family.
DR   EMBL; CU329670; CAA90806.1; -; Genomic_DNA.
DR   PIR; T37612; T37612.
DR   HSSP; P00727; 1BPN.
DR   KEGG; spo:SPAC13A11.05; -.
DR   GeneDB_Spombe; SPAC13A11.05; -.
DR   BioCyc; SPOM-XXX-01:SPOM-XXX-01-000717-MONOMER; -.
DR   ArrayExpress; Q09735; -.
DR   InterPro; IPR011356; Peptidase_M17.
DR   InterPro; IPR000819; Peptidase_M17_C.
DR   InterPro; IPR008283; Peptidase_M17_N.
DR   PANTHER; PTHR11963:SF3; Peptidase_M17; 1.
DR   Pfam; PF00883; Peptidase_M17; 1.
DR   Pfam; PF02789; Peptidase_M17_N; 1.
DR   PIRSF; PIRSF001116; Ctsl_amnpptdse; 1.
DR   PRINTS; PR00481; LAMNOPPTDASE.
DR   PROSITE; PS00631; CYTOSOL_AP; 1.
PE   2: Evidence at transcript level;
KW   Aminopeptidase; Complete proteome; Cytoplasm; Hydrolase;
KW   Metal-binding; Protease; Zinc.
FT   CHAIN         1    513       Putative aminopeptidase C13A11.05.
FT                                /FTId=PRO_0000165853.
FT   ACT_SITE    292    292       Potential.
FT   ACT_SITE    366    366       Potential.
FT   METAL       280    280       Zinc 2 (By similarity).
FT   METAL       285    285       Zinc 1 (By similarity).
FT   METAL       285    285       Zinc 2 (By similarity).
FT   METAL       303    303       Zinc 2 (By similarity).
FT   METAL       362    362       Zinc 1 (By similarity).
FT   METAL       364    364       Zinc 1 (By similarity).
FT   METAL       364    364       Zinc 2 (By similarity).
SQ   SEQUENCE   513 AA;  56195 MW;  F904CC0607502018 CRC64;
     MKGLGLSTRT FNWSSLSSIL LPRIPLATTK ADSLILAVRH DKQVFSEDYR QVVDQYFETS
     PKKNDIRLFW NTQGFVRLAI VQLEENVSEK SVRSAAAEAA KILKSNGAKS IAVDGMGFPK
     DAALGAALAT YDFSLRRDHL SVYQDEKVVE KENLFTSPAP ERLTFQLLSN TSEKKTATAE
     ENAFKVGLIE AAAQNLARSL MECPANYMTS LQFCHFAQEL FQNSSKVKVF VHDEKWIDEQ
     KMNGLLTVNA GSDIPPRFLE VQYIGKEKSK DDGWLGLVGK GVTFDSGGIS IKPSQNMKEM
     RADMGGAAVM LSSIYALEQL SIPVNAVFVT PLTENLPSGS AAKPGDVIFM RNGLSVEIDN
     TDAEGRLILA DAVHYVSSQY KTKAVIEAST LTGAMLVALG NVFTGAFVQG EELWKNLETA
     SHDAGDLFWR MPFHEAYLKQ LTSSSNADLC NVSRAGGGCC TAAAFIKCFL AQKDLSFAHL
     DIAGVMDKQL NSWDCDGMSG RPVRTIIEVA RKY
//

The alignment shows that all the functional sites are conserved i.e. metal
ion binding sites and the active sites between the S. pombe sequence and the
bovine one. However, because of the nature of the family it is not possible,
with the evidence available, to classify this completely. Hence all
available information is added and the entry is referred to as a "putative"
aminopeptidase.

YA55_SCHPO      MKGLGLSTRTFNWSSLSSILLPRIPLATTKADSLIL-AVRHDKQVFSEDYRQVVDQYFET
AMPL_BOVIN      TKGLVLGIYSKEKEEDE----PQFTSAGENFNKLVSGKLREILNISGPSLKAGKTRTFYG
                 *** *.  : : .. .    *::. *  : :.*:   :*.  :: . . :    : *

YA55_SCHPO      SPKKNDIRLFWNTQGFVRLAIVQLEENVSE--KSVRSAAAEAAKILKSNGAKSIAVDGMG
AMPL_BOVIN      --LHEDFPSVVVVGLGKKTAGIDEQENWHEGKENIRAAVAAGCRQIQDLEIPSVEVDPCG
                   ::*:  .  .    : * :: :**  *  :.:*:*.* ..: ::.    *: **  *

YA55_SCHPO      FPKDAALGAALATYDFSLRRDHLSVYQDEKVVEKENLFTSPAPERLTFQLLSNTSEKKTA
AMPL_BOVIN      DAQAAAEGAVLGLYEYDDLK------QKRKVVVSAKLHGSEDQE----------------
                 .: ** **.*. *::.  :      *..*** . :*. *   *

YA55_SCHPO      TAEENAFKVGLIEAAAQNLARSLMECPANYMTSLQFCHFAQELFQ-NSSKVKVFVHDEKW
AMPL_BOVIN      -----AWQRGVLFASGQNLARRLMETPANEMTPTKFAEIVEENLKSASIKTDVFIRPKSW
                     *:: *:: *:.***** *** *** **. :*..:.:* ::  * *..**:: :.*

YA55_SCHPO      IDEQKMNGLLTVNAGSDIPPRFLEVQYIGKEKSKDDGWLGLVGKGVTFDSGGISIKPSQN
AMPL_BOVIN      IEEQEMGSFLSVAKGSEEPPVFLEIHYKGSPNASE-PPLVFVGKGITFDSGGISIKAAAN
                *:**:*..:*:*  **: ** ***::* *. ::.:   * :****:**********.: *

YA55_SCHPO      MKEMRADMGGAAVMLSSIYALEQLSIPVNAVFVTPLTENLPSGSAAKPGDVIFMRNGLSV
AMPL_BOVIN      MDLMRADMGGAATICSAIVSAAKLDLPINIVGLAPLCENMPSGKANKPGDVVRARNGKTI
                *. *********.: *:* :  :*.:*:* * ::** **:***.* *****:  *** ::

YA55_SCHPO      EIDNTDAEGRLILADAVHYVSSQYKTKAVIEASTLTGAMLVALGNVFTGAFVQGEELWKN
AMPL_BOVIN      QVDNTDAEGRLILADALCYAHT-FNPKVIINAATLTGAMDIALGSGATGVFTNSSWLWNK
                ::**************: *. : ::.*.:*:*:****** :***.  **.*.:.. **::

YA55_SCHPO      LETASHDAGDLFWRMPFHEAYLKQLTSSSNADLCNVSRAG-GGCCTAAAFIKCFLAQKDL
AMPL_BOVIN      LFEASIETGDRVWRMPLFEHYTRQVIDCQLADVNNIGKYRSAGACTAAAFLKEFVTHP--
                *  ** ::** .****:.* * :*: ... **: *:.:   .*.******:* *:::

YA55_SCHPO      SFAHLDIAGVMD-KQLNSWDCDGMSGRPVRTIIEVARKY-----
AMPL_BOVIN      KWAHLDIAGVMTNKDEVPYLRKGMAGRPTRTLIEFLFRFSQDSA
                .:*********  *:  .:  .**:***.**:**.  ::


6) No similarity to any existing entries

From the genome sequencing data the majority of proteins translated from
predicted open reading frames have no sequence similarity to any existing
proteins. In these cases the proteins remain "hypothetical". It should be
noted here that we analyze these sequences by a number of programs so that
we can at least add some potential information, rather than having just an
entry containing submission and sequence data. Again, in these cases, care
is taken to show that this information is potential so that it cannot be
mixed up with data from classified proteins.

The features we currently look for are signal sequences, transmembrane
regions, coiled coil domains and a number of conserved domains described
in PROSITE, Pfam and SMART.


a) Signal sequence prediction

We make use of the SignalP program [R1] in its latest implementation
(version 3.0). The method incorporates a prediction of cleavage sites and
a signal peptide/non-signal peptide prediction based on a combination of
several artificial neural networks and hidden Markov models. The result in
the entry is of the type:

FT   SIGNAL        1      x       Potential.
FT   CHAIN         x      y


b) Transmembrane region prediction

Transmembrane helices are predicted using the TMHMM (version 2.0) program
[R2] which we have found [R3] to give the best results. In some cases we
complement the results of this method with predictions obtained with two
other programs, ESKM [R4] and MEMSAT [R5].

Predicted transmembrane helices are indicated as:

FT   TRANSMEM      x      y       Potential.


c) Coiled coil prediction

We make use of a program based on the algorithm of Lupas et al [R6] that
predicts coiled coil regions within the sequence. A positive result of this
program is:

FT   DOMAIN        x      y       Coiled coil (Potential).


d) REP

The program REP [R7] is used to annotate a number of well defined, yet
very variable protein repeats. The program currently recognize the
following types of repeats: Ankyrin, Armadillo, HAT, HEAT, HEAT_AAA,
HEAT_ADB, HEAT_IMB, Kelch, Leucine-rich Repeats, PFTA, PFTB, RCC1, TPR
and WD40.

Repeats detected by this program are annotated at the level of the
feature tables, specific keywords and CC lines are also added to the
entry. In the following example the lines that have been indented are
those where information has been added following the detection of
a repeat:

ID   YEX2_SCHPO              Reviewed;         361 AA.
AC   O13856;
DT   16-AUG-2004, integrated into UniProtKB/Swiss-Prot.
DT   01-JAN-1998, sequence version 1.
DT   23-OCT-2007, entry version 37.
 DE   Uncharacterized WD repeat-containing protein C1A6.02.
GN   ORFNames=SPAC1A6.02, SPAC23C4.21;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
 CC   -!- SIMILARITY: Contains 6 WD repeats.
DR   EMBL; Z99258; CAB16352.1; -; Genomic_DNA.
DR   PIR; T38005; T38005.
DR   KEGG; spo:SPAC1A6.02; -.
DR   GeneDB_Spombe; SPAC1A6.02; -.
DR   ArrayExpress; O13856; -.
DR   GO; GO:0005730; C:nucleolus; IDA:GeneDB_SPombe.
DR   InterPro; IPR001680; WD40.
DR   Pfam; PF00400; WD40; 1.
DR   SMART; SM00320; WD40; 5.
DR   PROSITE; PS00678; WD_REPEATS_1; FALSE_NEG.
DR   PROSITE; PS50082; WD_REPEATS_2; FALSE_NEG.
DR   PROSITE; PS50294; WD_REPEATS_REGION; 1.
PE   2: Evidence at transcript level;
 KW   Complete proteome; Repeat; WD repeat.
 FT   CHAIN         1    361       Uncharacterized WD repeat-containing
 FT                                protein C1A6.02.
 FT                                /FTId=PRO_0000051486.
 FT   REPEAT       57     96       WD 1.
 FT   REPEAT      103    142       WD 2.
 FT   REPEAT      146    184       WD 3.
 FT   REPEAT      187    229       WD 4.
 FT   REPEAT      237    275       WD 5.
 FT   REPEAT      280    318       WD 6.
SQ   SEQUENCE   361 AA;  39780 MW;  38DD785710325C03 CRC64;
     MGGTINAAIK QKFENEIFDL ACFGENQVLL GFSNGRVSSY QYDVAQISLV EQWSTKRHKK
     SCRNISVNES GTEFISVGSD GVLKIADTST GRVSSKWIVD KNKEISPYSV VQWIENDMVF
     ATGDDNGCVS VWDKRTEGGI IHTHNDHIDY ISSISPFEER YFVATSGDGV LSVIDARNFK
     KPILSEEQDE EMTCGAFTRD QHSKKKFAVG TASGVITLFT KGDWGDHTDR ILSPIRSHDF
     SIETITRADS DSLYVGGSDG CIRLLHILPN KYERIIGQHS SRSTVDAVDV TTEGNFLVSC
     SGTELAFWPV DQKEGDESSS SDNLDSDEDS SSDSEFSSPK KKKKVGNQGK KPLGTDFFDG
     L
//

e) PROSITE

PROSITE (http://www.expasy.org/prosite/), the database of protein domains
and families, plays a very big role in the addition of features in
Swiss-Prot entries, especially when no other information is available for
the sequence. Where patterns are matched this can lead to the addition of
comment lines, keywords, features either individually or in any
combination. As an example:

ID   NOP12_SCHPO             Reviewed;         438 AA.
AC   O13741;
DT   02-NOV-2001, integrated into UniProtKB/Swiss-Prot.
DT   01-JAN-1998, sequence version 1.
DT   23-OCT-2007, entry version 52.
DE   Nucleolar protein 12.
GN   Name=nop12; ORFNames=SPAC16E8.06c;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
CC   -!- FUNCTION: Involved in pre-25S rRNA processing (By similarity).
CC   -!- SUBCELLULAR LOCATION: Nucleus, nucleolus (By similarity).
CC   -!- SIMILARITY: Belongs to the RRM RBM34 family.
CC   -!- SIMILARITY: Contains 2 RRM (RNA recognition motif) domains.
DR   EMBL; CU329670; CAB11047.1; -; Genomic_DNA.
DR   PIR; T37786; T37786.
DR   HSSP; P33240; 1P1T.
DR   KEGG; spo:SPAC16E8.06c; -.
DR   GeneDB_Spombe; SPAC16E8.06c; -.
DR   BioCyc; SPOM-XXX-01:SPOM-XXX-01-001830-MONOMER; -.
DR   ArrayExpress; O13741; -.
DR   GO; GO:0005730; C:nucleolus; IDA:GeneDB_SPombe.
DR   InterPro; IPR012677; a_b_plait_nuc_bd.
DR   InterPro; IPR000504; RRM_RNP1.
DR   Gene3D; G3DSA:3.30.70.330; a_b_plait_nuc_bd; 2.
DR   Pfam; PF00076; RRM_1; 2.
DR   SMART; SM00360; RRM; 2.
DR   PROSITE; PS50102; RRM; 2.
PE   2: Evidence at transcript level;
KW   Complete proteome; Nucleus; Repeat; Ribosome biogenesis; RNA-binding;
KW   rRNA processing.
FT   CHAIN         1    438       Nucleolar protein 12.
FT                                /FTId=PRO_0000081673.
FT   DOMAIN      164    262       RRM 1.
FT   DOMAIN      270    348       RRM 2.
FT   COMPBIAS     20     23       Poly-Ser.
FT   COMPBIAS     81     90       Poly-Lys.
SQ   SEQUENCE   438 AA;  49381 MW;  3E943401F95E7C12 CRC64;
     MGETNSSLDN ENTSFVGKLS SSSNVDPTLN LLFSQSKPIP KPVAKETTVL TKKDVEVEEA
     NGVEEAAETI ESDTKEVQNI KPKSKKKKKK LNDSSDDIEG KYFEELLAEE DEEKDKDSAG
     LINDEEDKSP AKQSVLEERT SQEDVKSERE VAEKLANELE KSDKTVFVNN LPARVVTNKG
     DYKDLTKHFR QFGAVDSIRF RSLAFSEAIP RKVAFFEKKF HSERDTVNAY IVFRDSSSAR
     SALSLNGTMF MDRHLRVDSV SHPMPQDTKR CVFVGNLAFE AEEEPLWRYF GDCGSIDYVR
     IVRDPKTNLG KGFAYIQFKD TMGVDKALLL NEKKMPEGRT LRIMRAKSTK PKSITRSKRG
     DEKTRTLQGR ARKLIGKAGN ALLQQELALE GHRAKPGENP LAKKKVNKKR KERAAQWRNK
     KAESVGKKQK TAAGKKDK
//

In the above example note the PROSITE entries represented in the DR lines.
These matches have helped in the addition of the similarity comment and the
RNA-binding RRM domains to the feature table.

We have a method that automatically annotates a number of sites or domains
using PROSITE patterns anf profiles. All features copied into the feature
table by using facility are closely assessed to ensure that they are valid
for the particular sequence from that particular organism.


f) Pfam

Pfam [R10] (http://www.sanger.ac.uk/Software/Pfam/) is a large collection of
multiple sequence alignments and hidden Markov models covering many common
protein domains. Great use is made of this database, in conjunction with
PROSITE, for the automatic addition of annotation to TrEMBL entries. It also
provides important information for the curators as they begin to annotate
TrEMBL entries by highlighting the type of domain the sequence has.


g) Tyrosine sulfation sites

Tyrosine sulfation sites are predicted using a software tool called the
Sulfinator [R12]. The Sulfinator employs four different Hidden Markov
Models. The program in only run on eukaryotic proteins that are predicted
or supposed to be secreted or to have at least one extracellular
domain. The sulfation site are indicated as being "Potential". Example:

FT   MOD_RES     200    200       Sulfotyrosine (Potential).


Extradom.txt

This file outlines the nomenclature proposal for domains (or modules) found
mainly in extracellular proteins of higher eukaryotes. It shows the standard
nomenclature applied to these classified domains in Swiss-Prot entries. It
can be found via the Web at http://www.expasy.org/cgi-bin/lists?extradom.txt
It is one of numerous documents (all of which are visible from:
http://www.expasy.org/sprot/sp-docu.html) that are distributed with Swiss-
Prot.

Please note that when there is a modification or a binding event, "potential"
is added to show that these have not been determined experimentally. Below
is an example of such cases.

ID   YA9A_SCHPO              Reviewed;         530 AA.
AC   Q09788;
DT   01-NOV-1995, integrated into UniProtKB/Swiss-Prot.
DT   01-NOV-1995, sequence version 1.
DT   23-OCT-2007, entry version 40.
DE   Uncharacterized serine-rich protein C13G6.10c precursor.
GN   ORFNames=SPAC13G6.10c;
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=ATCC 38366 / 972;
RX   MEDLINE=21848401; [Pubmed: 11859360] [Article from publisher]
RA   Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A.,
RA   Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S.,
RA   Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M.,
RA   Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A.,
RA   Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G.,
RA   Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K.,
RA   James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J.,
RA   Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C.,
RA   Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E.,
RA   Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S.,
RA   Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K.,
RA   Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S.,
RA   Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B.,
RA   Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S.,
RA   Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D.,
RA   Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R.,
RA   Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B.,
RA   Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S.,
RA   Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M.,
RA   Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G.,
RA   Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J.,
RA   Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L.,
RA   Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J.,
RA   Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.;
RT   "The genome sequence of Schizosaccharomyces pombe.";
RL   Nature 415:871-880(2002).
DR   EMBL; CU329670; CAA91103.1; -; Genomic_DNA.
DR   PIR; S62439; S62439.
DR   KEGG; spo:SPAC13G6.10c; -.
DR   GeneDB_Spombe; SPAC13G6.10c; -.
DR   BioCyc; SPOM-XXX-01:SPOM-XXX-01-000580-MONOMER; -.
DR   ArrayExpress; Q09788; -.
DR   GO; GO:0005783; C:endoplasmic reticulum; IDA:GeneDB_SPombe.
DR   GO; GO:0005794; C:Golgi apparatus; IDA:GeneDB_SPombe.
DR   InterPro; IPR013781; Glyco_hydro_cat.
DR   Gene3D; G3DSA:3.20.20.80; Glyco_hydro_cat; 1.
PE   2: Evidence at transcript level;
KW   Complete proteome; Glycoprotein; Signal.
FT   SIGNAL        1     18       Potential.
FT   CHAIN        19    530       Uncharacterized serine-rich protein
FT                                C13G6.10c.
FT                                /FTId=PRO_0000014190.
FT   CARBOHYD     55     55       N-linked (GlcNAc...) (Potential).
FT   CARBOHYD    120    120       N-linked (GlcNAc...) (Potential).
FT   CARBOHYD    128    128       N-linked (GlcNAc...) (Potential).
SQ   SEQUENCE   530 AA;  54211 MW;  1C6A0261F63DFF02 CRC64;
     MRTTFATVAL AFLSTVGALP YAPNHRHHRR DDDGVLTVYE TILETVYVTA VPGANSSSSY
     TSYSTGLASV TESSDDGAST ALPTTSTESV VVTTSAPAAS SSATSYPATF VSTPLYTMDN
     VTAPVWSNTS VPVSTPETSA TSSSEFFTSY PATSSESSSS YPASSTEVAS SYSASSTEVT
     SSYPASSEVA TSTSSYVAPV SSSVASSSEI SAGSATSYVP TSSSSIALSS VVASASVSAA
     NKGVSTPAVS SAAASSSAVV SSVVSSATSV AASSTISSAT SSSASASPTS SSVSGKRGLA
     WIPGTDLGYS DNFVNKGINW YYNWGSYSSG LSSSFEYVLN QHDANSLSSA SSVFTGGATV
     IGFNEPDLSA AGNPIDAATA ASYYLQYLTP LRESGAIGYL GSPAISNVGE DWLSEFMSAC
     SDCKIDFIAC HWYGIDFSNL QDYINSLANY GLPIWLTEFA CTNWDDSNLP SLDEVKTLMT
     SALGFLDGHG SVERYSWFAP ATELGAGVGN NNALISSSGG LSEVGEIYIS
//

=======
Summary
=======

This has been an introduction to the world of annotation at Swiss-Prot.
There are numerous sources of information available to the curators and it
is our job to assess these and to add only relevant information to the
entries. These sources are primarily publications reporting the isolation of
particular genes and proteins. It is not only biochemical data that is
weaned but also molecular biology and genetic information too. For example,
we have thousands of entries in Swiss-Prot with reports of alternative
splicing as well as genetic map information. Coupled to reading publications
is looking at the data bank itself. In an attempt to maintain consistency
all new entries are checked, via alignments, to see if they belong to a
particular family. When yes, information is copied, but at the same time
checked, from similar entries.

All sources of information are given in Swiss-Prot entries. The reference
blocks show what is represented in the corresponding publication(s). They
therefore act as sources for the information given in the entry. This can be
direct sequencing of the isolated protein (RP SEQUENCE), sequencing of the
gene encoding the protein (RP SEQUENCE FROM N.A.), biochemical studies (RP
CHARACTERIZATION) and 3D studies (e.g. RP X-RAY CRYSTALLOGRAPHY) to name but
a few. It should be noted that in the earlier days of Swiss-Prot annotation
characterization studies may have been carried out but where represented as
only "SEQUENCE FROM N.A." It would be possible to alter these retrospectively,
although in doing so would detract from our current, labor-intensive process
of making new sequences available.

The annotation of Swiss-Prot entries involves extensive knowledge of all
types of proteins, a complete understanding of the Swiss-Prot database
itself as well as skills in assessing alignment programs and pattern
databases. All of these must be considered as one, for each individual
sequence, and all information resulting from these sources is skillfully
assessed before addition to the entry. Therefore we can say that the every
effort is made to ensure that the features and comments in Swiss-Prot are
complete, correct and have pointers to the information source.

Note: a short version of this document has been originally published as:

      Junker V.L., Apweiler R., Bairoch A.
      Representation of functional information in the Swiss-Prot data bank.
      Bioinformatics 15:1066-1067(1999).


==================
Methods references
==================

[R1]  Nielsen H., Engelbrecht J., Brunak S., von Heijne G.
      Identification of prokaryotic and eukaryotic signal peptides and
      prediction of their cleavage sites.
      Protein Eng. 10:1-6(1997).
      [Pubmed: 9051728]
[R2]  Krogh A., Larsson B., von Heijne G., Sonnhammer E.L.L.
      Predicting transmembrane protein topology with a hidden Markov
      model: application to complete genomes.
      J. Mol. Biol. 305:567-580(2001).
      [Pubmed: 11152613]
[R3]  Moeller S., Croning M.D.R., Apweiler R.
      Evaluation of methods for the prediction of membrane spanning
      regions.
      Bioinformatics 17:646-653(2001).
[R4]  Eisenberg D., Schwarz E., Komaromy M., Wall R.
      Analysis of membrane and surface protein sequences with the
      hydrophobic moment plot.
      J. Mol. Biol. 179:125-142(1984).
[R5]  Jones D.T., Taylor W.R., Thornton J.M.
      A model recognition approach to the prediction of all-helical
      membrane protein structure and topology.
      Biochemistry 33:3038-3049(1994).
[R6]  Lupas A., Van Dyke M., Stock J.
      Predicting coiled coils from protein sequences.
      Science 252:1162-1164(1991).
      [Pubmed: 2031185]
[R7]  Andrade M.A., Ponting C., Gibson T., Bork P.
      Identification of protein repeats and statistical significance of
      sequence comparisons.
      J. Mol. Biol. 298:521-537(2000).
[R8]  Apweiler R., Attwood T.K., Bairoch A., Bateman A., Birney E.,
      Biswas M., Bucher P., Cerutti L., Corpet F., Croning M.D., Durbin R.,
      Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N.,
      Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R.,
      Marx B., Mulder N.J., Oinn T.M., Pagni M., Servant F., Sigrist C.J.,
      Zdobnov E.M.
      InterPro -- an integrated documentation resource for protein
      families, domains and functional sites
      Bioinformatics 16:1145-1150(2000).
      [Pubmed: 11125043]
[R9]  Falquet L., Pagni M., Bucher P., Hulo N., Sigrist C.J, Hofmann K.,
      Bairoch A.
      The PROSITE database, its status in 2002.
      Nucleic Acids Res. 30:235-238(2002).
      [Pubmed: 11752303] 
[R10] Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., 
      Griffiths-Jones S., Howe K.L., Marshall M., Sonnhammer E.L.
      The Pfam protein families database.
      Nucleic Acids Res. 30:276-280(2002).
[R11] Ponting C.P., Schultz J., Milpetz F., Bork P.
      SMART: identification and annotation of domains from signalling and
      extracellular protein sequences.
      Nucleic Acids Res. 27:229-232(1999).
      [Pubmed: 9847187]
[R12] Monigatti F., Gasteiger E., Bairoch A., Jung E.
      The Sulfinator: predicting tyrosine sulfation sites in protein
      sequences.
      Bioinformatics 18:769-770(2002).
      [Pubmed: 12050077]

-----------------------------------------------------------------------
Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
Distributed under the Creative Commons Attribution-NoDerivs License
-----------------------------------------------------------------------