---------------------------------------------------------------------------- UniProt - Swiss-Prot Protein Knowledgebase Swiss Institute of Bioinformatics (SIB); Geneva, Switzerland European Bioinformatics Institute (EBI); Hinxton, United Kingdom Protein Information Resource (PIR); Washington DC, USA ---------------------------------------------------------------------------- Description: A primer on UniProtKB/Swiss-Prot annotation Name: annbioch.txt Release: 55.5 of 10-Jun-2008 ---------------------------------------------------------------------------- ============ Introduction ============ UniProtKB/Swiss-Prot is defined as an annotated protein sequence database. We make every effort possible to ensure that all available biochemical information accompanies the sequence data and that this information is as complete and up-to-date as possible. This annotation is a labor-intensive process that involves assessment of information from published articles along with use of a variety of programs/algorithms. Use is also made of Swiss-Prot itself in order to maintain standard nomenclature and description comments. We describe here the steps we take to add all relevant biochemical information to new entries going into Swiss-Prot. There are different scenarios with respect to biochemical information that accompanies sequence data reports. Sometimes scientists isolate and then biochemically characterize the protein encoded by the gene they have sequenced. Other times they infer this information through similarity to other proteins within the same, conserved family. If it does not belong to a particular family they infer through purely sequence similarity. Then we have the genome sequence data that does not often have an accompanying citation reporting any such classification. Below are the steps we use to analyze these reports and how we assess what and how to add this information to the sequence entries. In all the scenarios below a new entry is taken from TrEMBL and, generally, the first step is to get a copy of the article(s) given in the reference lines. Then the sequence is aligned, using FastA or Blast, against all existing Swiss-Prot and TrEMBL entries. This allows us, quickly and easily, to assess if and how the sequence relates to existing families in SWISS- PROT. The next step is to read the article(s), assess the information given and add relevant comments and features to the entry. It is important to note that the following is just an outline of the annotation process. The whole process of assessing information for addition into Swiss-Prot entries is MUCH MORE complex. (I) Article(s) reports sequencing (nucleic acid and/or amino acid) and biochemical characterization Often from reading the abstract of the paper and analyzing the FastA results, we can see that the protein belongs to a particular family. In these cases, care is taken to look at other members of the family and to become familiar with the annotation that already exists. Any standard annotation that is common to the family, for example, the description line(s) and the keywords, can be added to the new entry. Other comments and features, specific to the family, can be added in conjunction with reading the paper. Any additional information from the paper, for example post-translational modifications, is added to the entry. (II) Article(s) reports sequencing and with no biochemical characterization In the majority of articles reporting gene sequencing, the gene is translated to give the protein sequence but the in vivo protein is rarely isolated and characterized. Often a probe from a similar organism is used to pinpoint the gene and then the authors infer biochemical characteristics. In these cases, curators assess what the authors imply with the results of the alignments against Swiss-Prot and TrEMBL. When the sequence "hits" against a particular family the description line(s), the similarity comments and keywords specific to the family, can be added to the new entry. More care is taken when looking at function, subunit and sequence features. This is the first of the cases where we can introduce three of four adjectives commonly found in Swiss-Prot, namely "probable", "potential" and "by similarity" (for a description of "putative" please see later under Genome Data). When a gene has been identified from probing with the gene from another organism and that gene encodes a characterized protein the description line will be copied over from the corresponding protein sequence entry. When present in the existing entry and it is not species specific, the function and other comment lines are added along with "by similarity" in parentheses. It should be noted "by similarity" is used when the comment/feature in the existing entry has been proved, categorically, to be so. Examples: a) Swiss-Prot entry where authors have biochemically characterized the protein. ID AMPA_ECOLI Reviewed; 503 AA. AC P68767; P11648; Q2M649; DT 21-DEC-2004, integrated into UniProtKB/Swiss-Prot. DT 21-DEC-2004, sequence version 1. DT 23-OCT-2007, entry version 33. DE Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP) DE (Leucyl aminopeptidase) (Aminopeptidase A/I). GN Name=pepA; Synonyms=carP, xerB; OrderedLocusNames=b4260, JW4217; OS Escherichia coli (strain K12). OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. OX NCBI_TaxID=83333; RN [1] RP NUCLEOTIDE SEQUENCE [GENOMIC DNA], AND PROTEIN SEQUENCE OF 1-20. RC STRAIN=K12; RX MEDLINE=89356633; PubMed=2670557; RA Stirling C.J., Colloms S., Collins J.F., Szatmari G., Sherratt D.J.; RT "xerB, an Escherichia coli gene required for plasmid ColE1 site- RT specific recombination, is identical to pepA, encoding aminopeptidase RT A, a protein with substantial similarity to bovine lens leucine RT aminopeptidase."; RL EMBO J. 8:1623-1627(1989). RN [2] RP NUCLEOTIDE SEQUENCE [GENOMIC DNA]. RC STRAIN=K12; RX MEDLINE=95341674; PubMed=7616564; DOI=10.1006/jmbi.1995.0385; RA Charlier D., Hassanzadeh G., Kholti A., Gigot D., Pierard A., RA Glansdorff N.; RT "carP, involved in pyrimidine regulation of the Escherichia coli RT carbamoylphosphate synthetase operon encodes a sequence-specific DNA- RT binding protein identical to XerB and PepA, also required for RT resolution of ColEI multimers."; RL J. Mol. Biol. 250:392-406(1995). RN [3] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=K12 / MG1655 / ATCC 47076; RX MEDLINE=95334362; PubMed=7610040; DOI=10.1093/nar/23.12.2105; RA Burland V.D., Plunkett G. III, Sofia H.J., Daniels D.L., RA Blattner F.R.; RT "Analysis of the Escherichia coli genome VI: DNA sequence of the RT region from 92.8 through 100 minutes."; RL Nucleic Acids Res. 23:2105-2119(1995). RN [4] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=K12 / MG1655 / ATCC 47076; RX MEDLINE=97426617; PubMed=9278503; DOI=10.1126/science.277.5331.1453; RA Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V., RA Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F., RA Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J., RA Mau B., Shao Y.; RT "The complete genome sequence of Escherichia coli K-12."; RL Science 277:1453-1474(1997). RN [5] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911; RX PubMed=16738553; DOI=10.1038/msb4100049; RA Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S., RA Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.; RT "Highly accurate genome sequences of Escherichia coli K-12 strains RT MG1655 and W3110."; RL Mol. Syst. Biol. 2:E1-E5(2006). RN [6] RP MUTAGENESIS OF GLU-354. RX MEDLINE=94335644; PubMed=8057849; RA McCulloch R., Burke M.E., Sherratt D.J.; RT "Peptidase activity of Escherichia coli aminopeptidase A is not RT required for its role in Xer site-specific recombination."; RL Mol. Microbiol. 12:241-251(1994). RN [7] RP X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS). RX PubMed=10449417; DOI=10.1093/emboj/18.16.4513; RA Strater N., Sherratt D.J., Colloms S.D.; RT "X-ray structure of aminopeptidase A from Escherichia coli and a model RT for the nucleoprotein complex in Xer site-specific recombination."; RL EMBO J. 18:4513-4522(1999). CC -!- FUNCTION: Presumably involved in the processing and regular CC turnover of intracellular proteins. Catalyzes the removal of CC unsubstituted N-terminal amino acids from various peptides. CC Required for plasmid ColE1 site-specific recombination but not in CC its aminopeptidase activity. Could act as a structural component CC of the putative nucleoprotein complex in which the Xer CC recombination reaction takes place. CC -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|- CC Yaa-, in which Xaa is preferably Leu, but may be other amino acids CC including Pro although not Arg or Lys, and Yaa may be Pro. Amino CC acid amides and methyl esters are also readily hydrolyzed, but CC rates on arylamides are exceedingly low. CC -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity). CC -!- ENZYME REGULATION: Inhibited by zinc and EDTA. CC -!- SUBUNIT: Homohexamer. CC -!- SIMILARITY: Belongs to the peptidase M17 family. CC -!- CAUTION: The ligation for manganese is based on the ligation for CC zinc, an inhibitor, in the crystallographic structure reported in CC PubMed:10449417. The ligation for manganese in the active form of CC the enzyme may differ. DR EMBL; X15130; CAA33225.1; -; Genomic_DNA. DR EMBL; X86443; CAA60164.1; -; Genomic_DNA. DR EMBL; U14003; AAA97157.1; -; Genomic_DNA. DR EMBL; U00096; AAC77217.1; -; Genomic_DNA. DR EMBL; AP009048; BAE78257.1; -; Genomic_DNA. DR PIR; S04462; APECA. DR RefSeq; AP_004756.1; -. DR RefSeq; NP_418681.1; -. DR PDB; 1GYT; X-ray; A/B/C/D/E/F/G/H/I/J/K/L=1-503. DR IntAct; P68767; -. DR MEROPS; M17.003; -. DR GeneID; 948791; -. DR GenomeReviews; U00096_GR; b4260. DR GenomeReviews; AP009048_GR; JW4217. DR KEGG; ecj:JW4217; -. DR KEGG; eco:b4260; -. DR EchoBASE; EB0688; -. DR EcoGene; EG10694; pepA. DR BioCyc; EcoCyc:EG10694-MONOMER; -. DR GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP. DR GO; GO:0030145; F:manganese ion binding; IEA:HAMAP. DR HAMAP; MF_00181; -; 1. DR InterPro; IPR011356; Peptidase_M17. DR InterPro; IPR000819; Peptidase_M17_C. DR InterPro; IPR008283; Peptidase_M17_N. DR PANTHER; PTHR11963:SF3; Peptidase_M17; 1. DR Pfam; PF00883; Peptidase_M17; 1. DR Pfam; PF02789; Peptidase_M17_N; 1. DR PIRSF; PIRSF001116; Ctsl_amnpptdse; 1. DR PRINTS; PR00481; LAMNOPPTDASE. DR PROSITE; PS00631; CYTOSOL_AP; 1. PE 1: Evidence at protein level; KW 3D-structure; Aminopeptidase; Complete proteome; KW Direct protein sequencing; Hydrolase; Manganese; Metal-binding; KW Protease. FT CHAIN 1 503 Cytosol aminopeptidase. FT /FTId=PRO_0000165750. FT ACT_SITE 282 282 Potential. FT ACT_SITE 356 356 Potential. FT METAL 270 270 Manganese 2 (Probable). FT METAL 275 275 Manganese 1 (Probable). FT METAL 275 275 Manganese 2 (Probable). FT METAL 293 293 Manganese 2 (Probable). FT METAL 352 352 Manganese 1 (Probable). FT METAL 354 354 Manganese 1 (Probable). FT METAL 354 354 Manganese 2 (Probable). FT MUTAGEN 354 354 E->A: Loss of activity. FT STRAND 2 6 FT HELIX 10 12 FT STRAND 18 23 FT TURN 24 26 FT HELIX 30 36 FT STRAND 39 41 FT HELIX 42 49 FT STRAND 59 64 FT STRAND 69 77 FT HELIX 86 102 FT STRAND 106 110 FT HELIX 112 114 FT HELIX 122 137 FT STRAND 156 160 FT HELIX 164 166 FT HELIX 167 192 FT TURN 195 197 FT HELIX 200 213 FT TURN 214 217 FT STRAND 218 223 FT HELIX 225 230 FT HELIX 234 241 FT STRAND 243 245 FT STRAND 248 255 FT STRAND 265 275 FT HELIX 287 294 FT HELIX 295 310 FT STRAND 313 325 FT STRAND 337 339 FT STRAND 345 347 FT HELIX 355 365 FT HELIX 366 369 FT STRAND 372 378 FT HELIX 382 388 FT TURN 389 391 FT STRAND 392 398 FT HELIX 400 413 FT STRAND 417 419 FT HELIX 424 427 FT HELIX 428 430 FT STRAND 433 439 FT HELIX 446 455 FT STRAND 463 467 FT TURN 469 471 FT STRAND 472 474 FT HELIX 476 478 FT HELIX 486 496 SQ SEQUENCE 503 AA; 54880 MW; 643DED17EAC44DCD CRC64; MEFSVKSGSP EKQRSACIVV GVFEPRRLSP IAEQLDKISD GYISALLRRG ELEGKPGQTL LLHHVPNVLS ERILLIGCGK ERELDERQYK QVIQKTINTL NDTGSMEAVC FLTELHVKGR NNYWKVRQAV ETAKETLYSF DQLKTNKSEP RRPLRKMVFN VPTRRELTSG ERAIQHGLAI AAGIKAAKDL GNMPPNICNA AYLASQARQL ADSYSKNVIT RVIGEQQMKE LGMHSYLAVG QGSQNESLMS VIEYKGNASE DARPIVLVGK GLTFDSGGIS IKPSEGMDEM KYDMCGAAAV YGVMRMVAEL QLPINVIGVL AGCENMPGGR AYRPGDVLTT MSGQTVEVLN TDAEGRLVLC DVLTYVERFE PEAVIDVATL TGACVIALGH HITGLMANHN PLAHELIAAS EQSGDRAWRL PLGDEYQEQL ESNFADMANI GGRPGGAITA GCFLSRFTRK YNWAHLDIAG TAWRSGKAKG ATGRPVALLA QFLLNRAGFN GEE // b) Swiss-Prot entry where no characterization has taken place but where information has been added because the sequences are highly comparable and so we believe, beyond reasonable doubt, that it is such a protein. The lines that have been indented are those where information has been added. ID AMPA_HAEIN Reviewed; 491 AA. AC P45334; DT 01-NOV-1995, integrated into UniProtKB/Swiss-Prot. DT 01-NOV-1995, sequence version 1. DT 02-OCT-2007, entry version 58. DE Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP) DE (Leucyl aminopeptidase). GN Name=pepA; OrderedLocusNames=HI1705; OS Haemophilus influenzae. OC Bacteria; Proteobacteria; Gammaproteobacteria; Pasteurellales; OC Pasteurellaceae; Haemophilus. OX NCBI_TaxID=727; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 51907 / DSM 11121 / KW20 / Rd; RX MEDLINE=95350630; PubMed=7542800; RA Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., RA Kerlavage A.R., Bult C.J., Tomb J.-F., Dougherty B.A., Merrick J.M., RA McKenney K., Sutton G.G., FitzHugh W., Fields C.A., Gocayne J.D., RA Scott J.D., Shirley R., Liu L.-I., Glodek A., Kelley J.M., RA Weidman J.F., Phillips C.A., Spriggs T., Hedblom E., Cotton M.D., RA Utterback T.R., Hanna M.C., Nguyen D.T., Saudek D.M., Brandon R.C., RA Fine L.D., Fritchman J.L., Fuhrmann J.L., Geoghagen N.S.M., RA Gnehm C.L., McDonald L.A., Small K.V., Fraser C.M., Smith H.O., RA Venter J.C.; RT "Whole-genome random sequencing and assembly of Haemophilus influenzae RT Rd."; RL Science 269:496-512(1995). CC -!- FUNCTION: Presumably involved in the processing and regular CC turnover of intracellular proteins. Catalyzes the removal of CC unsubstituted N-terminal amino acids from various peptides (By CC similarity). CC -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|- CC Yaa-, in which Xaa is preferably Leu, but may be other amino acids CC including Pro although not Arg or Lys, and Yaa may be Pro. Amino CC acid amides and methyl esters are also readily hydrolyzed, but CC rates on arylamides are exceedingly low. CC -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity). CC -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity). CC -!- SIMILARITY: Belongs to the peptidase M17 family. DR EMBL; L42023; AAC23351.1; -; Genomic_DNA. DR PIR; C64137; C64137. DR RefSeq; NP_439847.1; -. DR HSSP; P11648; 1GYT. DR MEROPS; M17.003; -. DR GeneID; 949712; -. DR GenomeReviews; L42023_GR; HI1705. DR KEGG; hin:HI1705; -. DR TIGR; HI1705; -. DR BioCyc; HINF71421:HI_1705-MONOMER; -. DR GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP. DR GO; GO:0030145; F:manganese ion binding; IEA:HAMAP. DR HAMAP; MF_00181; -; 1. DR InterPro; IPR011356; Peptidase_M17. DR InterPro; IPR000819; Peptidase_M17_C. DR InterPro; IPR008283; Peptidase_M17_N. DR PANTHER; PTHR11963:SF3; Peptidase_M17; 1. DR Pfam; PF00883; Peptidase_M17; 1. DR Pfam; PF02789; Peptidase_M17_N; 1. DR PIRSF; PIRSF001116; Ctsl_amnpptdse; 1. DR PRINTS; PR00481; LAMNOPPTDASE. DR PROSITE; PS00631; CYTOSOL_AP; 1. PE 3: Inferred from homology; KW Aminopeptidase; Complete proteome; Cytoplasm; Hydrolase; Manganese; KW Metal-binding; Protease. FT CHAIN 1 491 Cytosol aminopeptidase. FT /FTId=PRO_0000165758. FT ACT_SITE 275 275 Potential. FT ACT_SITE 349 349 Potential. FT METAL 263 263 Manganese 2 (By similarity). FT METAL 268 268 Manganese 1 (By similarity). FT METAL 268 268 Manganese 2 (By similarity). FT METAL 286 286 Manganese 2 (By similarity). FT METAL 345 345 Manganese 1 (By similarity). FT METAL 347 347 Manganese 1 (By similarity). FT METAL 347 347 Manganese 2 (By similarity). SQ SEQUENCE 491 AA; 53529 MW; 71376DDB1B0076EB CRC64; MKYQAKNTAL SQATDCIVLG VYENNKFSKS FNEIDQLTQG YLNDLVKSGE LTGKLAQTVL LRDLQGLSAK RLLIVGCGKK GELTERQYKQ IIQAVLKTLK ETNTREVISY LTEIELKDRD LYWNIRFAIE TIEHTNYQFD HFKSQKAETS VLESFIFNTD CAQAQQAISH ANAISSGIKA ARDIANMPPN ICNPAYLAEQ AKNLAENSTA LSLKVVDEEE MAKLGMNAYL AVSKGSENRA YMSVLTFNNA PDKNAKPIVL VGKGLTFDAG GISLKPAADM DEMKYDMCGA ASVFGTMKTI AQLNLPLNVI GVLAGCENLP DGNAYRPGDI LTTMNGLTVE VLNTDAEGRL VLCDTLTYVE RFEPELVIDV ATLTGACVVA LGQHNSGLVS TDNNLANALL QAATETTDKA WRLPLSEEYQ EQLKSPFADL ANIGGRWGGA ITAGAFLSNF TKKYRWAHLD IAGTAWLQGA NKGATGRPVS LLTQFLINQV K // The alignment below shows that the degree of sequence similarity is such that we can classify, beyond reasonable doubt, this protein as an aminopeptidase A/I. AMPA_ECOLI MEFSVKSGSPEKQRSACIVVGVFEPRRLSPIAEQLDKISDGYISALLRRG AMPA_HAEIN MKYQAKN-TALSQATDCIVLGVYENNKFSKSFNEIDQLTQGYLNDLVKSG *.. .*. .. .* ..***.**.* ...* ...*....**...*...* AMPA_ECOLI ELEGKPGQTLLLHHVPNVLSERILLIGCGKERELDERQYKQVIQKTINTL AMPA_HAEIN ELTGKLAQTVLLRDLQGLSAKRLLIVGCGKKGELTERQYKQIIQAVLKTL **.** .**.**...... ..*.*..****. **.******.** ...** AMPA_ECOLI NDTGSMEAVCFLTELHVKGRNNYWKVRQAVETAKETLYSFDQLKTNKSEP AMPA_HAEIN KETNTREVISYLTEIELKDRDLYWNIRFAIETIEHTNYQFDHFKSQKAET ..*...*....***...*.*. **..* *.** ..* * **..*..*.*. AMPA_ECOLI RRPLRKMVFNVPTRRELTSGERAIQHGLAIAAGIKAAKDLGNMPPNICNA AMPA_HAEIN S-VLESFIFNTDC----AQAQQAISHANAISSGIKAARDIANMPPNICNP . * ...**. . ...** *. **..*****.*..********. AMPA_ECOLI AYLASQARQLADSYSKNVITRVIGEQQMKELGMHSYLAVGQGSQNESLMS AMPA_HAEIN AYLAEQAKNLAEN-STALSLKVVDEEEMAKLGMNAYLAVSKGSENRAYMS ****.**..**.. *... .*..*..* .***..****..**.* . ** AMPA_ECOLI VIEYKGNASEDARPIVLVGKGLTFDSGGISIKPSEGMDEMKYDMCGAAAV AMPA_HAEIN VLTFNNAPDKNAKPIVLVGKGLTFDAGGISLKPAADMDEMKYDMCGAASV *..........*.************.****.**...************.* AMPA_ECOLI YGVMRMVAELQLPINVIGVLAGCENMPGGRAYRPGDVLTTMSGQTVEVLN AMPA_HAEIN FGTMKTIAQLNLPLNVIGVLAGCENLPDGNAYRPGDILTTMNGLTVEVLN .*.*. .*.*.**.***********.*.*.******.****.* ****** AMPA_ECOLI TDAEGRLVLCDVLTYVERFEPEAVIDVATLTGACVIALGHHITGLMANHN AMPA_HAEIN TDAEGRLVLCDTLTYVERFEPELVIDVATLTGACVVALGQHNSGLVSTDN ***********.********** ************.***.* .**....* AMPA_ECOLI PLAHELIAASEQSGDRAWRLPLGDEYQEQLESNFADMANIGGRPGGAITA AMPA_HAEIN NLANALLQAATETTDKAWRLPLSEEYQEQLKSPFADLANIGGRWGGAITA **..*..*.....*.******..******.* ***.****** ****** AMPA_ECOLI GCFLSRFTRKYNWAHLDIAGTAWRSGKAKGATGRPVALLAQFLLNRAGFNGEE AMPA_HAEIN GAFLSNFTKKYRWAHLDIAGTAWLQGANKGATGRPVSLLTQFLINQVK * ***.**.**.*********** * .********.**.***.*.. (III) Protein sequence data from translation of genome sequencing data Genome sequencing has caused a massive influx of data into the nucleotide sequence databases and this has lead to the same influx into TrEMBL giving thousands of entries waiting to go into Swiss-Prot. This sequence data is submitted to the nucleotide sequence databases and is reported in publications that show the entire genome sequence as well as genes that are predicted by a number of methods. Apart from these gene designations the papers rarely include experimental information about any of the predicted proteins from these analyses. By making use of what is reported coupled to the assessment of results from sequence alignments, that hit against both characterized and part-characterized protein sequences (see above), we make an effort to add relevant biochemical information to these translated protein sequences. The first step here is to align the translated sequences against Swiss-Prot and TrEMBL. (We run against TrEMBL as an additional check for exact matches so helping in the attempt to reduce redundancy in our data and to pick up PROSITE/Pfam information that may be missing from the entry that is being worked with). This is described fully further on. The results give rise to a number of scenarios and they are: 1. identical to an existing sequence in Swiss-Prot from the same organism, 2. identical to an existing sequence in Swiss-Prot from a different organism which may or may not be related 3. strong similarity (i.e. many residues are conserved residues), over the entire sequence, to an existing entry (from a related or different organism) 4. strong similarity only at regions in the sequence (from same, related or different organism) 5. some similarity to one or more existing entries 6. no similarity to any existing entries Here is a detailed description of all the above scenario. 1) Identical to an existing sequence in Swiss-Prot from the same organism Update the existing Swiss-Prot entry by adding the new reference and new EMBL DR line. Check new reference for any additional information. 2) Identical to an existing sequence in Swiss-Prot from a different organism which may or may not be related We create a new entry based on the template entry. The majority of the annotation information (comments, features, etc) are copied with the qualifier "By similarity" added. For example, the entry shown below has been annotated based on the 100% identical (at protein level) entry from E.coli which was shown in section II above. ID AMPA_ECO57 Reviewed; 503 AA. AC P68768; P11648; DT 21-DEC-2004, integrated into UniProtKB/Swiss-Prot. DT 21-DEC-2004, sequence version 1. DT 02-OCT-2007, entry version 22. DE Cytosol aminopeptidase (EC 3.4.11.1) (Leucine aminopeptidase) (LAP) DE (Leucyl aminopeptidase) (Aminopeptidase A/I). GN Name=pepA; Synonyms=carP, xerB; OrderedLocusNames=Z5872, ECs5237; OS Escherichia coli O157:H7. OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. OX NCBI_TaxID=83334; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=O157:H7 / EDL933 / ATCC 700927 / EHEC; RX MEDLINE=21074935; PubMed=11206551; DOI=10.1038/35054089; RA Perna N.T., Plunkett G. III, Burland V., Mau B., Glasner J.D., RA Rose D.J., Mayhew G.F., Evans P.S., Gregor J., Kirkpatrick H.A., RA Posfai G., Hackett J., Klink S., Boutin A., Shao Y., Miller L., RA Grotbeck E.J., Davis N.W., Lim A., Dimalanta E.T., Potamousis K., RA Apodaca J., Anantharaman T.S., Lin J., Yen G., Schwartz D.C., RA Welch R.A., Blattner F.R.; RT "Genome sequence of enterohaemorrhagic Escherichia coli O157:H7."; RL Nature 409:529-533(2001). RN [2] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=O157:H7 / Sakai / RIMD 0509952 / EHEC; RX MEDLINE=21156231; PubMed=11258796; DOI=10.1093/dnares/8.1.11; RA Hayashi T., Makino K., Ohnishi M., Kurokawa K., Ishii K., Yokoyama K., RA Han C.-G., Ohtsubo E., Nakayama K., Murata T., Tanaka M., Tobe T., RA Iida T., Takami H., Honda T., Sasakawa C., Ogasawara N., Yasunaga T., RA Kuhara S., Shiba T., Hattori M., Shinagawa H.; RT "Complete genome sequence of enterohemorrhagic Escherichia coli RT O157:H7 and genomic comparison with a laboratory strain K-12."; RL DNA Res. 8:11-22(2001). CC -!- FUNCTION: Presumably involved in the processing and regular CC turnover of intracellular proteins. Catalyzes the removal of CC unsubstituted N-terminal amino acids from various peptides. CC Required for plasmid ColE1 site-specific recombination but not in CC its aminopeptidase activity. Could act as a structural component CC of the putative nucleoprotein complex in which the Xer CC recombination reaction takes place (By similarity). CC -!- CATALYTIC ACTIVITY: Release of an N-terminal amino acid, Xaa-|- CC Yaa-, in which Xaa is preferably Leu, but may be other amino acids CC including Pro although not Arg or Lys, and Yaa may be Pro. Amino CC acid amides and methyl esters are also readily hydrolyzed, but CC rates on arylamides are exceedingly low. CC -!- COFACTOR: Binds 2 manganese ions per subunit (By similarity). CC -!- ENZYME REGULATION: Inhibited by zinc and EDTA (By similarity). CC -!- SUBUNIT: Homohexamer (By similarity). CC -!- SIMILARITY: Belongs to the peptidase M17 family. DR EMBL; AE005174; AAG59459.1; -; Genomic_DNA. DR EMBL; BA000007; BAB38660.1; -; Genomic_DNA. DR PIR; E91283; E91283. DR PIR; G86124; G86124. DR RefSeq; NP_290893.1; -. DR RefSeq; NP_313264.1; -. DR SMR; P68768; 1-503. DR GeneID; 913804; -. DR GeneID; 959777; -. DR GenomeReviews; BA000007_GR; ECs5237. DR GenomeReviews; AE005174_GR; Z5872. DR KEGG; ece:Z5872; -. DR KEGG; ecs:ECs5237; -. DR BioCyc; ECOL83334:ECS5237-MONOMER; -. DR GO; GO:0004178; F:leucyl aminopeptidase activity; IEA:HAMAP. DR GO; GO:0030145; F:manganese ion binding; IEA:HAMAP. DR HAMAP; MF_00181; -; 1. DR InterPro; IPR011356; Peptidase_M17. DR InterPro; IPR000819; Peptidase_M17_C. DR InterPro; IPR008283; Peptidase_M17_N. DR PANTHER; PTHR11963:SF3; Peptidase_M17; 1. DR Pfam; PF00883; Peptidase_M17; 1. DR Pfam; PF02789; Peptidase_M17_N; 1. DR PIRSF; PIRSF001116; Ctsl_amnpptdse; 1. DR PRINTS; PR00481; LAMNOPPTDASE. DR PROSITE; PS00631; CYTOSOL_AP; 1. PE 3: Inferred from homology; KW Aminopeptidase; Complete proteome; Hydrolase; Manganese; KW Metal-binding; Protease. FT CHAIN 1 503 Cytosol aminopeptidase. FT /FTId=PRO_0000165752. FT ACT_SITE 282 282 Potential. FT ACT_SITE 356 356 Potential. FT METAL 270 270 Manganese 2 (By similarity). FT METAL 275 275 Manganese 1 (By similarity). FT METAL 275 275 Manganese 2 (By similarity). FT METAL 293 293 Manganese 2 (By similarity). FT METAL 352 352 Manganese 1 (By similarity). FT METAL 354 354 Manganese 1 (By similarity). FT METAL 354 354 Manganese 2 (By similarity). SQ SEQUENCE 503 AA; 54880 MW; 643DED17EAC44DCD CRC64; MEFSVKSGSP EKQRSACIVV GVFEPRRLSP IAEQLDKISD GYISALLRRG ELEGKPGQTL LLHHVPNVLS ERILLIGCGK ERELDERQYK QVIQKTINTL NDTGSMEAVC FLTELHVKGR NNYWKVRQAV ETAKETLYSF DQLKTNKSEP RRPLRKMVFN VPTRRELTSG ERAIQHGLAI AAGIKAAKDL GNMPPNICNA AYLASQARQL ADSYSKNVIT RVIGEQQMKE LGMHSYLAVG QGSQNESLMS VIEYKGNASE DARPIVLVGK GLTFDSGGIS IKPSEGMDEM KYDMCGAAAV YGVMRMVAEL QLPINVIGVL AGCENMPGGR AYRPGDVLTT MSGQTVEVLN TDAEGRLVLC DVLTYVERFE PEAVIDVATL TGACVIALGH HITGLMANHN PLAHELIAAS EQSGDRAWRL PLGDEYQEQL ESNFADMANI GGRPGGAITA GCFLSRFTRK YNWAHLDIAG TAWRSGKAKG ATGRPVALLA QFLLNRAGFN GEE // 3) Strong similarity (i.e. many residues are identical or conserved), over the entire sequence, to an existing entry (from a related or different organism) There is no fixed cut-off point in percentage sequence similarity. It is from experience that the curators assess whether similarity is considered to be strong or weak. For each individual case, we must also look to see whether sequences are highly conserved between species. To exhibit this, please look at the following example. This entry has been created from data submitted from the Schizosaccharomyces pombe genome project. ID CHMU_SCHPO Reviewed; 251 AA. AC O13739; DT 15-JUL-1998, integrated into UniProtKB/Swiss-Prot. DT 01-JAN-1998, sequence version 1. DT 23-OCT-2007, entry version 53. DE Probable chorismate mutase (EC 5.4.99.5) (CM). GN ORFNames=SPAC16E8.04c; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). CC -!- CATALYTIC ACTIVITY: Chorismate = prephenate. CC -!- ENZYME REGULATION: Allosterically regulated. CC -!- PATHWAY: Metabolic intermediate biosynthesis; prephenate CC biosynthesis; prephenate from chorismate: step 1/1. CC -!- SUBUNIT: Homodimer (By similarity). CC -!- SIMILARITY: Contains 1 chorismate mutase domain. DR EMBL; CU329670; CAB11033.1; -; Genomic_DNA. DR PIR; T37784; T37784. DR HSSP; P32178; 5CSM. DR KEGG; spo:SPAC16E8.04c; -. DR GeneDB_Spombe; SPAC16E8.04c; -. DR BioCyc; SPOM-XXX-01:SPOM-XXX-01-001828-MONOMER; -. DR ArrayExpress; O13739; -. DR GO; GO:0005829; C:cytosol; IDA:GeneDB_SPombe. DR GO; GO:0005634; C:nucleus; IDA:GeneDB_SPombe. DR InterPro; IPR008238; Chor_mut_AroQ_eu. DR InterPro; IPR002701; Chorismate_mut. DR Gene3D; G3DSA:1.10.590.10; Chor_mut_AroQ_eu; 1. DR PANTHER; PTHR21145; Chor_mut_AroQ_eu; 1. DR Pfam; PF01817; CM_2; 1. DR PIRSF; PIRSF017318; Chor_mut_AroQ_eu; 1. DR TIGRFAMs; TIGR01802; CM_pl-yst; 1. DR PROSITE; PS51169; CHORISMATE_MUT_3; 1. PE 2: Evidence at transcript level; KW Allosteric enzyme; Amino-acid biosynthesis; KW Aromatic amino acid biosynthesis; Complete proteome; Isomerase. FT CHAIN 1 251 Probable chorismate mutase. FT /FTId=PRO_0000119205. FT DOMAIN 1 251 Chorismate mutase. SQ SEQUENCE 251 AA; 29050 MW; 1AC18AE4C1E6C4B7 CRC64; MSLVNEKLKL ENIRSALIRQ EDTIIFNFLE RAQFPRNEKV YKSGKEGCLN LENYDGSFLN YLLHEEEKVY ALVRRYASPE EYPFTDNLPE PILPKFSGKF PLHPNNVNVN SEILEYYINE IVPKISSPGD DFDNYGSTVV CDIRCLQSLS RRIHYGKFVA EAKYLANPEK YKKLILARDI KGIENEIVDA AQEERVLKRL HYKALNYGRD AADPTKPSDR INADCVASIY KDYVIPMTKK VEVDYLLARL L // When aligned to its closest homolog in Swiss-Prot and TrEMBL the following results are obtained: CHMU_YEAST MDFTKPETVLNLQNIRDELVRMEDSIIFKFIERSHFATCPSVYEANHPG- CHMU_SCHPO MSLVNEK--LKLENIRSALIRQEDTIIFNFLERAQFPRNEKVYKSGKEGC *.... . *.*.***..*.* **.***.*.**..*. .**.... * CHMU_YEAST LEIPNFKGSFLDWALSNLEIAHSRIRRFESPDETPFFPDKIQKSFLPSIN CHMU_SCHPO LNLENYDGSFLNYLLHEEEKVYALVRRYASPEEYPF-TDNLPEPILP--K *.. *..****.. * . * ... .**..**.* ** .*......** . CHMU_YEAST YPQILAPYAPEVNYNDKIKKVYIEKIIPLISKRDGDDKNNFGSVATRDIE CHMU_SCHPO FSGKFPLHPNNVNVNSEILEYYINEIVPKISSP-GDDFDNYGSTVVCDIR .. .. .. .** *..* . **..*.* **.. *** .*.**... ** CHMU_YEAST CLQSLSRRIHFGKFVAEAKFQSDIPLYTKLIKSKDVEGIMKNITNSAVEE CHMU_SCHPO CLQSLSRRIHYGKFVAEAKYLANPEKYKKLILARDIKGIENEIVDAAQEE **********.********. .. *.*** ..*..** ..*...* ** CHMU_YEAST KILERLTKKAEVYGVDPTNES-GERRITPEYLVKIYKEIVIPITKEVEVE CHMU_SCHPO RVLKRLHYKALNYGRDAADPTKPSDRINADCVASIYKDYVIPMTKKVEVD ..*.** ** ** *... . . **.......***. ***.**.***. CHMU_YEAST YLLRRLEE CHMU_SCHPO YLLARLL *** ** The sequences show a high degree of similarity over their entire lengths and so it is highly likely that the sequence from the Schizosaccharomyces pombe genome project is indeed a chorismate mutase. This allows us to add the standard description line as well as comments describing catalytic activity, the pathway the enzyme is involved in as well as the relevant keywords. We can also add a subunit comment but here we add "(by similarity)" to show that this information has come from a characterized protein(s) (in this case from CHMU_YEAST (P32178)) and has not been experimentally determined in S. pombe. In addition, due to the fact that this protein has been biochemically characterized we add "probable" to the DE line to indicate this e.g. "Probable chorismate mutase." 4) Strong similarity only at regions in the sequence (from same, related or different organism) These cases often pick up on areas within a sequence responsible for binding sites of, for example, cofactors, metals, DNA-binding and ATP/GTP. Here, a function can often be assigned leading to description lines, comments and keywords being added to the new entry. In some cases, however, even though areas are conserved there is no evidence to characterize the protein. It should be noted that we also make use of domain/families databases such as PROSITE and Pfam in these cases. Below are examples of both these cases. The entry below is again from the S.pombe genome project. ID PPK14_SCHPO Reviewed; 566 AA. AC Q09831; DT 01-FEB-1996, integrated into UniProtKB/Swiss-Prot. DT 01-FEB-1996, sequence version 1. DT 23-OCT-2007, entry version 52. DE Serine/threonine-protein kinase ppk14 (EC 2.7.11.1). GN Name=ppk14; ORFNames=SPAC4G8.05; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). RN [2] RP IDENTIFICATION. RX PubMed=15821139; DOI=10.1128/EC.4.4.799-813.2005; RA Bimbo A., Jia Y., Poh S.L., Karuturi R.K.M., den Elzen N., Peng X., RA Zheng L., O'Connell M., Liu E.T., Balasubramanian M.K., Liu J.; RT "Systematic deletion analysis of fission yeast protein kinases."; RL Eukaryot. Cell 4:799-813(2005). CC -!- CATALYTIC ACTIVITY: ATP + a protein = ADP + a phosphoprotein. CC -!- SIMILARITY: Belongs to the protein kinase superfamily. Ser/Thr CC protein kinase family. KIN82 subfamily. CC -!- SIMILARITY: Contains 1 protein kinase domain. DR EMBL; CU329670; CAA91206.1; -; Genomic_DNA. DR PIR; S62482; S62482. DR HSSP; P31751; 1GZK. DR KEGG; spo:SPAC4G8.05; -. DR GeneDB_Spombe; SPAC4G8.05; -. DR BioCyc; SPOM-XXX-01:SPOM-XXX-01-000780-MONOMER; -. DR ArrayExpress; Q09831; -. DR GO; GO:0004674; F:protein serine/threonine kinase activity; TAS:GeneDB_SPombe. DR InterPro; IPR000719; Prot_kinase_core. DR InterPro; IPR008271; Ser_thr_pkin_AS. DR InterPro; IPR002290; Ser_thr_pkinase. DR Pfam; PF00069; Pkinase; 1. DR ProDom; PD000001; Prot_kinase; 1. DR SMART; SM00220; S_TKc; 1. DR PROSITE; PS00107; PROTEIN_KINASE_ATP; FALSE_NEG. DR PROSITE; PS50011; PROTEIN_KINASE_DOM; 1. DR PROSITE; PS00108; PROTEIN_KINASE_ST; 1. PE 2: Evidence at transcript level; KW ATP-binding; Complete proteome; Kinase; Nucleotide-binding; KW Serine/threonine-protein kinase; Transferase. FT CHAIN 1 566 Serine/threonine-protein kinase ppk14. FT /FTId=PRO_0000086043. FT DOMAIN 195 485 Protein kinase. FT NP_BIND 201 209 ATP (By similarity). FT ACT_SITE 320 320 Proton acceptor (By similarity). FT BINDING 224 224 ATP (By similarity). SQ SEQUENCE 566 AA; 63482 MW; 3D18B4F84E10AA13 CRC64; MNELHDGESS EEGRINVEDH LEEAKKDDTG HWKHSGTAKP SKFRAFIRLH FKDSRKFAFS RKKEKELTSE DSDAANQSPS GAPESQTEEE SDRKIDGTGS SAEGGDGSGT DSISVIKKSF FKSGRKKKDV PKSRNVSRSN GADTSVQREK LKDIFSPHGK EKELAHIKKT VATRARTYSS NSIKICDVEV GPSSFEKVFL LGKGDVGRVY LVREKKSGKF YAMKVLSKQE MIKRNKSKRA FAEQHILATS NHPFIVTLYH SFQSDEYLYL CMEYCMGGEF FRALQRRPGR CLSENEAKFY IAEVTAALEY LHLMGFIYRD LKPENILLHE SGHIMLSDFD LSKQSNSAGA PTVIQARNAP SAQNAYALDT KSCIADFRTN SFVGTEEYIA PEVIKGCGHT SAVDWWTLGI LFYEMLYATT PFKGKNRNMT FSNILHKDVI FPEYADAPSI SSLCKNLIRK LLVKDENDRL GSQAGAADVK LHPFFKNVQW ALLRHTEPPI IPKLAPIDEK GNPNISHLKE SKSLDITHSP QNTQTVEVPL SNLSGADHGD DPFESFNSVT VHHEWD // By looking at the alignment we can see that the areas conserved are around ATP-binding sites (which is picked up by PROSITE and Pfam too) and the active site is also conserved. Hence we can add this information to the entry as can be seen in the feature table by similarity. This shows that there is no experimental proof but that it is very likely to be a serine/threonine protein kinase because conserved features of that family of proteins are present in the sequence. Below is the alignment to highlight this. PPK14_SCHPO MNELHDGESSEEGRINVEDHLEEAKKDD---TGHWKHSGTAKPSKFRAFIRLHFKDSR NRC2_NEUCR MPSTKNANGEGHFPSRIKQFFRINSGSKDHKDRDAHTTSSSHGGAPRADAKTPSGFRQSR .: :**. .**: :::...**. .* . *. . ..: * *::** PPK14_SCHPO KFAFSRKKEKELTSED-------SDAANQSPSGAPESQ--TEEESD-----RKIDGTGSS NRC2_NEUCR FFSVGRLRSTTVVSEGNPLDESMSPTAHANPYFAHQGQPGLRHHNDGSVPPSPPDTPSLK *:..* :.. :.**. * :*: .* * :.* ....* * .. . PPK14_SCHPO AEGGDGSGTDSISVIKKSFFKSGRKKKDVPKSRNVS---RSNG---ADTSVQRE---KLK NRC2_NEUCR VDGPEGS-QQPTAATKEELARKLRRVASAPNAQGLFSKGQGNGDRPATAELSKEPLEESK .:* :** :. :. *:.: :. *: ..*:::.: :.** * :.:.:* : * PPK14_SCHPO DIFSPHGKEKE--------------------LAHIKKTVATRARTYSSNSIKICDVEVGP NRC2_NEUCR DSNTVGFAEQKPNNDSSTSLAAPDADGLGALPPPIRQSPLAFRRTYSSNSIKVRNVEVGP * : *:: . *::: : *********: :***** PPK14_SCHPO SSFEKVFLLGKGDVGRVYLVREKKSGKFYAMKVLSKQEMIKRNKSKRAFAEQHILATSNH NRC2_NEUCR QSFDKIKLIGKGDVGKVYLVKEKKSGRLYAMKVLSKKEMIKRNKIKRALAEQEILATSNH .**:*: *:******:****:*****::********:******* ***:***.******* PPK14_SCHPO PFIVTLYHSFQSDEYLYLCMEYCMGGEFFRALQRRPGRCLSENEAKFYIAEVTAALEYLH NRC2_NEUCR PFIVTLYHSFQSEDYLYLCMEYCSGGEFFRALQTRPGKCIPEDDARFYAAEVTAALEYLH ************::********* ********* ***:*:.*::*:** *********** PPK14_SCHPO LMGFIYRDLKPENILLHESGHIMLSDFDLSKQSNSAGAPTVIQARNAPSAQNAYALDTKS NRC2_NEUCR LMGFIYRDLKPENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKS *****************:***************:..* **:* .:*..*:.. ::**** PPK14_SCHPO CIADFRTNSFVGTEEYIAPEVIKGCGHTSAVDWWTLGILFYEMLYATTPFKGKNRNMTFS NRC2_NEUCR CIANFRTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFA ***:********************.**************:*****.********** **: PPK14_SCHPO NILHKDVIFPEYADAPSISSLCKNLIRKLLVKDENDRLGSQAGAADVKLHPFFKNVQWAL NRC2_NEUCR NILREDIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLGARAGASDIKTHPFFRTTQWAL ***::*: **::*.**.**.***.******:**** ***::***:*:* ****:..**** PPK14_SCHPO LRHTEPPIIPKLAPIDEKGNPNISHLKESKSLDITHSPQNTQTVEVPLSNLSG-ADHGDD NRC2_NEUCR IRHMKPPIVPNQGRG--IDTLNFRNVKESESVDISGSRQMGLKGEPLESGMVTPGENAVD :** :***:*: . .. *: ::***:*:**: * * . * *.: .::. * PPK14_SCHPO PFESFNSVTVHHEWD NRC2_NEUCR PFEEFNSVTLHHDGDEEYHSDAYEKR ***.*****:**: * 5) Some similarity to one or more existing entries It is in this category that the adjective "putative" comes into play. For these cases, again there is no experimental proof that the protein exists and there is only limited evidence to point the protein to a particular family. Again, we have no fixed rules on what is "limited" and what isn't. It is a judgement that we make based on which family it is and which, if any, areas are conserved. Below is one example of many that exist in Swiss-Prot. From the alignments and from hits to the pattern databases we attempt to add any information so that it is not lost. By using putative in the description line we are showing that there is evidence within the sequence data but that we do not want to classify indefinitely until experimental proof is available. When it is, the entry will be updated accordingly. Staying with the S.pombe project the following shows this. ID YA55_SCHPO Reviewed; 513 AA. AC Q09735; DT 01-NOV-1995, integrated into UniProtKB/Swiss-Prot. DT 01-NOV-1995, sequence version 1. DT 23-OCT-2007, entry version 56. DE Putative aminopeptidase C13A11.05 (EC 3.4.11.-). GN ORFNames=SPAC13A11.05; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). CC -!- COFACTOR: Binds 2 zinc ions per subunit (By similarity). CC -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity). CC -!- SIMILARITY: Belongs to the peptidase M17 family. DR EMBL; CU329670; CAA90806.1; -; Genomic_DNA. DR PIR; T37612; T37612. DR HSSP; P00727; 1BPN. DR KEGG; spo:SPAC13A11.05; -. DR GeneDB_Spombe; SPAC13A11.05; -. DR BioCyc; SPOM-XXX-01:SPOM-XXX-01-000717-MONOMER; -. DR ArrayExpress; Q09735; -. DR InterPro; IPR011356; Peptidase_M17. DR InterPro; IPR000819; Peptidase_M17_C. DR InterPro; IPR008283; Peptidase_M17_N. DR PANTHER; PTHR11963:SF3; Peptidase_M17; 1. DR Pfam; PF00883; Peptidase_M17; 1. DR Pfam; PF02789; Peptidase_M17_N; 1. DR PIRSF; PIRSF001116; Ctsl_amnpptdse; 1. DR PRINTS; PR00481; LAMNOPPTDASE. DR PROSITE; PS00631; CYTOSOL_AP; 1. PE 2: Evidence at transcript level; KW Aminopeptidase; Complete proteome; Cytoplasm; Hydrolase; KW Metal-binding; Protease; Zinc. FT CHAIN 1 513 Putative aminopeptidase C13A11.05. FT /FTId=PRO_0000165853. FT ACT_SITE 292 292 Potential. FT ACT_SITE 366 366 Potential. FT METAL 280 280 Zinc 2 (By similarity). FT METAL 285 285 Zinc 1 (By similarity). FT METAL 285 285 Zinc 2 (By similarity). FT METAL 303 303 Zinc 2 (By similarity). FT METAL 362 362 Zinc 1 (By similarity). FT METAL 364 364 Zinc 1 (By similarity). FT METAL 364 364 Zinc 2 (By similarity). SQ SEQUENCE 513 AA; 56195 MW; F904CC0607502018 CRC64; MKGLGLSTRT FNWSSLSSIL LPRIPLATTK ADSLILAVRH DKQVFSEDYR QVVDQYFETS PKKNDIRLFW NTQGFVRLAI VQLEENVSEK SVRSAAAEAA KILKSNGAKS IAVDGMGFPK DAALGAALAT YDFSLRRDHL SVYQDEKVVE KENLFTSPAP ERLTFQLLSN TSEKKTATAE ENAFKVGLIE AAAQNLARSL MECPANYMTS LQFCHFAQEL FQNSSKVKVF VHDEKWIDEQ KMNGLLTVNA GSDIPPRFLE VQYIGKEKSK DDGWLGLVGK GVTFDSGGIS IKPSQNMKEM RADMGGAAVM LSSIYALEQL SIPVNAVFVT PLTENLPSGS AAKPGDVIFM RNGLSVEIDN TDAEGRLILA DAVHYVSSQY KTKAVIEAST LTGAMLVALG NVFTGAFVQG EELWKNLETA SHDAGDLFWR MPFHEAYLKQ LTSSSNADLC NVSRAGGGCC TAAAFIKCFL AQKDLSFAHL DIAGVMDKQL NSWDCDGMSG RPVRTIIEVA RKY // The alignment shows that all the functional sites are conserved i.e. metal ion binding sites and the active sites between the S. pombe sequence and the bovine one. However, because of the nature of the family it is not possible, with the evidence available, to classify this completely. Hence all available information is added and the entry is referred to as a "putative" aminopeptidase. YA55_SCHPO MKGLGLSTRTFNWSSLSSILLPRIPLATTKADSLIL-AVRHDKQVFSEDYRQVVDQYFET AMPL_BOVIN TKGLVLGIYSKEKEEDE----PQFTSAGENFNKLVSGKLREILNISGPSLKAGKTRTFYG *** *. : : .. . *::. * : :.*: :*. :: . . : : * YA55_SCHPO SPKKNDIRLFWNTQGFVRLAIVQLEENVSE--KSVRSAAAEAAKILKSNGAKSIAVDGMG AMPL_BOVIN --LHEDFPSVVVVGLGKKTAGIDEQENWHEGKENIRAAVAAGCRQIQDLEIPSVEVDPCG ::*: . . : * :: :** * :.:*:*.* ..: ::. *: ** * YA55_SCHPO FPKDAALGAALATYDFSLRRDHLSVYQDEKVVEKENLFTSPAPERLTFQLLSNTSEKKTA AMPL_BOVIN DAQAAAEGAVLGLYEYDDLK------QKRKVVVSAKLHGSEDQE---------------- .: ** **.*. *::. : *..*** . :*. * * YA55_SCHPO TAEENAFKVGLIEAAAQNLARSLMECPANYMTSLQFCHFAQELFQ-NSSKVKVFVHDEKW AMPL_BOVIN -----AWQRGVLFASGQNLARRLMETPANEMTPTKFAEIVEENLKSASIKTDVFIRPKSW *:: *:: *:.***** *** *** **. :*..:.:* :: * *..**:: :.* YA55_SCHPO IDEQKMNGLLTVNAGSDIPPRFLEVQYIGKEKSKDDGWLGLVGKGVTFDSGGISIKPSQN AMPL_BOVIN IEEQEMGSFLSVAKGSEEPPVFLEIHYKGSPNASE-PPLVFVGKGITFDSGGISIKAAAN *:**:*..:*:* **: ** ***::* *. ::.: * :****:**********.: * YA55_SCHPO MKEMRADMGGAAVMLSSIYALEQLSIPVNAVFVTPLTENLPSGSAAKPGDVIFMRNGLSV AMPL_BOVIN MDLMRADMGGAATICSAIVSAAKLDLPINIVGLAPLCENMPSGKANKPGDVVRARNGKTI *. *********.: *:* : :*.:*:* * ::** **:***.* *****: *** :: YA55_SCHPO EIDNTDAEGRLILADAVHYVSSQYKTKAVIEASTLTGAMLVALGNVFTGAFVQGEELWKN AMPL_BOVIN QVDNTDAEGRLILADALCYAHT-FNPKVIINAATLTGAMDIALGSGATGVFTNSSWLWNK ::**************: *. : ::.*.:*:*:****** :***. **.*.:.. **:: YA55_SCHPO LETASHDAGDLFWRMPFHEAYLKQLTSSSNADLCNVSRAG-GGCCTAAAFIKCFLAQKDL AMPL_BOVIN LFEASIETGDRVWRMPLFEHYTRQVIDCQLADVNNIGKYRSAGACTAAAFLKEFVTHP-- * ** ::** .****:.* * :*: ... **: *:.: .*.******:* *::: YA55_SCHPO SFAHLDIAGVMD-KQLNSWDCDGMSGRPVRTIIEVARKY----- AMPL_BOVIN KWAHLDIAGVMTNKDEVPYLRKGMAGRPTRTLIEFLFRFSQDSA .:********* *: .: .**:***.**:**. :: 6) No similarity to any existing entries From the genome sequencing data the majority of proteins translated from predicted open reading frames have no sequence similarity to any existing proteins. In these cases the proteins remain "hypothetical". It should be noted here that we analyze these sequences by a number of programs so that we can at least add some potential information, rather than having just an entry containing submission and sequence data. Again, in these cases, care is taken to show that this information is potential so that it cannot be mixed up with data from classified proteins. The features we currently look for are signal sequences, transmembrane regions, coiled coil domains and a number of conserved domains described in PROSITE, Pfam and SMART. a) Signal sequence prediction We make use of the SignalP program [R1] in its latest implementation (version 3.0). The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models. The result in the entry is of the type: FT SIGNAL 1 x Potential. FT CHAIN x y b) Transmembrane region prediction Transmembrane helices are predicted using the TMHMM (version 2.0) program [R2] which we have found [R3] to give the best results. In some cases we complement the results of this method with predictions obtained with two other programs, ESKM [R4] and MEMSAT [R5]. Predicted transmembrane helices are indicated as: FT TRANSMEM x y Potential. c) Coiled coil prediction We make use of a program based on the algorithm of Lupas et al [R6] that predicts coiled coil regions within the sequence. A positive result of this program is: FT DOMAIN x y Coiled coil (Potential). d) REP The program REP [R7] is used to annotate a number of well defined, yet very variable protein repeats. The program currently recognize the following types of repeats: Ankyrin, Armadillo, HAT, HEAT, HEAT_AAA, HEAT_ADB, HEAT_IMB, Kelch, Leucine-rich Repeats, PFTA, PFTB, RCC1, TPR and WD40. Repeats detected by this program are annotated at the level of the feature tables, specific keywords and CC lines are also added to the entry. In the following example the lines that have been indented are those where information has been added following the detection of a repeat: ID YEX2_SCHPO Reviewed; 361 AA. AC O13856; DT 16-AUG-2004, integrated into UniProtKB/Swiss-Prot. DT 01-JAN-1998, sequence version 1. DT 23-OCT-2007, entry version 37. DE Uncharacterized WD repeat-containing protein C1A6.02. GN ORFNames=SPAC1A6.02, SPAC23C4.21; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). CC -!- SIMILARITY: Contains 6 WD repeats. DR EMBL; Z99258; CAB16352.1; -; Genomic_DNA. DR PIR; T38005; T38005. DR KEGG; spo:SPAC1A6.02; -. DR GeneDB_Spombe; SPAC1A6.02; -. DR ArrayExpress; O13856; -. DR GO; GO:0005730; C:nucleolus; IDA:GeneDB_SPombe. DR InterPro; IPR001680; WD40. DR Pfam; PF00400; WD40; 1. DR SMART; SM00320; WD40; 5. DR PROSITE; PS00678; WD_REPEATS_1; FALSE_NEG. DR PROSITE; PS50082; WD_REPEATS_2; FALSE_NEG. DR PROSITE; PS50294; WD_REPEATS_REGION; 1. PE 2: Evidence at transcript level; KW Complete proteome; Repeat; WD repeat. FT CHAIN 1 361 Uncharacterized WD repeat-containing FT protein C1A6.02. FT /FTId=PRO_0000051486. FT REPEAT 57 96 WD 1. FT REPEAT 103 142 WD 2. FT REPEAT 146 184 WD 3. FT REPEAT 187 229 WD 4. FT REPEAT 237 275 WD 5. FT REPEAT 280 318 WD 6. SQ SEQUENCE 361 AA; 39780 MW; 38DD785710325C03 CRC64; MGGTINAAIK QKFENEIFDL ACFGENQVLL GFSNGRVSSY QYDVAQISLV EQWSTKRHKK SCRNISVNES GTEFISVGSD GVLKIADTST GRVSSKWIVD KNKEISPYSV VQWIENDMVF ATGDDNGCVS VWDKRTEGGI IHTHNDHIDY ISSISPFEER YFVATSGDGV LSVIDARNFK KPILSEEQDE EMTCGAFTRD QHSKKKFAVG TASGVITLFT KGDWGDHTDR ILSPIRSHDF SIETITRADS DSLYVGGSDG CIRLLHILPN KYERIIGQHS SRSTVDAVDV TTEGNFLVSC SGTELAFWPV DQKEGDESSS SDNLDSDEDS SSDSEFSSPK KKKKVGNQGK KPLGTDFFDG L // e) PROSITE PROSITE (http://www.expasy.org/prosite/), the database of protein domains and families, plays a very big role in the addition of features in Swiss-Prot entries, especially when no other information is available for the sequence. Where patterns are matched this can lead to the addition of comment lines, keywords, features either individually or in any combination. As an example: ID NOP12_SCHPO Reviewed; 438 AA. AC O13741; DT 02-NOV-2001, integrated into UniProtKB/Swiss-Prot. DT 01-JAN-1998, sequence version 1. DT 23-OCT-2007, entry version 52. DE Nucleolar protein 12. GN Name=nop12; ORFNames=SPAC16E8.06c; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). CC -!- FUNCTION: Involved in pre-25S rRNA processing (By similarity). CC -!- SUBCELLULAR LOCATION: Nucleus, nucleolus (By similarity). CC -!- SIMILARITY: Belongs to the RRM RBM34 family. CC -!- SIMILARITY: Contains 2 RRM (RNA recognition motif) domains. DR EMBL; CU329670; CAB11047.1; -; Genomic_DNA. DR PIR; T37786; T37786. DR HSSP; P33240; 1P1T. DR KEGG; spo:SPAC16E8.06c; -. DR GeneDB_Spombe; SPAC16E8.06c; -. DR BioCyc; SPOM-XXX-01:SPOM-XXX-01-001830-MONOMER; -. DR ArrayExpress; O13741; -. DR GO; GO:0005730; C:nucleolus; IDA:GeneDB_SPombe. DR InterPro; IPR012677; a_b_plait_nuc_bd. DR InterPro; IPR000504; RRM_RNP1. DR Gene3D; G3DSA:3.30.70.330; a_b_plait_nuc_bd; 2. DR Pfam; PF00076; RRM_1; 2. DR SMART; SM00360; RRM; 2. DR PROSITE; PS50102; RRM; 2. PE 2: Evidence at transcript level; KW Complete proteome; Nucleus; Repeat; Ribosome biogenesis; RNA-binding; KW rRNA processing. FT CHAIN 1 438 Nucleolar protein 12. FT /FTId=PRO_0000081673. FT DOMAIN 164 262 RRM 1. FT DOMAIN 270 348 RRM 2. FT COMPBIAS 20 23 Poly-Ser. FT COMPBIAS 81 90 Poly-Lys. SQ SEQUENCE 438 AA; 49381 MW; 3E943401F95E7C12 CRC64; MGETNSSLDN ENTSFVGKLS SSSNVDPTLN LLFSQSKPIP KPVAKETTVL TKKDVEVEEA NGVEEAAETI ESDTKEVQNI KPKSKKKKKK LNDSSDDIEG KYFEELLAEE DEEKDKDSAG LINDEEDKSP AKQSVLEERT SQEDVKSERE VAEKLANELE KSDKTVFVNN LPARVVTNKG DYKDLTKHFR QFGAVDSIRF RSLAFSEAIP RKVAFFEKKF HSERDTVNAY IVFRDSSSAR SALSLNGTMF MDRHLRVDSV SHPMPQDTKR CVFVGNLAFE AEEEPLWRYF GDCGSIDYVR IVRDPKTNLG KGFAYIQFKD TMGVDKALLL NEKKMPEGRT LRIMRAKSTK PKSITRSKRG DEKTRTLQGR ARKLIGKAGN ALLQQELALE GHRAKPGENP LAKKKVNKKR KERAAQWRNK KAESVGKKQK TAAGKKDK // In the above example note the PROSITE entries represented in the DR lines. These matches have helped in the addition of the similarity comment and the RNA-binding RRM domains to the feature table. We have a method that automatically annotates a number of sites or domains using PROSITE patterns anf profiles. All features copied into the feature table by using facility are closely assessed to ensure that they are valid for the particular sequence from that particular organism. f) Pfam Pfam [R10] (http://www.sanger.ac.uk/Software/Pfam/) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Great use is made of this database, in conjunction with PROSITE, for the automatic addition of annotation to TrEMBL entries. It also provides important information for the curators as they begin to annotate TrEMBL entries by highlighting the type of domain the sequence has. g) Tyrosine sulfation sites Tyrosine sulfation sites are predicted using a software tool called the Sulfinator [R12]. The Sulfinator employs four different Hidden Markov Models. The program in only run on eukaryotic proteins that are predicted or supposed to be secreted or to have at least one extracellular domain. The sulfation site are indicated as being "Potential". Example: FT MOD_RES 200 200 Sulfotyrosine (Potential). Extradom.txt This file outlines the nomenclature proposal for domains (or modules) found mainly in extracellular proteins of higher eukaryotes. It shows the standard nomenclature applied to these classified domains in Swiss-Prot entries. It can be found via the Web at http://www.expasy.org/cgi-bin/lists?extradom.txt It is one of numerous documents (all of which are visible from: http://www.expasy.org/sprot/sp-docu.html) that are distributed with Swiss- Prot. Please note that when there is a modification or a binding event, "potential" is added to show that these have not been determined experimentally. Below is an example of such cases. ID YA9A_SCHPO Reviewed; 530 AA. AC Q09788; DT 01-NOV-1995, integrated into UniProtKB/Swiss-Prot. DT 01-NOV-1995, sequence version 1. DT 23-OCT-2007, entry version 40. DE Uncharacterized serine-rich protein C13G6.10c precursor. GN ORFNames=SPAC13G6.10c; OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=ATCC 38366 / 972; RX MEDLINE=21848401; PubMed=11859360; DOI=10.1038/nature724; RA Wood V., Gwilliam R., Rajandream M.A., Lyne M.H., Lyne R., Stewart A., RA Sgouros J.G., Peat N., Hayles J., Baker S.G., Basham D., Bowman S., RA Brooks K., Brown D., Brown S., Chillingworth T., Churcher C.M., RA Collins M., Connor R., Cronin A., Davis P., Feltwell T., Fraser A., RA Gentles S., Goble A., Hamlin N., Harris D.E., Hidalgo J., Hodgson G., RA Holroyd S., Hornsby T., Howarth S., Huckle E.J., Hunt S., Jagels K., RA James K.D., Jones L., Jones M., Leather S., McDonald S., McLean J., RA Mooney P., Moule S., Mungall K.L., Murphy L.D., Niblett D., Odell C., RA Oliver K., O'Neil S., Pearson D., Quail M.A., Rabbinowitsch E., RA Rutherford K.M., Rutter S., Saunders D., Seeger K., Sharp S., RA Skelton J., Simmonds M.N., Squares R., Squares S., Stevens K., RA Taylor K., Taylor R.G., Tivey A., Walsh S.V., Warren T., Whitehead S., RA Woodward J.R., Volckaert G., Aert R., Robben J., Grymonprez B., RA Weltjens I., Vanstreels E., Rieger M., Schaefer M., Mueller-Auer S., RA Gabel C., Fuchs M., Duesterhoeft A., Fritzc C., Holzer E., Moestl D., RA Hilbert H., Borzym K., Langer I., Beck A., Lehrach H., Reinhardt R., RA Pohl T.M., Eger P., Zimmermann W., Wedler H., Wambutt R., Purnelle B., RA Goffeau A., Cadieu E., Dreano S., Gloux S., Lelaure V., Mottier S., RA Galibert F., Aves S.J., Xiang Z., Hunt C., Moore K., Hurst S.M., RA Lucas M., Rochet M., Gaillardin C., Tallada V.A., Garzon A., Thode G., RA Daga R.R., Cruzado L., Jimenez J., Sanchez M., del Rey F., Benito J., RA Dominguez A., Revuelta J.L., Moreno S., Armstrong J., Forsburg S.L., RA Cerutti L., Lowe T., McCombie W.R., Paulsen I., Potashkin J., RA Shpakovski G.V., Ussery D., Barrell B.G., Nurse P.; RT "The genome sequence of Schizosaccharomyces pombe."; RL Nature 415:871-880(2002). DR EMBL; CU329670; CAA91103.1; -; Genomic_DNA. DR PIR; S62439; S62439. DR KEGG; spo:SPAC13G6.10c; -. DR GeneDB_Spombe; SPAC13G6.10c; -. DR BioCyc; SPOM-XXX-01:SPOM-XXX-01-000580-MONOMER; -. DR ArrayExpress; Q09788; -. DR GO; GO:0005783; C:endoplasmic reticulum; IDA:GeneDB_SPombe. DR GO; GO:0005794; C:Golgi apparatus; IDA:GeneDB_SPombe. DR InterPro; IPR013781; Glyco_hydro_cat. DR Gene3D; G3DSA:3.20.20.80; Glyco_hydro_cat; 1. PE 2: Evidence at transcript level; KW Complete proteome; Glycoprotein; Signal. FT SIGNAL 1 18 Potential. FT CHAIN 19 530 Uncharacterized serine-rich protein FT C13G6.10c. FT /FTId=PRO_0000014190. FT CARBOHYD 55 55 N-linked (GlcNAc...) (Potential). FT CARBOHYD 120 120 N-linked (GlcNAc...) (Potential). FT CARBOHYD 128 128 N-linked (GlcNAc...) (Potential). SQ SEQUENCE 530 AA; 54211 MW; 1C6A0261F63DFF02 CRC64; MRTTFATVAL AFLSTVGALP YAPNHRHHRR DDDGVLTVYE TILETVYVTA VPGANSSSSY TSYSTGLASV TESSDDGAST ALPTTSTESV VVTTSAPAAS SSATSYPATF VSTPLYTMDN VTAPVWSNTS VPVSTPETSA TSSSEFFTSY PATSSESSSS YPASSTEVAS SYSASSTEVT SSYPASSEVA TSTSSYVAPV SSSVASSSEI SAGSATSYVP TSSSSIALSS VVASASVSAA NKGVSTPAVS SAAASSSAVV SSVVSSATSV AASSTISSAT SSSASASPTS SSVSGKRGLA WIPGTDLGYS DNFVNKGINW YYNWGSYSSG LSSSFEYVLN QHDANSLSSA SSVFTGGATV IGFNEPDLSA AGNPIDAATA ASYYLQYLTP LRESGAIGYL GSPAISNVGE DWLSEFMSAC SDCKIDFIAC HWYGIDFSNL QDYINSLANY GLPIWLTEFA CTNWDDSNLP SLDEVKTLMT SALGFLDGHG SVERYSWFAP ATELGAGVGN NNALISSSGG LSEVGEIYIS // ======= Summary ======= This has been an introduction to the world of annotation at Swiss-Prot. There are numerous sources of information available to the curators and it is our job to assess these and to add only relevant information to the entries. These sources are primarily publications reporting the isolation of particular genes and proteins. It is not only biochemical data that is weaned but also molecular biology and genetic information too. For example, we have thousands of entries in Swiss-Prot with reports of alternative splicing as well as genetic map information. Coupled to reading publications is looking at the data bank itself. In an attempt to maintain consistency all new entries are checked, via alignments, to see if they belong to a particular family. When yes, information is copied, but at the same time checked, from similar entries. All sources of information are given in Swiss-Prot entries. The reference blocks show what is represented in the corresponding publication(s). They therefore act as sources for the information given in the entry. This can be direct sequencing of the isolated protein (RP SEQUENCE), sequencing of the gene encoding the protein (RP SEQUENCE FROM N.A.), biochemical studies (RP CHARACTERIZATION) and 3D studies (e.g. RP X-RAY CRYSTALLOGRAPHY) to name but a few. It should be noted that in the earlier days of Swiss-Prot annotation characterization studies may have been carried out but where represented as only "SEQUENCE FROM N.A." It would be possible to alter these retrospectively, although in doing so would detract from our current, labor-intensive process of making new sequences available. The annotation of Swiss-Prot entries involves extensive knowledge of all types of proteins, a complete understanding of the Swiss-Prot database itself as well as skills in assessing alignment programs and pattern databases. All of these must be considered as one, for each individual sequence, and all information resulting from these sources is skillfully assessed before addition to the entry. Therefore we can say that the every effort is made to ensure that the features and comments in Swiss-Prot are complete, correct and have pointers to the information source. Note: a short version of this document has been originally published as: Junker V.L., Apweiler R., Bairoch A. Representation of functional information in the Swiss-Prot data bank. Bioinformatics 15:1066-1067(1999). ================== Methods references ================== [R1] Nielsen H., Engelbrecht J., Brunak S., von Heijne G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10:1-6(1997). PubMed=9051728; [R2] Krogh A., Larsson B., von Heijne G., Sonnhammer E.L.L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305:567-580(2001). PubMed=11152613; [R3] Moeller S., Croning M.D.R., Apweiler R. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17:646-653(2001). [R4] Eisenberg D., Schwarz E., Komaromy M., Wall R. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 179:125-142(1984). [R5] Jones D.T., Taylor W.R., Thornton J.M. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33:3038-3049(1994). [R6] Lupas A., Van Dyke M., Stock J. Predicting coiled coils from protein sequences. Science 252:1162-1164(1991). PubMed=2031185; [R7] Andrade M.A., Ponting C., Gibson T., Bork P. Identification of protein repeats and statistical significance of sequence comparisons. J. Mol. Biol. 298:521-537(2000). [R8] Apweiler R., Attwood T.K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M.D., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N.J., Oinn T.M., Pagni M., Servant F., Sigrist C.J., Zdobnov E.M. InterPro -- an integrated documentation resource for protein families, domains and functional sites Bioinformatics 16:1145-1150(2000). PubMed=11125043; [R9] Falquet L., Pagni M., Bucher P., Hulo N., Sigrist C.J, Hofmann K., Bairoch A. The PROSITE database, its status in 2002. Nucleic Acids Res. 30:235-238(2002). PubMed=11752303; [R10] Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., Sonnhammer E.L. The Pfam protein families database. Nucleic Acids Res. 30:276-280(2002). [R11] Ponting C.P., Schultz J., Milpetz F., Bork P. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 27:229-232(1999). PubMed=9847187; [R12] Monigatti F., Gasteiger E., Bairoch A., Jung E. The Sulfinator: predicting tyrosine sulfation sites in protein sequences. Bioinformatics 18:769-770(2002). PubMed=12050077; ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------