UniProt
Swiss-ProtTrEMBL
UniProt Knowledgebase
Swiss-Prot Protein Knowledgebase
TrEMBL Protein Database

User Manual
Release 13.5 of 10-Jun-2008

Table of contents
Table of contents

1. What is the UniProt Knowledgebase?
1.1 The Swiss-Prot Protein Knowledgebase
1.2 The computer-annotated supplement TrEMBL
2. Conventions used in the database
2.1 General structure of the database
2.2 Status
2.3 Structure of a sequence entry
2.4 Non-experimental qualifiers
3. The different line types
3.1 The ID line
3.2 The AC line
3.3 The DT line
3.4 The DE line
3.5 The GN line
3.6 The OS line
3.7 The OG line
3.8 The OC line
3.9 The OX line
3.10 The OH line
3.11 The reference (RN, RP, RC, RX, RG, RA, RT, RL) lines
3.12 The CC line
3.13 The DR line
3.14 The PE line
3.15 The KW line
3.16 The FT line
3.17 The SQ line
3.18 The sequence data line
3.19 The // line
Appendix A : Amino-acid codes
Appendix B : Format differences between the Swiss-Prot and EMBL databases
B.1 Generalities
B.2 Differences in line types present in both databases
B.3 Line types defined by Swiss-Prot but currently not used by EMBL
B.4 Line types defined by EMBL but currently not used by Swiss-Prot
Appendix C : Documentation files
Appendix D : FTP access to Swiss-Prot and TrEMBL
D.1 Generalities
D.2 UniProt Knowledgebase
D.3 Biweekly updates of UniProtKB documents
D.4 Biweekly (cumulative) updates of Swiss-Prot
Appendix E : Relationships between Swiss-Prot and some biomolecular databases
1. What is the UniProt Knowledgebase?Table of contents

Until recently, the EBI/SIB Swiss-Prot + TrEMBL databases and the PIR Protein Sequence Database (PIR-PSD) coexisted as protein databases with differing protein sequence coverage and annotation priorities. In 2002, EBI, SIB, and PIR (at the Georgetown University Medical Center and National Biomedical Research Foundation) joined forces as the UniProt consortium. The primary mission of the consortium is to support biological research by maintaining a high quality database that serves as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community.

The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation.

The UniProt Knowledgebase consists of two sections: Swiss-Prot - a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and TrEMBL - a section with computationally analyzed records that await full manual annotation.

1.1. The Swiss-Prot Protein KnowledgebaseTable of contents

Swiss-Prot is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot Protein Knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of Swiss-Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database.

Swiss-Prot distinguishes itself from protein sequence databases by four distinct criteria:

a) Annotation

In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation.

For each sequence entry the core data consists of:

  • The sequence data;
  • The citation information (bibliographical references);
  • The taxonomic data (description of the biological source of the protein).

The annotation consists of the description of the following items:

  • Function(s) of the protein;
  • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor;
  • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, SH2 and SH3 domains and kringle;
  • Secondary structure, e.g. alpha helix, beta sheet;
  • Quaternary structure, i.g. homodimer, heterotrimer, etc.;
  • Similarities to other proteins;
  • Disease(s) associated with any number of deficiencies in the protein;
  • Sequence conflicts, variants, etc.

We try to include as much annotation information as possible in Swiss-Prot. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.

We believe that having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of Swiss-Prot.

In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by 'topics'; this approach permits the easy retrieval of specific categories of data from the database.

b) Minimal redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

c) Integration with other databases

It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. Swiss-Prot is currently cross-referenced to more than 50 different databases. Cross-references are provided in the form of pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot. This extensive network of cross-references allows Swiss-Prot to play a major role as a focal point of biomolecular database interconnectivity.

d) Documentation

Swiss-Prot is distributed with a large number of index files and specialized documentation files. Some of these files have been available for a long time (this user manual, the release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. 'Documentation files' section contains an up-to-date descriptive list of all distributed document files.

1.2. The computer-annotated supplement TrEMBLTable of contents

TrEMBL is the computer-annotated section of the UniProt Knowledgebase. It contains translations of all coding regions in the DDBJ/EMBL/GenBank nucleotide databases, and protein sequences extracted from the literature or submitted to UniProtKB, which are not yet integrated into Swiss-Prot. TrEMBL allows these sequences to be made publicly available quickly without diluting the high quality annotation found in Swiss-Prot.

The information in a TrEMBL entry is initially derived directly from the underlying DDBJ/EMBL/GenBank nucleotide entry and the quality of data is directly dependent on the information provided by the submitter of the nucleotide entry. This information may be enhanced later by automatic annotation procedures (see below) but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot.

After creation of a TrEMBL entry, a number of steps are taken to improve the data quality for users:

a) Automatic annotation

Records waiting in TrEMBL for full manual annotation are enhanced by automatic annotation. Information is transferred from well-characterised entries in Swiss-Prot to unannotated entries in TrEMBL which belong to groups defined by InterPro, a database of protein families, domains and functional sites. This process brings the standard of annotation in TrEMBL closer to that found in Swiss-Prot through the addition of accurate, high-quality information to TrEMBL entries, thus improving the quality of data available to the user.

b) Redundancy removal

Sequences from the same organism which are full-length and which have 100% identity are merged into a single entry to reduce redundancy.

c) Evidence attribution

TrEMBL contains data from a variety of sources including data which have been imported from the underlying nucleotide databases, data from specific programs, data from automatic annotation procedures, and some manual annotation. As it is essential for users to be able to identify the source of individual data items, a system of evidence attribution has been introduced. This system also allows UniProtKB staff to automatically update data if the underlying data source changes. This is ongoing internally and the evidence tags are currently visible in the XML version of TrEMBL. For more information, please see http://www.uniprot.org/support/docs/evidence.shtml.

2. Conventions used in the databaseTable of contents

The following sections describe the general conventions used in the knowledgebase to achieve uniformity of presentation. Experienced users of the EMBL Database can skip these sections and directly refer to this document, which lists the minor differences in format between the two data collections.

2.1. General structure of the database

The UniProt Knowledgebase is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions. Conversely, a single paper can provide data for several entries, e.g. when related sequences from different organisms are reported.

References to positions within a sequence are made using sequential numbering, beginning with 1 at the N-terminal end of the sequence.

The sequence data correspond to the precursor form of a protein before posttranslational modifications and processing.

2.2. Status

To distinguish the fully annotated entries in the Swiss-Prot section of the UniProt Knowledgebase from the computer-annotated entries in the TrEMBL section, the 'status' of each entry is indicated in the first (ID) line of each entry. The two defined classes are:

Reviewed Entries that have been manually reviewed and annotated by UniProtKB curators (Swiss-Prot section of the UniProt Knowledgebase).
Unreviewed Computer-annotated entries that have not been reviewed by UniProtKB curators (TrEMBL section of the UniProt Knowledgebase).
2.3. Structure of a sequence entry

The entries in the UniProt Knowledgebase are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used.

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry. A sample sequence entry is shown below.

ID   GRAA_HUMAN              Reviewed;         262 AA.
AC   P12544; Q6IB36;
DT   01-OCT-1989, integrated into UniProtKB/Swiss-Prot.
DT   01-OCT-1989, sequence version 1.
DT   15-JAN-2008, entry version 101.
DE   Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-lymphocyte proteinase
DE   1) (Hanukkah factor) (H factor) (HF) (Granzyme-1) (CTL tryptase)
DE   (Fragmentin-1).
GN   Name=GZMA; Synonyms=CTLA3, HFSP;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RC   TISSUE=T-cell;
RX   MEDLINE=88125000; PubMed=3257574;
RA   Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.;
RT   "Cloning and chromosomal assignment of a human cDNA encoding a T cell-
RT   and natural killer cell-specific trypsin-like serine protease.";
RL   Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RA   Ebert L., Schick M., Neubert P., Schatten R., Henze S., Korn B.;
RT   "Cloning of human full open reading frames in Gateway(TM) system entry
RT   vector (pDONR201).";
RL   Submitted (JUN-2004) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Blood;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA] OF 1-23.
RA   Goralski T.J., Krensky A.M.;
RT   "The upstream region of the human granzyme A locus contains both
RT   positive and negative transcriptional regulatory elements.";
RL   Submitted (NOV-1995) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RP   PROTEIN SEQUENCE OF 29-53.
RX   MEDLINE=88330824; PubMed=3047119;
RA   Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P.,
RA   Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.;
RT   "Human cytotoxic lymphocyte tryptase. Its purification from granules
RT   and the characterization of inhibitor and substrate specificity.";
RL   J. Biol. Chem. 263:13215-13222(1988).
RN   [6]
RP   PROTEIN SEQUENCE OF 29-40, AND CHARACTERIZATION.
RX   MEDLINE=89009866; PubMed=3262682;
RA   Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.;
RT   "Characterization of three serine esterases isolated from human IL-2
RT   activated killer cells.";
RL   J. Immunol. 141:3142-3147(1988).
RN   [7]
RP   PROTEIN SEQUENCE OF 29-39, AND CHARACTERIZATION.
RX   MEDLINE=89035468; PubMed=3263427;
RA   Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P.,
RA   Carrel S., Tschopp J.;
RT   "Characterization of granzymes A and B isolated from granules of
RT   cloned human cytotoxic T lymphocytes.";
RL   J. Immunol. 141:3471-3477(1988).
RN   [8]
RP   3D-STRUCTURE MODELING OF 29-262.
RX   MEDLINE=89184501; PubMed=3237717; DOI=10.1002/prot.340040306;
RA   Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H.,
RA   Weissman I.L., James M.N.G.;
RT   "Comparative molecular model building of two serine proteinases from
RT   cytotoxic T lymphocytes.";
RL   Proteins 4:190-204(1988).
RN   [9]
RP   X-RAY CRYSTALLOGRAPHY (2.4 ANGSTROMS) OF 29-262 IN COMPLEX WITH A
RP   TRIPEPTIDE CMK INHIBITOR.
RX   MEDLINE=22708839; PubMed=12819769; DOI=10.1038/nsb944;
RA   Bell J.K., Goetz D.H., Mahrus S., Harris J.L., Fletterick R.J.,
RA   Craik C.S.;
RT   "The oligomeric structure of human granzyme A is a determinant of its
RT   extended substrate specificity.";
RL   Nat. Struct. Biol. 10:527-534(2003).
RN   [10]
RP   X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS) OF 29-262 IN COMPLEX WITH
RP   SUBSTRATE.
RX   MEDLINE=22708840; PubMed=12819770; DOI=10.1038/nsb945;
RA   Hink-Schauer C., Estebanez-Perpina E., Kurschus F.C., Bode W.,
RA   Jenne D.E.;
RT   "Crystal structure of the apoptosis-inducing human granzyme A dimer.";
RL   Nat. Struct. Biol. 10:535-540(2003).
CC   -!- FUNCTION: This enzyme is necessary for target cell lysis in cell-
CC       mediated immune responses. It cleaves after Lys or Arg. May be
CC       involved in apoptosis.
CC   -!- CATALYTIC ACTIVITY: Hydrolysis of proteins, including fibronectin,
CC       type IV collagen and nucleolin. Preferential cleavage: -Arg-|-
CC       Xaa-, -Lys-|-Xaa- >> -Phe-|-Xaa- in small molecule substrates.
CC   -!- SUBUNIT: Homodimer; disulfide-linked.
CC   -!- INTERACTION:
CC       Self; NbExp=1; IntAct=EBI-519800, EBI-519800;
CC   -!- SUBCELLULAR LOCATION: Secreted. Cytoplasmic granule.
CC   -!- SIMILARITY: Belongs to the peptidase S1 family. Granzyme
CC       subfamily.
CC   -!- SIMILARITY: Contains 1 peptidase S1 domain.
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; M18737; AAA52647.1; -; mRNA.
DR   EMBL; CR456968; CAG33249.1; -; mRNA.
DR   EMBL; BC015739; AAH15739.1; -; mRNA.
DR   EMBL; U40006; AAD00009.1; -; Genomic_DNA.
DR   PIR; A31372; A31372.
DR   RefSeq; NP_006135.1; -.
DR   UniGene; Hs.90708; -.
DR   PDB; 1HF1; Model; -; A=29-262.
DR   PDB; 1OP8; X-ray; 2.50 A; A/B/C/D/E/F=29-262.
DR   PDB; 1ORF; X-ray; 2.40 A; A=29-262.
DR   PDBsum; 1HF1; -.
DR   PDBsum; 1OP8; -.
DR   PDBsum; 1ORF; -.
DR   IntAct; P12544; -.
DR   MEROPS; S01.135; -.
DR   Ensembl; ENSG00000145649; Homo sapiens.
DR   GeneID; 3001; -.
DR   KEGG; hsa:3001; -.
DR   H-InvDB; HIX0004862; -.
DR   HGNC; HGNC:4708; GZMA.
DR   MIM; 140050; gene.
DR   PharmGKB; PA29086; -.
DR   HOGENOM; P12544; -.
DR   HOVERGEN; P12544; -.
DR   LinkHub; P12544; -.
DR   ArrayExpress; P12544; -.
DR   CleanEx; HS_GZMA; -.
DR   GermOnline; ENSG00000145649; Homo sapiens.
DR   GO; GO:0001772; C:immunological synapse; TAS:UniProtKB.
DR   GO; GO:0005634; C:nucleus; TAS:UniProtKB.
DR   GO; GO:0004277; F:granzyme A activity; IDA:UniProtKB.
DR   GO; GO:0042803; F:protein homodimerization activity; IDA:UniProtKB.
DR   GO; GO:0006922; P:cleavage of lamin; IDA:UniProtKB.
DR   GO; GO:0006955; P:immune response; TAS:UniProtKB.
DR   InterPro; IPR001254; Peptidase_S1_S6.
DR   InterPro; IPR001314; Peptidase_S1A.
DR   Pfam; PF00089; Trypsin; 1.
DR   PRINTS; PR00722; CHYMOTRYPSIN.
DR   SMART; SM00020; Tryp_SPc; 1.
DR   PROSITE; PS50240; TRYPSIN_DOM; 1.
DR   PROSITE; PS00134; TRYPSIN_HIS; 1.
DR   PROSITE; PS00135; TRYPSIN_SER; 1.
PE   1: Evidence at protein level;
KW   3D-structure; Apoptosis; Cytolysis; Direct protein sequencing;
KW   Glycoprotein; Hydrolase; Polymorphism; Protease; Secreted;
KW   Serine protease; Signal; Zymogen.
FT   SIGNAL        1     26
FT   PROPEP       27     28       Activation peptide.
FT                                /FTId=PRO_0000027393.
FT   CHAIN        29    262       Granzyme A.
FT                                /FTId=PRO_0000027394.
FT   DOMAIN       29    259       Peptidase S1.
FT   ACT_SITE     69     69       Charge relay system.
FT   ACT_SITE    114    114       Charge relay system.
FT   ACT_SITE    212    212       Charge relay system.
FT   CARBOHYD    170    170       N-linked (GlcNAc...) (Potential).
FT   DISULFID     54     70
FT   DISULFID    148    218
FT   DISULFID    179    197
FT   DISULFID    208    234
FT   VARIANT     121    121       T -> M (in dbSNP:rs3104233).
FT                                /FTId=VAR_024291.
FT   STRAND       43     47
FT   STRAND       49     51
FT   STRAND       53     60
FT   STRAND       63     66
FT   STRAND       77     81
FT   STRAND       83     87
FT   STRAND       93     95
FT   STRAND       97    102
FT   TURN        108    111
FT   STRAND      116    122
FT   STRAND      147    153
FT   STRAND      167    174
FT   HELIX       176    179
FT   TURN        182    187
FT   STRAND      195    199
FT   STRAND      215    218
FT   STRAND      221    228
FT   STRAND      241    245
FT   HELIX       248    259
**
**   #################    INTERNAL SECTION    ##################
SQ   SEQUENCE   262 AA;  28969 MW;  DA87363A0D92BAF4 CRC64;
     MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA
     KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL
     TEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN
     DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP
     GVYILLSKKH LNWIIMTIKG AV
//

Entries from the TrEMBL section follow the same format. For format differences see the description of the distinct line types.

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below.

Line code Content Occurrence in an entry
IDIdentificationOnce; starts the entry
ACAccession number(s)Once or more
DTDateThree times
DEDescriptionOnce or more
GNGene name(s)Optional
OSOrganism speciesOnce or more
OGOrganelleOptional
OCOrganism classificationOnce or more
OXTaxonomy cross-referenceOnce
OHOrganism hostOptional
RNReference numberOnce or more
RPReference positionOnce or more
RCReference comment(s)Optional
RXReference cross-reference(s)Optional
RGReference groupOnce or more (Optional if RA line)
RAReference authorsOnce or more (Optional if RG line)
RTReference titleOptional
RLReference locationOnce or more
CCComments or notesOptional
DRDatabase cross-referencesOptional
PEProtein existenceOnce
KWKeywordsOptional
FTFeature table dataOnce or more
SQSequence headerOnce
(blanks)Sequence dataOnce or more
//Termination lineOnce; ends the entry

As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).

A detailed description of each line type is given in the next section of this document. It must be noted that, with the exception of GN, all line types exist in the EMBL Database. A description of the format differences between the UniProt Knowledgebase and EMBL databases is given in this document.

The two-character line-type code that begins each line is always followed by three blanks, so that the actual information begins with the sixth character. In general, information is not extended beyond character position 75, there are however a few exceptions where lines may be longer (e.g. OH lines, CC lines that contain the 'WEB RESOURCE' topic (see section 3.21), etc.).

2.4. Non-experimental qualifiersTable of contents

3 types of non-experimental qualifiers in comment (CC) lines and feature table (FT lines) indicate that the information given is not based on experimentally proven findings:

  • Potential
  • Probable
  • By similarity

The term 'Potential' indicates that there is some logical or conclusive evidence that the given annotation could apply. This non-experimental qualifier is often used to present the results from protein sequence analysis tools, which are only annotated, if the result makes sense in the context of a given protein. A typical example is the annotation of N-glycosylation sites in the entries of non-cytoplasmic domains or proteins.

The term 'Probable' is stronger than the qualifier 'Potential' and there must be at least some experimental evidence, which indicates, that the given information is expected to be found in the natural environment of a protein.

'By similarity' is added to facts that were proven for a protein or part of it, and which is then transferred to other protein family members within a certain taxonomic range, dependent on the biological event or characteristic. Non-experimental qualifiers are also assigned to biologically important sites found within conserved domains e.g. active sites within an enzymatic domain or disulfide bonds that stabilize the structure of extracellular modules.

Examples of the usage of non-experimental qualifiers are described in the document annbioch.txt.

3. The different line typesTable of contents

3.1. The ID lineTable of contents

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID   EntryName Status; SequenceLength.
3.1.1. Entry name

The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence, but it is not a stable identifier as is the accession number (see 3.2). The ID tracker ascertains the relevant accession numbers for Swiss-Prot entry names that are no longer in use.

a) Swiss-Prot entry names

The Swiss-Prot entry name consists of up to 11 uppercase alphanumeric characters. Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where:

  • X is a mnemonic code of at most 5 alphanumeric characters representing the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin, CAD17 for Cadherin-17;
  • The '_' sign serves as a separator;
  • Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species.

Examples:

PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self-explanatory codes are used. There are 16 of those codes: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays), MOUSE for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), and YEAST for Baker's yeast (Saccharomyces cerevisiae).

As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy-to-remember identification codes.

Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 from Escherichia coli,, AFTIN_HUMAN for Aftiphilin from human, SODC_DROME for Superoxide dismutase [Cu-Zn] from Drosophila melanogaster.

The names of all the presently-defined species identification codes are listed in the document file speclist.txt.

b) TrEMBL entry names

The TrEMBL entry name consists of up to 12 uppercase alphanumeric characters. TrEMBL uses a general purpose naming convention similar to that of Swiss-Prot, where:

  • X is identical to the accession number of the entry
  • The '_' sign serves as a separator;
  • Y is a mnemonic species identification code.

As it is not possible in a reasonable timeframe to manually assign organism codes to all species represented in TrEMBL, "virtual" codes have been defined that regroup organisms at a certain taxonomic level. Such codes are prefixed by the number "9" and generally correspond to a "pool" of organisms, which can be 'wide' as a kingdom. Here are some examples of such codes:

9BACT B      2: N=Bacteria
9CNID E   6073: N=Cnidaria
9FUNG E   4751: N=Fungi
9REOV V  10880: N=Reoviridae
9TETR E  32523: N=Tetrapoda
9VIRI E  33090: N=Viridiplantae

These type of "virtual" codes are also listed in the document file speclist.txt.

Examples of complete TrEMBL entry names are O95417_HUMAN, Q9VVG0_DROME, P71025_BACSU or Q9SR52_ARATH.

3.1.2. Status

The second item on the ID line indicates the status of the entry (see section 2.2).

3.1.3. Length of the molecule

The third and last item of the ID line is the length of the molecule, which is the total number of amino acids in the sequence. This number includes the positions reported to be present but which have not been determined (coded as 'X'). The length is followed by the letter code 'AA' (Amino Acids).

3.1.4. Examples of identification lines

Two examples of Swiss-Prot ID lines are shown below:

ID   CYC_BOVIN               Reviewed;         104 AA.
ID   GIA2_GIALA              Reviewed;         296 AA.

Example of a TrEMBL ID line:

ID   Q5JU06_HUMAN            Unreviewed;       268 AA.
3.2. The AC lineTable of contents

The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is:

AC   AC_number_1;[ AC_number_2;]...[ AC_number_N;]

An example of an accession number line is shown below:

AC   P00321;

Semicolons separate the accession numbers and a semicolon terminates the list. If necessary, more than one AC line can be used. Example:

AC   Q16653; O00713; O00714; O00715; Q13054; Q13055; Q14855; Q92891;
AC   Q92892; Q92893; Q92894; Q92895; Q93053; Q96KU9; Q96KV0; Q96KV1;
AC   Q99605;

The purpose of accession numbers is to provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of entries.

Researchers who wish to cite entries in their publications should always cite the first accession number. This is commonly referred to as the 'primary accession number'. 'Secondary accession numbers' are sorted alphanumerically.

We strongly advise those users who have programs performing mappings of Swiss-Prot to another data resource to use Swiss-Prot accession numbers to identify an entry.

Entries will have more than one accession number if they have been merged or split. For example, when two entries are merged into one, the accession numbers from both entries are stored in the AC line(s).

If an existing entry is split into two or more entries (a rare occurrence), the original accession numbers are retained in all the derived entries and a new primary accession number is added to all the entries.

An accession number is dropped only when the data to which it was assigned have been completely removed from the database. Accession numbers deleted from Swiss-Prot are listed in the document file delac_sp.txt and those deleted from TrEMBL are listed in delac_tr.txt.

Accession numbers consist of 6 alphanumerical characters in the following format:

1 2 3 4 5 6
[A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9]
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1 and P4A123.

3.3. The DT lineTable of contents

The DT (DaTe) lines show the date of creation and last modification of the database entry.

The format of the DT line in Swiss-Prot is:

DT   DD-MMM-YYYY, integrated into UniProtKB/database_name.
DT   DD-MMM-YYYY, sequence version x.
DT   DD-MMM-YYYY, entry version x.

Where 'DD' is the day, 'MMM' the month and 'YYYY' the year, respectively. The dates shown in DT lines correspond to the date of the biweekly release at which an entry was integrated or updated. There are always three DT lines in each entry, each of them is associated with a specific comment:

  • The first DT line indicates when the entry first appeared in the database. The associated comment, 'integrated into UniProtKB/database_name', indicates in which section of UniProtKB, Swiss-Prot or TrEMBL, the entry can be found;
  • The second DT line indicates when the sequence data was last modified. The associated comment, 'sequence version', indicates the sequence version number. The sequence version number of an entry is incremented by one when the amino acid sequence shown in the sequence record is modified;
  • The third DT line indicates when data other than the sequence was last modified. The associated comment, 'entry version', indicates the entry version number. The entry version number is incremented by one whenever any data in the flat file representation of the entry is modified.

Example of a block of Swiss-Prot DT lines:

DT   01-OCT-1996, integrated into UniProtKB/Swiss-Prot.
DT   01-OCT-1996, sequence version 1.
DT   07-FEB-2006, entry version 49.

Example of a block of TrEMBL DT lines:

DT   01-FEB-1999, integrated into UniProtKB/TrEMBL.
DT   15-OCT-2000, sequence version 2.
DT   15-DEC-2004, entry version 5.

Whenever the sequence of an entry is updated there is always also an annotation update. The date in the third DT line is thus always at least as recent as the one in the second DT line.

Note that sequence and entry versions are not reset when an entry moves from Swiss-Prot to TrEMBL. The date of integration into Swiss-Prot can be more recent than the last sequence update.

DT   25-OCT-2005, integrated into UniProtKB/Swiss-Prot.
DT   01-NOV-1996, sequence version 1.
DT   07-FEB-2006, entry version 35.

A comprehensive archive of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions is available: the UniProtKB Sequence/Annotation Version Database (UniSave) is a repository of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. Unlike the UniProt Knowledgebase, which contains only the latest Swiss-Prot and TrEMBL entry and sequence versions, the UniProtKB Sequence/Annotation Version Database provides access to all versions of these entries. This allows to track sequence changes, to find out when a given annotation appeared in an entry and how it evolved.

3.4. The DE lineTable of contents

The DE (DEscription) lines contain general descriptive information about the sequence stored. This information is generally sufficient to identify the protein precisely.

The format of the DE line is:

DE   Description.

The description is given in ordinary English (using US-spelling) and is free-format.

In cases where more than one DE line is required, the text is only divided between words and only the last DE line is terminated by a period.

a) The DE line in Swiss-Prot

The description always starts with the recommended name of the protein. Alternative names are indicated between brackets. Example:

DE   Annexin A5 (Annexin V) (Lipocortin V) (Endonexin II) (Calphobindin I)
DE   (CBP-I) (Placental anticoagulant protein I) (PAP-I) (PP4)
DE   (Thromboplastin inhibitor) (Vascular anticoagulant-alpha) (VAC-alpha)
DE   (Anchorin CII).

Protein naming guidelines are described in the document file nameprot.txt.

If a protein is known to be cleaved into multiple functional components, the description starts with the name of the precursor protein, followed by a section delimited by '[Contains: ...]'. All the individual components are listed in that section and are separated by semi-colons (';'). Synonyms are allowed at the level of the precursor and for each individual component. Example:

DE   Corticotropin-lipotropin precursor (Pro-opiomelanocortin) (POMC)
DE   [Contains: NPP; Melanotropin gamma (Gamma-MSH); Potential peptide;
DE   Corticotropin (Adrenocorticotropic hormone) (ACTH); Melanotropin alpha
DE   (Alpha-MSH); Corticotropin-like intermediary peptide (CLIP);
DE   Lipotropin beta (Beta-LPH); Lipotropin gamma (Gamma-LPH); Melanotropin
DE   beta (Beta-MSH); Beta-endorphin; Met-enkephalin].

If a protein is known to include multiple functional domains each of which is described by a different name, the description starts with the name of the overall protein, followed by a section delimited by '[Includes: ]'. All the domains are listed in that section and are separated by semi-colons (';'). Synonyms are allowed at the level of the protein and for each individual domain. Example:

DE   CAD protein [Includes: Glutamine-dependent carbamoyl-phosphate
DE   synthase (EC 6.3.5.5); Aspartate carbamoyltransferase (EC 2.1.3.2);
DE   Dihydroorotase (EC 3.5.2.3)].

In rare cases, the functional domains of an enzyme are cleaved, but the catalytic activity can only be observed, when the individual chains reorganize in a complex. Such proteins are described in the DE line by a combination of both '[Includes:...]' and '[Contains:...]', in the order given in the following example:

DE   Arginine biosynthesis bifunctional protein argJ [Includes: Glutamate
DE   N-acetyltransferase (EC 2.3.1.35) (Ornithine acetyltransferase)
DE   (Ornithine transacetylase) (OATase); Amino-acid acetyltransferase
DE   (EC 2.3.1.1) (N-acetylglutamate synthase) (AGS)] [Contains: Arginine
DE   biosynthesis bifunctional protein argJ alpha chain; Arginine
DE   biosynthesis bifunctional protein argJ beta chain].

If the complete sequence is not determined, the last information given on the DE lines is '(Fragment)' or '(Fragments)'. Example:

DE   Dihydrodipicolinate reductase (EC 1.3.1.26) (DHPR) (Fragment).
b) The DE line in TrEMBL

The format of the DE line in TrEMBL follows closely the format used in Swiss-Prot. However, as TrEMBL is not manually annotated, the description is derived directly from the underlying nucleotide entry and its accuracy relies on the information provided by the submitter of the nucleotide entry. The description may later be improved by automatic annotation procedures (see section Automatic annotation) but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot.

3.5. The GN lineTable of contents

The GN (Gene Name) line indicates the name(s) of the gene(s) that code for the stored protein sequence. The GN line contains three types of information:

  1. Gene names (a.k.a gene symbols). The name(s) used to represent a gene. As there can be more than one name assigned to a gene, we make a distinction between the one which we believe should be used as the official gene name and the other names which are listed as "Synonyms".
  2. Ordered locus names (a.k.a. OLN, ORF numbers, CDS numbers or Gene numbers). A name used to represent an ORF in a completely sequenced genome or chromosome. It is generally based on a prefix representing the organism and a number which usually represents the sequential ordering of genes on the chromosome. Depending on the genome sequencing center, numbers are only attributed to protein-coding genes, or also to pseudogenes, or also to tRNAs and other features. If two predicted genes have been merged to form a new gene, both gene identifiers are indicated, separated by a slash (see last example). Examples: HI0934, Rv3245c, At5g34500, YER456W, YAR042W/YAR044W.
  3. ORF names (a.k.a. sequencing names or contig names or temporary ORFNames). A name temporarily attributed by a sequencing project to an open reading frame. This name is generally based on a cosmid numbering system. Examples: MtCY277.28c, SYGP-ORF50, SpBC2F12.04, C06E1.1, CG10954.

The format of the GN line is:

GN   Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
GN   ORFNames=<name1>[, <name2>...];

None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

If there is more than one gene, GN line blocks for the different genes are separated by the following line:

GN   and
Example:
GN   Name=Jon99Cii; Synonyms=SER1, SER5, Ser99Da; ORFNames=CG7877;
GN   and
GN   Name=Jon99Ciii; Synonyms=SER2, SER5, Ser99Db; ORFNames=CG15519;

Wrapping is done preferentially at a semicolon, otherwise at a comma.

It often occurs that more than one name has been assigned to an individual locus, in which case all the synonyms will be listed alphabetically and case- insensitively. Example:

GN   Name=hns; Synonyms=bglY, cur, drdX, hnsA, msyA, osmZ, pilG, topS;
GN   OrderedLocusNames=b1237, c1701, z2013, ECs1739;
3.6. The OS lineTable of contents

The OS (Organism Species) line specifies the organism which was the source of the stored sequence. In the rare case where all the species information will not fit on a single line, more than one OS line is used. The last OS line is terminated by a period.

The species designation consists, in most cases, of the Latin genus and species designation followed by the English name (in parentheses). For viruses, only the common English name is given.

Examples of OS lines are shown here:

OS   Escherichia coli.
OS   Homo sapiens (Human).
OS   Solanum melongena (Eggplant) (Aubergine).
OS   Rous sarcoma virus (strain Schmidt-Ruppin A) (RSV-SRA) (Avian leukosis
OS   virus-RSA).

The names (official name, common name, synonym) concerning one species are cut across lines when they do not fit into a single line:

OS   Epizootic hemorrhagic disease virus 2 (strain Alberta) (EHDV-2).
3.7. The OG lineTable of contents

The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, the chloroplast, the cyanelle, the nucleomorph or a plasmid.

The format of the OG line is:

OG   Hydrogenosome.
OG   Mitochondrion.
OG   Nucleomorph.
OG   Plasmid name.
OG   Plastid.
OG   Plastid; Apicoplast.
OG   Plastid; Chloroplast.
OG   Plastid; Cyanelle.
OG   Plastid; Non-photosynthetic plastid.

Where 'name' is the name of the plasmid.

If an entry reports the sequence of a protein identical in a number of plasmids, the names of these plasmids will all be listed in the OG lines of that entry. The plasmid names are separated by commas, the last plasmid name is preceded by the word 'and'. Plasmid names are never cut across two lines. Example:

OG   Plasmid R6-5, Plasmid IncFII R100 (NR1), and
OG   Plasmid IncFII R1-19 (R1 drd-19).

The document plasmid.txt lists all the plasmid names that are used in the database in the context of the OG line. The document plastid.txt lists all plastid encoded proteins.

3.8. The OC lineTable of contents

The OC (Organism Classification) lines contain the taxonomic classification of the source organism. The taxonomic classification used is that maintained at the NCBI (see http://www.ncbi.nlm.nih.gov/Taxonomy/) and used by the nucleotide sequence databases (EMBL/GenBank/DDBJ). The NCBI's taxonomy reflects current phylogenetic knowledge. It is a sequence-based taxonomy as much as possible and based on published authorities wherever possible. Because of the inherent ambiguity of evolutionary classification and the specific needs of database users (e.g. trying to track down the phylogenetic history of a group of organisms or to elucidate the evolution of a molecule), this taxonomy strives to accurately reflect current phylogenetic knowledge. The NCBI's taxonomy is intended to be informative and helpful; no claim is made that it is the best or the most exact.

The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is given first. The classification may be distributed over several OC lines, but nodes are not split or hyphenated between lines. Semicolons separate the individual items and the list is terminated by a period.

The format of the OC line is:

OC   Node[; Node...].

For example the classification lines for a human sequence would be:

OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae;
OC   Homo.
3.9. The OX lineTable of contents

The OX (Organism taxonomy cross-reference) line is used to indicate the identifier of a specific organism in a taxonomic database. The format of the OX line is:

OX   Taxonomy_database_Qualifier=Taxonomic code;

Currently the cross-references are made to the taxonomy database of NCBI, which is associated with the qualifier 'TaxID' and a one- to six-digit taxonomic code.

Examples:

OX   NCBI_TaxID=9606;
OX   NCBI_TaxID=562;
3.10. The OH lineTable of contents

The OH (Organism Host) line is optional and appears only in viral entries. It indicates the host organism(s) that are susceptible to be infected by a virus.

A virus being an inert particle outside its hosts, the virion has neither metabolism, nor any replication capability, nor autonomous evolution. Identifying the host organism(s) is therefore essential, because features like virus-cell interactions and posttranslational modifications depend mostly on the host.

The format of the OH line is:

OH   NCBI_TaxID=TaxID; HostName.

The HostName consists of the official name and, optionally, a common name and/or synonym. The length of an OH line may exceed 75 characters.

Example for Simian hepatitis A virus:

OH   NCBI_TaxID=9481; Callithrix.
OH   NCBI_TaxID=9536; Cercopithecus hamlyni (Owl-faced monkey) (Hamlyn's monkey).
OH   NCBI_TaxID=9539; Macaca (macaques).
OH   NCBI_TaxID=9598; Pan troglodytes (Chimpanzee).
3.11. The reference (RN, RP, RC, RX, RG, RA, RT, RL) linesTable of contents

These lines comprise the literature citations. The citations indicate the sources from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RC, RX, RG, RA, RT and RL. Within each such reference block, the RN line occurs once, the RC, RX and RT lines occur zero or more times, and the RP, RG/RA and RL lines occur one or more times. If several references are given, there will be a reference block for each.

An example of a complete reference is:

RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORMS A AND C), FUNCTION, INTERACTION
RP   WITH PKC-3, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL
RP   STAGE, AND MUTAGENESIS OF PHE-175 AND PHE-221.
RC   STRAIN=Bristol N2;
RX   PubMed=11134024; DOI=10.1074/jbc.M008990200;
RA   Zhang L., Wu S.-L., Rubin C.S.;
RT   "A novel adapter protein employs a phosphotyrosine binding domain and
RT   exceptionally basic N-terminal domains to capture and localize an
RT   atypical protein kinase C: characterization of Caenorhabditis elegans
RT   C kinase adapter 1, a protein that avidly binds protein kinase C3.";
RL   J. Biol. Chem. 276:10463-10475(2001).

The formats of the individual lines are explained below.

3.11.2. The RN line

The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is:

RN   [n]

where 'n' denotes the nth reference for this entry. The reference number is always between square brackets.

3.11.3. The RP line

The RP (Reference Position) lines describe the extent of the work relevant to the entry carried out by the authors. The format of the RP line is:

RP   COMMENT.

It should contain a description of the information that has been propagated in the Swiss-Prot entry.

A typical comment is "NUCLEOTIDE SEQUENCE". This item might be tagged with a qualifier, indicating the origin of the sequence data. Valid names of this qualifiers are:

  • GENOMIC DNA: the individual gene has been sequenced
  • GENOMIC RNA: the individual gene has been sequenced
  • MRNA: the individual cDNA has been sequenced
  • LARGE SCALE GENOMIC DNA: the gene has been sequenced as part of a genome project
  • LARGE SCALE MRNA: the cDNA has been sequenced as part of a large-scale cDNA project

If 2 qualifiers apply, both are indicated, separated by a '/'.

The 'LARGE SCALE ANALYSIS' is another typical tag added in references that report large screen results to indicate that results have not been extensively studied.

Typical examples of RP lines are shown below:

RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA / MRNA] (ISOFORM 1).
RP   NUCLEOTIDE SEQUENCE [GENOMIC RNA / MRNA].
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA], AND PROTEIN SEQUENCE OF 21-35.
RP   PROTEIN SEQUENCE OF 39-76; 95-118 AND 125-138, AND DISULFIDE BONDS.
RP   SEQUENCE REVISION TO 76-84 AND 129.
RP   STRUCTURE BY NMR.
RP   X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS).
RP   CHARACTERIZATION.
RP   MUTAGENESIS OF TYR-65.
RP   REVIEW.
RP   VARIANT ALA-1368.
RP   VARIANTS HDLD1 ARG-597 AND ARG-1477, AND VARIANT HDLD2 LEU-693 DEL.
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA], PROTEIN SEQUENCE OF 1-22; 2-17;
RP   240-256; 318-339 AND 381-390, AND CHARACTERIZATION.
RP   NUCLEOTIDE SEQUENCE [MRNA], PROTEIN SEQUENCE OF 154-171; 302-308;
RP   312-328; 377-384 AND 419-431, FUNCTION, SUBCELLULAR LOCATION, AND
RP   MUTAGENESIS OF ARG-331; GLY-332 AND ARG-333.
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-387 AND SER-391, AND
RP   MASS SPECTROMETRY.
3.11.4. The RC line

The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited. The format of the RC line is:

RC   TOKEN1=Text; TOKEN2=Text; ...

The currently defined tokens and their order in the RC line are:

STRAIN
PLASMID
TRANSPOSON
TISSUE

Reference comment line topics may span lines. Examples of RC lines:

RC   STRAIN=Sprague-Dawley; TISSUE=Liver;
RC   STRAIN=Holstein; TISSUE=Lymph node, and Mammary gland;
RC   STRAIN=301 / Serotype 2a;
RC   STRAIN=cv. SP753012-O; TISSUE=Leaf;
RC   PLASMID=R1 (R7268); TRANSPOSON=Tn3;
RC   STRAIN=AL.012, AZ.026, AZ.180, DC.005, GA.039, GA2181, IL.014, IL2.17,
RC   IN.018, KY.172, KY2.37, LA.013, MI.035, MN.001, MNb027, MS.040,
RC   NY.016, OH.036, TN.173, TN2.38, UT.002, and VA.015;
3.11.5. The RX line

The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is:

RX   Bibliographic_db=IDENTIFIER[; Bibliographic_db=IDENTIFIER...];

Where the valid bibliographic database names and their associated identifiers are:

 Name  Identifier
 MEDLINE  Eight-digit MEDLINE Unique Identifier (UI)
 PubMed  PubMed Unique Identifier (PMID)
 DOI  Digital Object Identifier (DOI)

Example of RX lines:

RX   MEDLINE=83283433; PubMed=6688356;
RX   PubMed=15626370; DOI=10.1016/j.toxicon.2004.10.011;
RX   MEDLINE=22709107; PubMed=12788972; DOI=10.1073/pnas.1130426100;
3.11.6. The RG line

The Reference Group (RG) line lists the consortium name associated with a given citation. The RG line is mainly used in submission reference blocks, but can also be used in paper references, if the working group is cited as an author in the paper. RG line and RA line (Reference Author) can be present in the same reference block; at least one RG or RA line is mandatory per reference block. An example of the use of RG lines is shown below:

RG   The mouse genome sequencing consortium;
3.11.7. The RA line

The RA (Reference Author) lines list the authors of the paper (or other work) cited. The RA line is present in most references, but might be missing in references that cite a reference group (see RG line). At least one RG or RA line is mandatory per reference block.

All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank, followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines. An example of the use of RA lines is shown below:

RA   Galinier A., Bleicher F., Negre D., Perriere G., Duclos B.,
RA   Cozzone A.J., Cortay J.-C.;

As many RA lines as necessary are included in each reference. All initials of the author names are indicated and hyphens between initials are kept.

An author's initials can be followed by an abbreviation such as 'Jr' (for Junior), 'Sr' (Senior), 'II', 'III' or 'IV' (2nd, 3rd and 4th). Example:

RA   Nasoff M.S., Baker H.V. II, Wolf R.E. Jr.;
3.11.8. The RT line

The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set. The format of the RT line is:

RT   "Title.";

Example of a set of RT lines:

RT   "New insulin-like proteins with atypical disulfide bond pattern
RT   characterized in Caenorhabditis elegans by comparative sequence
RT   analysis and homology modeling.";

It should be noted that the format of the title is not always identical to that displayed at the top of the published work:

  • Major title words are not capitalized;
  • The text of a title ends with either a period '.', a question mark '?' or an exclamation mark '!';
  • Double quotation marks ' " ' in the text of the title are replaced by single quotation marks;
  • Titles of articles published in a language other than English have been translated into English;
  • Greek letters are written in full (alpha, beta, etc.).
3.11.9. The RL line

The RL (Reference Location) lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question.

a) Journal citations

The RL line for a journal citation includes the journal abbreviation, the volume number, the page range and the year. The format for such an RL line is:

RL   Journal_abbrev Volume:First_page-Last_page(YYYY).

Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given in the document file jourlist.txt

An example of an RL line is:

RL   J. Mol. Biol. 168:321-331(1983).

When a reference is made to a paper which is 'in press' at the time the database is released, the page range, and possibly the volume number, are indicated as '0' (zero). An example of such an RL line is shown here:

RL   Int. J. Parasitol. 0:0-0(2005).
b) Electronic publications

The RL line for an electronic publication includes an '(er)' prefix. The format is indicated below:

RL   (er) Free text.

Examples:

RL   (er) Plant Gene Register PGR98-023.
RL   (er) Worm Breeder's Gazette 15(3):34(1998).
c) Book citations

A variation of the RL line format is used for papers found in books or other types of publication, which are then cited using the following format:

RL   (In) Editor_1 I.[, Editor_2 I., Editor_X I.] (eds.);
RL   Book_name, pp.[Volume:]First_page-Last_page, Publisher, City (YYYY).

Examples:

RL   (In) Boyer P.D. (eds.);
RL   The enzymes (3rd ed.), pp.11:397-547, Academic Press, New York (1975).
RL   (In) Rich D.H., Gross E. (eds.);
RL   Proceedings of the 7th American peptide symposium, pp.69-72, Pierce
RL   Chemical Co., Rockford Il. (1981).
RL   (In) Magnusson S., Ottesen M., Foltmann B., Dano K., Neurath H.
RL   (eds.);
RL   Regulatory proteolytic enzymes and their inhibitors, pp.163-172,
RL   Pergamon Press, New York (1978).
d) Unpublished observations

For unpublished observations the format of the RL line is:

RL   Unpublished observations (MMM-YYYY).

Where 'MMM' is the month and 'YYYY' is the year.

We use the 'unpublished observations' RL line to cite communications by scientists to Swiss-Prot of unpublished information concerning various aspects of a sequence entry.

e) Thesis

For Ph.D. theses the format of the RL line is:

RL   Thesis (Year), Institution_name, Country.

An example of such a line is given here:

RL   Thesis (1977), University of Geneva, Switzerland.
f) Patent applications

For patent applications the format of the RL line is:

RL   Patent number Pat_num, DD-MMM-YYYY.

Where 'Pat_num' is the international publication number of the patent, 'DD' is the day, 'MMM' is the month and 'YYYY' is the year. Example:

RL   Patent number WO9010703, 20-SEP-1990.
g) Submissions

The final form that an RL line can take is that used for submissions. The format of such an RL line is:

RL   Submitted (MMM-YYYY) to Database_name.

Where 'MMM' is the month, 'YYYY' is the year and 'Database_name' is one of the following:

the EMBL/GenBank/DDBJ databases
UniProtKB
the PDB data bank
the PIR data bank

Two examples of submission RL lines are given here:

RL   Submitted (OCT-1995) to the EMBL/GenBank/DDBJ databases.
RL   Submitted (APR-2004) to UniProtKB.
3.12. The CC lineTable of contents

The CC lines are free text comments on the entry, and are used to convey any useful information. The comments always appear below the last reference line and are grouped together in comment blocks; a block is made up of 1 or more comment lines. The first line of a block starts with the characters '-!-'.

The format of a comment block is:

CC   -!- TOPIC: First line of a comment block;
CC       second and subsequent lines of a comment block.

The comment blocks are arranged according to what we designate as 'topics'. The current topics and their definitions are listed in the table below.

Topic Description
ALLERGEN Information relevant to allergenic proteins
ALTERNATIVE PRODUCTS Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene, alternative promoter usage, ribosomal frameshifting or by the use of alternative initiation codons; see 3.21.16
BIOPHYSICOCHEMICAL PROPERTIES Description of the information relevant to biophysical and physicochemical data and information on pH dependence, temperature dependence, kinetic parameters, redox potentials, and maximal absorption; see 3.21.8
BIOTECHNOLOGY Description of the use of a specific protein in a biotechnological process
CATALYTIC ACTIVITY Description of the reaction(s) catalyzed by an enzyme [1]
CAUTION Warning about possible errors and/or grounds for confusion
COFACTOR Description of any non-protein substance required by an enzyme for its catalytic activity
DEVELOPMENTAL STAGE Description of the developmentally-specific expression of mRNA or protein
DISEASE Description of the disease(s) associated with a deficiency of a protein
DOMAIN Description of the domain structure of a protein
ENZYME REGULATION Description of an enzyme regulatory mechanism
FUNCTION General description of the function(s) of a protein
INDUCTION Description of the compound(s) or condition(s) that regulate gene expression
INTERACTION Conveys information relevant to binary protein-protein interaction 3.21.12
MASS SPECTROMETRY Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods; see 3.21.24
MISCELLANEOUS Any comment which does not belong to any of the other defined topics
PATHWAY Description of the metabolic pathway(s) with which a protein is associated
PHARMACEUTICAL Description of the use of a protein as a pharmaceutical drug
POLYMORPHISM Description of polymorphism(s)
PTM Description of any chemical alternation of a polypeptide (proteolytic cleavage, amino acid modifications including crosslinks). This topic complements information given in the feature table or indicates polypeptide modifications for which position-specific data is not available.
RNA EDITING Description of any type of RNA editing that leads to one or more amino acid changes
SEQUENCE CAUTION Description of protein sequence reports that differ from the sequence that is shown in UniProtKB due to conflicts that are not described in FT CONFLICT lines, such as frameshifts, erroneous gene model predictions, etc. See 3.21.35
SIMILARITY Description of the similaritie(s) (sequence or structural) of a protein with other proteins
SUBCELLULAR LOCATION Description of the subcellular location of the chain/peptide/isoform.
SUBUNIT Description of the quaternary structure of a protein and any kind of interactions with other proteins or protein complexes; except for receptor-ligand interactions, which are described in the topic FUNCTION.
TISSUE SPECIFICITY Description of the tissue-specific expression of mRNA or protein
TOXIC DOSE Description of the lethal dose (LD), paralytic dose (PD) or effective dose of a protein
WEB RESOURCE Description of a cross-reference to a network database/resource for a specific protein; see 3.21.37

Note:

[1] For the 'CATALYTIC ACTIVITY' topic: To describe the catalytic activity of an enzyme we have used, whenever possible, the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).

Each entry contains a variable number of CC line topics. Most topics can be present more than once in a given entry. Topics that occur only once in an entry are: ALLERGEN, ALTERNATIVE PRODUCTS, BIOPHYSICOCHEMICAL PROPERTIES, BIOTECHNOLOGY, DEVELOPMENTAL STAGE, ENZYME REGULATION, INDUCTION, INTERACTIONS, PHARMACEUTICAL, SUBUNIT, TISSUE SPECIFICITY, TOXIC DOSE and RNA EDITING.

3.12.1. Examples for each comment line topic

We show here, for each of the defined topics, two examples of their usage:

CC   -!- ALLERGEN: Causes an allergic reaction in human. Binds to IgE.
CC       Partially heat-labile allergen that may cause both respiratory and
CC       food-allergy symptoms in patients with the bird-egg syndrome.
CC   -!- ALLERGEN: Causes an allergic reaction in human. Minor allergen of
CC       bovine dander.
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=3;
CC         Comment=Additional isoforms seem to exist. Experimental
CC         confirmation may be lacking for some isoforms;
CC       Name=1; Synonyms=AIRE-1;
CC         IsoId=O43918-1; Sequence=Displayed;
CC       Name=2; Synonyms=AIRE-2;
CC         IsoId=O43918-2; Sequence=VSP_004089;
CC       Name=3; Synonyms=AIRE-3;
CC         IsoId=O43918-3; Sequence=VSP_004089, VSP_004090;
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative initiation; Named isoforms=2;
CC       Name=Alpha;
CC         IsoId=P51636-1; Sequence=Displayed;
CC       Name=Beta;
CC         IsoId=P51636-2; Sequence=VSP_018696;
CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       pH dependence:
CC         Optimum pH is 8-10;
CC       Temperature dependence:
CC         Highly active at low temperatures, even at 0 degree Celsius.
CC         Thermolabile;
CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Kinetic parameters:
CC         KM=98 uM for ATP;
CC         KM=688 uM for pyridoxal;
CC         Vmax=1.604 mmol/min/mg enzyme;
CC       pH dependence:
CC         Optimum pH is 6.0. Active from pH 4.5 to 10.5;
CC   -!- BIOTECHNOLOGY: The effect of PG can be neutralized by introducing
CC       an antisense PG gene by genetic manipulation. The Flavr Savr
CC       tomato produced by Calgene (Monsanto) in such a manner has a
CC       longer shelf life due to delayed ripening.
CC   -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC       liquefaction of starch-containing mashes and in the detergent
CC       industry to remove starch. Sold under the name Termamyl by
CC       Novozymes.
CC   -!- CATALYTIC ACTIVITY: ATP + L-glutamate + NH(3) = ADP + phosphate +
CC       L-glutamine.
CC   -!- CATALYTIC ACTIVITY: (R)-2,3-dihydroxy-3-methylbutanoate + NADP(+)
CC       = (S)-2-hydroxy-2-methyl-3-oxobutanoate + NADPH.
CC   -!- CAUTION: It is uncertain whether Met-1 or Met-3 is the initiator.
CC   -!- COFACTOR: Pyridoxal phosphate.
CC   -!- COFACTOR: FAD (By similarity).
CC   -!- WEB RESOURCE: Name=Alzheimer Research Forum; Note=APP mutations;
CC       URL="http://www.alzforum.org/res/com/mut/app/default.asp".
CC   -!- DEVELOPMENTAL STAGE: Expressed early during conidial (dormant
CC       spores) differentiation.
CC   -!- DEVELOPMENTAL STAGE: Detected in embryonic skin (E12.5 and E14.5)
CC       during the formation of hair follicles and at E15.5 in the enamel
CC       knot of the developing tooth. Detected in the basal layer of the
CC       epidermis and hair follicles of P2 mice.
CC   -!- DISEASE: Defects in PHKA1 are linked to X-linked muscle
CC       glycogenosis [MIM:311870]. It is a disease characterized by slowly
CC       progressive, predominantly distal muscle weakness and atrophy.
CC   -!- DISEASE: Defects in ABCD1 are the cause of recessive X-linked
CC       adrenoleukodystrophy (X-ALD) [MIM:300100]. X-ALD is a rare
CC       peroxisomal metabolic disorder that occurs in boys and is
CC       characterized by progressive multifocal demyelination of the
CC       central nervous system and by adrenocortical insufficiency. It
CC       produces mental deterioration, corticospinal tract dysfunction,
CC       and cortical blindness. There is laboratory evidence of adrenal
CC       cortical dysfunction. Different clinical manifestations exist
CC       like: cerebral childhood ALD (CALD), adult cerebral ALD (ACALD),
CC       adrenomyeloneuropathy (AMN) and "Addison disease only" (ADO)
CC       phenotype.
CC   -!- DOMAIN: Contains a coiled-coil domain essential for vesicular
CC       transport and a dispensable C-terminal region.
CC   -!- DOMAIN: The B chain is composed of two domains, each domain
CC       consists of 3 homologous subdomains (alpha, beta, gamma).
CC   -!- ENZYME REGULATION: The activity of this enzyme is controlled by
CC       adenylation under conditions of abundant glutamine. The fully
CC       adenylated enzyme complex is inactive (By similarity).
CC   -!- ENZYME REGULATION: Activated by Gram-negative bacterial
CC       lipopolysaccharides and chymotrypsin.
CC   -!- FUNCTION: Binds to actin and affects the structure of the
CC       cytoskeleton. At high concentrations, profilin prevents the
CC       polymerization of actin, whereas it enhances it at low
CC       concentrations. By binding to PIP2, it inhibits the formation of
CC       IP3 and DG.
CC   -!- FUNCTION: Inhibitor of fungal polygalacturonase. It is an
CC       important factor for plant resistance to phytopathogenic fungi.
CC       Substrate preference is polygalacturonase (PG) from A.niger >> PG
CC       of F.oxysporum, A.solani or B.cinerea. Not active on PG from
CC       F.moniliforme.
CC   -!- INDUCTION: By heat shock, salt stress, oxidative stress, glucose
CC       limitation and oxygen limitation.
CC   -!- INDUCTION: By infection, plant wounding, or elicitor treatment of
CC       cell cultures.
CC   -!- INTERACTION:
CC       Self; NbExp=1; IntAct=EBI-123485, EBI-123485;
CC       Q9W158:CG4612; NbExp=1; IntAct=EBI-123485, EBI-89895;
CC       Q9VYI0:fne; NbExp=1; IntAct=EBI-123485, EBI-126770;
CC   -!- INTERACTION:
CC       Q9W1K5-1:CG11299; NbExp=1; IntAct=EBI-133844, EBI-212772;
CC   -!- MASS SPECTROMETRY: Mass=24948; Mass_error=6; Method=MALDI;
CC       Range=1-228; Source=PubMed:11101899;
CC   -!- MASS SPECTROMETRY: Mass=13822; Method=MALDI; Range=19-140 (P15522-
CC       2); Source=PubMed:10531593;
CC   -!- MISCELLANEOUS: Binds to bacitracin.
CC   -!- MISCELLANEOUS: Called DUO because the encoded protein is closely
CC       related to but shorter than TRIO.
CC   -!- PATHWAY: Cofactor biosynthesis; porphyrin biosynthesis; 5-
CC       aminolevulinate from L-glutamyl-tRNA(Glu): step 2/2.
CC   -!- PATHWAY: Nucleotide metabolism; purine metabolism.
CC   -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC       Betaseron (Berlex) and Rebif (Serono). Used in the treatment of
CC       multiple sclerosis (MS). Betaseron is a slightly modified form of
CC       IFNB1 with two residue substitutions.
CC   -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). Used
CC       in patients with renal cell carcinoma or metastatic melanoma.
CC   -!- POLYMORPHISM: The allelic form of the enzyme with Gln-191
CC       (Allozyme A) hydrolyzes paraoxon with a low turnover number and
CC       the one with Arg-191 (Allozyme B) with a high turnover number.
CC   -!- POLYMORPHISM: In the human populations there are two major allelic
CC       forms, alpha-1 with 83 residues and alpha-2 with 142 residues.
CC       These alleles determine the 3 major phenotypes HP*1F/HP*1S and
CC       HP*2FS. The two main alleles of HP*1 are called HP*1F (fast) and
CC       HP*1S (slow).
CC   -!- PTM: N-glycosylated and probably also O-glycosylated.
CC   -!- PTM: A soluble short 95 kDa form may be released by proteolytic
CC       cleavage from the long membrane-anchored form.
CC   -!- RNA EDITING: Modified_positions=393, 431, 452, 495.
CC   -!- RNA EDITING: Modified_positions=59, 78, 94, 98, 102, 121; Note=The
CC       nonsense codon at position 59 is modified to a sense codon. The
CC       stop codon at position 121 is created by RNA editing.
CC   -!- SEQUENCE CAUTION:
CC       Sequence=CAI24940.1; Type=Erroneous gene model prediction;
CC   -!- SIMILARITY: Belongs to the annexin family.
CC   -!- SIMILARITY: Contains 13 EGF-like domains.
CC   -!- SUBCELLULAR LOCATION: Bacterial cell inner membrane; Multi-pass
CC       membrane protein.
CC   -!- SUBUNIT: Homotetramer.
CC   -!- SUBUNIT: Disulfide-linked heterodimer of a light chain (L) and a
CC       heavy chain (H). The light chain has the pharmacological activity,
CC       while the N- and C-terminal of the heavy chain mediate channel
CC       formation and toxin binding, respectively.
CC   -!- TISSUE SPECIFICITY: Shoots, roots, and cotyledon from dehydrating
CC       seedlings.
CC   -!- TISSUE SPECIFICITY: Expressed at high levels in brain and ovary.
CC       Lower levels in small intestine. In brain regions, detected in all
CC       regions tested. Highest levels in the cerebellum and cerebral
CC       cortex.
CC   -!- TOXIC DOSE: PD(50) is 1.72 mg/kg by injection in blowfly larvae.
CC   -!- TOXIC DOSE: LD(50) is 0.015 mg/kg by intravenous injection for
CC       sarafotoxin-A and sarafotoxin-B, and 0.3 mg/kg for sarafotoxin-C.
CC   -!- WEB RESOURCE: Name=CD40Lbase;
CC       Note=European CD40L defect database (mutation db);
CC       URL="http://www.expasy.org/cd40lbase/".
3.12.2. Syntax of the topic 'BIOPHYSICOCHEMICAL PROPERTIES'
CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=xx nm;
CC         Note=free_text;
CC       Kinetic parameters:
CC         KM=xx unit for substrate [(free_text)];
CC         Vmax=xx unit enzyme [free_text];
CC         Note=free_text;
CC       pH dependence:
CC         free_text;
CC       Redox potential:
CC         free_text;
CC       Temperature dependence:
CC         free_text;

A BIOPHYSICOCHEMICAL PROPERTIES block must contain at least one of the properties Absorption, Kinetic parameters, pH dependence, Redox potential, Temperature dependence and may have any combination of these properties (ordered as indicated above). The meaning of these subtopics is as follows:

Property Description
Absorption indicates the wavelength at which photoreactive proteins such as opsins and DNA photolyases show maximal absorption
Kinetic parameters mentions the Michaelis-Menten constant (KM) and maximal velocity (Vmax) of enzymes
pH dependence describes the optimum pH for enzyme activity and/or the variation of enzyme activity with pH variation
Redox potential reports the value of the standard (midpoint) oxido-reduction potential(s) for electron transport proteins
Temperature dependence indicates the optimum temperature for enzyme activity and/or the variation of enzyme activity with temperature variation; the thermostability/thermolability of the enzyme is also mentioned when it is known
3.12.3. The topic 'INTERACTION'

The CC line topic INTERACTION conveys information relevant to binary protein-protein interaction. It is automatically derived from the IntAct database and is updated on a monthly basis. The occurrence is one INTERACTION topic per entry, with each binary interaction being presented in a separate line. Each data line can be longer than 75 characters.

Interactions can be derived by any appropriate experimental method, but must be confirmed by a second experiment, if resulting from a single yeast- two-hybrid experiment. For large-scale experiments, interactions are considered if a high confidence is assigned from the authors.

The format of the CC line topic INTERACTION is:

CC   -!- INTERACTION:
CC       {{SP_Ac:identifier[ (xeno)]}|Self}; NbExp=n; IntAct=IntAct_Protein_Ac, IntAct_Protein_Ac;

where

SP_Ac is the Swiss-Prot or TrEMBL accession number of the interacting protein. If appropriate, the IsoId is used instead to specify the relevant interacting protein isoform.
identifier serves to describe the interacting protein. It is derived from the Swiss-Prot or TrEMBL GN line and thus presents either a "gene name", a "ordered locus name" or a "ORF name". When no GN line is available a dash is indicated instead.
(xeno) is an optional qualifier indicating that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.
Self reflects a self-association; the corresponding current entry's SP_Ac and 'identifier' are not given/repeated.
NbExp=n refers to the number of experiments in IntAct supporting the interaction.
IntAct_Protein_Ac is the IntAct accession number of a interacting protein. The first IntAct_Protein_Ac refers to the protein or an isoform of the current entry, the second refers to the interacting protein or isoform.

Within the CC INTERACTION topic, homomeric interactions are listed before the heteromeric interactions; latter are sorted alphanumerical according the 'identifier'.

"IntAct=IntAct_Protein_Ac, IntAct_Protein_Ac" identifies the interaction in IntAct by using the two IntAct protein identifiers.

Examples of interaction lines are given below. The CC INTERACTION topics are not complete; only explained interaction lines are indicated.

CC   -!- INTERACTION:
CC       P11450:fcp3c; NbExp=1; IntAct=EBI-126914, EBI-159556;

In the typical example the current protein is interacting with P11450 which is further characterized by "fcp3c" derived from its GN line and presents its gene name "Fcp3C". The interaction is supported by one experiment stored in IntAct. Experimental details for this interaction can be found by querying IntAct with "EBI-126914, EBI-159556".


CC   -!- INTERACTION:
CC       Q9W1K5-1:CG11299; NbExp=1; IntAct=EBI-133844, EBI-212772;

The current protein interacts with an isoform of Q9W1K5 defined by the IsoID Q9W1K5-1 .


CC   -!- INTERACTION:
CC       Q8NI08:-; NbExp=1; IntAct=EBI-80809, EBI-80799;

No gene name information for the interacting protein is available.


CC   -!- INTERACTION:
CC       Self; NbExp=1; IntAct=EBI-123485, EBI-123485;

The protein self-associates.


CC   -!- INTERACTION:
CC       Q8C1S0:2410018M14Rik (xeno); NbExp=1; IntAct=EBI-394562, EBI-398761;

The source organisms of the interacting proteins are different.


CC   -!- INTERACTION:
CC       P51617:IRAK1; NbExp=1; IntAct=EBI-448466, EBI-358664;
CC       P51617:IRAK1; NbExp=1; IntAct=EBI-448472, EBI-358664;

Different isoforms of the current protein are shown to interact with the same protein (P51617). This is reflected by different IntAct_Protein_Acs for the current protein.

Example entry with many interaction lines: Q02821.

3.12.4. Syntax of the topic 'SUBCELLULAR LOCATION'

The document subcell.txt, lists the controlled vocabularies used in the comment line (CC) topic SUBCELLULAR LOCATION, their definitions and further information such as synonyms or relevant GO terms in the following format:

    ---------  -------------------------------   ----------------------------------------------
    Line code  Content                           Occurrence in an entry
    ---------  -------------------------------   ----------------------------------------------
    ID         Identifier (location)             Once; starts an entry
    IT         Identifier (topology)             Once; starts a 'topology' entry
    IO         Identifier (orientation)          Once; starts an 'orientation' entry
    AC         Accession (SL-xxxx)               Once
    DE         Definition                        Once or more
    SY         Synonyms                          Optional; Once or more
    SL         Content of subc. loc. lines       Once
    HI         Hierarchy ('is-a')                Optional; Once or more
    HP         Hierarchy ('part-of')             Optional; Once or more
    KW         Associated keyword (accession)    Optional; Once or more
    GO         Gene ontology (GO) mapping        Optional; Once or more
    WW         Interesting links or references   Optional; Once or more
    //         Terminator                        Once; ends an entry
    
   

Example:

    ID   Cyanelle.
    AC   SL-0082
    DE   A cyanelle is a photosynthetic organelle of glaucocystophyte algae.
    DE   Cyanelles are surrounded by a double membrane and, in between, a
    DE   peptidoglycan wall. Thylakoid membrane architecture and the presence
    DE   of carboxysomes are cyanobacteria-like. Historically, the term
    DE   cyanelle is derived from a classification as endosymbiotic
    DE   cyanobacteria, and thus is not fully correct.
    SY   Muroplast; Cyanoplast.
    SL   Plastid, cyanelle.
    HI   Plastid.
    KW   KW-0194
    GO   GO:0009842; cyanelle
    //
   

The format of SUBCELLULAR LOCATION is:

       CC   -!- SUBCELLULAR LOCATION:(( Molecule:)?( Location\.)+)?( Note=Free_text( Flag)?\.)?
   
Where:
  • Molecule: Isoform, chain or peptide name
  • Location = Subcellular_location( Flag)?(; Topology( Flag)?)?(; Orientation( Flag)?)?
    • Subcellular_location: SL-line of subcell.txt ID-record
    • Topology: SL-line of subcell.txt IT-record
    • Orientation: SL-line of subcell.txt IO-record
    • Flag = \(By similarity|Probable|Potential\)

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 1 or more times (+).

Examples:

When no chain/peptide/isoform is specified, the subcellular location corresponds to that of the mature protein.

    CC   -!- SUBCELLULAR LOCATION: Cytoplasm. Endoplasmic reticulum membrane;
    CC       Peripheral membrane protein. Golgi apparatus membrane; Peripheral
    CC       membrane protein.
   
    CC   -!- SUBCELLULAR LOCATION: Cell membrane; Peripheral membrane protein
    CC       (By similarity). Secreted (By similarity). Note=The last 22 C-
    CC       terminal amino acids may participate in cell membrane attachment.
    CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm (Probable).
   
    CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
    CC       membrane; Multi-pass membrane protein (By similarity).
    CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
    CC       redistributed to the plasma membrane in response to elevated
    CC       copper levels.
    CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm.
    CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion.
   
3.12.5. Syntax of the topic 'ALTERNATIVE PRODUCTS'

The format of the CC line topic ALTERNATIVE PRODUCTS is:

 CC   -!- ALTERNATIVE PRODUCTS:
 CC       Event=Event(, Event)*; Named isoforms=Number_of_isoforms;
(CC         Comment=Free_text;)?
(CC       Name=Isoform_name;( Synonyms=Synonym(, Synonym)*;)?
 CC         IsoId=Isoform_identifier(, Isoform_identifer)*;
 CC         Sequence=(Displayed|External|Not described|Feature_identifier(, Feature_identifier)*);
(CC         Note=Free_text;)?)+

Note: Variable values are represented in italics. Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?), may occur 0 or more times (*), or 1 or more times (+). Alternative values are separated by a pipe symbol (|).

Topic Description
Event Biological process that results in the production of the alternative forms. It lists one or a combination of the following values (Alternative promoter usage, Alternative splicing, Alternative initiation, Ribosomal frameshifting).
Format: Event=controlled vocabulary;
Example: Event=Alternative splicing;
Named isoforms Number of isoforms listed in the topics 'Name' currently only for 'Event=Alternative splicing'.
Format: Named isoforms=number;
Example: Named isoforms=6;
Comment Any comments concerning one or more isoforms; optional;
Format: Comment=free text;
Example: Comment=Experimental confirmation may be lacking for some isoforms;
Name A common name for an isoform used in the literature or assigned by Swiss-Prot; currenty only available for spliced isoforms.
Format: Name=common name;
Example: Name=Alpha;
Synonyms Synonyms for an isoform as used in the literature; optional; currently only available for spliced isoforms.
Format: Synonyms=Synonym_1[, Synonym_n];
Example: Synonyms=B, KL5;
IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and a number.
Format: IsoId=acc#-isoform_number[, acc#-isoform_number];
Example: IsoId=P05067-1;
Sequence Information on the isoform sequence; the term 'Displayed' indicates, that the sequence is shown in the entry; a lists of feature identifiers (VSP_#) indicates that the isoform is annotated in the feature table; the FTIds enable programs to create the sequence of a splice variant; if the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External'; 'Not described' points out that the sequence of the isoform is unknown.
Format: Sequence=VSP_#[, VSP_#]|Displayed|External|Not described;
Example: Sequence=Displayed;
Example: Sequence=VSP_000013, VSP_000014; Example: Sequence=External;
Example: Sequence=Not described;
Note Lists isoform-specific information; optional. It may specify the event(s), if there are several.
Format: Note=Free text;
Example: Note=No experimental confirmation available;

Example of the CC lines and the corresponding FT lines for an entry with alternative splicing:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing, Alternative initiation; Named isoforms=8;
CC         Comment=Additional isoforms seem to exist;
CC       Name=1; Synonyms=Non-muscle isozyme;
CC         IsoId=Q15746-1; Sequence=Displayed;
CC       Name=2;
CC         IsoId=Q15746-2; Sequence=VSP_004791;
CC       Name=3A;
CC         IsoId=Q15746-3; Sequence=VSP_004792, VSP_004794;
CC       Name=3B;
CC         IsoId=Q15746-4; Sequence=VSP_004791, VSP_004792, VSP_004794;
CC       Name=4;
CC         IsoId=Q15746-5; Sequence=VSP_004792, VSP_004793;
CC       Name=Del-1790;
CC         IsoId=Q15746-6; Sequence=VSP_004795;
CC       Name=5; Synonyms=Smooth-muscle isozyme;
CC         IsoId=Q15746-7; Sequence=VSP_018845;
CC         Note=Produced by alternative initiation at Met-923 of isoform 1;
CC       Name=6; Synonyms=Telokin;
CC         IsoId=Q15746-8; Sequence=VSP_018846;
CC         Note=Produced by alternative initiation at Met-1761 of isoform
CC         1. Has no catalytic activity;
...
FT   VAR_SEQ       1   1760       Missing (in isoform 6).
FT                                /FTId=VSP_018846.
FT   VAR_SEQ       1    922       Missing (in isoform 5).
FT                                /FTId=VSP_018845.
FT   VAR_SEQ     437    506       VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
FT                                RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in
FT                                isoform 2 and isoform 3B).
FT                                /FTId=VSP_004791.
FT   VAR_SEQ    1433   1439       DEVEVSD -> MKWRCQT (in isoform 3A,
FT                                isoform 3B and isoform 4).
FT                                /FTId=VSP_004792.
FT   VAR_SEQ    1473   1545       Missing (in isoform 4).
FT                                /FTId=VSP_004793.
FT   VAR_SEQ    1655   1705       Missing (in isoform 3A and isoform 3B).
FT                                /FTId=VSP_004794.
FT   VAR_SEQ    1790   1790       Missing (in isoform Del-1790).
FT                                /FTId=VSP_004795.
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing, Alternative initiation; Named isoforms=3;
CC         Comment=Isoform 1 and isoform 2 arise due to the use of two
CC         alternative first exons joined to a common exon 2 at the same
CC         acceptor site but in different reading frames, resulting in two
CC         completely different isoforms;
CC       Name=1; Synonyms=p16INK4a;
CC         IsoId=O77617-1; Sequence=Displayed;
CC       Name=3;
CC         IsoId=O77617-2; Sequence=VSP_018701;
CC         Note=Produced by alternative initiation at Met-35 of isoform 1.
CC         No experimental confirmation available;
CC       Name=2; Synonyms=p19ARF;
CC         IsoId=O77618-1; Sequence=External;
..
FT   VAR_SEQ       1     34       Missing (in isoform 3).
FT                                /FTId=VSP_004099.
3.12.6. Syntax of the topic 'MASS SPECTROMETRY'
CC   -!- MASS SPECTROMETRY: Mass=mass(; Mass_error=error)?; Method=method; Range=ranges( (IsoformID))?(; Note=free_text)?; Source=references;

Where:

  • 'Mass=XXX' is the determined molecular weight (MW);
  • 'Mass_error=XX' (optional) is the accuracy or error range of the MW measurement;
  • 'Method=XX' is the ionization method;
  • 'Range=XX-XX[ (Name)]' is used to indicate what part of the protein sequence entry corresponds to the molecular weight. In case of multiple products, the name of the relevant isoform is enclosed;
  • 'Note={Free text}'. Comment in free text format;
  • 'Source=PubMed:/Ref.n' indicates the relevant reference'.
3.12.7. Syntax of the topic 'SEQUENCE CAUTION'

The format of the SEQUENCE CAUTION topic is:

CC   -!- SEQUENCE CAUTION:
         Sequence=Sequence; Type=Type;[ Positions=Positions;][ Note=Note;]

Where:

  • Sequence is the sequence which differs from the UniProtKB sequence. It is described by one of:
    • an EMBL protein identifier (with version number)
    • an EMBL accession number.