Properties:
1) Documentation before Sequence ... fixed format: fields of information ... field names and data
in specific Columns in the entry ...
2) Words (Field Names), beginning in Column 1, used as Delimiters in part ...
3) Much Documentation or Annotation possible ...
FEATURES Table ... detailed features of the Sequence: primary, 2ary, 3ary structure; mutations, ...
4) Sequence in fixed columnar format ... similar to GCG ... residue number ... 60 res per line ...
spaces in fixed columns ... sequence in lower case ...
Delimiters: Word LOCUS begins an Entry; word ORIGIN begins Sequence; // ends Entry
GenBank Numbers used to Key or uniquely Identify Entries:
1. SeqID:
Initially, the Entry Name in the LOCUS line was used as the only key to a GenBank entry
This name attempted to mimic the organism and function of the gene encoded
Problem: impossible to do this systematically and uniquely with new knowledge ...
These Entry Names now change over time...
2. Accession Numbers:
The Accession Number was then introduced, to be the primary key to reference an entry in the database ...
Will always stay with the entry, even when entry is updated
Nomenclature: either 5 (eg: X79797) or 6 (eg: AF028831) ... new one: NC_001140 ... a 'RefSeq' entry
the letter used reflects which of the three databases (GenBank, EMBL, DDBJ) is the primary database ...
Problem:
a. Same Accession Number was assigned to new versions of the same sequence!
Thus, a given Accession Number would stay with a given Entry ... but could be associated with more than one Entries!
Result: Entry retrieved from a given Accession Number was not unique ...
b. now also have Secondary Accession Numbers!
These were introduced to given some notion of the history of the Entry
Example from below:
ACCESSION X79797 X70490 X74327
All Accession Numbers are on a single line, the ACCESSION line
The Primary key is the first number, all others are Secondary keys
Origin of Secondary keys: multiple origins, reflecting the history of GenBank ... some unknown origins ...
1. Two entries may have been merged into one entry, with one of the two AccNums becoming a Secondary
2. Primary Accession Number may have replaced the Secondary AccNum which otherwise no longer exists
3. gi (genInfo) Number:
number assigned by GenBank to each nucleotide and protein sequence
SeqID retained, reflects policy used by the sequence source database: GB, EMBL, DDBJ, SwissProt, PIR, ...
Rationale: any new sequence entering GenBank gets a new gi number, even if it is a new version of an existing sequence
Thus, two different sequences which are different versions of each other have the SAME Accession Number but DIFFERENT gi Numbers ...
4. NID and PID Numbers:
These are the gi numbers for Nucleotide and Protein entries at GenBank
5. More Recently: new SeqID number = Accession . Version
This approach, introduced about 2 yrs ago, is expected to replace the gi number approach
Rationale:
Accession Number: identifies a sequence record ...
Version Number: tracks changes to the sequence itself ...
Advantages:
1. can just use the Accession Number to retrieve the latest version
2. can record the Version Number in publications, to note which sequence was actually used
3. easy to determine the history of changes by noting the Version Number
6. Most Recently: new Entries associated with the RefSeq project ...
These are the NCBI Reference sequences ... built via a rather elaborate scheme.
An attempt to get reliable "biological annotation" into stable, reference set of GenBank entries ...
More info available at the RefSeq FAQ page ...
Identification of these new GenBank entries:
Entrez and BLAST results both present the following formatted text as part of the returned result: gi|4557284|ref|NM_000646.1|AGLf| [4557284] Data Element Comment gi "GenBank Identifier", or sequence ID number. "gi|" denotes that the number which follows is a unique sequence id. Any change to the sequence data will result in a new gi number. 4557284 The gi number. ref Indicates that RefSeq is the source database. NM_000646 The RefSeq accession number. AGLf The LOCUS name; this abbreviation is displayed in the LOCUS field of the record. The capitalized portion of this abbreviation is equal to the current gene symbol. The availability of splice variant records can be quickly determined by noting the lowercase alphabetic character appended to the gene symbol.
Example:
Protein (GenPept) Sequence:
old format: LOCUS 2599106 842 aa 05-JAN-1998 DEFINITION DNA polymerase. ACCESSION 2599106 PID g2599106 DBSOURCE GENBANK: locus AF028831, accession AF028831 KEYWORDS . SOURCE Cenarchaeum symbiosum. ORGANISM Cenarchaeum symbiosum Archaea; Crenarchaeota; Cenarchaeum. REFERENCE 1 (residues 1 to 842) AUTHORS Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Characterization of a DNA polymerase from the uncultivated psychrophilic archaeon Cenarchaeum symbiosum JOURNAL J. Bacteriol. 179 (24), 7803-7811 (1997) MEDLINE 98062213 REFERENCE 2 (residues 1 to 842) AUTHORS Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Direct Submission JOURNAL Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute, PO Box 628, Moss Landing, CA 95039, USA COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..842 /organism="Cenarchaeum symbiosum" /db_xref="taxon:46770" Protein <1..842 /product="DNA polymerase" CDS 1..842 /coded_by="AF028831:<1..2529" /transl_table=11 ORIGIN 1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge 61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy 121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi 181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg 241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs 301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd 361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq 421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm 481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn 541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi 601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls 661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk 721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy 781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff 841 fk // new format: LOCUS AAB94881 842 aa BCT 05-JAN-1998 DEFINITION DNA polymerase [Cenarchaeum symbiosum]. ACCESSION AAB94881 PID g2599106 VERSION AAB94881.1 GI:2599106 DBSOURCE locus AF028831 accession AF028831.1 KEYWORDS . SOURCE Cenarchaeum symbiosum. ORGANISM Cenarchaeum symbiosum Archaea; Crenarchaeota; Cenarchaeum. REFERENCE 1 (residues 1 to 842) AUTHORS Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Characterization of a DNA polymerase from the uncultivated psychrophilic archaeon Cenarchaeum symbiosum JOURNAL J. Bacteriol. 179 (24), 7803-7811 (1997) MEDLINE 98062213 REFERENCE 2 (residues 1 to 842) AUTHORS Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Direct Submission JOURNAL Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute, PO Box 628, Moss Landing, CA 95039, USA COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..842 /organism="Cenarchaeum symbiosum" /db_xref="taxon:46770" Protein <1..842 /product="DNA polymerase" /name="archaeal family B DNA polymerase" CDS 1..842 /coded_by="AF028831.1:<1..2529" /transl_table=11 ORIGIN 1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge 61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy 121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi 181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg 241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs 301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd 361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq 421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm 481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn 541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi 601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls 661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk 721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy 781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff 841 fk //
2. EMBL/Swiss-Prot/TREMBL Format
SwissProt and TREMBL are Protein, EMBL is DNA ... same formats
TREMBL is a "TRanslation of EMBL", i.e. the cognate of GenPept relative to GenBank
Properties:
1) EMBL DNA database is at EBI ...
2) SwissProt main online database is at ExPASy ... usually has links within the entries ...
3) Documentation before Sequence ... fixed format: fields of information ... field names and data
in specific Columns in the entry ...
4) 2 letter symbols (Field Names), beginning in Column 1, used as Delimiters in part ...
5) Much Documentation or Annotation possible ...
FEATURES Table (FT lines)... detailed features of the Sequence: primary, 2ary, 3ary structure ...
mutations, ...
6) Sequence in fixed columnar format ... similar to GCG ... residue number ... 60 res per line ...
spaces in fixed columns ... sequence in UPPER case ...
Delimiters: Word ID begins an Entry; word SQ begins Sequence; // ends Entry
Example:
SwissProt Protein entry:
ID DHE2_CLOSY STANDARD; PRT; 449 AA. AC P24295; DT 01-MAR-1992 (REL. 21, CREATED) DT 01-APR-1993 (REL. 25, LAST SEQUENCE UPDATE) DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) DE NAD-SPECIFIC GLUTAMATE DEHYDROGENASE (EC 1.4.1.2) (NAD-GDH). GN GDH. OS CLOSTRIDIUM SYMBIOSUM (BACTEROIDES SYMBIOSUS). OC PROKARYOTA; FIRMICUTES; ENDOSPORE-FORMING RODS AND COCCI; BACILLACEAE. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 92267007. [NCBI, Geneva] RA TELLER J.K., SMITH R.M., MCPHERSON M.J., ENGEL P.C., GUEST J.R.; RL EUR. J. BIOCHEM. 206:151-159(1992). RN [2] RP PRELIMINARY PARTIAL SEQUENCE. RX MEDLINE; 92062694. [NCBI, Geneva] RA LILLEY K.S., BAKER P.J., BRITTON K.L., STILLMAN T.J., BROWN P.E., RA MOIR A.J.G., ENGEL P.C., RICE D.W., BELL J.E., BELL E.; RL BIOCHIM. BIOPHYS. ACTA 1080:191-197(1991). RN [3] RP PARTIAL SEQUENCE, AND MODIFICATION OF SOME LYSINES. RX MEDLINE; 92339441. [NCBI, Geneva] RA LILLEY K.S., ENGEL P.C.; RL EUR. J. BIOCHEM. 207:533-540(1992). RN [4] RP X-RAY CRYSTALLOGRAPHY (1.96 ANGSTROMS). RX MEDLINE; 92204934. [NCBI, Geneva] RA BAKER P.J., BRITTON K.L., ENGEL P.C., FARRANTS G.W., LILLEY K.S., RA RICE D.W., STILLMAN T.J.; RL PROTEINS 12:75-86(1992). CC -!- CATALYTIC ACTIVITY: L-GLUTAMATE H(2)O NAD() = 2-OXOGLUTARATE CC NH(3) NADH. CC -!- PATHWAY: FIRST STEP IN THE HYDROXYGLUTARATE PATHWAY (ROUTE FOR THE CC ENERGY-YIELDING GLUTAMATE FERMENTATION). CC -!- SUBUNIT: HOMOHEXAMER. CC -!- SIMILARITY: BELONGS TO THE GLU/LEU/PHE/VAL DEHYDROGENASES FAMILY. DR EMBL; Z11747; G49280; -. [EMBL / GenBank / DDBJ] [CodingSequence] DR PIR; S18361; S18361. DR PIR; S22403; S22403. DR PDB; 1HRD; 12-MAR-97. DR [ENTRY / RASMOL / 3D IMAGE / HSSP ENTRY / SCOP] DR SWISS-3DIMAGE; DHE2_CLOSY. DR PROSITE; PS00074; GLFV_DEHYDROGENASE; 1. DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW OXIDOREDUCTASE; NAD; 3D-STRUCTURE. FT INIT_MET 0 0 FT ACT_SITE 125 125 SQ SEQUENCE 449 AA; 49165 MW; 8E22A020 CRC32; SKYVDRVIAE VEKKYADEPE FVQTVEEVLS SLGPVVDAHP EYEEVALLER MVIPERVIEF RVPWEDDNGK VHVNTGYRVQ FNGAIGPYKG GLRFAPSVNL SIMKFLGFEQ AFKDSLTTLP MGGAKGGSDF DPNGKSDREV MRFCQAFMTE LYRHIGPDID VPAGDLGVGA REIGYMYGQY RKIVGGFYNG VLTGKARSFG GSLVRPEATG YGSVYYVEAV MKHENDTLVG KTVALAGFGN VAWGAAKKLA ELGAKAVTLS GPDGYIYDPE GITTEEKINY MLEMRASGRN KVQDYADKFG VQFFPGEKPW GQKVDIIMPC ATQNDVDLEQ AKKIVANNVK YYIEVANMPT TNEALRFLMQ QPNMVVAPSK AVNAGGVLVS GFEMSQNSER LSWTAEEVDS KLHQVMTDIH DGSAAAAERY GLGYNLVAGA NIVGFQKIAD AMMAQGIAW //
3. PIR Format
most recent format of original Dayhoff NBRF ... protein oriented
Properties:
1) Header line: begins with > delimiter ... two symbols, then semicolon ... then EntryName ... no spaces
EntryName often limited to 6 or 8 or 12 characters, usually Letters or Numbers
2) Second line - a Description line: free text ... often: molecule name - hyphen - organism
3) Sequence: multiple lines ... Spaces, upper/lower case ok, multiple symbols, variable number per line
4) Sequence ends with a * - delimiter for end of sequence ... indicates Stop Codon for protein seqs
5) Annotation or documentation follows ... can be free text ...
True PIR uses symbols to distinguish types of annotation:
C; comment ... R; reference ... A; protein annotation ...
Delimiters: > at beginning of Entry; * at end of Sequence; none at end of Entry
Example:
a. The following is an example of a PIR-formatted Protein sequence obtained from the PIR protein library PROTEIN using the COPY command of the program PSQ . The documentation comments following the sequence are in the PIR-NBRF format.
>P1;CATPAA Chloramphenicol acetyltransferase (EC 2.3.1.28) - E. coli plasmids M E K K I T G Y T T V D I S Q W H R K E H F E A F Q S V A Q C T Y N Q T V Q L D I T A F L K T V K K N K H K F Y P A F I H I L A R L M N A H P E F R M A M K D G E L V I W D S V H P C Y T V F H E Q T E T F S S L W S E Y H D D F R Q F L H I Y S Q D V A C Y G E N L A Y F P K G F I E N M F F V S A N P W V S F T S F D L N V A N M D N F F A P V F T M G K Y Y T Q G D K V L M P L A I Q V H H A V C D G F H V G R M L N E L Q Q Y C D E W Q G G A * C;Species: Escherichia coli R;Shaw, W.V., Packman, L.C., Burleigh, B.D., Dell, A., Morris, H.R., and Hartley, BS Nature 282, 870-872, 1979 (Plasmid JR66b, complete sequence with experimental details) A;The chloramphenicol binding site may include regions near residues 31 and 192-196. Lys-136 may be involved in the formation of salt bridges be tween the chains. R;Alton, N.K., and Vapnek, D. Nature 282, 864-869, 1979 (Sequence translated from the nucleotide sequence for the transposable genetic element Tn9) A;Residues 77-219 correspond to a probable fusidic acid resistance protein. R;Marcoli, R., Iida, S., and Bickle, T.A. FEBS Lett. 110, 11-14, 1980 (Sequence translated from the nucleotide sequence for the transposon, Tncam204, derived from the R plasmid NR1 [=R10 0]) C;This enzyme, a type I variant mediated by an R plasmid in E. coli, exists as a tetramer of identical chains.
4. FASTA/Pseudo-FASTA
most commonly used format with a minimum of descriptive information
Properties: ... similar to PIR but abbreviated ...
1) Header line: begins with > delimiter followed by EntryName ... no spaces
EntryName often limited to 6 or 8 or 12 characters, usually Letters or Numbers
After EntryName comes a space followed by free text: Description of entry
NOTE: this first Space is Delimiter between EntryName and entry Description ...
2) Sequence begins on line 2 and continues for as many lines as needed ...
Usually: 80 residues per line or less ... Upper or lower case permitted ... no spaces ...
standard code letters used, no numbers ... no * at end
3) No Annotation or Documentation permitted except in brief line 1 description ...
Delimiters: > at beginning of Entry; none at end of Sequence; none at end of Entry
Example:
>CATPAA Chloramphenicol acetyltransferase (EC 2.3.1.28) - E. coli plasmids
MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFI
HILARLMNAHPEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIY
SQDVACYGENLAYFPKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQG