Properties:
1)
Documentation before Sequence ... fixed format: fields of information ...
field names and data
in specific Columns in the
entry ...
2) Words (Field Names), beginning in
Column 1, used as Delimiters in part ...
3) Much
Documentation or Annotation possible ...
FEATURES
Table ... detailed features of the Sequence: primary, 2ary, 3ary structure;
mutations, ...
4) Sequence in fixed columnar
format ... similar to GCG ... residue number ... 60 res per line ...
spaces in fixed columns ... sequence in lower case
...
Delimiters: Word LOCUS begins an Entry; word ORIGIN begins Sequence; // ends Entry
GenBank Numbers used to Key or uniquely Identify Entries:
1. SeqID:
Initially, the Entry Name in the LOCUS line was used as the
only key to a GenBank entry
This name attempted
to mimic the organism and function of the gene encoded
Problem: impossible to do this systematically and uniquely
with new knowledge ...
These Entry Names now
change over time...
2. Accession Numbers:
The Accession Number was then introduced, to be the primary key to
reference an entry in the database ...
Will
always stay with the entry, even when entry is updated
Nomenclature: either 5 (eg: X79797) or 6 (eg: AF028831) ... new
one: NC_001140 ... a 'RefSeq' entry
the
letter used reflects which of the three databases (GenBank, EMBL, DDBJ) is the
primary database ...
Problem:
a. Same Accession Number was assigned to new versions of
the same sequence!
Thus, a given Accession Number
would stay with a given Entry ... but could be associated with more than one
Entries!
Result: Entry retrieved from a given
Accession Number was not unique ...
b. now also
have Secondary Accession Numbers!
These
were introduced to given some notion of the history of the Entry
Example from below:
ACCESSION X79797 X70490 X74327
All
Accession Numbers are on a single line, the ACCESSION line
The Primary key is the first number, all others are Secondary
keys
Origin of Secondary keys: multiple
origins, reflecting the history of GenBank ... some unknown origins ...
1. Two entries may have been merged into one entry, with
one of the two AccNums becoming a Secondary
2.
Primary Accession Number may have replaced the Secondary AccNum which otherwise
no longer exists
3. gi (genInfo) Number:
number assigned by GenBank to each nucleotide and protein
sequence
SeqID retained, reflects policy used by
the sequence source database: GB, EMBL, DDBJ, SwissProt, PIR, ...
Rationale: any new sequence entering GenBank gets
a new gi number, even if it is a new version of an existing sequence
Thus, two different sequences which are different
versions of each other have the SAME Accession Number but DIFFERENT gi Numbers
...
4. NID and PID Numbers:
These are the gi numbers for Nucleotide and Protein entries at
GenBank
5. More Recently: new SeqID number = Accession .
Version
This approach, introduced about 2 yrs
ago, is expected to replace the gi number approach
Rationale:
Accession
Number: identifies a sequence record ...
Version Number: tracks changes to the sequence itself
...
Advantages:
1. can just use the Accession Number to retrieve the latest
version
2. can record the Version Number in
publications, to note which sequence was actually used
3. easy to determine the history of changes by noting the Version
Number
6. Most Recently: new Entries associated with the
RefSeq project ...
These are the NCBI
Reference Sequences ... built via a rather elaborate scheme
An attempt to get reliable "biological annotation" into stable, reference
set of GenBank entries ...
More info available at the RefSeq FAQ page ...
Identification of these new GenBank entries:
Entrez and BLAST results both present the following formatted text as part of the returned result: gi|4557284|ref|NM_000646.1|AGLf| [4557284] Data Element Comment gi "GenBank Identifier", or sequence ID number. "gi|" denotes that the number which follows is a unique sequence id. Any change to the sequence data will result in a new gi number. 4557284 The gi number. ref Indicates that RefSeq is the source database. NM_000646 The RefSeq accession number. AGLf The LOCUS name; this abbreviation is displayed in the LOCUS field of the record. The capitialized portion of this abbreviation is equal to the current gene symbol. The availability of splice variant records can be quickly determined by noting the lowercase alphabetic character appended to the gene symbol.
Example:
Protein (GenPept) Sequence:
old format: LOCUS 2599106 842 aa 05-JAN-1998 DEFINITION DNA polymerase. ACCESSION 2599106 PID g2599106 DBSOURCE GENBANK: locus AF028831, accession AF028831 KEYWORDS . SOURCE Cenarchaeum symbiosum. ORGANISM Cenarchaeum symbiosum Archaea; Crenarchaeota; Cenarchaeum. REFERENCE 1 (residues 1 to 842) AUTHORS Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Characterization of a DNA polymerase from the uncultivated psychrophilic archaeon Cenarchaeum symbiosum JOURNAL J. Bacteriol. 179 (24), 7803-7811 (1997) MEDLINE 98062213 REFERENCE 2 (residues 1 to 842) AUTHORS Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Direct Submission JOURNAL Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute, PO Box 628, Moss Landing, CA 95039, USA COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..842 /organism="Cenarchaeum symbiosum" /db_xref="taxon:46770" Protein <1..842 /product="DNA polymerase" CDS 1..842 /coded_by="AF028831:<1..2529" /transl_table=11 ORIGIN 1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge 61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy 121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi 181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg 241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs 301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd 361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq 421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm 481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn 541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi 601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls 661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk 721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy 781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff 841 fk // new format: LOCUS AAB94881 842 aa BCT 05-JAN-1998 DEFINITION DNA polymerase [Cenarchaeum symbiosum]. ACCESSION AAB94881 PID g2599106 VERSION AAB94881.1 GI:2599106 DBSOURCE locus AF028831 accession AF028831.1 KEYWORDS . SOURCE Cenarchaeum symbiosum. ORGANISM Cenarchaeum symbiosum Archaea; Crenarchaeota; Cenarchaeum. REFERENCE 1 (residues 1 to 842) AUTHORS Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Characterization of a DNA polymerase from the uncultivated psychrophilic archaeon Cenarchaeum symbiosum JOURNAL J. Bacteriol. 179 (24), 7803-7811 (1997) MEDLINE 98062213 REFERENCE 2 (residues 1 to 842) AUTHORS Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F. TITLE Direct Submission JOURNAL Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute, PO Box 628, Moss Landing, CA 95039, USA COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..842 /organism="Cenarchaeum symbiosum" /db_xref="taxon:46770" Protein <1..842 /product="DNA polymerase" /name="archaeal family B DNA polymerase" CDS 1..842 /coded_by="AF028831.1:<1..2529" /transl_table=11 ORIGIN 1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge 61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy 121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi 181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg 241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs 301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd 361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq 421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm 481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn 541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi 601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls 661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk 721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy 781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff 841 fk //
SwissProt and TREMBL are Protein, EMBL is DNA ... same
formats
TREMBL is a "TRanslation of EMBL", i.e.
the cognate of GenPept relative to GenBank
Properties:
1) EMBL
DNA database is at EBI ...
2) SwissProt main
online database is at ExPASy ... usually has links within the entries ...
3) Documentation before Sequence ... fixed format:
fields of information ... field names and data
in specific Columns in the entry ...
4) 2 letter symbols (Field Names), beginning in Column 1, used as
Delimiters in part ...
5) Much Documentation or
Annotation possible ...
FEATURES Table (FT
lines)... detailed features of the Sequence: primary, 2ary, 3ary structure
...
mutations, ...
6) Sequence in fixed columnar format ... similar to GCG ...
residue number ... 60 res per line ...
spaces in
fixed columns ... sequence in UPPER case ...
Delimiters: Word ID begins an Entry; word SQ begins Sequence; // ends Entry
Example:
SwissProt Protein entry:
ID DHE2_CLOSY STANDARD; PRT; 449 AA. AC P24295; DT 01-MAR-1992 (REL. 21, CREATED) DT 01-APR-1993 (REL. 25, LAST SEQUENCE UPDATE) DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) DE NAD-SPECIFIC GLUTAMATE DEHYDROGENASE (EC 1.4.1.2) (NAD-GDH). GN GDH. OS CLOSTRIDIUM SYMBIOSUM (BACTEROIDES SYMBIOSUS). OC PROKARYOTA; FIRMICUTES; ENDOSPORE-FORMING RODS AND COCCI; BACILLACEAE. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 92267007. [NCBI, Geneva] RA TELLER J.K., SMITH R.M., MCPHERSON M.J., ENGEL P.C., GUEST J.R.; RL EUR. J. BIOCHEM. 206:151-159(1992). RN [2] RP PRELIMINARY PARTIAL SEQUENCE. RX MEDLINE; 92062694. [NCBI, Geneva] RA LILLEY K.S., BAKER P.J., BRITTON K.L., STILLMAN T.J., BROWN P.E., RA MOIR A.J.G., ENGEL P.C., RICE D.W., BELL J.E., BELL E.; RL BIOCHIM. BIOPHYS. ACTA 1080:191-197(1991). RN [3] RP PARTIAL SEQUENCE, AND MODIFICATION OF SOME LYSINES. RX MEDLINE; 92339441. [NCBI, Geneva] RA LILLEY K.S., ENGEL P.C.; RL EUR. J. BIOCHEM. 207:533-540(1992). RN [4] RP X-RAY CRYSTALLOGRAPHY (1.96 ANGSTROMS). RX MEDLINE; 92204934. [NCBI, Geneva] RA BAKER P.J., BRITTON K.L., ENGEL P.C., FARRANTS G.W., LILLEY K.S., RA RICE D.W., STILLMAN T.J.; RL PROTEINS 12:75-86(1992). CC -!- CATALYTIC ACTIVITY: L-GLUTAMATE H(2)O NAD() = 2-OXOGLUTARATE CC NH(3) NADH. CC -!- PATHWAY: FIRST STEP IN THE HYDROXYGLUTARATE PATHWAY (ROUTE FOR THE CC ENERGY-YIELDING GLUTAMATE FERMENTATION). CC -!- SUBUNIT: HOMOHEXAMER. CC -!- SIMILARITY: BELONGS TO THE GLU/LEU/PHE/VAL DEHYDROGENASES FAMILY. DR EMBL; Z11747; G49280; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR PIR; S18361; S18361. DR PIR; S22403; S22403. DR PDB; 1HRD; 12-MAR-97. DR [ENTRY / RASMOL / 3D IMAGE / HSSP ENTRY / SCOP] DR SWISS-3DIMAGE; DHE2_CLOSY. DR PROSITE; PS00074; GLFV_DEHYDROGENASE; 1. DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW OXIDOREDUCTASE; NAD; 3D-STRUCTURE. FT INIT_MET 0 0 FT ACT_SITE 125 125 SQ SEQUENCE 449 AA; 49165 MW; 8E22A020 CRC32; SKYVDRVIAE VEKKYADEPE FVQTVEEVLS SLGPVVDAHP EYEEVALLER MVIPERVIEF RVPWEDDNGK VHVNTGYRVQ FNGAIGPYKG GLRFAPSVNL SIMKFLGFEQ AFKDSLTTLP MGGAKGGSDF DPNGKSDREV MRFCQAFMTE LYRHIGPDID VPAGDLGVGA REIGYMYGQY RKIVGGFYNG VLTGKARSFG GSLVRPEATG YGSVYYVEAV MKHENDTLVG KTVALAGFGN VAWGAAKKLA ELGAKAVTLS GPDGYIYDPE GITTEEKINY MLEMRASGRN KVQDYADKFG VQFFPGEKPW GQKVDIIMPC ATQNDVDLEQ AKKIVANNVK YYIEVANMPT TNEALRFLMQ QPNMVVAPSK AVNAGGVLVS GFEMSQNSER LSWTAEEVDS KLHQVMTDIH DGSAAAAERY GLGYNLVAGA NIVGFQKIAD AMMAQGIAW //
3. PIR Format
most recent format of original Dayhoff NBRF ... protein
oriented
Properties:
1) Header line: begins with > delimiter ... two symbols, then
semicolon ... then EntryName ... no spaces
EntryName often limited to 6 or 8 or 12 characters, usually
Letters or Numbers
2) Second line - a Description
line: free text ... often: molecule name - hyphen - organism
3) Sequence: multiple lines ... Spaces, upper/lower case ok,
multiple symbols, variable number per line
4)
Sequence ends with a * - delimiter for end of sequence ... indicates Stop Codon
for protein seqs
5) Annotation or documentation
follows ... can be free text ...
True PIR uses
symbols to distinguish types of annotation:
C;
comment ... R; reference ... A; protein annotation ...
Delimiters: > at beginning of Entry; * at end of Sequence; none at end of Entry
Example:
a. The following is an example of a PIR-formatted Protein sequence obtained from the PIR protein library PROTEIN using the COPY command of the program PSQ . The documentation comments following the sequence are in the PIR-NBRF format.
>P1;CATPAA Chloramphenicol acetyltransferase (EC 2.3.1.28) - E. coli plasmids M E K K I T G Y T T V D I S Q W H R K E H F E A F Q S V A Q C T Y N Q T V Q L D I T A F L K T V K K N K H K F Y P A F I H I L A R L M N A H P E F R M A M K D G E L V I W D S V H P C Y T V F H E Q T E T F S S L W S E Y H D D F R Q F L H I Y S Q D V A C Y G E N L A Y F P K G F I E N M F F V S A N P W V S F T S F D L N V A N M D N F F A P V F T M G K Y Y T Q G D K V L M P L A I Q V H H A V C D G F H V G R M L N E L Q Q Y C D E W Q G G A * C;Species: Escherichia coli R;Shaw, W.V., Packman, L.C., Burleigh, B.D., Dell, A., Morris, H.R., and Hartley, B.S. Nature 282, 870-872, 1979 (Plasmid JR66b, complete sequence with
experimental details) A;The chloramphenicol binding site may include regions near residues 31 and 192-196. Lys-136 may be involved in the formation of salt bridges be tween the chains. R;Alton, N.K., and Vapnek, D. Nature 282, 864-869, 1979 (Sequence translated from the nucleotide
sequence for the transposable genetic element Tn9) A;Residues 77-219 correspond to a probable fusidic acid resistance protein. R;Marcoli, R., Iida, S., and Bickle, T.A. FEBS Lett. 110, 11-14, 1980 (Sequence translated from the nucleotide
sequence for the transposon, Tncam204, derived from the R plasmid NR1 [=R10 0]) C;This enzyme, a type I variant mediated by an R plasmid in E. coli,
exists as a tetramer of identical chains.
4. FASTA/Pseudo-FASTA
Properties: ... similar to PIR but abbreviated
...
1) Header line: begins with > delimiter
followed by EntryName ... no spaces
EntryName
often limited to 6 or 8 or 12 characters, usually Letters or Numbers
After EntryName comes a space followed by free text:
Description of entry
NOTE: this first Space is
Delimiter between EntryName and entry Description ...
2) Sequence begins on line 2 and continues for as many lines as
needed ...
Usually: 80 residues per line or less
... Upper or lower case permitted ... no spaces ...
standard code letters used, no numbers ... no * at end
3) No Annotation or Documentation permitted except in
brief line 1 description ...
Delimiters: > at beginning of Entry; none at end of Sequence; none at end of Entry
Example:
>CATPAA Chloramphenicol acetyltransferase (EC
2.3.1.28) - E. coli plasmids
MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFI
HILARLMNAHPEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIY
SQDVACYGENLAYFPKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQG