Formats | Bioinformatics and Systems Biology Core | University of Nebraska Medical Center

Journal Club
Courses

GCBA815

Training

1. GenBank/GenPept Format

Properties:
1) Documentation before Sequence ... fixed format: fields of information ... field names and data
in specific Columns in the entry ...
2) Words (Field Names), beginning in Column 1, used as Delimiters in part ...
3) Much Documentation or Annotation possible ...
FEATURES Table ... detailed features of the Sequence: primary, 2ary, 3ary structure; mutations, ...
4) Sequence in fixed columnar format ... similar to GCG ... residue number ... 60 res per line ...
spaces in fixed columns ... sequence in lower case ...

Delimiters: Word LOCUS begins an Entry; word ORIGIN begins Sequence; // ends Entry

GenBank Numbers used to Key or uniquely Identify Entries:

1. SeqID:
Initially, the Entry Name in the LOCUS line was used as the only key to a GenBank entry
This name attempted to mimic the organism and function of the gene encoded
Problem: impossible to do this systematically and uniquely with new knowledge ...
These Entry Names now change over time...

2. Accession Numbers:
The Accession Number was then introduced, to be the primary key to reference an entry in the database ...
Will always stay with the entry, even when entry is updated
Nomenclature: either 5 (eg: X79797) or 6 (eg: AF028831) ... new one: NC_001140 ... a 'RefSeq' entry
the letter used reflects which of the three databases (GenBank, EMBL, DDBJ) is the primary database ...
Problem:
a. Same Accession Number was assigned to new versions of the same sequence!
Thus, a given Accession Number would stay with a given Entry ... but could be associated with more than one Entries!
Result: Entry retrieved from a given Accession Number was not unique ...
b. now also have Secondary Accession Numbers!
These were introduced to given some notion of the history of the Entry
Example from below:
ACCESSION X79797 X70490 X74327
All Accession Numbers are on a single line, the ACCESSION line
The Primary key is the first number, all others are Secondary keys
Origin of Secondary keys: multiple origins, reflecting the history of GenBank ... some unknown origins ...
1. Two entries may have been merged into one entry, with one of the two AccNums becoming a Secondary
2. Primary Accession Number may have replaced the Secondary AccNum which otherwise no longer exists

3. gi (genInfo) Number:
number assigned by GenBank to each nucleotide and protein sequence
SeqID retained, reflects policy used by the sequence source database: GB, EMBL, DDBJ, SwissProt, PIR, ...
Rationale: any new sequence entering GenBank gets a new gi number, even if it is a new version of an existing sequence
Thus, two different sequences which are different versions of each other have the SAME Accession Number but DIFFERENT gi Numbers ...

4. NID and PID Numbers:
These are the gi numbers for Nucleotide and Protein entries at GenBank

5. More Recently: new SeqID number = Accession . Version
This approach, introduced about 2 yrs ago, is expected to replace the gi number approach
Rationale:
Accession Number: identifies a sequence record ...
Version Number: tracks changes to the sequence itself ...
Advantages:
1. can just use the Accession Number to retrieve the latest version
2. can record the Version Number in publications, to note which sequence was actually used
3. easy to determine the history of changes by noting the Version Number

6. Most Recently: new Entries associated with the RefSeq project ...
These are the NCBI Reference sequences ... built via a rather elaborate scheme.
An attempt to get reliable "biological annotation" into stable, reference set of GenBank entries ...
More info available at the RefSeq FAQ page ...
Identification of these new GenBank entries:

Entrez and BLAST results both present the following formatted text 
as part of the returned result: 
      gi|4557284|ref|NM_000646.1|AGLf| [4557284] 


Data Element     Comment
gi            "GenBank Identifier", or sequence ID number. "gi|" denotes that 
              the number which follows is a unique sequence id. 
              Any change to the sequence data will result in a new gi number.
4557284       The gi number.
ref           Indicates that RefSeq is the source database.
NM_000646     The RefSeq accession number.
AGLf          The LOCUS name; this abbreviation is displayed in the LOCUS
              field of the record. The capitalized portion of this abbreviation is
              equal to the current gene symbol. The availability of splice 
              variant records can be quickly determined by noting the 
              lowercase alphabetic character appended to the gene symbol.

Example:

Protein (GenPept) Sequence:


old format:

LOCUS       
2599106       842 aa                              05-JAN-1998
DEFINITION  DNA polymerase.

ACCESSION   2599106
PID         g2599106
DBSOURCE    GENBANK: locus AF028831, accession AF028831
KEYWORDS    .
SOURCE      Cenarchaeum symbiosum.
  ORGANISM  Cenarchaeum symbiosum
            Archaea; Crenarchaeota; Cenarchaeum.
REFERENCE   1  (residues 1 to 842)
  AUTHORS   Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F.
  TITLE     Characterization of a DNA polymerase from the uncultivated
            psychrophilic archaeon Cenarchaeum symbiosum
  JOURNAL   J. Bacteriol. 179 (24), 7803-7811 (1997)
  MEDLINE   98062213
REFERENCE   2  (residues 1 to 842)
  AUTHORS   Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute,
            PO Box 628, Moss Landing, CA 95039, USA
COMMENT     Method: conceptual translation supplied by author.
FEATURES             Location/Qualifiers
     source          1..842
                     /organism="Cenarchaeum symbiosum"
                     /db_xref="taxon:46770"
     Protein         <1..842
                     /product="DNA polymerase"
     CDS             1..842
                     /coded_by="AF028831:<1..2529"
                     /transl_table=11
ORIGIN      
        1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge
       61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy
      121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi
      181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg
      241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs
      301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd
      361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq
      421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm
      481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn
      541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi
      601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls
      661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk
      721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy
      781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff
      841 fk
//


new format:

LOCUS       
AAB94881      842 aa                    BCT       05-JAN-1998
DEFINITION  DNA polymerase [Cenarchaeum symbiosum].

ACCESSION   AAB94881
PID         g2599106
VERSION     AAB94881.1  GI:2599106
DBSOURCE    locus AF028831 accession AF028831.1
KEYWORDS    .
SOURCE      Cenarchaeum symbiosum.
  ORGANISM  Cenarchaeum symbiosum
            Archaea; Crenarchaeota; Cenarchaeum.
REFERENCE   1  (residues 1 to 842)
  AUTHORS   Schleper,C., Swanson,R.V., Mathur,E.J. and DeLong,E.F.
  TITLE     Characterization of a DNA polymerase from the uncultivated
            psychrophilic archaeon Cenarchaeum symbiosum
  JOURNAL   J. Bacteriol. 179 (24), 7803-7811 (1997)
  MEDLINE   98062213
REFERENCE   2  (residues 1 to 842)
  AUTHORS   Schleper,C.M., Swanson,R.V., Mathur,E.J. and DeLong,E.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (06-OCT-1997) Monterey Bay Aquarium Research Institute,
            PO Box 628, Moss Landing, CA 95039, USA
COMMENT     Method: conceptual translation supplied by author.
FEATURES             Location/Qualifiers
     source          1..842
                     /organism="Cenarchaeum symbiosum"
                     /db_xref="taxon:46770"
     Protein         <1..842
                     /product="DNA polymerase"
                     /name="archaeal family B DNA polymerase"
     CDS             1..842
                     /coded_by="AF028831.1:<1..2529"
                     /transl_table=11
ORIGIN      
        1 vqdaveipps llvsatydsq agavvlkfye pesqkivhwt dntghkpycy trqppselge
       61 legredvlgt eqvmrhdlia dkdvpvtkit vadplaiggt nseksirnim dtwesdikyy
      121 enylydkslv vgryysvsgg kviphdmpis devklalksl lwdkvvdegm adrkefrefi
      181 agwadllnqp iprirrlsfd ievdseegri pdpkisdrrv tavgfaatdg lkqvfvlrsg
      241 aeegengvtp gvevvfydke admirdalsv igsypfvlty ngddfdmpym lnrarrlgvs
      301 dsdiplymmr dsatlrhgvh ldlyrtfsnr sfqlyafaak ytdyslnsvt kamlgegkvd
      361 ygvklgdltl yqtanycyhd arltlelstf gneilmdllv vtsriarmpi ddmsrmgvsq
      421 wirsllyyeh rqrnaliprr delegrsrev sndavikdkk frgglvvepe egihfdvtvm
      481 dfaslypsii kvrnlsyetv rcvhaeckkn tipdtnhwvc tknngltsmi igslrdlrvn
      541 yykslsksts iteeqrqqyt visqalkvvl nasygvmgae ifplyflpaa eattavgryi
      601 imqtishceq mgvrvlygdt dslfikdpee rqiheiveha kkehgvelev dkeyryvvls
      661 nrkknyfgvt ragkvdvkgl tgkkshtppf ikelfyslld ilsgvesede fesakmrisk
      721 aiaacgkrle erqiplvdla fnvmiskaps eyvktvpqhi raarllenar evkkgdiisy
      781 vkvmnktgvk pvemaragev dtskylefme stldqltssm gldfdeilgk pkqtgmeqff
      841 fk
//

2. EMBL/Swiss-Prot/TREMBL Format

SwissProt and TREMBL are Protein, EMBL is DNA ... same formats
TREMBL is a "TRanslation of EMBL", i.e. the cognate of GenPept relative to GenBank

Properties:
1) EMBL DNA database is at EBI ...
2) SwissProt main online database is at ExPASy ... usually has links within the entries ...
3) Documentation before Sequence ... fixed format: fields of information ... field names and data
in specific Columns in the entry ...
4) 2 letter symbols (Field Names), beginning in Column 1, used as Delimiters in part ...
5) Much Documentation or Annotation possible ...
FEATURES Table (FT lines)... detailed features of the Sequence: primary, 2ary, 3ary structure ...
mutations, ...
6) Sequence in fixed columnar format ... similar to GCG ... residue number ... 60 res per line ...
spaces in fixed columns ... sequence in UPPER case ...

Delimiters: Word ID begins an Entry; word SQ begins Sequence; // ends Entry

Example:
SwissProt Protein entry:

ID   DHE2_CLOSY     STANDARD;      PRT;   449 AA.
AC   P24295;
DT   01-MAR-1992 (REL. 21, CREATED)
DT   01-APR-1993 (REL. 25, LAST SEQUENCE UPDATE)
DT   01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)
DE   NAD-SPECIFIC GLUTAMATE DEHYDROGENASE (EC 1.4.1.2) (NAD-GDH).
GN   GDH.
OS   CLOSTRIDIUM SYMBIOSUM (BACTEROIDES SYMBIOSUS).
OC   PROKARYOTA; FIRMICUTES; ENDOSPORE-FORMING RODS AND COCCI; BACILLACEAE.
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 92267007. [NCBI, Geneva]
RA   TELLER J.K., SMITH R.M., MCPHERSON M.J., ENGEL P.C., GUEST J.R.;
RL   EUR. J. BIOCHEM. 206:151-159(1992).
RN   [2]
RP   PRELIMINARY PARTIAL SEQUENCE.
RX   MEDLINE; 92062694. [NCBI, Geneva]
RA   LILLEY K.S., BAKER P.J., BRITTON K.L., STILLMAN T.J., BROWN P.E.,
RA   MOIR A.J.G., ENGEL P.C., RICE D.W., BELL J.E., BELL E.;
RL   BIOCHIM. BIOPHYS. ACTA 1080:191-197(1991).
RN   [3]
RP   PARTIAL SEQUENCE, AND MODIFICATION OF SOME LYSINES.
RX   MEDLINE; 92339441. [NCBI, Geneva]
RA   LILLEY K.S., ENGEL P.C.;
RL   EUR. J. BIOCHEM. 207:533-540(1992).
RN   [4]
RP   X-RAY CRYSTALLOGRAPHY (1.96 ANGSTROMS).
RX   MEDLINE; 92204934. [NCBI, Geneva]
RA   BAKER P.J., BRITTON K.L., ENGEL P.C., FARRANTS G.W., LILLEY K.S.,
RA   RICE D.W., STILLMAN T.J.;
RL   PROTEINS 12:75-86(1992).
CC   -!- CATALYTIC ACTIVITY: L-GLUTAMATE  H(2)O  NAD() = 2-OXOGLUTARATE
CC        NH(3)  NADH.
CC   -!- PATHWAY: FIRST STEP IN THE HYDROXYGLUTARATE PATHWAY (ROUTE FOR THE
CC       ENERGY-YIELDING GLUTAMATE FERMENTATION).
CC   -!- SUBUNIT: HOMOHEXAMER.
CC   -!- SIMILARITY: BELONGS TO THE GLU/LEU/PHE/VAL DEHYDROGENASES FAMILY.
DR   EMBL; Z11747; G49280; -. 
[EMBL / GenBank / DDBJ] [CodingSequence]
DR   PIR; S18361; S18361.
DR   PIR; S22403; S22403.
DR   PDB; 1HRD; 12-MAR-97.

DR        [ENTRY / RASMOL / 3D IMAGE / HSSP ENTRY / SCOP]
DR   SWISS-3DIMAGE; DHE2_CLOSY.
DR   PROSITE; PS00074; GLFV_DEHYDROGENASE; 1.
DR   PRODOM [Domain structure / List of seq. sharing at least 1 domain]
DR   SWISS-2DPAGE; GET REGION ON 2D PAGE.
KW   OXIDOREDUCTASE; NAD; 3D-STRUCTURE.
FT   INIT_MET      0      0
FT   ACT_SITE    125    125       
SQ   SEQUENCE   449 AA;  49165 MW;  8E22A020 CRC32;
     SKYVDRVIAE VEKKYADEPE FVQTVEEVLS SLGPVVDAHP EYEEVALLER MVIPERVIEF
     RVPWEDDNGK VHVNTGYRVQ FNGAIGPYKG GLRFAPSVNL SIMKFLGFEQ AFKDSLTTLP
     MGGAKGGSDF DPNGKSDREV MRFCQAFMTE LYRHIGPDID VPAGDLGVGA REIGYMYGQY
     RKIVGGFYNG VLTGKARSFG GSLVRPEATG YGSVYYVEAV MKHENDTLVG KTVALAGFGN
     VAWGAAKKLA ELGAKAVTLS GPDGYIYDPE GITTEEKINY MLEMRASGRN KVQDYADKFG
     VQFFPGEKPW GQKVDIIMPC ATQNDVDLEQ AKKIVANNVK YYIEVANMPT TNEALRFLMQ
     QPNMVVAPSK AVNAGGVLVS GFEMSQNSER LSWTAEEVDS KLHQVMTDIH DGSAAAAERY
     GLGYNLVAGA NIVGFQKIAD AMMAQGIAW
//

3. PIR Format

most recent format of original Dayhoff NBRF ... protein oriented
Properties:
1) Header line: begins with > delimiter ... two symbols, then semicolon ... then EntryName ... no spaces
EntryName often limited to 6 or 8 or 12 characters, usually Letters or Numbers
2) Second line - a Description line: free text ... often: molecule name - hyphen - organism
3) Sequence: multiple lines ... Spaces, upper/lower case ok, multiple symbols, variable number per line
4) Sequence ends with a * - delimiter for end of sequence ... indicates Stop Codon for protein seqs
5) Annotation or documentation follows ... can be free text ...
True PIR uses symbols to distinguish types of annotation:
C; comment ... R; reference ... A; protein annotation ...

Delimiters: > at beginning of Entry; * at end of Sequence; none at end of Entry

Example:

a. The following is an example of a PIR-formatted Protein sequence obtained from the PIR protein library PROTEIN using the COPY command of the program PSQ . The documentation comments following the sequence are in the PIR-NBRF format.

>P1;CATPAA
Chloramphenicol acetyltransferase (EC 2.3.1.28) - E. coli plasmids
 M E K K I T G Y T T V D I S Q W H R K E H F E A F Q S V A Q
 C T Y N Q T V Q L D I T A F L K T V K K N K H K F Y P A F I
 H I L A R L M N A H P E F R M A M K D G E L V I W D S V H P
 C Y T V F H E Q T E T F S S L W S E Y H D D F R Q F L H I Y
 S Q D V A C Y G E N L A Y F P K G F I E N M F F V S A N P W
 V S F T S F D L N V A N M D N F F A P V F T M G K Y Y T Q G
 D K V L M P L A I Q V H H A V C D G F H V G R M L N E L Q Q
 Y C D E W Q G G A *
C;Species: Escherichia coli
R;Shaw, W.V., Packman, L.C., Burleigh, B.D., Dell, A., Morris, H.R., and
Hartley, BS
Nature 282, 870-872, 1979 (Plasmid JR66b, complete sequence with experimental details)
A;The chloramphenicol binding site may include regions near residues 31 
and 192-196. Lys-136 may be involved in the formation of salt bridges be
tween the chains.
R;Alton, N.K., and Vapnek, D.
Nature 282, 864-869, 1979 (Sequence translated from the nucleotide sequence for the transposable genetic element Tn9)
A;Residues 77-219 correspond to a probable fusidic acid resistance protein.
R;Marcoli, R., Iida, S., and Bickle, T.A.
FEBS Lett. 110, 11-14, 1980 (Sequence translated from the nucleotide sequence for the transposon, Tncam204, derived from the R plasmid NR1 [=R10
0])
C;This enzyme, a type I variant mediated by an R plasmid in E. coli, exists as a tetramer of identical chains.

4. FASTA/Pseudo-FASTA
most commonly used format with a minimum of descriptive information

Properties: ... similar to PIR but abbreviated ...
1) Header line: begins with > delimiter followed by EntryName ... no spaces
EntryName often limited to 6 or 8 or 12 characters, usually Letters or Numbers
After EntryName comes a space followed by free text: Description of entry
NOTE: this first Space is Delimiter between EntryName and entry Description ...
2) Sequence begins on line 2 and continues for as many lines as needed ...
Usually: 80 residues per line or less ... Upper or lower case permitted ... no spaces ...
standard code letters used, no numbers ... no * at end
3) No Annotation or Documentation permitted except in brief line 1 description ...

Delimiters: > at beginning of Entry; none at end of Sequence; none at end of Entry

Example:

>CATPAA Chloramphenicol acetyltransferase (EC 2.3.1.28) - E. coli plasmids
MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFI
HILARLMNAHPEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIY
SQDVACYGENLAYFPKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQG

Journal Club
Courses

GCBA815

Training