BIOINFORMATICS
BIOINFORMATICS
In biology,
bioinformatics is defined as, “the use of computer to store, retrieve, analyse
or predict the composition or structure of bio-molecules”. Bioinformatics is
the application of computational techniques and information technology to the organization
and management of biological data. Classical bioinformatics deals primarily
with sequence analysis. The biological
information of nucleic acids is available as sequences while the data of
proteins is available as sequences and structures. Sequences are represented in
single dimension whereas the structure contains the three dimensional data of
sequences.
Aims of bioinformatics
ü Development of database containing
all biological information.
ü Development of better tools for data
designing, annotation and mining.
ü Design and development of drugs by
using simulation software.
ü Design and development of software
tools for protein structure prediction function, annotation and docking
analysis.
ü Creation and development of software
to improve tools for analyzing sequences for their function and similarity with
other sequences
Biological Databases
A biological
database is a collection of data that is organized so that its contents can
easily be accessed, managed, and updated. The activity of preparing a database
can be divided in to:
· Collection of data in a form which
can be easily accessed
· Making it available to a multi-user
system ( always available for the user)
Importance of biological database
A range of
information like biological sequences, structures, binding sites, metabolic
interactions, molecular action, functional relationships, protein families,
motifs and homologous can be retrieved by using biological databases. The main
purpose of a biological database is to store and manage biological data and
information in computer readable forms.
When Sanger
first discovered the method to sequence proteins, there was a lot of excitement
in the field of Molecular Biology. Initial interest in Bioinformatics was
propelled by the necessity to create databases of biological sequences
Biological
databases can be broadly classified in to sequence
and structure databases. Sequence databases are
applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable
to only Proteins.
The first database was created within a short
period after the Insulin protein sequence was made available in 1956.
Incidentally, Insulin is the first protein to be sequenced. The sequence of
Insulin consisted of just 51 residues (analogous to alphabets in a sentence)
which characterize the sequence. Around mid-nineteen sixties, the first nucleic
acid sequence of Yeast tRNA with 77 bases (individual units of nucleic acids)
was found out.
During this
period, three dimensional structures of proteins were studied and the well-known
Protein Data Bank was developed as the first protein structure database with
only 10 entries in 1972. This has now grown in to a large database with over
10,000 entries. While the initial databases of protein sequences were maintained
at the individual laboratories, the development of a consolidated formal
database known as SWISS-PROT protein sequence database was initiated in 1986
which now has about 70,000 protein sequences from more than 5000 model
organisms, a small fraction of all known organisms. These huge varieties of
divergent data resources are now available for study and research by both
academic institutions and industries. These are made available as public domain
information in the larger interest of research community through Internet
(www.ncbi.nlm.nih.gov) and CDROMs (on request from www.rcsb.org). These
databases are constantly updated with additional entries.
Databases in general can be
classified in to primary, secondary and composite databases
A primary database contains
information of the sequence or structure alone. Examples of these include
Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome
sequences and the Protein Databank for protein structures.
A secondary database contains
derived information from the primary database. A secondary sequence database
contains information like the conserved sequence, signature sequence and active
site residues of the protein families arrived by multiple sequence alignment of
a set of related proteins. A secondary structure database contains entries of
the PDB in an organized way. These contain entries that are classified
according to their structure like all alpha proteins, all beta proteins, etc.
These also contain information on conserved secondary structure motifs of a
particular protein. Some of the secondary database created and hosted by
various researchers at their individual laboratories includes SCOP, developed
at Cambridge University; CATH developed at University College of London,
PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.
Composite database amalgamates a variety of different
primary database sources, which obviates the need to search multiple resources.
Different composite database use different primary database and different criteria
in their search algorithm. Various options for search have also been
incorporated in the composite database. The National Center for Biotechnology
Information (NCBI) which host these nucleotide and protein databases in their
large high available redundant array of computer servers, provides free access
to the various persons involved in research. This also has link to OMIM (Online
Mendelian Inheritance in Man) which contains information about the proteins
involved in genetic diseases.
Protein and Nucleic acid databases
A protein database is one or more
datasets about proteins, which could include a protein’s amino acid sequence,
conformation, structure, and features such as active sites.
Protein
databases are compiled by the translation of DNA sequences from different gene
databases and include structural information. They are an important resource
because proteins mediate most biological functions.
As
biology has increasingly turned into a data-rich science, the need for storing
and communicating large datasets has grown tremendously.
The
obvious examples are the nucleotide sequences, the protein sequences, and the
3D structural data produced by X-ray crystallography and macromolecular NMR.
The
biological information of proteins is available as sequences and structures.
Sequences are represented in a single dimension whereas the structure contains
the three-dimensional data of sequences.
Importance of Protein Databases
Huge
amounts of data for protein structures, functions, and particularly sequences
are being generated. Searching databases are often the first step in the study
of a new protein. It has the following uses:
1.
Comparison
between proteins or between protein families provides information about the
relationship between proteins within a genome or across different species and
hence offers much more information that can be obtained by studying only an
isolated protein.
2.
Secondary
databases derived from experimental databases are also widely available. These
databases reorganize and annotate the data or provide predictions.
3.
The
use of multiple databases often helps researchers understand the structure and
function of a protein.
Primary databases of Protein
The PRIMARY databases hold the experimentally determined
protein sequences inferred from the conceptual translation of the nucleotide
sequences. This, of course, is not experimentally derived information, but has
arisen as a result of interpretation of the nucleotide sequence information and
consequently must be treated as potentially containing misinterpreted
information. There is a number of primary protein sequence databases and each
requires some specific consideration.
Protein Information Resource (PIR) – Protein
Sequence Database (PIR-PSD):
PIR
is an integrated public bioinformatics resource to support genomic and
proteomic research and scientific studies. Nowadays, PIR offers a wide variety
of resources mainly oriented to assisting the propagation and consistency of
protein annotations like PIRSF, ProClass and ProLINK
PIR
- The Protein Sequence Database was
developed in the early 1960’s. It is located at the National Biomedical
Research Foundation (NBRF). Since 1988 it has been maintained by
PIR-International. PIR currently contains 250,417 entries (Release 70.0,
September 30, 2001). It is split into four distinct sections, that differ in
quality of the data and the level of annotation
SWISS-PROT
Swiss-Prot
was established in 1986. It is maintained collaboratively by SIB (Swiss
Institute of Bioinformatics) and EBI/EMBL. Provides high-level annotations,
including description of protein function, structure of protein domains,
post-translational modifications, variants, etc. It aims to be minimally
redundant. Swiss-Prot is linked to many other resources, including other
sequence databases.
The annotation contains information on the
function or functions of the protein, post-translational modification such as
phosphorylation, acetylation, etc., functional and structural domains and
sites, such as calcium binding regions, ATP-binding sites, zinc fingers, etc.,
known secondary structural features as for examples alpha helix, beta sheet,
etc., the quaternary structure of the protein, similarities to other protein if
any, and diseases that may arise due to different authors publishing different
sequences for the same protein, or due to mutations in different strains of an
described as part of the annotation.
TrEMBL (for
Translated EMBL) is a computer-annotated protein sequence database that
is released as a supplement to SWISS-PROT. It contains the translation of all
coding sequences present in the EMBL Nucleotide database, which have not been
fully annotated. Thus it may contain the sequence of proteins that are never
expressed and never actually identified in the organisms.
Protein
Databank (PDB):
·
PDB
is a primary protein structure database. It is a crystallographic database for
the three-dimensional structure of large biological molecules, such as
proteins.
·
In
spite of the name, PDB archive the three-dimensional structures of not only
proteins but also all biologically important molecules, such as nucleic acid
fragments, RNA molecules, large peptides such as antibiotic gramicidin and
complexes of protein and nucleic acids.
·
The
database holds data derived from mainly three sources: Structure determined by
X-ray crystallography, NMR experiments, and molecular modeling.
Secondary Databases of Protein
The
secondary databases are so termed because they contain the results of analysis
of the sequences held in primary databases. Many secondary protein databases
are the result of looking for features that relate different proteins. Some
commonly used secondary databases of sequence and structure are as follows:
a.
PROSITE:
·
A
set of databases collects together patterns found in protein sequences rather
than the complete sequences. PROSITE is one such pattern database.
·
The
protein motif and pattern are encoded as “regular expressions”.
·
The
information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
b.
PRINTS:
·
In
the PRINTS database, the protein sequence patterns are stored as
‘fingerprints’. A fingerprint is a set of motifs or patterns rather than a
single one.
·
The
information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first
section contains cross-links to other databases that have more information
about the characterized family.
·
The
second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
·
The
last section of the entry contains the actual fingerprints that are stored as
multiple aligned sets of sequences, the alignment is made without gaps. There
is, therefore, one set of aligned sequences for each motif.
c.
MHCPep:
·
MHCPep
is a database comprising over 13000 peptide sequences known to bind the Major
Histocompatibility Complex of the immune system.
·
Each
entry in the database contains not only the peptide sequence, which may be 8 to
10 amino acid long but in addition has information on the specific MHC
molecules to which it binds, the experimental method used to assay the peptide,
the degree of activity and the binding affinity observed , the source protein
that, when broken down gave rise to this peptide along with other, the
positions along the peptide where it anchors on the MHC molecules and
references and cross-links to other information.
d.
Pfam
·
Pfam
contains the profiles used using Hidden Markov models.
·
HMMs
build the model of the pattern as a series of the match, substitute, insert or
delete states, with scores assigned for alignment to go from one state to
another.
·
Each
family or pattern defined in the Pfam consists of the four elements. The first
is the annotation, which has the information on the source to make the entry,
the method used and some numbers that serve as figures of merit.
·
The
second is the seed alignment that is used to bootstrap the rest of the
sequences into the multiple alignments and then the family.
·
The
third is the HMM profile.
·
The
fourth element is the complete alignment of all the sequences identified in
that family.
The
Nucleotide database is a
collection of sequences from several sources, including GenBank, RefSeq, TPA
and PDB. Genome, gene and transcript sequence data provide the foundation for
biomedical research and discovery
The
large DNA databases are:Genbank (US), EMBL (Europe - UK), DDBJ (Japan). These
databases are quite similar regarding their contents and are updating one
another periodically. This was is a result of the International Nucleotide
Sequence Database Collaboration.
GenBank
The GenBank sequence
database is open access, annotated collection of all publicly
available nucleotide sequences and
their protein translations. This database is produced and maintained
by the National Center for Biotechnology Information (NCBI) as part
of the International Nucleotide Sequence Database
Collaboration (INSDC). receive sequences produced in
laboratories throughout the world from more than 100,000
distinct organisms. GenBank has become an important database for research
in biological fields and has grown in recent years at an exponential
rate by doubling roughly every 18 months.
b.
EMBL (European Molecular Biology Laboratory)
The
European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a
comprehensive collection of primary nucleotide sequences maintained at the
European Bioinformatics Institute (EBI). Data are received from genome
sequencing centers, individual scientists and patent offices.
c.
DDBJ (DNA databank of Japan)
It
is located at the National Institute of Genetics (NIG) in
the Shizuoka prefecture of Japan. It is the only nucleotide sequence
data bank in Asia. Although DDBJ mainly receives its data from Japanese
researchers, it can accept data from contributors from any other country.
2.
Secondary databases of nucleotide sequences
·
Many
of the secondary databases are simply sub-collection of sequences culled from
one or the other of the primary databases such as GenBank or EMBL.
·
There
is also usually a great deal of value addition in terms of annotation,
software, presentation of the information and the cross-references.
·
There
are other secondary databases that do not present sequences at all, but only
information gathered from sequences databases.
a.
Omniome Database:
Omniome
Database is a comprehensive microbial resource maintained by TIGR (The
Institute for Genomic Research). It has not only the sequence and annotation of
each of the completed genomes, but also has associated information about the
organisms (such as taxon and gram stain pattern), the structure and composition
of their DNA molecules, and many other attributes of the protein sequences
predicted from the DNA sequences.
It
facilitates the meaningful multi-genome searches and analysis, for instance,
alignment of entire genomes, and comparison of the physical proper of proteins
and genes from different genomes etc.
b.
FlyBase Database:
A
consortium sequenced the entire genome of the fruit fly D. Melanogaster to
a high degree of completeness and quality.
c.
ACeDB:
It
is a repository of not only the sequence but also the genetic map as well as
phenotypic information about the C. Elegans nematode
worm.
DSSP (hydrogen bond estimation
algorithm)
The DSSP
algorithm is the standard method for assigning secondary structure to the amino
acids of a protein, given the atomic-resolution coordinates of the protein. The
abbreviation is only mentioned once in the 1983 paper describing this
algorithm,[1] where it is the name of the Pascal program that implements the
algorithm Define Secondary Structure of Proteins.
Cambridge Structural Database
The Cambridge Structural Database (CSD) is
both a repository and a validated and curated resource for the
three-dimensional structural data of molecules generally containing at least
carbon and hydrogen, comprising a wide range of organic, metal-organic and
organometallic molecules. The specific entries are complementary to the other
crystallographic databases such as the Protein Data Bank (PDB), Inorganic
Crystal Structure Database and International Centre for Diffraction Data. The
data, typically obtained by X-ray crystallography and less frequently by
electron diffraction or neutron diffraction, and submitted by crystallographers
and chemists from around the world, are freely accessible (as deposited by
authors) on the Internet via the CSD's parent organization's website (CCDC,
Repository). The CSD is overseen by the not-forprofit incorporated company
called the Cambridge Crystallographic Data Centre, CCDC. The CSD is a widely
used repository for small-molecule organic and metal-organic crystal structures
for scientists. Structures deposited with Cambridge Crystallographic Data
Centre (CCDC) are publicly available for download at the point of publication
or at consent from the depositor. They are also scientifically enriched and
included in the database used by software offered by the centre. Targeted
subsets of the CSD are also freely available to support teaching and other
activities.
CATH database
The CATH Protein Structure Classification
database is a free, publicly available online resource that provides
information on the evolutionary relationships of protein domains. It was
created in the mid-1990s by Professor Christine Orengo and colleagues including
Janet Thornton and David Jones, and continues to be developed by the Orengo
group at University College London. CATH shares many broad features with the
SCOP resource, however there are also many areas in which the detailed
classification differs greatly
Database searching
As the amount of biological relevant
data is increasing so rapidly, knowing how to access and search this
information is essential. The two main ways of searching are:
· Text based search - Searching the annotations.
Examples:SRS, GCG’s Lookup, Entrez.
· Sequence based search - Searching the sequence itself.
Examples:Blast, FastA, SW.
Text based retrieval tools
The listed retrieval systems allow text
searching in a multitude of molecular biology database and provide links to
relevant information for entries that match the search criteria. The systems
differ in the databases they search and the links they have to other
information.
SRS
(Sequence Retrieval System)
SRS had been developed at the EBI. It
provides a homogeneous interface to over 80 biological databases (see SRS help
at [25]). It includes databases of sequences, metabolic pathways, transcription
factors, application results (like BLAST, SSEARCH, FASTA), protein 3-D structures,
genomes, mappings, mutations, and locus specific mutations. For each of the 80 available
databases, there is a short description, including its last release. Before
entering a query, one selects one or more of the databases to search. It is
possible to send the query results as a batch query to a sequence search tool.
The SRS is highly recommended for use.
Entrez
Entrez is a molecular biology database
and retrieval system, developed by the NCBI. It is an entry point for exploring
the NCBI’s integrated databases. The
Entrez is easy to use, but unlike SRS,
the search is limited. It does not allow customization with an institutes
preferred databases.
Sequence
Based Searching
The straight forward technique to search
a DNA sequence is to search it against DNA databases. However, it is possible
to translate a coding DNA sequence into a protein sequence, and then search it
against protein databases. Let us compare the two techniques:
• A DNA sequence is a string of length n
over an alphabet of size 4. Its protein translation is a string of length n/3
over an alphabet of size 20. Statistically, the expected number of random
matches in some arbitrary database is larger for a DNA sequence.
• DNA databases are much larger than
protein databases, and they grow faster. This also means more random hits.
• Translation of a DNA sequence to a
protein sequence causes loss of information.
• Protein sequences are more
biologically preserved than DNA sequences.
Bottom line:Translating DNA to a protein
yields better search results. When possible (i.e. for a coding DNA sequence),
it is the recommended technique.
Protein sequences are always searched
against protein databases. Translating them to DNA is ambiguous and results in
a large number of possible DNA sequences. The analysis in the previous
paragraph also discourages translation to DNA.
Accession number = 131435452
An Accession
Number (sometimes called a Document ID) is a unique number assigned by a
particular database as an additional means of locating a specific article. Note
that an Accession Number is distinct and unrelated to a document’s DOI number.
Many Library databases assign an Accession Number or Document ID, including
EBSCOhost. See example in screen shot below. The Accession Number included in
an EBSCOhost record is an identifying number of an article in the database. If
you know the accession number of an article, you can search for it by entering
AN as a search tag in EBSCOhost's Boolean search screens, followed by the
number. While Accession Numbers are created as a unique identifier within a
database, unfortunately, searching by Accession Number can be tricky and
unreliable for a number of reasons. First, these numbers are assigned by the
databases and are subject to change without warning. Further, the same article
may be in more than one database, and thus will have a different accession
numbers. Finally, you must already know which specific database (e.g., CINAHL,
Business Source Complete, etc.) contains the article in full text to know where
to search. You cannot search by Accession Number in Roadrunner Search.
Accession numbers are not used in APA formatted references.