BIOINFORMATICS

 

BIOINFORMATICS

In biology, bioinformatics is defined as, “the use of computer to store, retrieve, analyse or predict the composition or structure of bio-molecules”. Bioinformatics is the application of computational techniques and information technology to the organization and management of biological data. Classical bioinformatics deals primarily with sequence analysis. The biological information of nucleic acids is available as sequences while the data of proteins is available as sequences and structures. Sequences are represented in single dimension whereas the structure contains the three dimensional data of sequences.

Aims of bioinformatics

ü Development of database containing all biological information.

ü Development of better tools for data designing, annotation and mining.

ü Design and development of drugs by using simulation software.

ü Design and development of software tools for protein structure prediction function, annotation and docking analysis.

ü Creation and development of software to improve tools for analyzing sequences for their function and similarity with other sequences

Biological Databases

A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to: 

·       Collection of data in a form which can be easily accessed

·       Making it available to a multi-user system ( always available for the user)

 



 

Importance of biological database

A range of information like biological sequences, structures, binding sites, metabolic interactions, molecular action, functional relationships, protein families, motifs and homologous can be retrieved by using biological databases. The main purpose of a biological database is to store and manage biological data and information in computer readable forms.

When Sanger first discovered the method to sequence proteins, there was a lot of excitement in the field of Molecular Biology. Initial interest in Bioinformatics was propelled by the necessity to create databases of biological sequences

Biological databases can be broadly classified in to sequence and structure databases. Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable to only Proteins.

 The first database was created within a short period after the Insulin protein sequence was made available in 1956. Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin consisted of just 51 residues (analogous to alphabets in a sentence) which characterize the sequence. Around mid-nineteen sixties, the first nucleic acid sequence of Yeast tRNA with 77 bases (individual units of nucleic acids) was found out.

During this period, three dimensional structures of proteins were studied and the well-known Protein Data Bank was developed as the first protein structure database with only 10 entries in 1972. This has now grown in to a large database with over 10,000 entries. While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986 which now has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known organisms. These huge varieties of divergent data resources are now available for study and research by both academic institutions and industries. These are made available as public domain information in the larger interest of research community through Internet (www.ncbi.nlm.nih.gov) and CDROMs (on request from www.rcsb.org). These databases are constantly updated with additional entries.

Databases in general can be classified in to primary, secondary and composite databases

A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures.

A secondary database contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. A secondary structure database contains entries of the PDB in an organized way. These contain entries that are classified according to their structure like all alpha proteins, all beta proteins, etc. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Composite database amalgamates a variety of different primary database sources, which obviates the need to search multiple resources. Different composite database use different primary database and different criteria in their search algorithm. Various options for search have also been incorporated in the composite database. The National Center for Biotechnology Information (NCBI) which host these nucleotide and protein databases in their large high available redundant array of computer servers, provides free access to the various persons involved in research. This also has link to OMIM (Online Mendelian Inheritance in Man) which contains information about the proteins involved in genetic diseases.

 

Protein and Nucleic acid databases

A protein database is one or more datasets about proteins, which could include a protein’s amino acid sequence, conformation, structure, and features such as active sites.

Protein databases are compiled by the translation of DNA sequences from different gene databases and include structural information. They are an important resource because proteins mediate most biological functions.

As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously.

The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular NMR.

The biological information of proteins is available as sequences and structures. Sequences are represented in a single dimension whereas the structure contains the three-dimensional data of sequences.

Importance of Protein Databases

 

Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases are often the first step in the study of a new protein. It has the following uses:

1.    Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species and hence offers much more information that can be obtained by studying only an isolated protein.

2.    Secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions.

3.    The use of multiple databases often helps researchers understand the structure and function of a protein.

 

Primary databases of Protein

 

The PRIMARY databases hold the experimentally determined protein sequences inferred from the conceptual translation of the nucleotide sequences. This, of course, is not experimentally derived information, but has arisen as a result of interpretation of the nucleotide sequence information and consequently must be treated as potentially containing misinterpreted information. There is a number of primary protein sequence databases and each requires some specific consideration.

Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):

PIR is an integrated public bioinformatics resource to support genomic and proteomic research and scientific studies. Nowadays, PIR offers a wide variety of resources mainly oriented to assisting the propagation and consistency of protein annotations like PIRSF, ProClass and ProLINK

PIR - The Protein Sequence Database  was developed in the early 1960’s. It is located at the National Biomedical Research Foundation (NBRF). Since 1988 it has been maintained by PIR-International. PIR currently contains 250,417 entries (Release 70.0, September 30, 2001). It is split into four distinct sections, that differ in quality of the data and the level of annotation

 SWISS-PROT

Swiss-Prot was established in 1986. It is maintained collaboratively by SIB (Swiss Institute of Bioinformatics) and EBI/EMBL. Provides high-level annotations, including description of protein function, structure of protein domains, post-translational modifications, variants, etc. It aims to be minimally redundant. Swiss-Prot is linked to many other resources, including other sequence databases.

The annotation contains information on the function or functions of the protein, post-translational modification such as phosphorylation, acetylation, etc., functional and structural domains and sites, such as calcium binding regions, ATP-binding sites, zinc fingers, etc., known secondary structural features as for examples alpha helix, beta sheet, etc., the quaternary structure of the protein, similarities to other protein if any, and diseases that may arise due to different authors publishing different sequences for the same protein, or due to mutations in different strains of an described as part of the annotation.

TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is released as a supplement to SWISS-PROT. It contains the translation of all coding sequences present in the EMBL Nucleotide database, which have not been fully annotated. Thus it may contain the sequence of proteins that are never expressed and never actually identified in the organisms.

Protein Databank (PDB):

·         PDB is a primary protein structure database. It is a crystallographic database for the three-dimensional structure of large biological molecules, such as proteins.

·         In spite of the name, PDB archive the three-dimensional structures of not only proteins but also all biologically important molecules, such as nucleic acid fragments, RNA molecules, large peptides such as antibiotic gramicidin and complexes of protein and nucleic acids.

·         The database holds data derived from mainly three sources: Structure determined by X-ray crystallography, NMR experiments, and molecular modeling.

 

Secondary Databases of Protein

 

The secondary databases are so termed because they contain the results of analysis of the sequences held in primary databases. Many secondary protein databases are the result of looking for features that relate different proteins. Some commonly used secondary databases of sequence and structure are as follows:

a. PROSITE: 

·         A set of databases collects together patterns found in protein sequences rather than the complete sequences. PROSITE is one such pattern database.

·         The protein motif and pattern are encoded as “regular expressions”.

·         The information corresponding to each entry in PROSITE is of the two forms – the patterns and the related descriptive text.

b. PRINTS:

·         In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. A fingerprint is a set of motifs or patterns rather than a single one.

·         The information contained in the PRINT entry may be divided into three sections. In addition to entry name, accession number and number of motifs, the first section contains cross-links to other databases that have more information about the characterized family.

·         The second section provides a table showing how many of the motifs that make up the fingerprint occurs in the how many of the sequences in that family.

·         The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of sequences, the alignment is made without gaps. There is, therefore, one set of aligned sequences for each motif.

c. MHCPep:

·         MHCPep is a database comprising over 13000 peptide sequences known to bind the Major Histocompatibility Complex of the immune system.

·         Each entry in the database contains not only the peptide sequence, which may be 8 to 10 amino acid long but in addition has information on the specific MHC molecules to which it binds, the experimental method used to assay the peptide, the degree of activity and the binding affinity observed , the source protein that, when broken down gave rise to this peptide along with other, the positions along the peptide where it anchors on the MHC molecules and references and cross-links to other information.

d. Pfam

·         Pfam contains the profiles used using Hidden Markov models.

·         HMMs build the model of the pattern as a series of the match, substitute, insert or delete states, with scores assigned for alignment to go from one state to another.

·         Each family or pattern defined in the Pfam consists of the four elements. The first is the annotation, which has the information on the source to make the entry, the method used and some numbers that serve as figures of merit.

·         The second is the seed alignment that is used to bootstrap the rest of the sequences into the multiple alignments and then the family.

·         The third is the HMM profile.

·         The fourth element is the complete alignment of all the sequences identified in that family.

 

The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery

The large DNA databases are:Genbank (US), EMBL (Europe - UK), DDBJ (Japan). These databases are quite similar regarding their contents and are updating one another periodically. This was is a result of the International Nucleotide Sequence Database Collaboration.

GenBank

The GenBank sequence database is open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC).  receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.

b. EMBL (European Molecular Biology Laboratory)

The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics Institute (EBI). Data are received from genome sequencing centers, individual scientists and patent offices. 

c. DDBJ (DNA databank of Japan)

It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country. 

 

2. Secondary databases of nucleotide sequences

·         Many of the secondary databases are simply sub-collection of sequences culled from one or the other of the primary databases such as GenBank or EMBL.

·         There is also usually a great deal of value addition in terms of annotation, software, presentation of the information and the cross-references.

·         There are other secondary databases that do not present sequences at all, but only information gathered from sequences databases.

a. Omniome Database:

Omniome Database is a comprehensive microbial resource maintained by TIGR (The Institute for Genomic Research). It has not only the sequence and annotation of each of the completed genomes, but also has associated information about the organisms (such as taxon and gram stain pattern), the structure and composition of their DNA molecules, and many other attributes of the protein sequences predicted from the DNA sequences.

It facilitates the meaningful multi-genome searches and analysis, for instance, alignment of entire genomes, and comparison of the physical proper of proteins and genes from different genomes etc.

b. FlyBase Database:

A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high degree of completeness and quality.

c. ACeDB:

It is a repository of not only the sequence but also the genetic map as well as phenotypic information about the C. Elegans nematode worm.

 

DSSP (hydrogen bond estimation algorithm)

The DSSP algorithm is the standard method for assigning secondary structure to the amino acids of a protein, given the atomic-resolution coordinates of the protein. The abbreviation is only mentioned once in the 1983 paper describing this algorithm,[1] where it is the name of the Pascal program that implements the algorithm Define Secondary Structure of Proteins.

Cambridge Structural Database

 The Cambridge Structural Database (CSD) is both a repository and a validated and curated resource for the three-dimensional structural data of molecules generally containing at least carbon and hydrogen, comprising a wide range of organic, metal-organic and organometallic molecules. The specific entries are complementary to the other crystallographic databases such as the Protein Data Bank (PDB), Inorganic Crystal Structure Database and International Centre for Diffraction Data. The data, typically obtained by X-ray crystallography and less frequently by electron diffraction or neutron diffraction, and submitted by crystallographers and chemists from around the world, are freely accessible (as deposited by authors) on the Internet via the CSD's parent organization's website (CCDC, Repository). The CSD is overseen by the not-forprofit incorporated company called the Cambridge Crystallographic Data Centre, CCDC. The CSD is a widely used repository for small-molecule organic and metal-organic crystal structures for scientists. Structures deposited with Cambridge Crystallographic Data Centre (CCDC) are publicly available for download at the point of publication or at consent from the depositor. They are also scientifically enriched and included in the database used by software offered by the centre. Targeted subsets of the CSD are also freely available to support teaching and other activities.

CATH database

 The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly


Database searching

As the amount of biological relevant data is increasing so rapidly, knowing how to access and search this information is essential. The two main ways of searching are:

·       Text based search - Searching the annotations. Examples:SRS, GCG’s Lookup, Entrez.

·       Sequence based search - Searching the sequence itself. Examples:Blast, FastA, SW.

 

Text based retrieval tools

The listed retrieval systems allow text searching in a multitude of molecular biology database and provide links to relevant information for entries that match the search criteria. The systems differ in the databases they search and the links they have to other information.

 

SRS (Sequence Retrieval System)

SRS had been developed at the EBI. It provides a homogeneous interface to over 80 biological databases (see SRS help at [25]). It includes databases of sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH, FASTA), protein 3-D structures, genomes, mappings, mutations, and locus specific mutations. For each of the 80 available databases, there is a short description, including its last release. Before entering a query, one selects one or more of the databases to search. It is possible to send the query results as a batch query to a sequence search tool. The SRS is highly recommended for use.

 

Entrez

Entrez is a molecular biology database and retrieval system, developed by the NCBI. It is an entry point for exploring the NCBI’s integrated databases. The

Entrez is easy to use, but unlike SRS, the search is limited. It does not allow customization with an institutes preferred databases.

 

 Sequence Based Searching

The straight forward technique to search a DNA sequence is to search it against DNA databases. However, it is possible to translate a coding DNA sequence into a protein sequence, and then search it against protein databases. Let us compare the two techniques:

• A DNA sequence is a string of length n over an alphabet of size 4. Its protein translation is a string of length n/3 over an alphabet of size 20. Statistically, the expected number of random matches in some arbitrary database is larger for a DNA sequence.

• DNA databases are much larger than protein databases, and they grow faster. This also means more random hits.

• Translation of a DNA sequence to a protein sequence causes loss of information.

• Protein sequences are more biologically preserved than DNA sequences.

Bottom line:Translating DNA to a protein yields better search results. When possible (i.e. for a coding DNA sequence), it is the recommended technique.

Protein sequences are always searched against protein databases. Translating them to DNA is ambiguous and results in a large number of possible DNA sequences. The analysis in the previous paragraph also discourages translation to DNA.

 

Accession number = 131435452

An Accession Number (sometimes called a Document ID) is a unique number assigned by a particular database as an additional means of locating a specific article. Note that an Accession Number is distinct and unrelated to a document’s DOI number. Many Library databases assign an Accession Number or Document ID, including EBSCOhost. See example in screen shot below. The Accession Number included in an EBSCOhost record is an identifying number of an article in the database. If you know the accession number of an article, you can search for it by entering AN as a search tag in EBSCOhost's Boolean search screens, followed by the number. While Accession Numbers are created as a unique identifier within a database, unfortunately, searching by Accession Number can be tricky and unreliable for a number of reasons. First, these numbers are assigned by the databases and are subject to change without warning. Further, the same article may be in more than one database, and thus will have a different accession numbers. Finally, you must already know which specific database (e.g., CINAHL, Business Source Complete, etc.) contains the article in full text to know where to search. You cannot search by Accession Number in Roadrunner Search. Accession numbers are not used in APA formatted references.

Popular Posts