SEVENS database

What's SEVENS

We have developed SEVENS database which includes G-protein coupled receptors (GPCR) genes with seven transmembrane helices (TMH), that are identified with high accuracy from complete genomes of 68 eukaryotes, by a pipeline integrating such software as a gene finder, a sequence alignment tool, a motif and domain assignment tool, and a TMH predictor. To perform an overview of the "GPCR universe", it is necessary to treat a larger data space than that in currently available databases, which should include not only the expressed sequences but also the newly identified sequences that cannot be detected by in vivo experiments, although they definitely exist on the genome sequence and are just waiting for the opportunity to express their functions. For this purpose, we introduce this database.

Contents

Content Search

Retrieve GPCR candidate sequences by the "AND" combination of (1) Keyword in nr.aa database search results, (2)Chromosome number, (3)Data level, (4)Predicted exon number, (5) Gene length, (6)Protein length, (7)E-value of sequence search against Swiss-Prot, nr.aa, or UniGene, (8) PROSITE motifs, (9) Pfam domains. (10) Novel or not. and (11) Family.

After selection with some of contents, GPCR candidates will be appear at the chromosomal viewer and the gene lists which navigate to the detailed analysis for each gene.

News

Release Information and news concerning updates of analysis.

What's SEVENS

Introduction for SEVENS database and it's usage.

Statistics

Release information of data statistics.

How we found GPCR sequences.

Condidate GPCR genes were collected from 68 eukaryote genomes by using the GPCR gene discovery pipeline, composed of two stages: (1) the gene finding stage, and (2) the GPCR gene screening stage.

1)Gene finding stage (i.e., translation of genomic sequences into amino acid sequences).
2)GPCR gene screening stage of GPCR candidates by assessing genes with sequence search, motif- and domain assignment, and transmembrane helix (TMH) prediction.

(1) Gene finding stage:

Genomic sequences were obtained from ftp sites of NCBI, UCSC, Ensembl, Broad Institute, Baylor College of Medicine, dictyBase, and IRGSP. To maximize the number of gene candidates, we detected two kinds of sequence sets,

(a)"6f-sequences" which were all possible combination between initial and stop codons in 6 reading frames with the rule of using the most upstream ATG possible.
(b)"ALN-sequences" The genomic regions where at least partial regions of the known GPCR sequences hit with significant score of TBLASTN are listed. Around these regions full gene structure is constructed by using ALN which performs dynamic alignment of known GPCR and genome sequence by considering exon/intron boundary.

Candidate sequences selected by the above process still contain the following redundancies. (1) Perfect matches or overlaps at the same genomic position (chromosome number, relative position on the genome). We regarded them as the same gene and adjusted the double count accordingly. (2) Multiple sequence copies in different genomic positions. We regarded them as different genes. (3) Separate sequence fragments linked by a known protein sequence. They originate in an erroneous prediction by the gene finding programs. We merged them using the linker sequence.
These redundancies were detected by the following clustering method for each level. First, Swith-Waterman sequence alignment was applied to the candidate sequences in an all-against-all fashion. Then sequences were linked together only when they hit for more than 50 amino acids with more than 95% identity, and shared the same chromosome number and overlapping genomic position. If chromosome numbers were unknown for (either/both) sequences, they were linked with more than 99% identity. After computing transitive closures of the links, each of the known human GPCR sequences from the Swiss-Prot was aligned against all the candidate sequences. All clusters that hit for more than 50 amino acids with more than 99% identity were merged. Finally, in each cluster, the longest sequence was selected as the representative.

ALN

(Gotoh,O. Bioinformatics 16(3),190-202(2000))

Using a new convention for encoding a DNA sequence into a series of 23 possible letters, a dynamic programming algorithm ('aln' written in ANSI-C) was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frame shift errors, coding potentials, and translation initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient was about 95% at the nucleotide level for the 288 genes tested, and 97.0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids.

(2) Gene screening stage:

Each analysis tool was first assessed to determine two threshold settings, best specificity and best sensitivity, with a reference dataset: GPCR sequences and non-GPCR sequences in the Swiss-Prot database. The best specificity threshold is intended to achieve, when applied to the reference dataset, almost 100% specificity and with minimum false-negatives. On the other hand, the best sensitivity threshold is intended to achieve almost 100% sensitivity and with minimum false-positives.
Using the thresholds shown in Table 1, those GPCR candidates were selected that showed significant sequence similarity or contained characteristic motifs and domains, and transmembrane helices.
Four confidence levels of the datasets were determined by combining the best specificity and best sensitivity thresholds. Level A data, expected to show the best specificity, were obtained by adding the candidate sequences given by best specificity thresholds of the sequence similarity search, motif- and domain assignments. To discover remote GPCR homologues, we combined candidates from the three-level thresholds for TMH prediction (see Table 1) with the sequences that were obtained by the best sensitivity thresholds of sequence search and motif- and domain assignment, and level D data are expected to show the best sensitivity.

Table 1. Thresholds used for GPCR discovery.

	Level A (Best specificity)	Level B	Level C	Level D (Best sensitivity)
Sequence search with BLASTP	E < 10^-80	E < 10^-30	E < 10^-30	E < 10^-30
Domain assignment with Pfam	E < 10^-10	E < 1.0	E < 1.0	E < 1.0
Motif assignment with PROSITE	Not used	Match	Match	Match
TMH Prediction	Not used	TMwindows(7) AND SOSUI(7)	TMwindows(7) AND SOSUI(6-8)	TMwindows(7) OR SOSUI(7)
Sensitivity	99.4%	99.8%	99.9%	99.9%
Specificity	96.6%	70.0%	48.4%	20.0%

Thresholds of the programs are shown.

Using BLASTP (Altschul, S. F., et al Nucleic Acids Res.25,3389-3402 (1997)) known GPCR seguences were searched against the reference dataset, and the sensitivity and specificity of E values were computed for discriminating correct pairs.
Using HMMER (Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K, L. & Sonnhammer, E. L. Nucleic Acids Res.28,263-266 (2000).), GPCR specific Hidden Markov Models ( Pfam domain ) were assigned to reference sequences, and the sensitivity and specificity of E values were computed for correct assignment.
Since PROSITE patterns are written by regular expression, we determined the P value, which is calculated as the multiplication of each residue frequency in the Swiss-Prot database; the sensitivity and specificity of P values were computed for correct assignment.
For TMH prediction we used the TMwindows program, our original program along with the SOSUI . We treated the results as GPCR outputs when the predicted helix number was dispersed between n and m. Here we used n-m ranges 7-7, 6 -8, 5-9, and 4-10 and combined the sequences obtained from each range of the two programs. For example, the descriptor {TMwindows(7) OR SOSUI(6-8) } unifies ("OR"), the sequences within range 7-7 that were obtained by TMwindows and the sequences within range 6-8 that were obtained by SOSUI.

SOSUI

A useful tool for secondary structure prediction of membrane proteins from a protein sequence. The basic idea of prediction in this system is based on the physicochemical properties of amino acid sequences such as hydrophobicity and charges. The system deals with three types of prediction: discrimination of membrane proteins from soluble ones, prediction of the existence of transmembrane helices and determination of transmembrane helical regions.
(Hirokawa, T., Boon-Chieng, S. & Mitaku, S. Bioinformatics.14,378-379 (1998).)

TMwindows

Predicts transmembrane helices by the following procedures.
(1) It assigns the Engelman-Steitz-Goldman (Annual Review of Biophysics and Biophysical Chemistry.15,321-353 (1986).) hydropathy index to amino acid sequences and calculates average hydrophobicity within a pre-determined window. The index was selected, after comparing all indices in the AAindex database (Protein Eng. 9, 27-36 (1996). as the most powerful for discriminating membrane proteins from others using total average hydrophobicity.
(2) The window size is changed from 19 to 27 and if the average hydrophobicity within each window exceeds 2.5, the region is regarded as a transmembrane helix. The total number of helices computed for each window size gives the range of predicted helix number.

(3) Databases used for analysis:

Genome DataBases
Homo sapiens	UCSC hg18
Pan troglodytes	UCSC panTro2
Gorilla gorilla	Ensembl 1.56
Pongo abelii	UCSC ponAbe2
Papio hamadryas	Baylor College of Medicine 2008/11/20
Rhesus macaque	UCSC rheMac2
Microcebus murinus	Ensembl 1.56
Tarsius syrichta	Ensembl 1.56
Callithrix jacchus	UCSC calJac1
Ottolemur garnetti	UCSC otoGar1
Tupaia belangeri	UCSC tupBel1
Erinaceus europaeu	UCSC eriEur1
Sorex araneus	UCSC sorAra1
Echinops telfairi	UCSC echTel1
Cavia porcellus	UCSC cavPor2
Rattus norvegicus	UCSC rn4
Mus musculus	UCSC mm9
Spermophilus tridecemlineatus	Broad Institute speTri1
Procavia capensis	Ensembl 1.56
Pteropus vampyrus	Ensembl 1.56
Myotis lucifugus	Ensembl 1.56
Equus caballus	UCSC equCab2
Canis familiaris	UCSC canFam2
Felis catus	UCSC felCat3
Bos taurus	UCSC bosTau3
Vicugna pacos	Ensembl 1.56
Tursiops truncatus	Baylor College of Medicine 2007/04/19
Oryctolagus cuniculus	UCSC oryCun1
Ochotona princeps	Broad Institute OchPri2.0
Loxodonta africana	UCSC loxAfr1
Dasypus novemcinctus	UCSC dasNov1
Choloepus hoffmanni	UCSC choHof1
Monodelphis domestica	UCSC monDom5
Macropus eugenii	Ensembl 1.56
Ornithorhynchus anatinus	UCSC ornAna1
Gallus gallus	UCSC galGal3
Taeniopygia guttata	UCSC taeGut1
Anolis carolinensis	UCSC anoCar1
Xenopus tropicalis	UCSC xenTro2
Oryzias latipes	UCSC oryLat1
Gasterosteus aculeatus	UCSC gasAcu1
Tetraodon nigroviridis	UCSC tetNig1
Fugu rubripes	UCSC fr2
Danio rerio	UCSC danRer5
Petromyzon marinus	UCSC petMar1
Ciona intestinalis	UCSC ci2
Drosophila melanogaster	UCSC dm3
Drosophila simulans	UCSC droSim1
Drosophila yakuba	UCSC droYak2
Drosophila pseudoobscura	UCSC dp4
Drosophila ananassae	UCSC droAna3
Drosophila erecta	UCSC droEre2
Drosophila grimshawi	UCSC droGri2
Drosophila mojavensis	UCSC droMoj3
Drosophila persimilis	UCSC droPer1
Drosophila sechellia	UCSC droSec1
Drosophila virilis	UCSC droVir3
Drosophila willistoni	UCSC droWil1
Anopheles gambiae	UCSC anoGam1
Caenorhabditis elegans	UCSC ce4
Caenorhabditis remanei	UCSC caeRem2
Caenorhabditis brenneri	UCSC caePb1
Caenorhabditis briggsae	UCSC cb3
Caenorhabditis japonica	UCSC caeJap1
Arabidopsis thaliana	NCBI 2000/12/14
Oryza sativa	IRGSP build 4.0
Dictyostelium discoideum	dictyBase 2007/12/11
Saccharomyces cerevisiae	UCSC sacCer1
Sequence DataBases
Swiss-Prot	ver.54.2
GPCRDB	release 10.0
nr.aa	release at 2007/10/25
UniGene	release at 2007/10/25
Sequence DataBases
Pfam	release 22.0
PROSITE	release 20.19

The signal transmission pass way started from a GPCR in the cell.

We added information on signal transmission pass way to about 333 kinds except for olfactory receptors among 890 kinds of human GPCRs. This includes the following information. ❶ Ligand, ❷ binding G protein, ❸ signal transmission pass way of downstream proteins and ❹ Life phenomenon caused by this pathway, ❺ documentation information relating to this pass way. (Please refer to this document for experimental conditions and disease information.) Start from the one combination of a ligand, a GPCR and a kind of G protein, signal flows to the end point (cell nuclei) are expressed as a picture by the protein kind, the interaction and production substances. Therefore, for one GPCR entry, the number of pictures is the multiplication of the number of different ligands and the different kinds of G proteins.

(a)"Valious allows" indicate a control relation between the proteins. An ordinary arrow makes the next protein be activated, and an arrow of T letter restrains the next protein. Productions by proteins, are indicated by circles.
(b)"A word of null " shows the next meaning. Some proteins or products exist there but it is not still clear that what they are.

Comments or questions to [email protected]
Recent Revise on 2018/1/29.