Protein test case study ========================== .. topic:: Application: retrieving information about a given protein This section uses BioServices to demonstrate the interest of combining several services together within a single framework using the Python language as a glue language. In this tutorial we are interested in using **BioServices** to obtain information about a specific protein. Let us focus on ZAP70 protein (homo sapiens). .. testsetup:: protein from bioservices import UniProt u = UniProt(verbose=False) Get a unique identifier and gene names from a name ---------------------------------------------------- From **Uniprot**, we can obtain the unique accession number of ZAP70, which may be useful later on. Let us try to use the :meth:`~bioservices.uniprot.UniProt.search` method:: >>> from bioservices import UniProt >>> u = UniProt(verbose=False) >>> u.search("ZAP70_HUMAN") # could be lower case The default format of the returned answer is "tabulated":: >>> res = u.search("ZAP70_HUMAN", frmt="tab") >>> print(res) Entry Entry name Status Protein names Gene names Organism Length P43403 ZAP70_HUMAN reviewed Tyrosine-protein kinase ZAP-70 (EC 2.7.10.2) (70 kDa zeta-chain associated protein) (Syk-related tyrosine kinase) ZAP70 SRK Homo sapiens (Human) 619 It is better, but let us simplify even further. In **BioServices**, the output of the tabulated format contains several columns but we can select only a subset such as the Entry (accession number) and the gene names, which are coded as "id" and "genes" in uniprot database:: >>> res = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes") >>> print(res) Entry Gene names P43403 ZAP70 SRK So here we got the Entry P43403. Entry and Gene names can be saved in two variables as follows:: >>> res = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes") >>> entry, gene_names = res.split("\n")[1].split("\t") Getting the fasta sequence --------------------------- It is then straightforward to obtain the FASTA sequence of ZAP70 using another method from the UniProt class called :meth:`~bioservices.uniprot.UniProt.retrieve`: .. doctest:: protein >>> sequence = u.retrieve("P43403", "fasta") >>> print(sequence) >sp|P43403|ZAP70_HUMAN Tyrosine-protein kinase ZAP-70 OS=Homo sapiens OX=9606 GN=ZAP70 PE=1 SV=1 MPDPAAHLPFFYGSISRAEAEEHLKLAGMADGLFLLRQCLRSLGGYVLSLVHDVRFHHFP IERQLNGTYAIAGGKAHCGPAELCEFYSRDPDGLPCNLRKPCNRPSGLEPQPGVFDCLRD AMVRDYVRQTWKLEGEALEQAIISQAPQVEKLIATTAHERMPWYHSSLTREEAERKLYSG AQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYLK LKADGLIYCLKEACPNSSASNASGAAAPTLPAHPSTLTHPQRRIDTLNSDGYTPEPARIT SPDKPRPMPMDTSVYESPYSDPEELKDKKLFLKRDNLLIADIELGCGNFGSVRQGVYRMR KKQIDVAIKVLKQGTEKADTEEMMREAQIMHQLDNPYIVRLIGVCQAEALMLVMEMAGGG PLHKFLVGKREEIPVSNVAELLHQVSMGMKYLEEKNFVHRDLAARNVLLVNRHYAKISDF GLSKALGADDSYYTARSAGKWPLKWYAPECINFRKFSSRSDVWSYGVTMWEALSYGQKPY KKMKGPEVMAFIEQGKRMECPPECPPELYALMSDCWIYKWEDRPDFLTVEQRMRACYYSL ASKVEGPPGSTQKAEAACA .. note:: There are many services that provides access to the FASTA sequence. We chose uniprot but you could use the Entrez utilities as well as other services. Using BLAST on the sequence ------------------------------ You can then analyse this sequence with your favorite tool. As an example, within **BioServices** you can use :class:`~bioservices.services.nciblast.NCIBlast` but first let us extract the sequence itself (without the header):: sequence = sequence.split("\n", 1)[1].strip("\n") then, :: >>> s = NCBIblast(verbose=False) >>> jobid = s.run(program="blastp", sequence=sequence, stype="protein", \ ... database="uniprotkb", email="cokelaer@ebi.ac.uk") >>> print(s.getResult(jobid, "out")[0:1000]) BLASTP 2.2.26 [Sep-21-2011] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= EMBOSS_001 (619 letters) Database: uniprotkb 32,727,302 sequences; 10,543,978,207 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value SP:ZAP70_HUMAN P43403 Tyrosine-protein kinase ZAP-70 OS=Homo sap... 1279 0.0 TR:H2QIE3_PANTR H2QIE3 Tyrosine-protein kinase OS=Pan troglodyte... 1278 0.0 TR:G3QGN8_GORGO G3QGN8 Tyrosine-protein kinase OS=Gorilla gorill... 1278 0.0 TR:G1QLX3_NOMLE G1QLX3 Tyrosine-protein kinase OS=Nomascus leuco... 1249 0.0 TR:F6SWY7_CALJA F6SWY7 Tyrosin The last command waits for the job to be finised before printing the results, which may be quite long. We could look at the beginnin of the reported results and select only HUMAN sequences to see that the best sequence found is indeed ZAP70_HUMAN as expected:: >>> [x for x in s.getResult(jobid, "out").split("\n") if "HUMAN" in x] ['SP:ZAP70_HUMAN P43403 Tyrosine-protein kinase ZAP-70 OS=Homo sap... 1279 0.0 ', 'SP:KSYK_HUMAN P43405 Tyrosine-protein kinase SYK OS=Homo sapiens... 691 0.0 ', 'TR:A8K4G2_HUMAN A8K4G2 Tyrosine-protein kinase OS=Homo sapiens P... 691 0.0 ', ... Searching for relevant pathways ------------------------------------------ The KEGG services provides pathways, so let try to find pathways that contains our targetted protein. First we need to know the KEGG Id that corresponds to ZAP70. We can use the **find** method form KEGG service:: >>> from bioservices import KEGG >>> k = KEGG(verbose=False) >>> k.find("hsa", "zap70") # "hsa" stands for homo sapiens hsa:7535 ZAP70, SRK, STCD, STD, TZK, ZAP-70; zeta-chain (TCR) associated protein kinase 70kDa (EC:2.7.10.2); K07360 tyrosine-protein kinase ZAP-70 [EC:2.7.10.2 There are other ways to perform this conversion using the :meth:`bioservices.uniprot.UniProt.mapping` or :meth:`bioeservices.KEGG.conv` methods (e.g., \textit{k.conv("hsa", "up:P43403")}). Now, let us get the pathways that contains this ID:: >>> k.get_pathway_by_gene("7535", "hsa") {'hsa04064': ' NF-kappa B signaling pathway', 'hsa04650': 'Natural killer cell mediated cytotoxicity', 'hsa04660': 'T cell receptor signaling pathway', 'hsa05340': 'Primary immunodeficiency'} We can look at the first pathway in a browser (highlighting the ZAP70 node):: >>> k.show_pathway("hsa04064", keggid={"7535": "red"}) Searching for binary Interactions ----------------------------------- For this purpose, we could use PSICQUIC services to find the interactions that involve ZAP70 in the **mint** database:: >>> from bioservices import PSICQUIC >>> s = PSICQUIC(verbose=False) >>> data = s.query("mint", "ZAP70 AND species:9606") where 9606 is the taxonomy Id for homo sapiens. We could also figure out how many interactions could be found in each dabase for this particular query:: >>> s.getInteractionCounter("zap70 AND species:9606") {'apid': 82, 'bar': 0, 'bind': 4, 'bindingdb': 29, 'biogrid': 73, 'chembl': 161, 'dip': 0, 'i2d-imex': 0, 'innatedb': 13, 'innatedb-imex': 0, 'intact': 11, 'interoporc': 0, 'irefindex': 273, 'matrixdb': 0, 'mbinfo': 0, 'mint': 34, 'molcon': 0, 'mpidb': 0, 'reactome': 0, 'reactome-fis': 134, 'spike': 47, 'string': 319, 'topfind': 0, 'uniprot': 0} We see for instance that the **mint** database has 34 interactions. Coming back to the interactions returned by s.query, we find indeed 34 intercations between ZAP70 and another component:: >>> len(data) 34 Let us look at the first one:: >>> for x in data[0]: print(x) uniprotkb:P15498 uniprotkb:P43403 - - uniprotkb:VAV1(gene name)|uniprotkb:VAV(gene name synonym) uniprotkb:ZAP70(gene name)|uniprotkb:SRK(gene name synonym)|uniprotkb:70 kDa zeta-associated protein(gene name synonym)|uniprotkb:Syk-related tyrosine kinase(gene name synonym) psi-mi:"MI:0019"(coimmunoprecipitation) - pubmed:9151714 taxid:9606(Homo sapiens) taxid:9606(Homo sapiens) psi-mi:"MI:0914"(association) psi-mi:"MI:0471"(mint) mint:MINT-8035351 mint-score:0.28(free-text)|homomint-score:0.28(free-text)'intact-miscore:0.60'] The First two elements are the entries for specy A and B. The last element is the score. The 11th element is the type of interaction and so on. What could be useful is to convert these elements into uniprot ID only. With mint database it is irrelevant for this particular entry but with other DBs or entries, it may be useful (e.g., biogrid). BioServices provides such a function called :meth:`~bioservices.services.psicquic.convert`:: >>> data = s.query("biogrid", "ZAP70 AND species:9606") >>> data2 = s.convert(data, "biogrid") .. warning:: some databases may be offline. If so, try we another database. Type "s.activeDBs". **convert** method converts all entries from data into uniprot ID. If this is not possible, the entry is removed. The **query** and **convert** works on a single database but you we could query all or a subset of all databases using the queryAll and convertAll functions:: >>> data = s.queryAll("ZAP70 AND species:9606", databases=["mint", "biogrid"]) >>> data2 = s.convertAll(data) However, extra cleaning is required to remove entries that are not relevant (no match to uniprot ID, redundant, not a protein, self interactions, ...). In order to ease this tast, the psicquic.AppsPPI class is very useful. .. fails on sphinx .. .. plot:: :width: 80% :include-source: from bioservices import psicquic s = psicquic.AppsPPI() s.queryAll("ZAP70 AND species:9606", databases=["mint", "biogrid", "intact", "reactome-fis"]) s.summary() s.show_pie() The summary function print a useful summary about the number of found interactions and overlap between databases: .. doctest:: :options: +SKIP >>> s.summary() Found 8 interactions within intact database Found 124 interactions within reactome-fis database Found 19 interactions within mint database Found 67 interactions within biogrid database ------------- Found 152 interactions in 1 common databases Found 14 interactions in 2 common databases Found 0 interactions in 3 common databases Found 1 interactions in 4 common databases This may be different depending on the available databases. Finally, you can obtain the relation that was found in the 4 databases: .. doctest:: :options: +SKIP >>> s.relevant_interactions[4] [['LCK_HUMAN', 'ZAP70_HUMAN']] What's next ? ------------------- There are lots of other services that could be usefule. An example is the wikipathway (see :class:`~bioservices.wikipathway.Wikipathway`) to retrieve even more pathways that include the ZAP70 protein. Another example is the BioMart portal. You could use it to retrieve pathways from REACTOME (see :class:`~bioservices.biomart.BioMart`). You can also retrieve target from ChEMBL given the uniprot ID ( get_target_by_uniprotId("P43403") ) and so on. .. For instance all human interactions reported in MArch 2013 ---------------------------------------------------------------- =========== =============== =================================== Status name number of interactions =========== =============== =================================== ONLINE APID 123,427 ONLINE BAR 0 ONLINE BIND 38,419 ONLINE BindingDB 74,082 ONLINE BioGrid 182,911 ONLINE ChEMBL 399,482 ONLINE DIP 18,434 OFFLINE DrugBank OFFLINE GeneMANIA OFFLINE I2D ONLINE I2D-IMEx 915 ONLINE InnateDB 14,734 ONLINE InnateDB-IMEx 352 ONLINE IntAct 84,692 ONLINE Interoporc 17,284 ONLINE iRefIndex 396,368 ONLINE MatrixDB 604 ONLINE MBInfo 307 ONLINE MINT 36,741 ONLINE MolCon 242 ONLINE MPIDB 28 ONLINE Reactome 113,204 ONLINE Reactome-FIs 209,988 ONLINE Spike 36,248 ONLINE STRING 656,493 ONLINE TopFind 4,986 ONLINE UniProt 5,564 OFFLINE VirHostNet =========== =============== =================================== res = p.queryAll("species:9606", databases=["uniprot", "apid"]) data1 = res['uniprot'] data2 = p.preCleaning(data1) mapping = p.convertUniprot(data2) len(set(p.postCleaning(mapping))) ('Before removing anything: ', 5558) ('After removing the None: ', 5545) ('Before removing the !: ', 5107) ("Before removing entries that don't match HUMAN : ", 4242) Finally, a set can be use to extract unique entries a further cleanup: A-B is same as B-A >>> p = psicquic.AppsPPI() >>> p.queryAll("ZAP70 AND species:9606") >>> p.summary() >>> for i in range(1,p.N+1): ... print(i, len(p.relevant_interactions[i])) 1 265 2 62 3 31 4 12 5 11 6 7 7 4 8 2 9 1 >>> labels = range(1, p.N + 1 ) >>> counting = [len(p.relevant_interactions[i]) for i in labels] >>> pie(counting, labels = [str(x) for x in labels])