Source code for bioservices.psicquic

#
#  This file is part of bioservices software
#
#  Copyright (c) 2013-2014 - EMBL-EBI
#
#  File author(s):
#
#
#  Distributed under the GPLv3 License.
#  See accompanying file LICENSE.txt or copy at
#      http://www.gnu.org/licenses/gpl-3.0.html
#
#  website: https://github.com/cokelaer/bioservices
#  documentation: http://packages.python.org/bioservices
#
##############################################################################
"""Interface to the PSICQUIC web service

.. topic:: What is PSICQUIC ?

    :URL: http://code.google.com/p/psicquic/
    :REST: http://code.google.com/p/psicquic/wiki/PsicquicSpec_1_3_Rest

    .. highlights::

        "PSICQUIC is an effort from the HUPO Proteomics Standard Initiative
        (HUPO-PSI) to standardise the access to molecular interaction databases
        programmatically. The PSICQUIC View web interface shows that PSICQUIC
        provides access to 25 active service "

        -- Dec 2012


.. _about_queries:

About queries
================

.. rubric:: source: PSICQUIC View web page

The idea behind PSICQUIC is to retrieve information related to protein
interactions from various databases. Note that protein interactions does not
necesseraly mean protein-protein interactions. In order to be effective, the
query format has been standarised.

To do a search you can use the Molecular Interaction Query Language which is
based on Lucene's syntax. Here are some rules

* Use OR or space ' ' to search for ANY of the terms in a field
* Use AND if you want to search for those interactions where ALL of your terms are found
* Use quotes (") if you look for a specific phrase (group of terms that must
  be searched together) or terms containing special characters that may otherwise
  be interpreted by our query engine (eg. ':' in a GO term)
* Use parenthesis for complex queries (e.g. '(XXX OR YYY) AND ZZZ')
* Wildcards (`*`,?) can be used between letters in a term or at the end of terms to do fuzzy queries,
   but never at the beginning of a term.
* Optionally, you can prepend a symbol in front of your term.
    *  + (plus): include this term. Equivalent to AND. e.g. +P12345
    *  - (minus): do not include this term. Equivalent to NOT. e.g. -P12345
    *    Nothing in front of the term. Equivalent to OR. e.g. P12345
* Implicit fields are used when no field is specified (simple search). For
  instance, if you put 'P12345' in the simple query box, this will mean the same
  as identifier:P12345 OR pubid:P12345 OR pubauth:P12345 OR species:P12345 OR
  type:P12345 OR detmethod:P12345 OR interaction_id:P12345

About the MITAB output
=========================

The output returned by a query contains a list of entries. Each entry is
formatted following the MITAB output.

Here below are listed the name of the field returned ordered as they would
appear in one entry. The first item is always idA whatever version of MITAB is
used. The version 25 of MITAB contains the first 15 fields in the table below.
Newer version may incude more fields but always include the 15 from MITAB 25 in
the same order.  See the link from **irefindex**
`about mitab <http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_8.0#What_each_line_represents>`_
for more information.

=============== =========================================== =============== ======================
Field Name      Searches on                                 Implicit*       Example
=============== =========================================== =============== ======================
idA             Identifier A                                No              idA:P74565
idB             Identifier B                                No              idB:P74565
id              Identifiers (A or B)                        No              id:P74565
alias           Aliases (A or B)                            No              alias:(KHDRBS1 HCK)
identifiers     Identifiers and Aliases undistinctively     Yes             identifier:P74565
pubauth         Publication 1st author(s)                   Yes             pubauth:scott
pubid           Publication Identifier(s) OR                Yes             pubid:(10837477 12029088)
taxidA          Tax ID interactor A: the tax ID or
                the species name                            No              taxidA:mouse
taxidB          Tax ID interactor B: the tax ID or
                species name                                No              taxidB:9606
species         Species. Tax ID A or Tax ID B               Yes             species:human
type            Interaction type(s)                         Yes             type:"physical interaction"
detmethod       Interaction Detection method(s)             Yes             detmethod:"two hybrid*"
interaction_id  Interaction identifier(s)                   Yes             interaction_id:EBI-761050
pbioroleA       Biological role A                           Yes             pbioroleA:ancillary
pbioroleB       Biological role B                           Yes             pbioroleB:"MI:0684"
pbiorole        Biological roles (A or B)                   Yes             pbiorole:enzyme
ptypeA          Interactor type A                           Yes             ptypeA:protein
ptypeB          Interactor type B                           Yes             ptypeB:"gene"
ptype           Interactor types (A or B)                   Yes             pbiorole:"small molecule"
pxrefA          Interactor xref A (or Identifier A)         Yes             pxrefA:"GO:0003824"
pxrefB          Interactor xref B (or Identifier B)                         Yes pxrefB:"GO:0003824"
pxref           Interactor xrefs (A or B or Identifier
                A or Identifier B)                          Yes             pxref:"catalytic activity"
xref            Interaction xrefs (or Interaction
                identifiers)                                Yes             xref:"nuclear pore"
annot           Interaction annotations and tags            Yes             annot:"internally curated"
udate           Update date                                 Yes             udate:[20100101 TO 20120101]
negative        Negative interaction boolean                Yes             negative:true
complex         Complex expansion                           Yes             complex:"spoke expanded"
ftypeA          Feature type of participant A               Yes             ftypeA:"sufficient to bind"
ftypeB          Feature type of participant B               Yes             ftypeB:mutation
ftype           Feature type of participant A or B          Yes             ftype:"binding site"
pmethodA        Participant identification method A         Yes             pmethodA:"western blot"
pmethodB        Participant identification method B         Yes             pmethodB:"sequence tag identification"
pmethod         Participant identification methods
                 (A or B)                                   Yes             pmethod:immunostaining
stc             Stoichiometry (A or B). Only true or
                false, just to be able to filter
                interaction having stoichiometry available  Yes             stc:true
param           Interaction parameters. Only true or
                false, just to be able to filter
                interaction having parameters available     Yes             param:true
=============== =========================================== =============== ======================



"""

from bioservices import REST, UniProt


# http://code.google.com/p/psicquic/wiki/PsicquicSpec_1_3_Rest

# http://www.biocatalogue.org/services/2078#operations

__all__ = ["PSICQUIC"]


[docs]class PSICQUIC:
    """Interface to the `PSICQUIC <http://code.google.com/p/psicquic/>`_ service

    There are 2 interfaces to the PSICQUIC service (REST and WSDL) but we used
    the REST only.


    This service provides a common interface to more than 25 other services
    related to protein. So, we won't detail all the possiblity of this service.
    Here is an example that consists of looking for interactors of the
    protein ZAP70 within the IntAct database::

        >>> from bioservices import *
        >>> s = PSICQUIC()
        >>> res = s.query("intact", "zap70")
        >>> len(res) # there are 11 interactions found
        11
        >>> for x in res[1]:
        ...     print(x)
        uniprotkb:O95169
        uniprotkb:P43403
        intact:EBI-716238
        intact:EBI-1211276
        psi-mi:ndub8_human(display_long)|uniprotkb:NADH-ubiquinone oxidoreductase ASHI
        .
        .

    Here we have a list of entries. There are 15 of them (depending on
    the *output* parameter). The meaning of the entries is described on PSICQUIC
    website: https://code.google.com/p/psicquic/wiki/MITAB25Format . In short:


    #. Unique identifier for interactor A
    #. Unique identifier for interactor B.
    #. Alternative identifier for interactor A, for example the official gene
    #. Alternative identifier for interactor B.
    #. Aliases for A, separated by "|
    #. Aliases for B.
    #. Interaction detection methods, taken from the corresponding PSI-MI
    #. First author surname(s) of the publication(s)
    #. Identifier of the publication
    #. NCBI Taxonomy identifier for interactor A.
    #. NCBI Taxonomy identifier for interactor B.
    #. Interaction types,
    #. Source databases and identifiers,
    #. Interaction identifier(s) i
    #. Confidence score. Denoted as scoreType:value.

    Another example with reactome database::

        res = s.query("reactome", "Q9Y266")

    .. warning:: PSICQUIC gives access to 25 other services. We cannot create
        a dedicated parsing for all of them. So, the ::`query` method returns
        the raw data. Addition class may provide dedicated parsing in the
        future.

    .. seealso:: :class:`bioservices.biogrid.BioGRID`
    """

    _formats = [
        "tab25",
        "tab26",
        "tab27",
        "xml25",
        "count",
        "biopax",
        "xgmml",
        "rdf-xml",
        "rdf-xml-abbrev",
        "rdf-n3",
        "rdf-turtle",
    ]

    # note the typo in "genbank indentifier from bind DB
    _mapping_uniprot = {
        "genbank indentifier": "P_GI",
        "entrezgene/locuslink": "P_ENTREZGENEID",
        "uniprotkb": "ACC+ID",
        "rcsb pdb": "PDB_ID",
        "ensembl": "ENSEMBL_ID",
        "refseq": "P_REFSEQ_AC",
        "hgnc": "HGNC_ID",
        "kegg": "KEGG_ID",
        "entrez gene/locuslink": "P_ENTREZGENEID",
        "chembl": "CHEMBL_ID",
        "ddbj/embl/genbank": "EMBL_ID",
        "dip": "DIP_ID",
        "ensemblgenomes": "ENSEMBLGENOME_ID",
        "omim": "MIM_ID",
        "chebi": None,
        "chembl": None,
        #        "intact": None
    }

    # unknown: hprd, omim, bind, bind complexid, mdl,

    def __init__(self, verbose=True):
        """.. rubric:: Constructor

        :param bool verbose: print informative messages

        .. doctest::

            >>> from bioservices import PSICQUIC
            >>> s = PSICQUIC()

        """
        self.services = REST(
            "PSICQUIC",
            verbose=verbose,
            url="https://www.ebi.ac.uk/Tools/webservices/psicquic",
            url_defined_later=True,
        )  # this prevent annoying warning

        self._registry = None

        try:
            self.uniprot = UniProt(verbose=False)
        except:
            self.services.logging.warning("UniProt service could be be initialised")
        self.buffer = {}

    def _get_formats(self):
        return PSICQUIC._formats

    formats = property(_get_formats, doc="Returns the possible output formats")

    def _get_active_db(self):
        names = self.registry_names[:]
        actives = self.registry_actives[:]
        names = [x.lower() for x, y in zip(names, actives) if y == "true"]
        return names

    activeDBs = property(_get_active_db, doc="returns the active DBs only")

[docs]    def read_registry(self):
        """Reads and returns the active registry"""
        url = "registry/registry?action=ACTIVE&format=txt"
        res = self.services.http_get(url, frmt="txt")
        return res.split()

[docs]    def print_status(self):
        """Prints the services that are available

        :return: Nothing

        The output is tabulated. The columns are:

        * names
        * active
        * count
        * version
        * rest URL
        * soap URL
        * rest example
        * restricted

        .. seealso:: If you want the data into lists, see all attributes
            starting with registry such as :meth:`registry_names`
        """
        url = "registry/registry?action=STATUS&format=xml"
        res = self.services.http_get(url, frmt="txt")

        names = self.registry_names
        counts = self.registry_counts
        versions = self.registry_versions
        actives = self.registry_actives
        resturls = self.registry_resturls
        soapurls = self.registry_soapurls
        restexs = self.registry_restexamples
        restricted = self.registry_restricted
        N = len(names)

        indices = sorted(range(0, N), key=lambda k: names[k])

        for i in range(0, N):
            print(
                "%s\t %s\t %s\t %s\t %s %s %s %s\n"
                % (
                    names[i],
                    actives[i],
                    counts[i],
                    versions[i],
                    resturls[i],
                    soapurls[i],
                    restexs[i],
                    restricted[i],
                )
            )

    # todo a property for the version of PISCQUIC

    def _get_registry(self):
        if self._registry is None:
            url = "registry/registry?action=STATUS&format=xml"
            res = self.services.http_get(url, frmt="xml")
            res = self.services.easyXML(res)
            self._registry = res
        return self._registry

    registry = property(_get_registry, doc="returns the registry of psicquic")

    def _get_registry_names(self):
        res = self.registry
        return [x.findAll("name")[0].text for x in res.findAll("service")]

    registry_names = property(_get_registry_names, doc="returns all services available (names)")

    def _get_registry_restricted(self):
        res = self.registry
        return [x.findAll("restricted")[0].text for x in res.findAll("service")]

    registry_restricted = property(_get_registry_restricted, doc="returns restricted status of services")

    def _get_registry_resturl(self):
        res = self.registry
        data = [x.findAll("resturl")[0].text for x in res.findAll("service")]
        return data

    registry_resturls = property(_get_registry_resturl, doc="returns URL of REST services")

    def _get_registry_restex(self):
        res = self.registry
        data = [x.findAll("restexample")[0].text for x in res.findAll("service")]
        return data

    registry_restexamples = property(_get_registry_restex, doc="retuns REST example for each service")

    def _get_registry_soapurl(self):
        res = self.registry
        return [x.findAll("soapurl")[0].text for x in res.findAll("service")]

    registry_soapurls = property(_get_registry_soapurl, doc="returns URL of WSDL service")

    def _get_registry_active(self):
        res = self.registry
        return [x.findAll("active")[0].text for x in res.findAll("service")]

    registry_actives = property(_get_registry_active, doc="returns active state of each service")

    def _get_registry_count(self):
        res = self.registry
        return [x.findAll("count")[0].text for x in res.findAll("service")]

    registry_counts = property(_get_registry_count, doc="returns number of entries in each service")

    def _get_registry_version(self):
        res = self.registry
        names = [x.findAll("name")[0].text for x in res.findAll("service")]
        N = len(names)
        version = [0] * N
        for i in range(0, N):
            x = res.findAll("service")[i]
            if x.findAll("version"):
                version[i] = x.findAll("version")[0].text
            else:
                version[i] = None
        return version

    registry_versions = property(_get_registry_version, doc="returns version of each service")

[docs]    def query(
        self,
        service,
        query,
        output="tab25",
        version="current",
        firstResult=None,
        maxResults=None,
    ):
        """Send a query to a specific database

        :param str service: a registered service. See :attr:`registry_names`.
        :param str query: a valid query. Can be `*` or a protein name.
        :param str output: a valid format. See s._formats

        ::

            s.query("intact", "brca2", "tab27")
            s.query("intact", "zap70", "xml25")
            s.query("matrixdb", "*", "xml25")

        This is the programmatic approach to this website:

        http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml


        Another example consist in accessing the *string* database for fetching
        protein-protein interaction data of a particular model organism. Here we
        restrict the query to 100 results::

            s.query("string", "species:10090", firstResult=0, maxResults=100, output="tab25")

        # spaces are automatically converted

            s.query("biogrid", "ZAP70 AND species:9606")

        .. warning:: AND must be in big caps. Some database are ore permissive
            than other (e.g., intact accepts "and"). species must be a valid ID number. Again, some DB are more
            permissive and may accept the name (e.g., human)

        To obtain the number of interactions in intact for the human specy::

            >>> len(p.query("intact", "species:9606"))


        """
        if service not in self.activeDBs:
            raise ValueError("database %s not in active databases" % service)

        params = {}
        if output is not None:
            self.services.devtools.check_param_in_list(output, self.formats)
            params["format"] = output
        else:
            output = "none"

        names = [x.lower() for x in self.registry_names]
        try:
            index = names.index(service)
        except ValueError:
            self.logging.error("The service you gave (%s) is not registered. See self.registery_names" % service)
            raise ValueError

        # get the base url according to the service requested
        resturl = self.registry_resturls[index]

        if firstResult is not None:
            params["firstResult"] = firstResult
        if maxResults is not None:
            params["maxResults"] = maxResults

        url = resturl + "query/" + query

        if "xml" in output:
            res = self.services.http_get(url, frmt="xml", params=params)
        else:
            res = self.services.http_get(url, frmt="txt", params=params)
            res = res.strip().split("\n")

        if output.startswith("tab"):
            res = self._convert_tab2dict(res)

        return res

    def _convert_tab2dict(self, data):
        """

        https://code.google.com/p/psicquic/wiki/MITAB26Format
        """
        results = []
        for line in data:
            results.append(line.split("\t"))

        return results

[docs]    def queryAll(
        self,
        query,
        databases=None,
        output="tab25",
        version="current",
        firstResult=None,
        maxResults=None,
    ):
        """Same as query but runs on all active database

        :param list databases: database to query. Queries all active DB if not provided
        :return: dictionary where keys correspond to databases and values to the output of the query.

        ::

            res = s.queryAll("ZAP70 AND species:9606")
        """

        results = {}
        if databases is None:
            databases = [x.lower() for x in self.activeDBs]

        for x in databases:
            if x not in self.activeDBs:
                raise ValueError("database %s not in active databases" % x)

        for name in databases:
            self.logging.warning("Querying %s" % name),
            res = self.query(
                name,
                query,
                output=output,
                version=version,
                firstResult=firstResult,
                maxResults=maxResults,
            )
            if output.startswith("tab25"):
                results[name] = [x for x in res if x != [""]]
            else:
                import copy

                results[name] = copy.copy(res)
        for name in databases:
            self.logging.info("Found %s in %s" % (len(results[name]), name))
        return results

[docs]    def getInteractionCounter(self, query):
        """Returns a dictionary with database as key and results as values

        :param str query: a valid query
        :return: a dictionary which key as database and value as number of entries

        Consider only the active database.

        """
        # get the active names only
        activeDBs = self.activeDBs[:]
        res = [(str(name), int(self.query(name, query, output="count")[0])) for name in activeDBs]
        return dict(res)

[docs]    def getName(self, data):
        idsA = [x[0] for x in data]
        idsB = [x[1] for x in data]
        return idsA, idsB

[docs]    def knownName(self, data):
        """Scan all entries (MITAB) and returns simplified version


        Each item in the input list of mitab entry
        The output is made of 2 lists corresponding to
        interactor A and B found in the mitab entries.

        elements in the input list takes the following forms::

            DB1:ID1|DB2:ID2
            DB3:ID3

        The | sign separates equivalent IDs from different databases.

        We want to keep only one. The first known databae is kept. If in the list of DB:ID pairs no known
        database is found, then we keep the first one whatsover.

        known databases are those available in the uniprot mapping tools.

        chembl and chebi IDs are kept unchanged.


        """
        self.logging.info("converting data into known names")
        idsA = [x[0].replace('"', "") for x in data]
        idsB = [x[1].replace('"', "") for x in data]
        # extract the first and second ID but let us check if it is part of a
        # known uniprot mapping.Otherwise no conversion will be possible.
        # If so, we set the ID to "unknown"
        # remove the " character that can be found in a few cases (e.g,
        # chebi:"CHEBI:29036")
        # idsA = [x.replace("chebi:CHEBI:","chebi:") for x in idsA]
        # idsB = [x.replace("chebi:CHEBI:", "chebi:") for x in idsB]

        # special case:
        # in mint, there is an entry that ends with a | uniprotkb:P17844|
        idsA = [x.strip("|") for x in idsA]
        idsB = [x.strip("|") for x in idsB]

        # the first ID
        for i, entry in enumerate(idsA):
            try:
                dbs = [x.split(":")[0] for x in entry.split("|")]
                IDs = [x.split(":")[1] for x in entry.split("|")]
                valid_dbs = [(db, ID) for db, ID in zip(dbs, IDs) if db in self._mapping_uniprot.keys()]
                # search for an existing DB
                if len(valid_dbs) >= 1:
                    idsA[i] = valid_dbs[0][0] + ":" + valid_dbs[0][1]
                else:
                    self.logging.debug("none of the DB for this entry (%s) are available" % (entry))
                    idsA[i] = "?" + dbs[0] + ":" + IDs[0]
            except:
                self.logging.info("Could not extract name from %s" % entry)
                idsA[i] = "??:" + entry  # we add a : so that we are sure that a split(":") will work
        # the second ID
        for i, entry in enumerate(idsB):
            try:
                dbs = [x.split(":")[0] for x in entry.split("|")]
                IDs = [x.split(":")[1] for x in entry.split("|")]
                valid_dbs = [(db, ID) for db, ID in zip(dbs, IDs) if db in self._mapping_uniprot.keys()]
                # search for an existing DB
                if len(valid_dbs) >= 1:
                    idsB[i] = valid_dbs[0][0] + ":" + valid_dbs[0][1]
                else:
                    self.logging.debug("none of the DB (%s) for this entry are available" % (entry))
                    idsB[i] = "?" + dbs[0] + ":" + IDs[0]
            except:
                self.logging.info("Could not extract name from %s" % entry)
                idsB[i] = "??:" + entry

        countA = len([x for x in idsA if x.startswith("?")])
        countB = len([x for x in idsB if x.startswith("?")])
        if countA + countB > 0:
            self.logging.warning("%s ids out of %s were not identified" % (countA + countB, len(idsA) * 2))
            print(set([x.split(":")[0] for x in idsA if x.startswith("?")]))
            print(set([x.split(":")[0] for x in idsB if x.startswith("?")]))
        self.logging.info("knownName done")
        return idsA, idsB

[docs]    def preCleaning(self, data):
        """remove entries ehre IdA or IdB is set to "-" """
        ret = [x for x in data if x[0] != "-" and x[1] != "-"]
        return ret

[docs]    def postCleaningAll(self, data, keep_only="HUMAN", flatten=True, verbose=True):
        """

        even more cleaing by ignoring score, db and interaction
        len(set([(x[0],x[1]) for x in retnew]))
        """
        results = {}
        for k in data.keys():
            self.logging.info("Post cleaning %s" % k)
            ret = self.postCleaning(data[k], keep_only="HUMAN", verbose=verbose)
            if len(ret):
                results[k] = ret
        if flatten:
            results = [x for k in results.keys() for x in results[k]]
        return results

[docs]    def postCleaning(
        self,
        data,
        keep_only="HUMAN",
        remove_db=["chebi", "chembl"],
        keep_self_loop=False,
        verbose=True,
    ):
        """Remove entries with a None and keep only those with the keep pattern"""
        if verbose:
            print("Before removing anything: ", len(data))

        data = [x for x in data if x[0] is not None and x[1] is not None]
        if verbose:
            print("After removing the None: ", len(data))

        data = [x for x in data if x[0].startswith("!") is False and x[1].startswith("!") is False]
        if verbose:
            print("After removing the !: ", len(data))

        for db in remove_db:
            data = [x for x in data if x[0].startswith(db) is False]
            data = [x for x in data if x[1].startswith(db) is False]
            if verbose:
                print("After removing entries that match %s : " % db, len(data))

        data = [x for x in data if keep_only in x[0] and keep_only in x[1]]
        if verbose:
            print("After removing entries that don't match %s : " % keep_only, len(data))

        if keep_self_loop is False:
            data = [x for x in data if x[0] != x[1]]
            if verbose:
                print("After removing self loop : ", len(data))

        data = list(set(data))
        if verbose:
            print("After removing identical entries", len(data))

        return data

[docs]    def convertAll(self, data):
        results = {}
        for k in data.keys():
            self.logging.info("Analysing %s" % k)
            results[k] = self.convert(data[k], db=k)
        return results

[docs]    def convert(self, data, db=None):
        self.logging.debug("converting the database %s" % db)
        idsA, idsB = self.knownName(data)
        mapping = self.mappingOneDB(data)
        results = []
        for i, entry in enumerate(data):
            x = idsA[i].split(":", 1)[1]
            y = idsB[i].split(":", 1)[1]
            xp = mapping[x]
            yp = mapping[y]
            try:
                ref = entry[8]
            except:
                ref = "?"
            try:
                score = entry[14]
            except:
                score = "?"
            try:
                interaction = entry[11]
            except:
                interaction = "?"
            results.append((xp, yp, score, interaction, ref, db))
        return results

[docs]    def mappingOneDB(self, data):
        query = {}
        self.logging.debug("converting IDs with proper DB name (knownName function)")
        entriesA, entriesB = self.knownName(data)  # idsA and B contains list of a single identifier of the form db:id
        # the db is known from _mapping.uniprot otherwise it is called "unknown"

        # get unique DBs to build the query dictionary
        dbsA = [x.split(":")[0] for x in entriesA]
        dbsB = [x.split(":")[0] for x in entriesB]
        for x in set(dbsA):
            query[x] = set()

        for x in set(dbsB):
            query[x] = set()

        #
        query_copy = query.copy()
        for k in query_copy.keys():
            if k.startswith("?"):
                del query[k]

        # the data to store
        mapping = {}
        N = len(data)

        # scan all entries
        counter = 0
        for entryA, entryB in zip(entriesA, entriesB):
            counter += 1
            dbA, idA = entryA.split(":")
            try:
                dbB, idB = entryB.split(":")
            except:
                print(entryB)
            if idA not in mapping.keys():
                if dbA.startswith("?"):
                    mapping[idA] = entryA
                else:
                    query[dbA].add(idA)
            if idB not in mapping.keys():
                if dbB.startswith("?"):
                    mapping[idB] = entryB
                else:
                    query[dbB].add(idB)

            for k in query.keys():
                if len(query[k]) > 2000 or counter == N:
                    this_query = list(query[k])
                    DBname = self._mapping_uniprot[k]

                    if DBname is not None:
                        self.logging.warning("Request sent to uniprot for %s database (%s/%s)" % (DBname, counter, N))
                        res = self.uniprot.mapping(fr=DBname, to="ID", query=" ".join(this_query))
                        for x in this_query:
                            if x not in res:  # was not found
                                mapping[x] = "!" + k + ":" + x
                            else:
                                # we should be here since the queries are populated
                                # if not already in the mapping dictionary
                                if x not in res.keys():
                                    raise ValueError(x)
                                if len(res[x]) == 1:
                                    mapping[x] = res[x][0]
                                else:
                                    self.logging.warning("psicquic mapping found more than 1 id. keep first one")
                                    mapping[x] = res[x][0]
                    else:
                        for x in this_query:
                            mapping[x] = k + ":" + x
                    query[k] = set()

        for k in query.keys():
            assert len(query[k]) == 0
        return mapping


class AppsPPI(object):
    """This is an application based on PPI that search for relevant interactions

    Interctions between proteins may have a score provided by each database.
    However, scores are sometimes ommited. Besides, they may have different
    meaning for different databases. Another way to score an interaction is to
    count in how many database it is found.

    This class works as follows. First, you query a protein:


        p = AppsPPI()
        p.query("ZAP70 AND species:9606")

    This, is going to call the PSICQUIC queryAll method to send this query to
    all active databases. Then, it calls the convertAll functions to convert all
    interactors names into uniprot name if possible. If not, interactions are
    not taken into account. Finally, it removes duplicated and performs some
    cleaning inside the postCleaningall method.

    Then, you can call the summary method that counts the interactions. The
    count is stored in the attrbiute relevant_interactions.

        p.summary()

    Let us see how many intercations where found with. THe number of databases
    that contains at least one interactions is

        >>> p.N
        >>> p.relevant_interactions[N]
        [['ZAP70_HUMAN', 'DBNL_HUMAN']]

    So, there was 1 interaction found in all databases.

    """

    def __init__(self, verbose=False):
        """.. rubric:: constructor"""
        self.psicquic = PSICQUIC(verbose=False)
        self.verbose = verbose

    def queryAll(self, query, databases=None):
        """

        :param str query: a valid query. See :class:`~bioservices.psicquic.PSICQUIC`
            and :mod:`bioservices.psicquic` documentation.
        :param str databases: by default, queries are sent to each active database.
            you can overwrite this behavious by providing your own list of
            databses
        :return: nothing but the interactions attributes is populated with a
            dictionary where keys correspond to each database that returned a non empty list
            of interactions. The item for each key is a list of interactions containing the
            interactors A and B, the score, the type of intercations and the score.


        """
        # self.results_query = self.psicquic.queryAll("ZAP70 AND species:9606")
        print("Requests sent to psicquic. Can take a while, please be patient...")
        self.results_query = self.psicquic.queryAll(query, databases)
        self.interactions = self.psicquic.convertAll(self.results_query)
        self.interactions = self.psicquic.postCleaningAll(self.interactions, flatten=False, verbose=self.verbose)
        self.N = len(self.interactions.keys())
        self.counter = {}
        self.relevant_interactions = {}

    def summary(self):
        """Build some summary related to the found interactions from queryAll

        :return: nothing but the relevant_interactions and counter attribute

            p = AppsPPI()
            p.queryAll("ZAP70 AND species:9606")
            p.summary()

        """
        for k, v in self.interactions.items():
            print("Found %s interactions within %s database" % (len(v), k))

        counter = {}
        for k in self.interactions.keys():
            # scan each dabase
            for v in self.interactions[k]:
                interaction = v[0] + "++" + v[1]
                db = v[5]
                if interaction in counter.keys():
                    counter[interaction].append(db)
                else:
                    counter[interaction] = [db]
        for k in counter.keys():
            counter[k] = list(set(counter[k]))

        N = len(self.interactions.keys())

        print("-------------")
        summ = {}
        for i in range(1, N + 1):
            res = [(x.split("++"), counter[x]) for x in counter.keys() if len(counter[x]) == i]
            print("Found %s interactions in %s common databases" % (len(res), i))
            res = [x.split("++") for x in counter.keys() if len(counter[x]) == i]
            if len(res):
                summ[i] = [x for x in res]
            else:
                summ[i] = []

        self.counter = counter.copy()
        self.relevant_interactions = summ.copy()

    def get_reference(self, idA, idB):
        key = idA + "++" + idB
        uniq = len(self.counter[key])
        ret = [x for k in self.interactions.keys() for x in self.interactions[k] if x[0] == idA and x[1] == idB]
        N = len(ret)
        print("Interactions %s -- %s has %s entries in %s databases (%s):" % (idA, idB, N, uniq, self.counter[key]))
        for r in ret:
            print(r[5], " reference", r[4])

    def show_pie(self):
        """a simple example to demonstrate how to visualise number of
        interactions found in various databases

        """
        try:
            from pylab import pie, clf, title, show, legend
        except ImportError:
            from bioservices import BioServicesError

            raise BioServicesError("You must install pylab/matplotlib to use this functionality")
        labels = range(1, self.N + 1)
        print(labels)
        counting = [len(self.relevant_interactions[i]) for i in labels]

        clf()
        # pie(counting, labels=[str(int(x)) for x in labels], shadow=True)
        pie(counting, labels=[str(x) for x in counting], shadow=True)
        title("Number of interactions found in N databases")
        legend([str(x) + " database(s)" for x in labels])
        show()