Source code for bioservices.pdb

#
#  This file is part of bioservices software
#
#  Copyright (c) 2013-2020 - EBI-EMBL - Institut Pasteur
#
#  File author(s):
#      Thomas Cokelaer <thomas.cokelaer@pasteur.fr>
#
#  Distributed under the GPLv3 License.
#  See accompanying file LICENSE.txt or copy at
#      http://www.gnu.org/licenses/gpl-3.0.html
#
#  website: https://github.com/cokelaer/bioservices
#  documentation: http://packages.python.org/bioservices
#
##############################################################################
# $Id$
"""Interface to the PDB web Service (New API Jan 2021).

.. topic:: What is PDB ?

    :URL: http://www.rcsb.org/pdb/
    :REST: http://search.rcsb.org/#search-api

    .. highlights::

        An Information Portal to Biological Macromolecular Structures

        -- PDB home page, Jan 2021


"""
from bioservices.services import REST


__all__ = ["PDB"]


[docs]class PDB: """Interface to `PDB <http://search.rcsb.org/>`_ service (new API Jan 2021) With the new API, one method called :meth:`~bioservices.pdb.PDB.search` is provided by PDB. To perform a search you need to define a query. Here is an example .. doctest:: >>> from bioservices import PDB >>> s = PDB() >>> query = {"query": ... {"type": "terminal", ... "service": "text", ... "parameters": { ... "value": "thymidine kinase" ... } ... }, ... "return_type": "entry"} >>> res = s.search(query, return_type=return_type) .. note:: as of December 2020, a new API has be set up by PDB. some prevous functionalities such as return list of Ligand are not supported anymore (Jan 2021). However, many more powerful searches as available. I encourage everyone to look at the PDB page for complex examples: http://search.rcsb.org/#examples As mentionnaed above, the PDB service provide one method called search available in :meth:`~bioservices.pdb.PDB.search`. We will not cover all the power and capability of this search function. User should refer to the official PDB help for that. Yet, given examples from PDB should all work with this method. When possible, we will add convenient aliases function in this class. For now we have for example the :meth:`~bioservices.pdb.PDB.get_current_ids` and :meth:`~bioservices.pdb.PDB.get_similarity_sequence` that users may find useful. The main idea behind the PDB API is to create queries that can access to different type of services. A query will need to at least two keys: - **query** - **return_type** Consider this basic example that searches for the text *thymidine kinase*:: { "query": { "type": "terminal", "service": "text", "parameters": { "value": "thymidine kinase" } }, "return_type": "entry" } Here the query is defined by a **query** and a **return_type** indeed. The return type is a simple value such as **entry**. The query itself is composed of 3 pairs of key/value. Here we have the type service and parameters as defined below. The query can have several fields: - **type**: the clause type can be either **terminal** or **group** - **terminal**: performs an atomic search operation, e.g. searches for a particular value in a particular field. - **group**: wraps other terminal or group nodes and is used to combine multiple queries in a logical fashion. - **service**: - **text**: linguistic searches against textual annotations. - **sequence**: uses MMSeq2 to perform sequence matching searches (blast-like). following targets that are available: - pdb_protein_sequence, - pdb_dna_sequence, - pdb_na_sequence - **seqmotif**: performs short motif searches against nucleotide or protein sequences using 3 different inputs: - simple (e.g., CXCXXL) - prosite (e.g., C-X-C-X(2)-[LIVMYFWC]) - regex (e.g., CXCX{2}[LIVMYFWC]) - **structure**: searches matching a global 3D shape of assemblies or chains of a given entry (identified by PDB ID), in either strict (strict_shape_match) or relaxed (relaxed_shape_match) modes - strucmotif: Performs structural motif searches on all available PDB structures. - chemical: queries of small-molecule constituents of PDB structures, based on chemical formula and chemical structure. Queries for matching and similar chemical structures can be performed using SMILES and InChI descriptors as search targets. - graph-strict: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment are used as matching criteria for this search type. - graph-relaxed: atom type, formal charge and bond order are used as matching criteria for this search type. - graph-relaxed-stereo: atom type, formal charge, bond order, atom and bond chirality are used as matching criteria for this search type. - fingerprint-similarity: Tanimoto similarity is used as the matching criteria Concerning the **return_type** key, it can be one of : - entry: a list of PDB IDs. - assembly: list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies. - polymer_entity: list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities. - non_polymer_entity: list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands). - polymer_instance: list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains. **Optional arguments** There are many optional arguments. Let us see a couple of them. Pagination can be set (default is 10 entries) using the **request_options** (optional) key. Consider this query example:: { "query": { "type": "terminal", "service": "text", "parameters": { "attribute": "rcsb_polymer_entity.formula_weight", "operator": "greater", "value": 500 } }, "request_options": { "pager": { "start": 0, "rows": 100 } }, "return_type": "polymer_entity" } Here, the query searches for the polymer_entity that have a formula weight above 500. Withe request_options pager set to 100, we will get the first 100 hits. To return all hits, set this field in the request_options:: "return_all_hits": true Coming back at the first basic example, we can reuse it to illustrate how to refine the search using attribute and operators:: { "query": { "type": "terminal", "service": "text", "parameters": { "value": "thymidine kinase", "attribute": "exptl.method", "operator": "exact_match", } }, "return_type": "entry" } All valid combo of operators and attributes can be found here: http://search.rcsb.org/search-attributes.html For instance, in the example above only in, exact_match and exists can be used with exptl.method attribute. This is not checked in bioservices. Sorting is determined by the sort object in the request_options context. It allows you to add one or more sorting conditions to control the order of the search result hits. The sort operation is defined on a per field level, with special field name for score to sort by score (the default)< By default sorting is done in descending order ("desc"). The sort can be reversed by setting direction property to "asc". This example demonstrates how to sort the search results by release date:: { "query": { "type": "terminal", "service": "text", "parameters": { "attribute": "struct.title", "operator": "contains_phrase", "value": "\"hiv protease\"" } }, "request_options": { "sort": [ { "sort_by": "rcsb_accession_info.initial_release_date", "direction": "desc" } ] }, "return_type": "entry" } Again, many more complex examples can be found on PDB page. """ _url = "http://search.rcsb.org/rcsbsearch/v1/" def __init__(self, verbose=False, cache=False): """.. rubric:: Constructor :param bool verbose: prints informative messages (default is off) """ self.services = REST(name="PDB", verbose=verbose, cache=cache, url_defined_later=True) self.services.url = PDB._url
[docs] def search(self, query, request_options=None, request_info=None, return_type=None): """search request represented as a JSON object. This is the only function in PDB API. You should be able to perform any valid PDB searches here (see the :class:`bioservices.pdb.PDB` documentation for details. Note, however, that we have aliases methods in BioServices that will be added on demand for common searches. :param str query: the search expression. Can be omitted if, instead of IDs retrieval, facets or count operation should be performed. In this case the request must be configured via the request_options context. :param str request_options: (optional) controls various aspects of the search request including pagination, sorting, scoring and faceting. :param str request_info: additional information about the query, e.g. query_id. (optional) :param str return_type: type of results to return. :return: json results You must define a query as defined in the PDB web page. For example the following query search for macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken):: query = { "query": { "type": "terminal", "service": "sequence", "parameters": { "evalue_cutoff": 1, "identity_cutoff": 0.9, "target": "pdb_protein_sequence", "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS" } }, "request_options": { "scoring_strategy": "sequence" }, "return_type": "polymer_entity" } What is important is that the dictionary called **query** contains 2 compulsary keys namely **query** and **return_type**. The two other optional keys are **request_options** and **return_info** You would then call the PDB search as follows:: from bioservices import PDB p = PDB() results = p.search(query) Now, in BioServices, you can also decompose the query as follows:: query = { "type": "terminal", "service": "sequence", "parameters": { "evalue_cutoff": 1, "identity_cutoff": 0.9, "target": "pdb_protein_sequence", "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS" }} request_options = { "scoring_strategy": "sequence"} return_type= "polymer_entity" and then use PDB search again:: from bioservices import PDB p = PDB() results = p.search(query, request_options=request_options, return_type=return_type) or even simpler for the Pythonic lovers:: results = p.search(**query) """ if "query" in query: pass else: query = {"query": query} if request_options: query["request_options"] = request_options if request_info: query["request_info"] = request_info if return_type: query["return_type"] = return_type if "return_type" not in query: # pragma: no cover raise ValueError("Yourr query must have a return_type key") print(query) res = self.services.http_post("query", frmt="json", json=query) return res
[docs] def get_current_ids(self): """Get a list of all current PDB IDs.""" # first query returns 10 entries by default request_options = {"return_all_hits": True} # second requests all entries res = self.search( query={"type": "terminal", "service": "text"}, request_options=request_options, return_type="entry", ) identifiers = [x["identifier"] for x in res["result_set"]] return identifiers
[docs] def get_similarity_sequence(self, seq): """Search of seauence similarity search with protein sequence seq = "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTAVAHVDDMPNAL" results = p.get_similarity_sequence(seq) """ res = self.search( { "query": { "type": "terminal", "service": "sequence", "parameters": {"target": "pdb_protein_sequence", "value": seq}, }, "return_type": "polymer_entity", } ) return res