8. Services

8.1. ArrayExpress

Interface to the ArrayExpress web Service.

class ArrayExpress(verbose=False, cache=False)[source]

Interface to the ArrayExpress service.

ArrayExpress data is now hosted via the BioStudies platform at EBI. This class provides access to the ArrayExpress collection using the BioStudies REST API.

Quick start:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> results = s.search("breast cancer")
>>> results["totalHits"]  # total experiments found
>>> study = s.get_study("E-MEXP-31")
>>> files = s.get_files("E-MEXP-31")

You can also search by keyword and retrieve accessions:

>>> accessions = s.queryAE("pneumonia homo sapiens")

Note

ArrayExpress was migrated from http://www.ebi.ac.uk/arrayexpress to the BioStudies platform in 2021. The new API base URL is https://www.ebi.ac.uk/biostudies.

See also

search() for the primary search method.

Constructor

Parameters:
  • verbose (bool) – prints informative messages

  • cache (bool) – use HTTP cache

getAE(accession)[source]

Download all files from an experiment and save them locally.

Parameters:

accession (str) – Experiment accession (e.g., "E-MEXP-31").

Files are written to the current working directory. Binary files (e.g., .zip) are written in binary mode; text files in text mode.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> s.getAE("E-MEXP-31")  
get_files(accession)[source]

Retrieve the list of file paths for a specific study.

Parameters:

accession (str) – Study accession number (e.g., "E-MEXP-31").

Returns:

list of file path strings.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> files = s.get_files("E-MEXP-31")
>>> "E-MEXP-31.idf.txt" in files  
True
get_study(accession)[source]

Retrieve full metadata for a specific ArrayExpress study.

Parameters:

accession (str) – Study accession number (e.g., "E-MEXP-31").

Returns:

dict containing study metadata, sections, and file listings.

The returned dict has keys accno, attributes, section, and type. Files are nested within section["subsections"].

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> study = s.get_study("E-MEXP-31")
>>> study["accno"]
'E-MEXP-31'
>>> # Get the study title
>>> [a["value"] for a in study["attributes"] if a["name"] == "Title"][0]  
'Transcription profiling of mammalian male germ cells...'
queryAE(query, **kargs)[source]

Search ArrayExpress and return a list of experiment accessions.

Parameters:
  • query (str) – Search query (keywords, species, etc.).

  • kargs – Additional arguments passed to search() (page, page_size, sort_by, sort_order).

Returns:

list of accession strings.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> accessions = s.queryAE("pneumonia homo sapiens")  
>>> accessions[:3]  
['E-GEOD-12345', 'E-MEXP-67890', 'E-MTAB-11111']
queryExperiments(**kargs)[source]

Search ArrayExpress experiments.

This method accepts the same keyword arguments as the original ArrayExpress v2 API for backward compatibility and maps them to the current BioStudies API.

Parameters:
  • accession (str) – Experiment accession (e.g., "E-MEXP-31").

  • keywords (str) – Search keywords (e.g., "cancer breast"). Separate multiple terms with + or spaces.

  • species (str) – Species filter (e.g., "homo sapiens").

  • expdesign (str) – Experiment design type (e.g., "dose response").

  • exptype (str) – Experiment type (e.g., "RNA-seq").

  • array (str) – Array design accession (e.g., "A-AFFY-33").

  • pmid (str) – PubMed identifier (e.g., "16553887").

  • sa (str) – Sample attribute value (e.g., "fibroblast").

  • ef (str) – Experimental factor name (e.g., "CellType").

  • efv (str) – Experimental factor value (e.g., "HeLa").

  • sortby (str) – Sort field. One of: accession, name, assays, species, releasedate, fgem, raw, atlas.

  • sortorder (str) – Sort direction: ascending or descending.

  • pagesize (int) – Number of results per page (default: 20).

Returns:

dict with keys page, pageSize, totalHits, hits.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> res = s.queryExperiments(keywords="breast cancer")  
>>> res["totalHits"]  
1152
>>> res = s.queryExperiments(array="A-AFFY-33", species="Homo sapiens")  
>>> res = s.queryExperiments(keywords="pneumonia", sortby="releasedate",
...                          sortorder="ascending")  

See also

search() for the primary search interface.

queryFiles(**kargs)[source]

Search ArrayExpress experiments and return results including file counts.

Accepts the same keyword arguments as queryExperiments(). Each hit in the result includes a files field with the file count. Use get_files() to retrieve the actual file paths for a specific experiment.

Returns:

dict with keys page, pageSize, totalHits, hits.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> res = s.queryFiles(keywords="breast cancer")  
>>> res["hits"][0]["files"]  # number of files in first hit  
78

See also

get_files() to retrieve file paths for a study.

retrieveExperiment(experiment)[source]

Retrieve metadata for a specific experiment by accession.

This is an alias for get_study().

Parameters:

experiment (str) – Experiment accession (e.g., "E-MEXP-31").

Returns:

dict with full study metadata.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> study = s.retrieveExperiment("E-MEXP-31")  
>>> study["accno"]  
'E-MEXP-31'
retrieveFile(experiment, filename, save=False)[source]

Download a specific file from an experiment.

This is an alias for retrieve_file().

Parameters:
  • experiment (str) – Experiment accession (e.g., "E-MEXP-31").

  • filename (str) – Name of the file to download.

  • save (bool) – If True, save the file to disk.

Returns:

file content as str or bytes, or None if save is True.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> content = s.retrieveFile("E-MEXP-31", "E-MEXP-31.idf.txt")  
retrieveFilesFromExperiment(experiment)[source]

Return the list of file paths for a given experiment.

This is an alias for get_files().

Parameters:

experiment (str) – Experiment accession (e.g., "E-MEXP-31").

Returns:

list of file path strings.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> files = s.retrieveFilesFromExperiment("E-MEXP-31")  
>>> "E-MEXP-31.idf.txt" in files  
True
retrieve_file(accession, filename, save=False)[source]

Download a specific file from an ArrayExpress study.

Files are served via the BioStudies file store (redirecting to the EBI FTP). For large files such as .zip archives the content is returned as bytes; plain-text files are returned as strings.

Parameters:
  • accession (str) – Study accession number (e.g., "E-MEXP-31").

  • filename (str) – Name of the file to download (e.g., "E-MEXP-31.idf.txt").

  • save (bool) – If True, write the file to disk in the current working directory (default: False).

Returns:

file content (str or bytes), or None when save is True.

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> content = s.retrieve_file("E-MEXP-31", "E-MEXP-31.idf.txt")  
search(query, page=1, page_size=20, sort_by='relevance', sort_order='descending')[source]

Search ArrayExpress experiments.

Parameters:
  • query (str) – Free-text search query. Supports keywords, accession numbers, species names, and boolean operators (AND, OR, NOT).

  • page (int) – Page number for paginated results (default: 1).

  • page_size (int) – Number of results per page (default: 20).

  • sort_by (str) – Field to sort by. One of: relevance, release_date, views (default: relevance).

  • sort_order (str) – Sort direction. One of: ascending, descending (default: descending).

Returns:

dict with keys page, pageSize, totalHits, hits.

Each entry in hits contains: accession, title, author, release_date, files (count), links (count).

Example:

>>> from bioservices import ArrayExpress
>>> s = ArrayExpress()
>>> res = s.search("breast cancer")
>>> res["totalHits"]  
1152
>>> res["hits"][0]["accession"]  
'E-GEOD-17155'
>>> res2 = s.search("Homo sapiens", sort_by="release_date", sort_order="ascending")

8.2. Biocontainers

Interface to BioContainers.

class Biocontainers(verbose=True, cache=False)[source]

Interface to the BioContainers service.

BioContainers exposes a GA4GH Tool Registry Service (TRS) v2 API for discovering bioinformatics containers (Docker, Singularity, Conda).

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> b.get_tools(limit=5)
>>> b.get_tool("samtools")
>>> b.get_tool_classes()

Constructor

Parameters:
  • verbose (bool) – set to False to suppress informative messages

  • cache (bool) – use HTTP cache

get_tool(tool_id)[source]

Return metadata for a single tool.

Parameters:

tool_id (str) – the BiGG/BioContainers tool identifier (e.g., "samtools").

Returns:

dict with keys id, name, description, organization, toolclass, versions, pulls, etc.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> tool = b.get_tool("samtools")
>>> tool["name"]
'samtools'
>>> tool["pulls"]  
381303353
get_tool_classes()[source]

Return all tool classes defined in BioContainers.

Returns:

list of dicts, each with keys id, name, description. Current classes are CommandLineTool, Workflow, CommandLineMultiTool, and Service.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> classes = b.get_tool_classes()
>>> [c["name"] for c in classes]
['CommandLineTool', 'Workflow', 'CommandLineMultiTool', 'Service']
get_tool_version(tool_id, version_id)[source]

Return metadata for a specific version of a tool.

Parameters:
  • tool_id (str) – the tool identifier (e.g., "samtools").

  • version_id (str) – the version identifier, typically in the form "<tool>-<version>" (e.g., "samtools-1.17").

Returns:

dict with keys id, name, meta_version, images (list of container image records).

Each image entry includes image_name, image_type (Docker, Singularity, or Conda), registry_host, size, and updated.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> v = b.get_tool_version("samtools", "samtools-1.17")
>>> v["meta_version"]
'1.17'
>>> [img["image_type"] for img in v["images"]]  
['Conda', 'Docker', 'Singularity', ...]
get_tool_versions(tool_id)[source]

Return all versions of a given tool.

Parameters:

tool_id (str) – the tool identifier (e.g., "samtools").

Returns:

pandas.DataFrame with one version per row, or the raw list if the response cannot be converted.

Each row contains image information (Docker, Singularity, Conda) and metadata such as id, name, meta_version.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> df = b.get_tool_versions("samtools")
>>> df["id"].tolist()[:3]  
['samtools-0.1.19', 'samtools-0.1.20', 'samtools-0.1.21']
get_tools(limit=1000, search=None, toolname=None, sort_field='id', sort_order='asc')[source]

Return a list of available tools.

Parameters:
  • limit (int) – maximum number of tools to return (default: 1000).

  • search (str) – free-text search filter applied across tool names, descriptions and tags (e.g., "alignment").

  • toolname (str) – filter by exact tool name (e.g., "samtools").

  • sort_field (str) – field to sort results by (default: "id").

  • sort_order (str) – sort direction — "asc" or "desc" (default: "asc").

Returns:

pandas.DataFrame with one tool per row, or the raw list if the response cannot be converted.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> df = b.get_tools(limit=10)
>>> df.columns.tolist()  
['id', 'name', 'organization', 'toolclass', 'versions', ...]
>>> b.get_tools(limit=5, search="alignment")  
get_versions_one_tool(tool_id)[source]

Return all versions of a given tool.

This is an alias for get_tool_versions().

Parameters:

tool_id (str) – the tool identifier (e.g., "samtools").

Returns:

pandas.DataFrame or raw list.

Example:

>>> from bioservices import Biocontainers
>>> b = Biocontainers()
>>> b.get_versions_one_tool("samtools")  

8.3. BiGG

Interface to the BiGG Models API Service.

class BiGG(verbose=False, cache=False)[source]

Interface to the BiGG Models API.

BiGG Models is a knowledgebase of genome-scale metabolic network reconstructions with standardised BiGG identifiers.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.search("e coli", "models")
[{'bigg_id': 'e_coli_core', 'gene_count': 137, ...}, ...]

Constructor

Parameters:
  • verbose (bool) – print informative messages

  • cache (bool) – use HTTP cache

download(model_id, format_='json', gzip=True, target=None)[source]

Download a model file and save it locally.

Parameters:
  • model_id (str) – BiGG model identifier (e.g., "e_coli_core").

  • format (str) – file format — one of "xml", "json", "mat" (default: "json").

  • gzip (bool) – download the gzip-compressed version (default: True).

  • target (str) – local file path to write to. Defaults to "<model_id>.<format_>[.gz]" in the current directory.

Raises:

TypeError – if format_ is not one of the accepted values.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.download("e_coli_core", format_="json", target="/tmp/e_coli_core.json.gz")  
genes(model_id, ids=None)[source]

Retrieve genes from a model.

Parameters:
  • model_id (str) – BiGG model identifier (e.g., "e_coli_core").

  • ids – a single gene BiGG ID string or a list of IDs. If None, returns all genes for the model.

Returns:

a list of gene dicts when ids is None or a list; a single dict when ids is a single string.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.genes("e_coli_core")           # list all  
>>> bigg.genes("e_coli_core", "b0351")  # single detail  
get_model(model_id)[source]

Retrieve metadata for a specific model.

Parameters:

model_id (str) – BiGG model identifier (e.g., "e_coli_core").

Returns:

dict with keys model_bigg_id, organism, metabolite_count, reaction_count, gene_count, reference_id, reference_type, escher_maps, last_updated, and download-size fields.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> m = bigg.get_model("e_coli_core")
>>> m["organism"]
'Escherichia coli str. K-12 substr. MG1655'
>>> m["reaction_count"]
95
metabolites(model_id=None, ids=None)[source]

Retrieve metabolites from a model or the universal database.

Parameters:
  • model_id (str) – BiGG model identifier (e.g., "e_coli_core"). If None, queries the universal metabolite database.

  • ids – a single metabolite BiGG ID string or a list of IDs. If None, returns the full list for the model (or universal DB).

Returns:

a list of metabolite dicts when ids is None or a list; a single dict when ids is a single string.

Model metabolites (list and detail):

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.metabolites("e_coli_core")          # list all  
>>> bigg.metabolites("e_coli_core", "atp_c") # single detail  
>>> bigg.metabolites("e_coli_core", ids=["atp_c", "adp_c"])  

Universal metabolites:

>>> bigg.metabolites()               # list all universal  
>>> bigg.metabolites(ids="atp")      # single universal detail  
property models

Return the list of all models in BiGG.

Returns:

list of dicts, each with keys bigg_id, organism, metabolite_count, reaction_count, gene_count.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> models = bigg.models
>>> models[0]["bigg_id"]
'e_coli_core'
reactions(model_id=None, ids=None)[source]

Retrieve reactions from a model or the universal database.

Parameters:
  • model_id (str) – BiGG model identifier (e.g., "e_coli_core"). If None, queries the universal reaction database.

  • ids – a single reaction BiGG ID string or a list of IDs. If None, returns the full list for the model (or universal DB).

Returns:

a list of reaction dicts when ids is None or a list; a single dict when ids is a single string.

Model reactions (list and detail):

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.reactions("e_coli_core")         # list all  
>>> bigg.reactions("e_coli_core", "PFK")  # single detail  

Universal reactions:

>>> bigg.reactions()            # list all universal  
>>> bigg.reactions(ids="PFK")   # single universal detail  
search(query, type_)[source]

Search BiGG Models by keyword.

Parameters:
  • query (str) – search term (e.g., "e coli", "atp", "phosphate").

  • type (str) – resource type to search. One of: "models", "metabolites", "reactions", "genes".

Returns:

list of matching result dicts.

Raises:

TypeError – if type_ is not one of the accepted values.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> models = bigg.search("e coli", "models")
>>> models[0]["bigg_id"]  
'e_coli_core'
>>> bigg.search("atp", "metabolites")   
>>> bigg.search("gap", "genes")         
>>> bigg.search("phosphate", "reactions")  
property version

Return the current BiGG database and API version.

Returns:

dict with keys bigg_models_version, api_version, last_updated.

Example:

>>> from bioservices import BiGG
>>> bigg = BiGG()
>>> bigg.version["bigg_models_version"]
'1.6.0'

8.4. BioDBnet

Interface to the BioDBNet REST web service.

class BioDBNet(verbose=True, cache=False)[source]

Interface to the BioDBNet service.

BioDBNet converts biological identifiers between databases (Ensembl, UniProt, Entrez Gene, KEGG, Reactome, and many more).

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> b.getInputs()[:5]
>>> df = b.db2db("UniProt Accession", ["Gene ID", "Gene Symbol"], "P43403")

Use db2db() to convert identifiers from one database to others. Use dbReport() to convert to all possible output databases at once. Use dbOrtho() for cross-species identifier conversion. Use dbFind() when the identifier type is unknown. Use dbWalk() to follow a custom path through the database network.

Constructor

Parameters:
  • verbose (bool) – set to False to suppress informative messages

  • cache (bool) – use HTTP cache

db2db(input_db, output_db, input_values, taxon=9606)[source]

Convert identifiers from one database to one or more output databases.

Parameters:
  • input_db (str) – input database name (e.g., "UniProt Accession").

  • output_db – output database name or list of names (e.g., ["Gene ID", "Gene Symbol"]).

  • input_values – single identifier string or list of identifiers.

  • taxon (int) – NCBI taxonomy ID (default: 9606 for human).

Returns:

pandas.DataFrame indexed by the input identifier, with one column per output database.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> df = b.db2db("UniProt Accession", ["Gene ID", "Gene Symbol"], "P43403")
>>> df.loc["P43403", "Gene Symbol"]
'ZAP70'
>>> df = b.db2db("Ensembl Gene ID", ["Gene Symbol"],
...              ["ENSG00000121410", "ENSG00000171428"], taxon=9606)
dbFind(output_db, input_values, taxon='9606')[source]

Find identifiers of unknown type and convert to an output database.

Use when you do not know the identifier type, or when you have a mixture of different identifier types. BioDBNet detects the type automatically and converts to output_db.

Parameters:
  • output_db (str) – output database name (e.g., "Gene ID").

  • input_values – single identifier string or list of identifiers.

  • taxon (str) – NCBI taxonomy ID as string (default: "9606").

Returns:

pandas.DataFrame indexed by the input value, with columns output_db and Input Type.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> df = b.dbFind("Gene ID", ["ZMYM6_HUMAN", "NP_710159", "ENSP00000305919"])
>>> df.loc["ZMYM6_HUMAN", "Gene ID"]
'9204'
dbOrtho(input_db, output_db, input_values, input_taxon, output_taxon)[source]

Convert identifiers from one species to identifiers of another species.

Parameters:
  • input_db (str) – input database name (e.g., "Gene Symbol").

  • output_db (str) – output database name (e.g., "Gene ID").

  • input_values – single identifier string or list of identifiers.

  • input_taxon (int) – NCBI taxonomy ID for the input species (e.g., 9606 for human).

  • output_taxon (int) – NCBI taxonomy ID for the output species (e.g., 10090 for mouse).

Returns:

pandas.DataFrame indexed by the input identifier with a column for the output database.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> df = b.dbOrtho("Gene Symbol", "Gene ID", ["MYC", "MTOR", "A1BG"],
...                input_taxon=9606, output_taxon=10090)
>>> df.loc["MYC", "Gene ID"]
'17869'
dbReport(input_db, input_values, taxon=9606)[source]

Convert identifiers to all available output databases at once.

Same as db2db() but automatically uses every output database reachable from input_db, making it convenient for exploratory mapping.

Parameters:
  • input_db (str) – input database name (e.g., "Ensembl Gene ID").

  • input_values – single identifier string or list of identifiers.

  • taxon (int) – NCBI taxonomy ID (default: 9606 for human).

Returns:

pandas.DataFrame indexed by the input identifier, with one column per output database.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> df = b.dbReport("UniProt Accession", ["P43403"])
>>> "Gene Symbol" in df.columns
True
dbWalk(db_path, input_values, taxon=9606)[source]

Walk through the biological database network along a custom path.

Gives full control over the conversion path. Useful when the same database appears at both ends of the path (e.g., converting human Ensembl Gene IDs to mouse Ensembl Gene IDs via Homologene).

Parameters:
  • db_path (str) – "->"`-separated path of database names (e.g., ``"Ensembl Gene ID->Gene ID->Homolog - Mouse Gene ID->Ensembl Gene ID").

  • input_values – single identifier string or list of identifiers.

  • taxon (int) – NCBI taxonomy ID (default: 9606).

Returns:

pandas.DataFrame with columns corresponding to each node in the path.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> path = "Ensembl Gene ID->Gene ID->Homolog - Mouse Gene ID->Ensembl Gene ID"
>>> df = b.dbWalk(path, ["ENSG00000121410"])
getDirectOutputsForInput(input_db)[source]

Return databases reachable from input_db by a single edge.

Unlike getOutputsForInput(), which returns all transitively reachable databases, this returns only those connected by a direct single-hop edge in the BioDBNet graph.

Parameters:

input_db (str) – input database name or normalised alias (e.g., "Gene Symbol" or "genesymbol").

Returns:

list of directly connected output database name strings.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> b.getDirectOutputsForInput("Gene Symbol")  
>>> b.getDirectOutputsForInput("genesymbol")   # normalised alias  
getInputs()[source]

Return the list of all valid input database names.

Returns:

list of database name strings.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> inputs = b.getInputs()
>>> "UniProt Accession" in inputs
True
getOutputsForInput(input_db)[source]

Return all output databases reachable from a given input database.

Parameters:

input_db (str) – input database name (e.g., "UniProt Accession").

Returns:

list of output database name strings.

Example:

>>> from bioservices import BioDBNet
>>> b = BioDBNet()
>>> outputs = b.getOutputsForInput("UniProt Accession")
>>> "Gene Symbol" in outputs
True

8.5. BioMart

This module provides a class BioMart that allows easy access to the BioMart service.

Note

SOAP and REST are available. We use REST for the wrapping.

class BioMart(host=None, verbose=False, cache=False, secure=False)[source]

Interface to the BioMart service

BioMart is made of different views. Each view correspond to a specific MART. For instance the UniProt service has a BioMart view.

The registry can help to find the different services available through BioMart.

>>> from bioservices import *
>>> s = BioMart()
>>> ret = s.registry() # to get information about existing services

The registry is a list of dictionaries. Some aliases are available to get all the names or databases:

>>> s.names      # alias to list of valid service names from registry
>>> "unimart" in s.names
True

Once you selected a view, you will want to select a database associated with this view and then a dataset. The datasets can be retrieved as follows:

>>> s.datasets("prod-intermart_1")  # retrieve datasets available for this mart

The main issue is how to figure out the database name (here prod-intermart_1) ? Indeed, from the web site, what you see is the displayName and you must introspect the registry to get this information. In BioServices, we provide the lookfor() method to help you. For instance, to retrieve the database name of interpro, type:

>>> s = BioMart(verbose=False)
>>> s.lookfor("interpro")
Candidate:
     database: intermart_1
    MART name: prod-intermart_1
  displayName: INTERPRO (EBI UK)
        hosts: www.ebi.ac.uk

The display name (INTERPRO) correspond to the MART name prod-intermart_1. Let us you it to retrieve the datasets:

>>> s.datasets("prod-intermart_1")
['protein', 'entry', 'uniparc']

Now that we have the dataset names, we can select one and build a query. Queries are XML that contains the dataset name, some attributes and filters. The dataset name is one of the element returned by the datasets method. Let us suppose that we want to query protein, we need to add this dataset to the query:

>>> s.add_dataset_to_xml("protein")

Then, you can add attributes (one of the keys of the dictionary returned by attributes(“protein”):

>>> s.add_attribute_to_xml("protein_accession")

Optional filters can be used:

>>> s.add_filter_to_xml("protein_length_greater_than", 1000)

Finally, you can retrieve the XML query:

>>> xml_query = s.get_xml()

and send the request to biomart:

>>> res = s.query(xml_query)
>>> len(res)
12801
# print the first 10 accession numbers
>>> res = res.split("\n")
>>> for x in res[0:10]: print(x)
['P18656',
 'Q81998',
 'O09585',
 'O77624',
 'Q9R3A1',
 'E7QZH5',
 'O46454',
 'Q9T3F4',
 'Q9TCA3',
 'P72759']

REACTOME example:

s.lookfor("reactome")
s.datasets("REACTOME")
['interaction', 'complex', 'reaction', 'pathway']

s.new_query()
s.add_dataset_to_xml("pathway")
s.add_filter_to_xml("species_selection", "Homo sapiens")
s.add_attribute_to_xml("pathway_db_id")
s.add_attribute_to_xml("_displayname")
xmlq = s.biomartQuery.get_xml()
res = s.query(xmlq)

Note

the biomart sevice is slow (in my experience, 2013-2014) so please be patient…

Constructor

URL required to use biomart change quite often. Experience has shown that BioMart class in Bioservices may fail. This is not a bioservices issue but due to API changes on server side.

For that reason the host is not filled anymore and one must set it manually.

Let us take the example of the ensembl biomart. The host is

www.ensembl.org

Note that there is no prefix http and that the actual URL looked for internally is http://www.ensembl.org/biomart/martview

(It used to be martservice in 2012-2016)

Another reason to not set any default host is that servers may be busy or take lots of time to initialise (if many MARTS are available). Usually, one knows which MART to look at, in which case you may want to use a specific host (e.g., www.ensembl.org) that will speed up significantly the initialisation time.

Parameters:

host (str) – a valid host (e.g. “www.ensembl.org”, gramene.org)

List of databases are available in this webpage http://www.biomart.org/community.html

add_attribute_to_xml(name, dataset=None)[source]
add_dataset_to_xml(dataset)[source]
add_filter_to_xml(name, value, dataset=None)[source]
attributes(dataset)[source]

to retrieve attributes available for a dataset:

Parameters:

dataset (str) – e.g. oanatinus_gene_ensembl

configuration(dataset)[source]

to retrieve configuration available for a dataset:

Parameters:

dataset (str) – e.g. oanatinus_gene_ensembl

create_attribute(name, dataset=None)[source]
create_filter(name, value, dataset=None)[source]
custom_query(**args)[source]
property databases

list of valid datasets

datasets(mart, raw=False)[source]

to retrieve datasets available for a mart:

Parameters:

mart (str) – e.g. ensembl. see names for a list of valid MART names the mart is the database. see lookfor method or databases attributes

>>> s = BioMart(verbose=False)
>>> s.datasets("prod-intermart_1")
['protein', 'entry', 'uniparc']
property displayNames

list of valid datasets

filters(dataset)[source]

to retrieve filters available for a dataset:

Parameters:

dataset (str) – e.g. oanatinus_gene_ensembl

>>> s.filters("uniprot").split("\n")[1].split("\t")
>>> s.filters("pathway")["species_selection"]
[Arabidopsis thaliana,Bos taurus,Caenorhabditis elegans,Canis familiaris,Danio
rerio,Dictyostelium discoideum,Drosophila melanogaster,Escherichia coli,Gallus
gallus,Homo sapiens,Mus musculus,Mycobacterium tuberculosis,Oryza
sativa,Plasmodium falciparum,Rattus norvegicus,Saccharomyces
cerevisiae,Schizosaccharomyces pombe,Staphylococcus aureus N315,Sus
scrofa,Taeniopygia guttata ,Xenopus tropicalis]
get_datasets(mart)[source]

Retrieve datasets with description

get_xml()[source]
property host
property hosts

list of valid hosts

lookfor(pattern, verbose=True)[source]

Search the registry for MARTs whose name, database, or displayName matches pattern.

Parameters:
  • pattern (str) – case-insensitive substring to search for (e.g., "interpro", "ensembl").

  • verbose (bool) – if True (default), print each matching entry.

Example:

>>> from bioservices import BioMart
>>> s = BioMart(host="www.ensembl.org", verbose=False)
>>> s.lookfor("ensembl")
property marts

list of marts

property names

list of valid datasets

new_query()[source]

Reset the current query, clearing all datasets, filters and attributes.

Call this before building a new BioMart XML query from scratch.

Example:

>>> from bioservices import BioMart
>>> s = BioMart(host="www.ensembl.org", verbose=False)
>>> s.new_query()
>>> s.add_dataset_to_xml("mmusculus_gene_ensembl")
>>> s.add_attribute_to_xml("ensembl_gene_id")
>>> xml = s.get_xml()
query(xmlq)[source]

Send a query to biomart

The query must be formatted in a XML format which looks like ( example from https://gist.github.com/keithshep/7776579):

<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE Query>
        <Query virtualSchemaName="default" formatter="CSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6">
        <Dataset name="mmusculus_gene_ensembl" interface="default">
        <Filter name="ensembl_gene_id" value="ENSMUSG00000086981"/>
        <Attribute name="ensembl_gene_id"/>
        <Attribute name="ensembl_transcript_id"/>
        <Attribute name="transcript_start"/>
        <Attribute name="transcript_end"/>
        <Attribute name="exon_chrom_start"/>
        <Attribute name="exon_chrom_end"/>
        </Dataset>
        </Query>

Warning

the input XML must be valid. THere is no validation made in thiss method.

registry()[source]

to retrieve registry information

the XML contains list of children called MartURLLocation made of attributes. We parse the xml to return a list of dictionary. each dictionary correspond to one MART.

aliases to some keys are provided: names, databases, displayNames

property valid_attributes

list of valid datasets

version(mart)[source]

Returns version of a mart

Parameters:

mart (str) – e.g. ensembl

8.6. BioModels

This module provides a class BioModels to access to BioModels WS.

class BioModels(verbose=True)[source]

Interface to the BioModels service

from bioservices import BioModels
bm = BioModels()
model = bm.get_model('BIOMD0000000299')

Previous API had several functions such as getAuthorsByModelId. This is easy to mimic with the new API:

bm = BioModels()
models = bm.get_all_models()
[x['submitter'] for x in res if x[] == "MODEL1204280003"][0]

This is also true for getDateLastModifByModelId and getModelNameById if one use the field lastModified or name. There was the ability to search for models based on their CHEBI identifiers, which is not supported anymore; this concerns functions getModelsIdByChEBI, getModelsIdByChEBIId, getSimpleModelsByChEBIIds, getSimpleModelsRelatedWithChEBI. For other searches related to Reactome, Uniprot identifiers or GO terms, the search() method should work:

bm.search("P10113")
bm.search("REACT_33")
bm.search("GO:0006919")

constructor

Parameters:

verbose (bool) –

get_all_models(chunk=100)[source]

Return all models

get_model(model_id, frmt='json')[source]

Fetch information about a given model at a particular revision.

get_model_download(model_id, filename=None, output_filename=None)[source]

Download a particular file associated with a given model or all its files as a COMBINE archive.

Parameters:
  • model_id – a valid BioModels identifier

  • filename (str) – this is the requested filename to be found in the model

  • output_filename (str) – if you request a different output filename, use this parameter

  • frmt – format of the output (json, xml, html)

Returns:

nothing. This function save the model into a ZIP file called after the model identifier. If parameter filename is specified, then the output file is the requested filename (if found)

bm.get_model_download("BIOMD0000000100", filename="BIOMD0000000100.png")
bm.get_model_download("BIOMD0000000100")

This function can retrieve all files in a ZIP archive or a single image. In the example below, we retrieve the PNG and plot it using matplotlib. Using your favorite image viewver, you should get a better resolution. Or just download the SVG version of the model.

from bioservices import BioModels
bm = BioModels()
from easydev import TempFile
with TempFile(suffix=".png") as fout:
    bm.get_model_download("BIOMD0000000100",
            filename="BIOMD0000000100.png",
            output_filename=fout.name)
    from pylab import imshow, imread
    imshow(imread(fout.name), aspect="auto")
get_model_files(model_id, frmt='json')[source]

Extract metadata information of model files of a particular model

Parameters:
  • model_id – a valid BioModels identifier

  • frmt – format of the output (json, xml)

get_p2m_missing(frmt='json')[source]

Retrieve all models in Path2Models that are now only available indirectly, through the representative model for the corresponding genus

Parameters:

frmt (str) – the format of the result (xml, csv, json)

Returns:

list of model identifiers

get_p2m_representative(model, frmt='json')[source]

Retrieve a representative model in Path2Models

Get the representative model identifier for a given missing model in Path2Models. This endpoint accepts as parameters a mandatory model identifier and an optional response format

Parameters:
  • model (str) – The identifier of a model of interest

  • frmt (str) – the format of the result (xml, csv, json)

get_p2m_representatives(models, frmt='json')[source]

Find the replacement accessions for a set of Path2Models entries

Get the representative model identifiers of a set of given missing models in Path2Models. This end point expects a comma-separated list of model identifiers (without any surrounding whitespace) and an optional response format. Examples: BMID000000112902, BMID000000009880, BMID000000027397.

Parameters:
  • model (str) – The model identifiers separated by commas, or as a list.

  • frmt (str) – the format of the result (xml, csv, json)

from bioservices import BioModels
bm = BioModels()
bm.get_p2m_representatives("BMID000000112902, BMID000000009880, BMID000000027397")
get_pdgsmm_missing(frmt='json')[source]

Retrieve the identifiers of all PDGSMM entries that are no longer directly accessible

Parameters:

frmt (str) – the format of the result (xml, csv, json)

Returns:

list of model identifiers

get_pdgsmm_representative(model, frmt='json')[source]

Retrieve a representative model in PDGSMM

Get the representative model identifier for a given missing model in PDGSMM. This endpoint accepts as parameters a mandatory model identifier and an optional response format.

Parameters:
  • model (str) – The identifier of a model of interest

  • frmt (str) – the format of the result (xml, csv, json)

get_pdgsmm_representatives(models, frmt='json')[source]

Find the replacement accessions for a set of PDFSSM

Get the representative model identifiers of a set of given missing models in PDGSMM. This end point expects a comma-separated list of model identifiers (without any surrounding whitespace) and an optional response format. Examples: MODEL1707110145,MODEL1707112456,MODEL1707115900.

Parameters:
  • model (str) – The model identifiers separated by commas, or as a list.

  • frmt (str) – the format of the result (xml, csv, json)

search(query, offset=None, numResults=None, sort=None, frmt='json')[source]

Search models of interest via keywords.

Examples: PUBMED:”27869123” to search models associated with the PubMed record identified by 27869123.

Parameters:
  • query (str) – search query. colon character must be escaped

  • offset (int) – number of items to skip before starting to collect the result set

  • numResults (int) – number of items to return

  • sort (str) – sort criteria in {id-asc, relevance-asc, relevance-desc, first_author-asc, first_author, name-asc, name-desc, publication_year-asc, publication_year-desc}

  • frmt (str) – format of the output (json, xml)

search_download(models, output_filename='models.zip', force=False)[source]

Returns models (XML) corresponding to a list of model identifiers.

Parameters:
  • models (str) – list of model identifiers using comma to separate them. Could be a list of string (e.g ‘BIOMD1,BIOMD2’ or [‘BIOMD1’, ‘BIOMD2’]

  • output_filename (str) – file used to save the models. This is a zipped file. If the file exists, you must use the force* parameter

Todo

if no models are found (typos), an error message is printed. if one model is not found, there is no warning or errors. Could be nice to have a warning by introspecting the number of models in the output file

search_parameter(query, start=0, size=10, sort=None, frmt='json')[source]

Search for parameters of a model

Details BioModels Parameters is a resource that facilitates easy search and retrieval of parameter values used in the SBML models stored in the BioModels repository. Users can search for a model entity (e.g. a protein or drug) to retrieve the rate equations describing it; the associated parameter values and the initial concentration from the SBML models in BioModels. Although these data are directly extracted from the curated SBML models, they are not individually curated or validated; rather presented as such in the table below. Hence BioModels Parameters table will only provide a quick overview of available parameter values for guidance and original model should be referred to understand the complete context of the parameter usage.

Parameters:
  • query (str) – A query to search against the model parameter values.

  • start (int) – if is the offset of the result set (default 0)

  • size (int) – number of items to display per page

  • sort (str) – model or entity

  • frmt (str) – the format of the result (xml, csv, json)

bm.search_parameter("MAPK", size=100, sort="entity")

8.7. ChEBI

This module provides a class ChEBI

class ChEBI(verbose=False, cache=False)[source]

Interface to the ChEBI REST API.

ChEBI (Chemical Entities of Biological Interest) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.

The REST API is documented at https://www.ebi.ac.uk/chebi/backend/api/docs/

Example usage:

>>> from bioservices import ChEBI
>>> ch = ChEBI()
>>> res = ch.getCompleteEntity("CHEBI:27732")
>>> res.smiles
'Cn1cnc2c1c(=O)n(c(=O)n2C)C'

Constructor

Parameters:
  • verbose (bool) –

  • cache (bool) –

conv(chebiId, target)[source]

Return the cross-reference accession number(s) for a given database.

Calls getCompleteEntity() internally and filters the DatabaseLinks by target.

Parameters:
  • chebiId (str) – a valid ChEBI identifier (e.g. "CHEBI:10102")

  • target (str) – source database name (e.g. "KEGG COMPOUND accession")

Returns:

list of accession number strings

>>> ch.conv("CHEBI:10102", "KEGG COMPOUND accession")
['C07484']
getAllOntologyChildrenInPath(chebiId, relationshipType, onlyWithChemicalStructure=False)[source]

Retrieve ontology children connected by a specific relationship type.

Parameters:
  • chebiId (str) – a valid ChEBI identifier (string)

  • relationshipType (str) – one of "is a", "has part", "has role", "is conjugate base of", "is conjugate acid of", "is tautomer of", "is enantiomer of", "has functional parent", "has parent hydride", "is substituent group of" (see module-level _RELATION_TYPE_MAP for the full list)

  • onlyWithChemicalStructure (bool) – filter to entities with a chemical structure (default False)

Returns:

list of ontology relation dicts

>>> ch.getAllOntologyChildrenInPath("CHEBI:27732", "has part")
getCompleteEntity(chebiId)[source]

Retrieve the complete entity for a ChEBI identifier.

Parameters:

chebiId (str) – a valid ChEBI identifier (e.g. "CHEBI:27732")

Returns:

a ChebiEntity dict-like object

>>> from bioservices import ChEBI
>>> ch = ChEBI()
>>> res = ch.getCompleteEntity("CHEBI:27732")
>>> float(res.mass)
194.19076
getCompleteEntityByList(chebiIdList=None)[source]

Retrieve complete entities for a list of ChEBI identifiers.

Parameters:

chebiIdList (list) – list of ChEBI identifiers (maximum 50 entries recommended)

Returns:

list of ChebiEntity objects

getLiteEntity(search, searchCategory='ALL', maximumResults=200, stars='ALL')[source]

Retrieve a list of lite entities matching a search term.

Parameters:
  • search (str) – search string (ChEBI name, identifier, SMILES, etc.)

  • searchCategory (str) – filter category (default "ALL"); currently unused by the REST backend – all categories are searched

  • maximumResults (int) – maximum number of results (default 200)

  • stars (str) – star filter – "ALL", "TWO ONLY", or "THREE ONLY" (default "ALL"); currently unused by the REST backend

Returns:

list of ChebiEntity objects

>>> res = ch.getLiteEntity("caffeine", maximumResults=10)
>>> len(res)
10
getOntologyChildren(chebiId)[source]

Retrieve the ontology children of a ChEBI entity.

Parameters:

chebiId (str) – a valid ChEBI identifier (string)

Returns:

dict with ontology children information

getOntologyParents(chebiId)[source]

Retrieve the ontology parents of a ChEBI entity.

Parameters:

chebiId (str) – a valid ChEBI identifier (string)

Returns:

dict with ontology parent information

getStructureSearch(structure, mode='MOLFILE', structureSearchCategory='SIMILARITY', totalResults=50, tanimotoCutoff=0.25)[source]

Perform a substructure, similarity, or identity search.

Parameters:
  • structure (str) – input structure string

  • mode (str) – structure format – "MOLFILE", "SMILES", or "CML"

  • structureSearchCategory (str) – search type – "SIMILARITY", "SUBSTRUCTURE", or "IDENTITY"

  • totalResults (int) – maximum number of results (default 50)

  • tanimotoCutoff (float) – minimum Tanimoto score (default 0.25, only used for "SIMILARITY" searches)

Returns:

list of matching entities

>>> ch = ChEBI()
>>> smiles = ch.getCompleteEntity("CHEBI:27732").smiles
>>> ch.getStructureSearch(smiles, "SMILES", "SIMILARITY", 3, 0.25)
getUpdatedPolymer(chebiId)[source]

Return compound data for a polymer ChEBI entry.

In the REST API this is equivalent to getCompleteEntity().

Parameters:

chebiId (str) – a valid ChEBI identifier (string)

Returns:

a ChebiEntity dict-like object

8.8. ChEMBL

This module provides a class ChEMBL

class ChEMBL(verbose=False, cache=False)[source]

New ChEMBL API bioservices 1.6.0

Resources

ChEMBL database is made of a set of resources. We recommend to look at https://arxiv.org/pdf/1607.00378.pdf

Here we first create an instance and retrieve the first 1000 molecules from the database using the limit parameter.

>>> from bioservices import ChEMBL
>>> c = ChEMBL()
>>> res = c.get_molecule(limit=1000)

The returned object is a list of 1000 records, each of them being a dictionary. The molecule resource is actually a very large one and one may want to skip some entries. This is possible using the offset parameter as follows:

# Retrieve 1000 molecules skipping the first 50
res = c.get_molecule(limit=1000, offset=50)

If you want to know all resources available and the number of entries in each resources, use:

status = c.get_status_resources()

For instance, you should be able to get the total number of entries in the mechanism resource is about 5,000:

print(status['mechanism'])

To retrieve all entries from the mechanism resource, you can either set limit to a value large enough:

res = c.get_mechanism(limit=1000000)

or simply set it to -1:

res = c.get_mechanism(limit=-1)

All resources methods behaves in the same way.

Those resources methods are: get_activity(), get_assay(), get_atc_class(), get_binding_site(), get_biotherapeutic(), get_cell_line(), get_chembl_id_lookup(), get_compound_record(), get_compound_structural_alert(), get_document(), get_document_similarity(), get_document_term(), get_drug(), get_drug_indication(), get_go_slim(), get_mechanism(), get_metabolism(), get_molecule(), get_molecule_form(), get_protein_class(), get_source(), get_target(), get_target_component(), get_target_prediction(), get_target_relation(), get_tissue().

3 ways of getting items

  1. Retrieve everything:

    c.get_molecule(limit=-1)
    
  2. Retrieve a specific entry:

    c.get_molecule("CHEMBL24")
    
  3. Retrieve a set of entries:

    c.get_molecule(["CHEMBL24","CHEMBL2"])
    

Filtering and Ordering

For ordering the results, we provide a simple method order_by() that allows to sort the dictionary according to values in a specific key.

Any data returned by a resource method (a list of dictionary) can be process through this method:

c = ChEMBL()
data = c.get_drug(limit=100)
ordered_data = c.order_by(data, 'chirality')

If you want to order using a key within a key, for instance order by molecular weight stored in the molecular_properties key, use the double underscore method as follows:

c = ChEMBL()
data = c.get_drug(limit=100)
ordered_data = c.order_by(data, 'molecular_properties__mw_freebase')

For filtering, it is possible to apply search filters to any resources. For example, it is possible to return all ChEMBL targets that contain the term ‘kinase’ in the pref_name attribute:

c.get_target(filters='pref_name__contains=kinase')

The pattern for applying a filter is as follows:

[field]__[filter_type]=[value]

where field has to be found by the user. Simply introspect the content of an item returned by the resource. For instance:

c.get_target(limit=1) # to get one entry

Let us consider the case of the molecule resource. You can retrieve the first 10 molecules using e.g.:

res = c.get_molecule(limit=10)

If you look at the first entry using res[0], you will get about 38 keys. For instance molecule_properties or molecule_chembl_id.

You can filter the molecules to keep only the molecule_chembl_id that match either CHEMBL25 or CHEMBL1000 using:

res = c.get_molecule(filters='molecule_chembl_id__in=CHEMBL25,CHEMBL1000')

For molecule_properties, this is actually a dictionary. For instance, inside the molecule_properties field, you have the molecular weight (mw_freebase). So to apply this filter, you need to use the following code (to keep molecules with molecular weight greater than 300:

res = c.get_molecule(filters='molecule_properties__mw_freebase__gte=300')

Here are the different types of filtering:

Filter Type

Description

exact (iexact)

Exact match with query

contains

wild card search with query

startswith

starts with query

endswith

ends with query

regex

regular expression query

gt (gte)

Greater than (or equal)

lt (lte)

Less than (or equal)

range

Within a range of values

in

Appears within list of query values

isnull

Field is null

search

Special type of filter allowing a full text search based on Solr queries.

Several filters can be applied at the same time using a list:

filters = ['molecule_properties__mw_freebase__gte=300']
filters += ['molecule_properties__alogp__gte=3']
res = c.get_molecule(filters)

Use Cases: (inspired from ChEMBL documentation)

Search molecules by synonym:

>>> from bioservices import ChEMBL
>>> c = ChEMBL()
>>> res = c.search_molecule('aspirin')

or SMILE, or InChiKey, or CHEMBLID:

>>> res = c.get_molecule("CC(=O)Oc1ccccc1C(=O)O")
>>> res = c.get_molecule("BSYNRYMUTXBXSQ-UHFFFAOYSA-N")
>>> res = c.get_molecule('CHEMBL25')

Several molecules at the same time can also be retrieved using lists:

>>> res = c.get_molecule(['CHEMBL25', 'CHEMBL2'])

Search target by gene name:

>>> res = c.search_target("GABRB2")
>>> len(res['targets'])
18

or directly in the target synonym field:

>>> res = c.get_target(filters='target_synonym__icontains=GABRB2')

Note

Not sure what is the difference between icontains vs contains. It looks like icontains is more permissive (you get more entries with icontains).

Having a list of molecules ChEMBL IDs in a list, get uniprot accession numbers that map to those compounds:

# First, get some IDs of approved drugs (about 2000 molecules)
c = ChEMBL()
drugs = c.get_approved_drugs()
IDs = [x['molecule_chembl_id'] for x in drugs]

# we jump from compounds to targets through activities
# Here this is a one to many mapping so we initialise a default
# dictionary.
compound2target = defaultdict(set)

filter = "molecule_chembl_id__in={}"
for i in range(0, len(IDs), 50):
    activities = c.get_activity(filter.format(IDs[i:i+50]))
    # get target ChEMBL IDs from activities
    for act in activities:
        compound2target[act['molecule_chembl_id']].add(act['target_chembl_id'])

# What we need is to get targets for all targets found in the previous
# step. For each compound/drug there are hundreds of targets though. And
# we will call the get_target for each list of hundreds targets. This
# will take forever. Instead, because there are *only* 12,000 targets,
# let us download all of them ! This took about 4 minutes on this test but
# if you use the cache, next time it will be much much quicker. This is
# not down at the activities level because there are too many entries

targets = c.get_target(limit=-1)

# identifies all target chembl id to easily retrieve the entry later on
target_names = [target['target_chembl_id'] for target in targets]

# retrieve all uniprot accessions for all targets of each compound
for compound, targs in compounds2targets.items():
    accessions = set()
    for target in targs:
        index = target_names.index(target)
        accessions = accessions.union([comp['accession']
            for comp in targets[index]['target_components']])
    compounds2targets[compound] = accessions

In version 1.6.0 of bioservices, you can simply use:

res = c.compounds2targets(IDs)

Get Target type count for all targets:

import collections
collections.Counter([x['target_type'] for x in targets]

Find compounds similar to given SMILES query with similarity threshold of 85%:

>>> SMILE = "CN(CCCN)c1cccc2ccccc12"
>>> c.get_similarity(SMILE, similarity=70)

Find compounds similar to aspirin (CHEMBL25) with similarity threshold of 70%:

# search for aspirin in all molecules and from first hist
# get the ChEMBL ID
>>> molecules = c.search_molecule("aspirin")['molecules']
>>> chembl_id = molecules[0]['molecule_chembl_id']
# now use the :meth:`get_similarity` given the ID
>>> res = c.get_similarity(chembl_id, similarity=70)

Perform substructure search using SMILES or ChEMBID:

>>> res = c.get_substructure("CN(CCCN)c1cccc2ccccc12")
>>> res = c.get_substructure("CHEMBL25")

Obtain the pChEMBL value for compound:

res = c.get_activity(filters=['pchembl_value__isnull=False',
                              'molecule_chembl_id=CHEMBL25'])

Obtain the pChEMBL value for compound and target:

res = c.get_activity(filters=['pchembl_value__isnull=False',
                              'molecule_chembl_id=CHEMBL25',
                              'target_chembl_id=CHEMBL612545'])

Get all approved drugs:

c.get_approved_drugs(max_phase=4)

Get approved drugs for lung cancer

The ChEMBL API significantly changed in 2018 and the new version of bioservices (1.6.0) had to change the API as well, which has been simplified.

Here below are some correspondances between the previous and the new API.

bioservices before 1.6.0

After 1.6.0

get_compounds_substructure

get_substructure

get_compounds_similar_to_SMILES

get_similarity(SMILE)

get_compounds_by_chemblId(ID)

get_similarity(ID)

get_individual_compounds_by_inChiKey

get_molecule(inchikey)

get_compounds_by_chemblId_form

get_molecule_form

get_compounds_by_chemblId_drug_mechanism

get_mechanism(ID)

get_target_by_chemblId(ID)

get_target(ID)

get_image_of_compounds_by_chemblId

get_image

etc

References:

Constructor

Parameters:
  • verbose (bool) – set to True to get more logging output

  • cache (bool) – set to True to enable HTTP caching

compounds2accession(compounds)[source]

For each compound, identifies the target and corresponding UniProt accession number

This is not part of ChEMBL API

# we recommend to use cache if you use this method regularly
c = Chembl(cache=True)
drugs = c.get_approved_drugs()

# to speed up example
drugs = drugs[0:20]
IDs = [x['molecule_chembl_id'] for x in drugs]

c.compounds2accession(IDs)
get_ATC(limit=20, offset=0, filters=None)[source]

WHO ATC Classification for drugs

c.get_atc() c[‘atc’]

Note

get_molecule returns ‘molecules’ and likewise all methods return a dictionary whose key is the plural of the method name. This is quite consistent through the API except for that one because it is an acronym

get_activity(query=None, limit=20, offset=0, filters=None)[source]

Activity values recorded in an Assay

get_approved_drugs(max_phase=4, maxdrugs=1000000)[source]

Return all approved drugs

Parameters:
  • max_phase (int) – development phase filter (default 4 = approved drugs)

  • maxdrugs (int) – upper cap on results returned (default 1000000, effectively all)

get_assay(query=None, limit=20, offset=0, filters=None)[source]

Assay details as reported in source Document/Dataset

>>> c.get_assay("CHEMBL1217643")
get_binding_site(limit=20, offset=0, filters=None)[source]

Target binding site definition

get_biotherapeutic(limit=20, offset=0, filters=None)[source]

Biotherapeutic molecules, which includes HELM notation and sequence data

get_cell_line(limit=20, offset=0, filters=None)[source]

Cell line information

get_chembl_id_lookup(query=None, limit=20, offset=0, filters=None)[source]

Look up ChEMBL Id entity type

get_compound_record(query=None, limit=20, offset=0, filters=None)[source]

Occurrence of a given compound in a specific document

get_compound_structural_alert(query=None, limit=20, offset=0, filters=None)[source]

Indicates certain anomaly in compound structure

get_document(query=None, limit=20, offset=0, filters=None)[source]

Document/Dataset from which Assays have been extracted

get_document_similarity(query=None, limit=20, offset=0, filters=None)[source]

Provides documents similar to a given one

get_document_term(query=None, limit=20, offset=0, filters=None)[source]

Provides keywords extracted from a document using the TextRank algorithm

get_drug(query=None, limit=20, offset=0, filters=None)[source]

Approved drugs information, including (but not limited to) applicants, patent numbers and research codes

get_drug_indication(query=None, limit=20, offset=0, filters=None)[source]

Joins drugs with diseases providing references to relevant sources

get_go_slim(query=None, limit=20, offset=0, filters=None)[source]

GO slim ontology

get_image(query, dimensions=500, format='png', save=True, view=True, engine='indigo')[source]

Get the image of a given compound in PNG format.

Parameters:
  • query (str) – a valid compound ChEMBLId or a list/tuple of valid compound ChEMBLIds.

  • format – png, svg. json not supported

  • dimensions (int) – size of image in pixels. An integer z (1 \leq z \leq 500)

  • save

  • view (bool) –

  • engine – rendering engine; "rdkit" or "indigo" (default "indigo")

  • view – show the image if set to True.

Returns:

the path (list of paths) used to save the figure (figures) (different from Chembl API)

from pylab import imread, imshow
from bioservices import ChEMBL
s = ChEMBL(verbose=False)
res = s.get_image(31863)
imshow(imread(res['filenames'][0]))

Todo

ignorecoords option

get_mechanism(query=None, limit=20, offset=0, filters=None)[source]

Mechanism of action information for FDA-approved drugs

get_metabolism(query=None, limit=20, offset=0, filters=None)[source]

Metabolic pathways with references

get_molecule(query=None, limit=20, offset=0, filters=None)[source]

Returns some molecules

Parameters:
  • limit (int) – number of molecules to retrieve

  • offset (int) – number of molecules to skip before retrieving

Returns:

a list of molecule dictionaries (or a single dict when querying by ID)

You can only retrieve 1,000 molecules at most per request using the limit parameter. Use a loop with offset to retrieve molecules in batches.

c.get_molecule('QFFGVLORLPOAEC-SNVBAGLBSA-N')
c.get_molecule("CC(=O)Oc1ccccc1C(=O)O")
get_molecule_form(query=None, limit=20, offset=0, filters=None)[source]

Relationships between molecule parents and salts

>>> s.get_molecule_form("CHEMBL2")['molecule_forms']
[{'is_parent': 'True',
  'molecule_chembl_id': 'CHEMBL2',
  'parent_chembl_id': 'CHEMBL2'},
 {'is_parent': 'False',
  'molecule_chembl_id': 'CHEMBL1558',
  'parent_chembl_id': 'CHEMBL2'},
 {'is_parent': 'False',
  'molecule_chembl_id': 'CHEMBL1347191',
  'parent_chembl_id': 'CHEMBL2'}]
get_organism(query=None, limit=20, offset=0, filters=None)[source]

Organism information for targets

get_similarity(structure, similarity=80, limit=20, offset=0, filters=None)[source]

Molecule similarity search

Parameters:
  • structure – SMILES, InChIKey, or ChEMBL ID of the query molecule

  • similarity – must be an integer greater than or equal to 70 and less than or equal to 100

Returns:

list of molecules corresponding to the search

>>> from bioservices import ChEMBL
>>> c = ChEMBL()
>>> res = c.get_similarity("CC(=O)Oc1ccccc1C(=O)O", 80)
>>> res['molecules']

Here are more examples:

# Similarity (80% cut off) search for against ChEMBL using
# aspirin SMILES string
c.get_similarity("CC(=O)Oc1ccccc1C(=O)O") # 80 by default

# Similarity (80% cut off) search for against ChEMBL using
# aspirin CHEMBL_ID
c.get_similarity("CHEMBL25")

# Similarity (80% cut off) search for against ChEMBL
# using aspirin InChI Key
c.get_similarity("BSYNRYMUTXBXSQ-UHFFFAOYSA-N")

The ‘Substructure’ and ‘Similarity’ web service resources allow for the chemical content of ChEMBL to be searched. Similar to the other resources, these search based resources except filtering, paging and ordering arguments. These methods accept SMILES, InChI Key and molecule ChEMBL_ID as arguments and in the case of similarity searches an additional identity cut-off is needed. Some example molecule searches are provided in the table below.

Searching with InChI key is only possible for InChI keys found in the ChEMBL database. The system does not try and convert InChI key to a chemical representation.

get_source(query=None, limit=20, offset=0, filters=None)[source]

Document/Dataset source

get_status()[source]

Return version of the DB and number of entries

Returns the number of entries for activities, compound_records, distinct_compounds (molecule), publications (document), targets, etc…

get_status_resources()[source]

Return number of entries for all resources

Note

not in the ChEMBL API.

Changed in version 1.7.3: (removed target_prediction and document_term)

get_substructure(structure, limit=20, offset=0, filters=None)[source]

Molecule substructure search

Parameters:

structure – provide a valid / existing substructure in SMILE format to look for in all molecules:

Returns:

list of molecules corresponding to the search

>>> from bioservices import ChEMBL
>>> c = ChEMBL()
>>> res = c.get_substructure("CC(=O)Oc1ccccc1C(=O)O")

Other examples:

# Substructure search for against ChEMBL using aspirin
# SMILES string
c.get_substructure("CC(=O)Oc1ccccc1C(=O)O")

# Substructure search for against ChEMBL using aspirin
# CHEMBL_ID
c.get_substructure("CHEMBL25")

# Substructure search for against ChEMBL using aspirin
# InChIKey
c.get_substructure("BSYNRYMUTXBXSQ-UHFFFAOYSA-N")

The ‘Substructure’ and ‘Similarity’ web service resources allow for the chemical content of ChEMBL to be searched. Similar to the other resources, these search based resources except filtering, paging and ordering arguments. These methods accept SMILES, InChI Key and molecule ChEMBL_ID as arguments and in the case of similarity searches an additional identity cut-off is needed. Some example molecule searches are provided in the table below.

Searching with InChI key is only possible for InChI keys found in the ChEMBL database. The system does not try and convert InChI key to a chemical representation.

get_target(query=None, limit=20, offset=0, filters=None)[source]

Targets (protein and non-protein) defined in Assay

>>> from bioservices import *
>>> s = ChEMBL(verbose=False)
>>> resjson = s.get_target('CHEMBL240')
get_target_component(query=None, limit=20, offset=0, filters=None)[source]

Target sequence information (A Target may have 1 or more sequences)

res = c.get_target_component(1)
res['sequence']
get_target_prediction(query=None, limit=20, offset=0, filters=None)[source]

Predicted binding of a molecule to a given biological target

>>> res = c.get_target_prediction(1)
>>> res['molecule_chembl_id']
'CHEMBL2'
get_target_relation(query=None, limit=20, offset=0, filters=None)[source]

Describes relations between targets

>>> c.get_target_relation('CHEMBL261')
{'related_target_chembl_id': 'CHEMBL2095180',
 'relationship': 'SUBSET OF',
 'target_chembl_id': 'CHEMBL261'}
get_tissue(query=None, limit=20, offset=0, filters=None)[source]

Tissue classification

c.get_tissue(filters=[‘pref_name__contains=cervix’])

get_xref_source(query=None, limit=20, offset=0, filters=None)[source]

Cross-reference source information

order_by(data, name, ascending=True)[source]

Ordering data

we use same API as ChEMBL API using the double underscore to indicate a hierarchy in the dictionary. So to access to d[‘a’][‘b’], we use a__b as the input name parameter. We only allows 3 levels e.g., a__b__c

data = c.get_molecules()
data1 = c.order_by(data, 'molecule_chembl_id')
data2 = c.order_by(data, 'molecule_properties__alogp')

Note

the ChEMBL API allows for ordering but we do not use that API. Instead, we provide this generic function.

search_activity(query, limit=20, offset=0)[source]

Activity values recorded in an Assay

search_assay(query, limit=20, offset=0)[source]

Assay details as reported in source document

search_chembl_id_lookup(query, limit=20, offset=0)[source]

Look up ChEMBL Id entity type

search_document(query, limit=20, offset=0)[source]

Document/Dataset from which Assays have been extracted

search_molecule(query, limit=20, offset=0)[source]

Search molecules by synonym, SMILES, InChIKey, or ChEMBL ID

search_target(query, limit=20, offset=0)[source]

Targets (protein and non-protein) defined in Assay

8.9. COG

Interface to the COG (Clusters of Orthologous Genes) web service

class COG(verbose=False, cache=False)[source]

Interface to the COG service

Note that in addition to the original COG service from NCBI, this interface also helps you in searching for organisms, and retrieves all pages in a single command (rather than paginating manually).

Here is an example of getting the COGs for E. coli. You first need the exact matching name. Bioservices provides a helper to search for the organism name understood by the COG service (e.g. Escherichia_coli_K-12_sub_MG1655 — not easy to guess):

from bioservices import COG
c = COG()
c.search_organism('coli')

# the output of the previous command gives you the name
c.get_cogs_by_organism('Escherichia_coli_K-12_sub_MG1655')

Constructor

get_all_cogs_definition(page=None)[source]

Get all COG definitions

get_cog_definition_by_cog_id(cog_id)[source]

Get specific COG Definitions by COG: COG0003

get_cog_definition_by_name(cog, page=None)[source]

Get specific COG Definitions by name: Thiamin-binding stress-response protein YqgV, UPF0045 family

get_cogs(**kwargs)[source]

Get COGs. Unfortunately, the API sends 10 COGS at a time given a specific page.

The dictionary returned contains the results, count, previous and next page.

get_cogs_by_assembly_id(assembly_id, page=None)[source]

Filter COGs by assembly ID: GCA_000007185.1

get_cogs_by_category(category, page=None)[source]

Filter COGs by Taxonomic Category: ACTINOBACTERIA

get_cogs_by_category_id(category, page=None)[source]

Filter COGs by Taxonomic Category taxid: 651137

get_cogs_by_gene(gene, page=None)[source]

Filter COGs by gene tag: MK0280

get_cogs_by_id(cog_id, page=None)[source]

Filter COGs by COG ID tag: COG0003

get_cogs_by_id_and_category(cog_id, category, page=None)[source]

Filter COGs by COG id and Taxonomy Categories: COG0004 and CYANOBACTERIA

get_cogs_by_id_and_organism(cog_id, organism, page=None)[source]

Filter COGs by COG id and organism: COG0004 and Escherichia_coli_K-12_sub_MG1655

get_cogs_by_organism(name, page=None)[source]

Filter COGs by organism name: Nitrosopumilus_maritimus_SCM1

get_cogs_by_protein_name(protein, page=None)[source]

Filter COGs by Protein name: AJP49128.1

get_cogs_by_taxon_id(taxon_id, page=None)[source]

Filter COGs by taxid: 1229908

get_taxonomic_categories(page=None)[source]

Get all Taxonomic Categories.

if page is set, only that page is returned. There are 10 entires per page. if page is unset (default), all results are returned.

from bioservices import COG
c = COG()
names = [x['name'] for x in c.get_taxonomic_categories()['results']]
get_taxonomic_category_by_name(name, page=None)[source]

Get specific Taxonomic Category by name

c.get_taxonomic_category_by_name("ALPHAPROTEOBACTERIA")
search_organism(name)[source]

Return candidates that match the input name.

Parameters:

name (str) – search string matched case-insensitively against genome names

Returns:

list of items. Each item is a dictionary with genome name, assembly identifier and taxon identifier.

8.10. ENA

This module provides a class ENA

New in version 1.4.4.

class ENA(verbose=False, cache=False)[source]

Interface to the ENA (European Nucleotide Archive)

>>> from bioservices import ENA
>>> s = ENA(verbose=False)

Retrieve read domain metadata in XML format:

print(e.get_data('ERA000092', 'xml'))

Retrieve assembled and annotated sequences in FASTA format:

print(e.get_data('A00145', 'fasta'))

The range parameter can be used to retrieve a subsequence from sequence entry A00145 from bases 3 to 63:

e.get_data('A00145', 'fasta', fasta_range=[3, 63])

Retrieve assembled and annotated subsequences in HTML format:

e.view_data('A00145')

Retrieve expanded CON records:

To retrieve expanded CON records use the expanded=True parameter. For example, the expanded CON entry AL513382 in flat file format can be obtained as follows:

e.get_data('AL513382', frmt='text', expanded=True)

Expanded CON records differ from CON records in two ways: firstly, they contain the full sequence in addition to the contig assembly instructions; secondly, if a CON record contains only source or gap features, the expanded CON records will also display all features from the segment records.

Retrieve assembled and annotated sequence header in flat file format using the header=True parameter:

e.get_data('BN000065', 'text', header=True)

Retrieve assembled and annotated sequence records using sequence versions:

e.get_data('AM407889.1', 'fasta')
e.get_data('AM407889.2', 'fasta')

Constructor

Parameters:

verbose – set to False to prevent informative messages

data_warehouse()[source]
get_data(identifier, frmt, fasta_range=None, expanded=None, header=None, download=None)[source]

Retrieve an ENA entry in the specified format.

Parameters:
  • identifier (str) – ENA accession or identifier (e.g. 'AL513382')

  • frmt (str) – output format — one of xml, text, fasta, fastq, html, embl (availability depends on entry type)

  • fasta_range (list) – [start, end] base positions for subsequence retrieval (FASTA only)

  • expanded (bool) – if True, return expanded CON records

  • header (bool) – if True, return only the sequence header

  • download (bool) – if True, return data as a downloadable file

get_data("AL513382", "embl")

Note

The ENA API changed in 2020; this method wraps the current REST API.

get_taxon(taxon)[source]

Deprecated since version 7.8: — removed due to ENA API update.

url = 'http://www.ebi.ac.uk/ena/browser/api'

8.11. Ensembl

Interface to Ensembl web service

class Ensembl(verbose=False, cache=False)[source]

Interface to the Ensembl service

For the BioServices documentation see the documentation of each method for the list of parameters. The API was copied from the Ensembl API (http://rest.ensembl.org)

All methods have been tested using this BioServices notebook

Todo

There are 3 methods out of 50 that are not implemented so far.

Todo

some methods have a parameter called feature. The official Ensembl API allows one to provide several features at the same time. This is not yet implemented. Only one at a time is accepted.

Note

Some function uses SQL wildcards. See e.g. http://www.w3schools.com/sql/sql_wildcards.asp In brief, “_” can be use to substitute a single character and ‘%’ a set of characters.

Constructor

Parameters:

verbose – set to False to prevent informative messages

check_nh_format(value)[source]
check_sequence(value)[source]
get_alignment_by_region(region, species, frmt='json', aligned=True, compact=True, compara='multi', display_species_set=None, mask=None, method='EPO', species_set=None, species_set_group='mammals')[source]

Retrieves genomic alignments as separate blocks based on a region and species

Parameters:
  • region (str) – Query region. A maximum of 10Mb is allowed to be requested at any one time (e.g., ‘X:1000000..1000100:1’, ‘X:1000000..1000100:-1’, ‘X:1000000..1000100’)

  • species (str) – Species name/alias (e.g., human)

  • aligned (bool) – Return the aligned string if true. Otherwise, return the original sequence (no insertions)

  • compact (bool) – Applicable to EPO_LOW_COVERAGE alignments. If true, concatenate the low coverage species sequences together to create a single sequence. Otherwise, separates out all sequences.

  • compara (str) – Name of the compara database to use. Multiple comparas can exist on a server if you are accessing Ensembl Genomes data (defaults to multi)

  • display_species_set (str) – Subset of species in the alignment to be displayed (multiple values). All the species in the alignment will be displayed if this is not set. Any valid alias may be used.. (e.g., human, chimp, gorilla)

  • mask (str) – Request the sequence masked for repeat sequences. Hard will mask all repeats as N’s and soft will mask repeats as lowercased characters.

  • method (str) – The alignment method amongst Enum(EPO, EPO_LOW_COVERAGE, PECAN, LASTZ_NET, BLASTZ_NET, TRANSLATED_BLAT_NET)

  • species_set (str) – the set of species used to define the pairwise alignment (multiple values). Should not be used with the species_set_group parameter. Use get_info_compara_by_method() with one of the methods listed above to obtain a valid list of species sets. Any valid alias may be used. (e.g., musc_musculus, homo_sapiens)

  • species_set_group (str) – The species set group name of the multiple alignment. Should not be used with the species_set parameter. Use /info/compara/species_sets/:method with one of the methods listed above to obtain a valid list of group names. (Defaults to mammals. e.g. mammals, amniotes, fish, sauropsids)

get_archive(identifier, frmt='json')[source]

Uses the given identifier to return the archived sequence

Parameters:
  • identifier (str) – An Ensembl stable ID

  • frmt (str) – output format (json, xml or jsonp)

>>> from bioservices import Ensembl
>>> s = Ensembl()
>>> res = s.get_archive("ENSG00000157764")
get_genetree_by_id(identifier, aligned=False, frmt='json', nh_format='simple', sequence='protein', compara='multi')[source]

Retrieves a gene tree dump for a gene tree stable identifier

Parameters:
  • identifier (str) – An Ensembl genetree ID

  • frmt (str) – response formats: json, jsonp, nh, phyloxml

  • aligned (bool) – if true, return the aligned string otherwise return the original sequence (no insertions). Can be True/1 or False/0 and defaults to 0

  • compara (str) – Name of the compara database to use. Multiple comparas can exist on a server if you are accessing Ensembl Genomes data

  • nh_format – The format of a NH (New Hampshire) request. Valid values are ‘full’, ‘display_label_composite’, ‘simple’, ‘species’, ‘species_short_name’, ‘ncbi_taxon’, ‘ncbi_name’, ‘njtree’, ‘phylip’

  • sequence – The type of sequence to bring back. Setting it to none results in no sequence being returned. Valid values are ‘none’, ‘cdna’, ‘protein’.

>>> from bioservices import Ensembl
>>> s = Ensembl()
>>> s.get_genetree('ENSGT00390000003602', frmt='nh', nh_format='simple')
>>> s.get_genetree('ENSGT00390000003602', frmt='phyloxml')
>>> s.get_genetree('ENSGT00390000003602', frmt='phyloxml',aligned=True, sequence='cdna')
>>> s.get_genetree('ENSGT00390000003602', frmt='phyloxml', sequence='none')
get_genetree_by_member_id(identifier, species, frmt='json', aligned=False, db_type='core', object_type=None, nh_format='simple', sequence='protein', compara='multi')[source]

Retrieves a gene tree containing the gene identified by its member ID

Parameters:
  • compara (str) – Name of the compara database to use. Multiple comparas can exist on a server if you are accessing Ensembl Genomes data. Default to ‘multi’

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core. Defaults to core

  • object_type (str) – Filter by feature type. Default to None; examples are gene, transcript.

get_genetree_by_member_id('ENSG00000157764', 'human', frmt='phyloxml')
get_genetree_by_member_symbol(species, symbol, frmt='json', aligned=False, db_type='core', object_type=None, nh_format='simple', sequence='protein', compara='multi')[source]

Retrieves a gene tree containing the gene identified by a symbol

get_homology_by_species_and_id(identifier, species, frmt='json', aligned=True, compara='multi', format=None, sequence=None, target_species=None, target_taxon=None, type='all')[source]

Retrieves homology information (orthologs) by Ensembl gene id

get_homology_by_symbol(species, symbol, frmt='json', aligned=True, compara=None)[source]

Retrieves homology information (orthologs) by symbol

get_info_analysis(species, frmt='json')[source]

List the names of analyses involved in generating Ensembl data.

Parameters:
  • species (str) – Species name/alias (e.g., homo_sapiens)

  • frmt (str) – response formats: json, jsonp,xml

get_info_assembly(species, frmt='json', bands=False)[source]

List the currently available assemblies for a species.

Parameters:
  • species (str) – Species name/alias (e.g., homo_sapiens)

  • frmt (str) – response formats: json, jsonp,xml

  • bands (bool) – if set to 1, include karyotype band information. Only display if band information is available

get_info_assembly_by_region(species, region, frmt='json', bands=0)[source]

Returns information about the specified toplevel sequence region for the given species.

get_info_biotypes(species, frmt='json')[source]

List the functional classifications of gene models that Ensembl associates with a particular species. Useful for restricting the type of genes/transcripts retrieved by other endpoints.

Parameters:
  • species (str) – Species name/alias (e.g., homo_sapiens)

  • frmt (str) – response formats: json, jsonp,xml

get_info_compara_by_method(method, frmt='json', compara='multi')[source]

List all collections of species analysed with the specified compara method.

Parameters:
  • method (str) – Filter by compara method. Use one the methods returned by /info/compara/methods endpoint. e.g., EPO

  • frmt (str) – response formats: json, jsonp,xml

  • compara (str) – Name of the compara database to use. Multiple comparas may exist on a server when accessing Ensembl Genomes data. defaults to ‘multi’

get_info_compara_methods(frmt='json', compara='multi', method_class=None)[source]

List all compara analyses available (an analysis defines the type of comparative data).

Parameters:
  • frmt (str) – response formats: json, yaml, jsonp, xml

  • class (str) – The class of the method to query for. Regular expression patterns are supported. (Defaults to GenomicAlign)

  • compara (str) – Name of the compara database to use. Multiple comparas may exist on a server when accessing Ensembl Genomes data.

Note

API argument is class, renamed in method_class

get_info_comparas(frmt='json')[source]

Lists all available comparative genomics databases and their data release.

Parameters:

frmt (str) – response formats: json, jsonp,xml

get_info_data(frmt='json')[source]

Shows the data releases available on this REST server. May return more than one release (unfrequent non-standard Ensembl configuration).

Parameters:

frmt (str) – response formats: json, jsonp,xml

get_info_external_dbs(species, frmt='json', filter=None)[source]

Lists all available external sources for a species.

Parameters:
  • frmt (str) – response formats: json, jsonp,xml

  • species (str) – Species name/alias

  • filter (str) – Restrict external DB searches to a single source or pattern. SQL-LIKE patterns are supported. See Ensembl doc.

get_info_ping(frmt='json')[source]

Checks if the service is alive.

get_info_rest(frmt='json')[source]

Shows the current version of the Ensembl REST API.

Parameters:

frmt (str) – response formats: json, jsonp,xml

get_info_software(frmt='json')[source]

Shows the current version of the Ensembl API used by the REST server.

Parameters:

frmt (str) – response formats: json, jsonp,xml

get_info_species(frmt='json')[source]

Lists all available species, their aliases, available adaptor groups and data release.

Parameters:

frmt (str) – response formats: json, jsonp,xml

get_lookup_by_id(identifier, frmt='json', db_type=None, expand=False, format='full', species=None)[source]

Find the species and database for a single identifier

Parameters:
  • identifier (str) – An ontology term identifier (e.g., GO:0005667)

  • frmt (str) – response formats in json, xml, jsonp

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core. Defaults to core

  • expand (str) – Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well.

  • format (str) – Specify the formats to emit from this endpoint

  • species (str) – Species name/alias (e.g., human)

get_lookup_by_id('ENSG00000157764', expand=True)
get_lookup_by_symbol(species, symbol, frmt='json', expand=False, format='full')[source]

Find the species and database for a single identifier

Parameters:
  • species (str) – Species name/alias (e.g., human)

  • symbol (str) – A name or symbol from an annotation source has been linked to a genetic feature. e.g., BRCA2

  • frmt (str) – response formats in json, xml, jsonp

  • expand (str) – Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well.

  • format (str) – Specify the formats to emit from this endpoint

get_lookup_by_symbol('homo_sapiens', 'BRCA2', expand=True)
get_map_assembly_one_to_two(first, second, region, species='human', frmt='json')[source]

Convert the co-ordinates of one assembly to another

Parameters:
  • first (str) – version of the input assembly (e.g., GRCh37)

  • second (str) – version of the output assembly (e.g., GRCh38)

  • region (str) – query region (e.g., X:1000000..1000100:1)

  • species (str) – species name/alias (default human)

e.get_map_assembly_one_to_two(species='human',
    first='GRCh37', region='X:1000000..1000100:1', second='GRCh38')
get_map_cdna_to_region(identifier, region, frmt='json', species=None)[source]

Convert from cDNA coordinates to genomic coordinates.

Parameters:
  • identifier (str) – a stable Ensembl transcript ID (e.g., ENST00000288602)

  • region (str) – query region in the form start..end (e.g., 100..300)

  • species (str) – species name/alias (default human)

Output reflects forward orientation coordinates as returned from the Ensembl API.

get_map_cdna_to_region('ENST00000288602', '100..300')
get_map_cds_to_region(identifier, region, frmt='json', species=None)[source]

Convert from cDNA coordinates to genomic coordinates.

Parameters:
  • identifier – Ensembl ID e.g. ENST00000288602

  • region – Query region e.g., 100..300

Output reflects forward orientation coordinates as returned from the Ensembl API.

get_map_cds_to_region('ENST00000288602', '1..1000')
get_map_translation_to_region(identifier, region, frmt='json', species=None)[source]

Convert from protein (translation) coordinates to genomic coordinates.

Output reflects forward orientation coordinates as returned from the Ensembl API.

Parameters:
  • identifier (str) – a stable Ensembl translation ID (e.g., ENSP00000288602)

  • region (str) – query region in the form start..end (e.g., 100..300)

  • species (str) – species name/alias (e.g., homo_sapiens)

get_map_translation_to_region('ENSP00000288602', '100..300')
get_ontology_ancestors_by_id(identifier, frmt='json', ontology=None)[source]

Reconstruct the entire ancestry of a term from is_a and part_of relationships

Parameters:
  • identifier (str) – An ontology term identifier (e.g., GO:0005667)

  • frmt (str) – json, xml, yaml, jsonp

  • ontology (str) – Filter by ontology. Used to disambiguate terms which are shared between ontologies such as GO and EFO (e.g., GO)

get_ontology_ancestors_chart_by_id(identifier, frmt='json', ontology=None)[source]

Reconstruct the entire ancestry of a term from is_a and part_of relationships.

Parameters:
  • identifier (str) – an ontology term identifier (GO:0005667)

  • frmt (str) – json, xml, yaml, jsonp

  • ontology (str) – Filter by ontology. Used to disambiguate terms which are shared between ontologies such as GO and EFO

get_ontology_by_id(identifier, frmt='json', relation=None, simple=False)[source]

Search for an ontological term by its namespaced identifier

Parameters:
  • identifier (str) – An ontology term identifier (e.g., GO:0005667)

  • simple (bool) – If set the API will avoid the fetching of parent and child terms

  • frmt (str) – response formats in json, xml, yaml, jsonp

  • relation (str) – The types of relationships to include in the output. Fetches all relations by default (e.g., is_a, part_of)

>>> from bioservices import Ensembl
>>> e = Ensembl()
>>> res = e.get_ontology_by_id('GO:0005667')
get_ontology_by_name(name, frmt='json', ontology=None, relation=None, simple=False)[source]

Search for a list of ontological terms by their name

Parameters:
  • name (str) – An ontology name. SQL wildcards See Ensembl doc.

  • frmt (str) – response formats in json, xml, yaml, jsonp

  • simple (str) – If set the API will avoid the fetching of parent and child terms

  • relation (str) – The types of relationships to include in the output. Fetches all relations by default (e.g., is_a, part_of)

  • ontology (str) – Filter by ontology. Used to disambiguate terms which are shared between ontologies such as GO and EFO (e.g., GO)

>>> from bioservices import Ensembl
>>> e = Ensembl()
>>> res = e.get_ontology_by_name('transcription factor')
400
>>> res = e.get_ontology_by_name('transcription factor complex')
>>> res[0]['children']
get_ontology_descendants_by_id(identifier, frmt='json', closest_term=None, ontology=None, subset=None, zero_distance=None)[source]

Find all the terms descended from a given term. By default searches are conducted within the namespace of the given identifier

Parameters:
  • identifier (str) – an ontology term identifier (GO:0005667)

  • frmt (str) – json, xml, jsonp

  • closest_term (bool) – If true return only the closest terms to the specified term

  • ontology (str) – Filter by ontology. Used to disambiguate terms which are shared between ontologies such as GO and EFO

  • subset (str) – Filter terms by the specified subset

  • zero_distance (bool) – Return terms with a distance of 0

get_overlap_by_id(identifier, feature=None, frmt='json', biotype=None, db_type=None, logic_name=None, misc_set=None, object_type=None, so_term=None, species=None, species_set='mammals')[source]

Retrieves features (e.g. genes, transcripts, variations etc.) that overlap a region defined by the given identifier.

Parameters:
  • identifier (str) – An Ensembl stable ID

  • feature (str) – The type of feature to retrieve. Multiple values are accepted. Value in Enum(gene, transcript, cds, exon, repeat, simple, misc, variation, somatic_variation, structural_variation, somatic_structural_variation, constrained, regulatory

  • biotype (str) – The functional classification of the gene or transcript to fetch. Cannot be used in conjunction with logic_name when querying transcripts. (e.g., protein_coding)

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • logic_name (str) – Limit retrieval of genes, transcripts and exons by a given name of an analysis.

  • misc_set (str) – Miscellaneous set which groups together feature entries. Consult the DB or returned data sets to discover what is available. (e.g., cloneset_30k

  • object_type (str) – Filter by feature type (e.g., gene)

  • so_term (str) – Sequence Ontology term to narrow down the possible variations returned. (e.g., SO:0001650)

  • species (str) – Species name/alias.

  • species_set (str) – Filter by species set for retrieving constrained elements. (e.g. mammals)

get_overlap_by_region(region, species, feature=None, frmt='json', biotype=None, cell_type=None, db_type=None, logic_name=None, misc_set=None, object_type=None, so_term=None, species_set=None, trim_downstream=False, trim_upstream=False)[source]

Retrieves multiple types of features for a given region.

Parameters:
  • region (str) – Query region. A maximum of 5Mb is allowed to be requested at any one time. e.g., X:1..1000:1, X:1..1000:-1, X:1..1000

  • species (str) – Species name/alias.

  • feature (str) – The type of feature to retrieve. Multiple values are accepted: gene, transcript, cds, exon, repeat, simple, misc, variation, somatic_variation, structural_variation, somatic_structural_variation, constrained, regulatory

  • biotype (str) – The functional classification of the gene or transcript to fetch. Cannot be used in conjunction with logic_name when querying transcripts. (e.g., protein_coding)

  • cell_type – Cell type name in Ensembl’s Regulatory Build, required for segmentation feature, optional for regulatory elements. e.g., K562

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • logic_name (str) – Limit retrieval of genes, transcripts and exons by a given name of an analysis.

  • misc_set (str) – Miscellaneous set which groups together feature entries. Consult the DB or returned data sets to discover what is available. (e.g., cloneset_30k)

  • so_term (str) – Sequence Ontology term to narrow down the possible variations returned. (e.g., SO:0001650)

  • species_set (str) – Filter by species set for retrieving constrained elements. (e.g. mammals)

  • trim_downstream (bool) – Do not return features which overlap the downstream end of the region.

  • trim_upstream (bool) – Do not return features which overlap upstream end of the region.

Todo

feature can take several values. how can be do that.

get_overlap_by_translation(identifier, frmt='json', db_type=None, feature='protein_feature', so_term=None, species=None, type='none')[source]

Retrieve features related to a specific Translation as described by its stable ID (e.g. domains, variations).

Parameters:
  • identifier (str) – a stable Ensembl translation ID

  • frmt (str) – response formats in json, xml, jsonp

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • feature (str) – requested feature in: transcript_variation, protein_feature, residue_overlap, translation_exon, somatic_transcript_variation

  • so_term (str) – Sequence Ontology term to restrict the variations found. Its descendants are also included in the search. (e.g., SO:0001650)

  • species (str) – species name/alias

  • type (str) – Type of data to filter by. By default, all features are returned. Can specify a domain or consequence type. (e.g., low_complexity)

get_regulatory_by_id(identifier, species, frmt='json')[source]

Returns a RegulatoryFeature given its stable ID

Parameters:
  • identifier (str) – a stable Ensembl regulatory feature ID

  • species (str) – species name/alias (e.g., homo_sapiens)

get_sequence_by_id(identifier, frmt='fasta', db_type=None, expand_3prime=None, expand_5prime=None, format=None, mask=None, mask_feature=False, multiple_sequences=False, object_type=None, species=None, type='genomic')[source]

Request multiple types of sequence by stable identifier.

Parameters:
  • identifier (str) – a stable Ensembl ID

  • frmt (str) – response formats: fasta, json, text, yaml, jsonp

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core (e.g., core)

  • expand_3prime (int) – Expand the sequence downstream of the sequence by this many basepairs. Only available when using genomic sequence type.

  • expand_5prime (int) – Expand the sequence upstream of the sequence by this many basepairs. Only available when using genomic sequence type.

  • format (str) – Format of the data (e.g., fasta)

  • mask (str) – Request the sequence masked for repeat sequences. Hard will mask all repeats as N’s and soft will mask repeats as lowercased characters. Only available when using genomic sequence type. (hard/soft)

  • mask_feature (bool) – Mask features on the sequence. If sequence is genomic, mask introns. If sequence is cDNA, mask UTRs. Incompatible with the ‘mask’ option

  • multiple_sequences (bool) – Allow the service to return more than 1 sequence per identifier. This is useful when querying for a gene but using a type such as protein.

  • object_type (str) – Filter by feature type (e.g., gene)

  • species (str) – Species name/alias (e.g., homo_sapiens)

  • type (str) – could be genomic, cds, cdna, protein (homo_sapiens). Requesting a gene and kind not equal to genomic may result in multiple sequence, which required the parameter multi_sequences to be set to True

Example:

>>> # Default format is fasta, let us use parameter frmt to overwrite it
>>> sequence = e.get_sequence('ENSG00000157764', frmt='text')
>>> print(sequence[0:10])
CGCCTCCCTTCCCCCTCCCC

>>> # complex request for different database and kind
>>> res = e.get_sequence('CCDS5863.1', frmt='fasta',
        object_type='transcript', db_type='otherfeatures',
        type='cds', species='human')
>>> print(res[0:100])
>CCDS5863.1
ATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAAC
GGGGACATGGAGCCCGAGGCCGGCGCC
get_sequence_by_region(region, species, frmt='json', coord_system=None, coord_system_version=None, expand_3prime=None, expand_5prime=None, format=None, mask=None, mask_feature=False)[source]

Returns the genomic sequence of the specified region of the given species.

Parameters:
  • region (str) – Query region. A maximum of 10Mb is allowed to be requested at any one time. e.g., X:1000000..1000100:1

  • species (str) – Species name/alias

  • coord_system (str) – Filter by coordinate system name (e.g., contig, seqlevel)

  • coord_system_version (str) – Filter by coordinate system version (e.g., GRCh37)

  • expand_3prime (int) – Expand the sequence downstream of the sequence by this many basepairs. Only available when using genomic sequence type.

  • expand_5prime (int) – Expand the sequence upstream of the sequence by this many basepairs. Only available when using genomic sequence type.

  • format (str) – Format of the data. (e.g., fasta)

  • mask (str) – Request the sequence masked for repeat sequences. Hard will mask all repeats as N’s and soft will mask repeats as lowercased characters. Only available when using genomic sequence type. (hard/soft)

  • mask_feature (bool) – Mask features on the sequence. If sequence is genomic, mask introns. If sequence is cDNA, mask UTRs. Incompatible with the ‘mask’ option

get_taxonomy_by_id(identifier, frmt='json', simple=False)[source]

Search for a taxonomic term by its identifier or name

Parameters:
  • identifier (str) – A taxon identifier. Can be a NCBI taxon id or a name (e.g., 9606 or Homo sapiens)

  • simple (bool) – If set the API will avoid the fetching of parent and child terms

  • frmt (str) – response formats in json, xml, yaml, jsonp

get_taxonomy_by_name(name, frmt='json')[source]

Search for a taxonomic id by a non-scientific name

Parameters:
  • name (str) – A non-scientific species name. Can include SQL wildcards See Ensembl doc.

  • frmt (str) – response formats in json, xml, yaml, jsonp

>>> from bioservices import Ensembl
>>> e = Ensembl()
>>> res = e.get_taxonomy_by_name('homo')
get_taxonomy_classification_by_id(identifier, frmt='json')[source]

Return the taxonomic classification of a taxon node

Parameters:
  • identifier (str) – A taxon identifier. Can be a NCBI taxon id or a name (e.g., 9606, Homo sapiens)

  • frmt (str) – json, xml, yaml, jsonp

>>> from bioservices import Ensembl
>>> e = Ensembl()
>>> res = e.get_taxonomy_classification_by_id('9606')
get_variation_by_id(identifier, species, frmt='json', genotypes=False, phenotypes=False, pops=False)[source]
Parameters:
  • identifier (str) – variation identifier (e.g., rs56116432)

  • species (str) – Species name/alias (e.g., homo_sapiens)

  • frmt (str) – response format (json, xml, jsonp)

  • genotypes (bool) – Include genotypes

  • phenotypes (bool) – Include phenotypes

  • pops (bool) – Include populations

get_vep_by_id(identifier, species, frmt='json', canonical=False, ccds=False, domains=False, hgvs=False, numbers=False, protein=False, xref_refseq=False)[source]

Fetch variant consequences based on a variation identifier

Parameters:
  • identifier (str) – Query ID. Supports dbSNP, COSMIC and HGMD identifiers (e.g., rs116035550, COSM476)

  • species (str) – Species name/alias

  • canonical (bool) – Include a flag indicating the canonical transcript for a gene

  • ccds (bool) – Include CCDS transcript identifiers

  • domains (bool) – Include names of overlapping protein domains

  • hgvs (bool) – Include HGVS nomenclature based on Ensembl stable identifiers

  • numbers (bool) – Include affected exon and intron positions within the transcript

  • protein (bool) – Include Ensembl protein identifiers

  • xref_refseq (bool) – Include aligned RefSeq mRNA identifiers for transcript. NB: theRefSeq and Ensembl transcripts aligned in this way MAY NOT, AND FREQUENTLY WILL NOT, match exactly in sequence, exon structure and protein product

get_vep_by_region(region, allele, species, frmt='json', canonical=False, ccds=False, domains=False, hgvs=False, numbers=False, protein=False, xref_refseq=False)[source]

Fetch variant consequences

Parameters:
  • region – Query region e.g, 9:22125503-22125502:1

  • allele (str) – Variation allele (e.g., C, DUP)

  • species (str) – Species name/alias

  • canonical (bool) – Include a flag indicating the canonical transcript for a gene

  • ccds (bool) – Include CCDS transcript identifiers

  • domains (bool) – Include names of overlapping protein domains

  • hgvs (bool) – Include HGVS nomenclature based on Ensembl stable identifiers

  • numbers (bool) – Include affected exon and intron positions within the transcript

  • protein (bool) – Include Ensembl protein identifiers

  • xref_refseq (bool) – Include aligned RefSeq mRNA identifiers for transcript. NB: theRefSeq and Ensembl transcripts aligned in this way MAY NOT, AND FREQUENTLY WILL NOT, match exactly in sequence, exon structure and protein product

get_xrefs_by_id(identifier, frmt='json', all_levels=False, db_type='core', external_db=None, object_type=None, species=None)[source]

Perform lookups of Ensembl Identifiers and retrieve their external references in other databases

Parameters:
  • identifier (str) – An Ensembl Stable ID (ENSG00000157764)

  • frmt (str) – response formats: json, jsonp, nh, phyloxml

  • all_levels (bool) – Set to find all genetic features linked to the stable ID, and fetch all external references for them. Specifying this on a gene will also return values from its transcripts and translations.

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • external_db (str) – Filter by external database (e.g., HGNC)

  • object_type (str) – filter by feature type (e.g., gene, transcript)

  • species (str) – Species name/alias (human)

get_xrefs_by_name(name, species, frmt='json', db_type='core', external_db=None)[source]

Performs a lookup based upon the primary accession or display label of an external reference and returning the information we hold about the entry

Parameters:
  • name (str) – Symbol or display name of a gene (e.g., BRCA2)

  • species (str) – Species name/alias (e.g., human)

  • frmt (str) – response formats: json, jsonp,xml

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • external_db (str) – Filter by external database (e.g., HGNC)

get_xrefs_by_symbol(symbol, species, frmt='json', db_type='core', external_db=None, object_type=None)[source]

Looks up an external symbol and returns all Ensembl objects linked to it. This can be a display name for a gene/transcript/translation, a synonym or an externally linked reference. If a gene’s transcript is linked to the supplied symbol the service will return both gene and transcript (it supports transient links).

Parameters:
  • species (str) – Species name/alias (e.g., human)

  • symbol (str) – Symbol or display name of a gene (BRCA2)

  • frmt (str) – response formats: json, jsonp,xml

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core

  • external_db (str) – Filter by external database (e.g., HGNC)

  • object_type (str) – filter by feature type (e.g., gene, transcript)

nh_format_to_frmt(value)[source]
post_archive(identifiers, frmt='json')[source]

Retrieve the archived sequence for a set of identifiers

returned by the requested JSONP response. Required ONLY when using JSONP as the serialisation method. Please see the user guide.

post_lookup_by_id(identifiers, frmt='json', db_type=None, expand=False, format='full', object_type=None, species=None)[source]

Find the species and database for a single identifier

Parameters:
  • identifier (str) – An ontology term identifier (e.g., GO:0005667)

  • frmt (str) – response formats in json, xml, jsonp

  • db_type (str) – Restrict the search to a database other than the default. Useful if you need to use a DB other than core. Defaults to core

  • expand (str) – Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well.

  • format (str) – Specify the formats to emit from this endpoint

  • object_type (str) – Filter by feature type (e.g., gene, transcript)

  • species (str) – Species name/alias (e.g., human)

post_lookup_by_id(["ENSG00000157764", "ENSG00000248378" ])
post_lookup_by_symbol(species, symbols, frmt='json', expand=False, format='full')[source]

Find the species and database for a set of symbols

Parameters:
  • species (str) – Species name/alias (e.g., human)

  • symbols (list) – A list of names or symbols from an annotation source has been linked to a genetic feature. e.g., BRCA2

  • frmt (str) – response formats in json, xml, jsonp

  • expand (str) – Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well.

  • format (str) – Specify the formats to emit from this endpoint

post_lookup_by_symbol('homo_sapiens', ['BRCA2', 'BRAF'], expand=True)
post_vep_by_id(species, identifiers)[source]
post_vep_by_region(species, region)[source]

8.12. EVA

Interface to some part of the EVA web service

class EVA(verbose=False, cache=False)[source]

Interface to the EVA service

  • version: indicates the version of the API, this defines the available filters and JSON schema to be returned. Currently there is only version ‘v1’.

  • category: this defines what objects we want to query. Currently there are five different categories: variants, segments, genes, files and studies.

  • resource: specifies the resource to be returned, therefore the JSON data model.

  • filters: each specific endpoint allows different filters.

Constructor

Parameters:

verbose – set to False to prevent informative messages

fetch_allinfo(name)[source]

Fetch summary information for a study by its accession.

Parameters:

name (str) – study accession (e.g., "PRJEB4019")

Returns:

study summary data

8.13. EUtils

Interface to the EUtils web Service.

class EUtils(verbose=False, email='unknown', cache=False, xmlparser='EUtilsParser')[source]

Interface to NCBI Entrez Utilities service

Note

Technical note: the WSDL interface was dropped in july 2015 so we now use the REST service.

Warning

Read the guidelines before sending requests. No more than 3 requests per seconds otherwise your IP may be banned. You should provide your email by filling the email so that before being banned, you may be contacted.

There are a few methods such as ELink(), EFetch(). Here is an example on how to use EFetch() method to retrieve the FASTA sequence of a given identifier (34577063):

>>> from bioservices import EUtils
>>> s = EUtils()
>>> print(s.EFetch("protein", "34577063", rettype="fasta"))
>gi|34577063|ref|NP_001117.2| adenylosuccinate synthetase isozyme 2 [Homo sapiens]
MAFAETYPAASSLPNGDCGRPRARPGGNRVTVVLGAQWGDEGKGKVVDLLAQDADIVCRCQGGNNAGHTV
VVDSVEYDFHLLPSGIINPNVTAFIGNGVVIHLPGLFEEAEKNVQKGKGLEGWEKRLIISDRAHIVFDFH
QAADGIQEQQRQEQAGKNLGTTKKGIGPVYSSKAARSGLRMCDLVSDFDGFSERFKVLANQYKSIYPTLE
IDIEGELQKLKGYMEKIKPMVRDGVYFLYEALHGPPKKILVEGANAALLDIDFGTYPFVTSSNCTVGGVC
TGLGMPPQNVGEVYGVVKAYTTRVGIGAFPTEQDNEIGELLQTRGREFGVTTGRKRRCGWLDLVLLKYAH
MINGFTALALTKLDILDMFTEIKVGVAYKLDGEIIPHIPANQEVLNKVEVQYKTLPGWNTDISNARAFKE
LPVNAQNYVRFIEDELQIPVKWIGVGKSRESMIQLF

Most of the methods take a database name as input. You can obtain the valid list by checking the databases attribute.

A few functions takes Identifier(s) as input. It could be a list of strings, list of numbers, or a string where identifiers are separated either by comma or spaces.

A few functions take an argument called term. You can use the AND keyword with spaces or + signs as separators:

Correct:   term=biomol mrna[properties] AND mouse[organism]
Correct:   term=biomol+mrna[properties]+AND+mouse[organism]

Other special characters, such as quotation marks (”) or the # symbol used in referring to a query key on the History server, could be represented by their URL encodings (%22 for “; %23 for #) or verbatim .:

Correct: term=#2+AND+"gene in genomic"[properties]
Correct: term=%232+AND+%22gene+in+genomic%22[properties]

For information about retmode and retype, please see:

http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

ECitMatch(bdata, **kargs)[source]

Retrieve PubMed IDs that correspond to a set of input citation strings.

Parameters:

bdata

Citation strings. Each input citation must be represented by a citation string in the following format:

journal_title|year|volume|first_page|author_name|your_key|

Multiple citation strings may be provided by separating the strings with a carriage return character (%0D) or simply \r or \n.

The your_key value is an arbitrary label provided by the user that may serve as a local identifier for the citation, and it will be included in the output.

all spaces must be replaced by + symbols and that citation strings should end with a final vertical bar |.

Only xml supported at the time of this implementation.

from bioservices import EUtils
s = EUtils()
print(s.ECitMatch("proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|%0Dscience|1987|235|182|palmenberg+ac|Art2|"))
EFetch(db, id, retmode='text', **kargs)[source]

Access to the EFetch E-Utilities

Parameters:
  • db (str) – database from which to retrieve UIDs.

  • id (str) – list of identifiers.

  • retmode – default to text (could be xml but not recommended).

  • rettype – could be fasta, summary, docsum

Returns:

depends on retmode parameter.

Note

addition to NCBI: settings rettype to “dict” returns a dictionary

>>> ret = s.EFetch("omim", "269840")  # ZAP70
>>> ret = s.EFetch("taxonomy", "9606", retmode="xml")
>>> [x.text for x in ret.getchildren()[0].getchildren() if x.tag=="ScientificName"]
['Homo sapiens']

>>> s = eutils.EUtils()
>>> s.EFetch("protein", "34577063", retmode="text", rettype="fasta")
>gi|34577063|ref|NP_001117.2| adenylosuccinate synthetase isozyme 2 [Homo sapiens]
MAFAETYPAASSLPNGDCGRPRARPGGNRVTVVLGAQWGDEGKGKVVDLLAQDADIVCRCQGGNNAGHTV
VVDSVEYDFHLLPSGIINPNVTAFIGNGVVIHLPGLFEEAEKNVQKGKGLEGWEKRLIISDRAHIVFDFH
QAADGIQEQQRQEQAGKNLGTTKKGIGPVYSSKAARSGLRMCDLVSDFDGFSERFKVLANQYKSIYPTLE
IDIEGELQKLKGYMEKIKPMVRDGVYFLYEALHGPPKKILVEGANAALLDIDFGTYPFVTSSNCTVGGVC
TGLGMPPQNVGEVYGVVKAYTTRVGIGAFPTEQDNEIGELLQTRGREFGVTTGRKRRCGWLDLVLLKYAH
MINGFTALALTKLDILDMFTEIKVGVAYKLDGEIIPHIPANQEVLNKVEVQYKTLPGWNTDISNARAFKE
LPVNAQNYVRFIEDELQIPVKWIGVGKSRESMIQLF

Identifiers could be provided as a single string with comma-separated values, or a list of strings, a list of integers, or just one string or one integer but no mixing of types in the list:

>>> e.EFetch("protein", "352, 234", retmode="text", rettype="fasta")
>>> e.EFetch("protein", 352, retmode="text", rettype="fasta")
>>> e.EFetch("protein", [352], retmode="text", rettype="fasta")
>>> e.EFetch("protein", [352, 234], retmode="text", rettype="fasta")

retmode should be xml or text depending on the database. For instance, xml for pubmed:

>>> e.EFetch("pubmed", "20210808", retmode="xml")
>>> e.EFetch('nucleotide', id=15, retmode='xml')
>>> e.EFetch('nucleotide', id=15, retmode='text', rettype='fasta')
>>> e.EFetch('nucleotide', 'NT_019265', rettype='gb')

Other special characters, such as quotation marks (”) or the # symbol used in referring to a query key on the History server, should be represented by their URL encodings (%22 for “; %23 for #).

A useful command is the following one that allows to get back a GI identifier from its accession, which is common to NCBI/EMBL:

e.EFetch(db="nuccore",id="AP013055", rettype="seqid", retmode="text")

Changed in version 1.5.0: instead of “xml”, retmode can now be set to dict, in which case an XML is retrieved and converted to a dictionary if possible.

EGQuery(term, **kargs)[source]

Provides the number of records retrieved in all Entrez databases by a text query.

Parameters:

term (str) – Entrez text query. Spaces may be replaced by ‘+’ signs. For very long queries (more than several hundred characters long), consider using an HTTP POST call. See the PubMed or Entrez help for information about search field descriptions and tags. Search fields and tags are database specific.

Returns:

returns a json data structure

>>> ret = s.EGQuery("asthma")
>>> [(x.DbName, x.Count) for x in ret.eGQueryResult.ResultItem if x.Count!='0']

>>> ret = s.EGQuery("asthma")
>>> ret.eGQueryResult.ResultItem[0]
{'Count': '115241',
 'DbName': 'pmc',
 'MenuName': 'PubMed Central',
 'Status': 'Ok'}
EInfo(db=None, **kargs)[source]

Provides information about a database (e.g., number of records)

Parameters:

db (str) – target database about which to gather statistics. Value must be a valid Entrez database name. See databases or don’t provide any value to obtain the entire list

Returns:

a json data structure that depends on the value of databases (default to json)

>>> all_database_names = s.EInfo()
>>> # specific info about one database:
>>> ret = s.EInfo("taxonomy")
>>> ret[0]['count']
u'1445358'
>>> ret = s.EInfo('pubmed')
>>> ret[0]['fieldlist'][2]['fullname']
'Filter'

You can use the retmode parameter to ‘xml’ as well. In that case, you will need a XML parser.

>>> ret = s.EInfo("taxonomy")

Note

Note that the name in the XML or json outputs differ (some have lower cases, some have upper cases). This is inherent to the output of EUtils.

The Entrez links utility

Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database;

Parameters:
  • db (str) – valid database from which to retrieve UIDs.

  • dbfrom (str) – Database containing the input UIDs. The value must be a valid database name (default = pubmed). This is the origin database of the link operation. If db and dbfrom are set to the same database value, then ELink will return computational neighbors within that database. Computational neighbors have linknames that begin with dbname_dbname (examples: protein_protein, pcassay_pcassay_activityneighbor).

  • id (str) – UID list. Either a single UID or a comma-delimited list Limited to 200 Ids

  • cmd (str) – ELink command mode. The command mode specified which function ELink will perform. Some optional parameters only function for certain values of cmd (see http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ELink). Examples are neighbor, prlinks.

>>> # Example: Find related articles to PMID 20210808
>>> ret = s.ELink("pubmed", id="20210808", cmd="neighbor_score")

>>> ret = s.parse_xml(ret, 'EUtilsParser')
>>> ret.eLinkResult.LinkSet.LinkSetDb[0].Link[1]
{'Id': '16539535'}

>>> s.ELink(dbfrom="nucleotide", db="protein",
                  id="48819,7140345")
>>> s.ELink(dbfrom="nucleotide", db="protein",
                  id="48819,7140345")
>>> s.ELink(dbfrom='nuccore', id='21614549,219152114',
        cmd='ncheck')

Convert GI number to Taxon identifiers:

>>> s.ELink(dbfrom='nuccore', db="taxonomy", id='21614549,219152114')
EPost(db, id, **kargs)[source]

Accepts a list of UIDs from a given database,

stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.

Parameters:
  • db (str) – a valid database

  • id – list of strings of strings

Returns:

a dictionary with a Web Environment string and a QueryKey to be re-used in another EUtils.

ESearch(db, term, **kargs)[source]

Responds to a query in a given database

The response can be used later in ESummary, EFetch or ELink, along with the term translations of the query.

Parameters:
  • db – a valid database

  • term – an Entrez text query

Note

see _get_esearch_params() for the list of valid parameters.

>>> ret = e.ESearch('protein', 'human', retmax=5)
>>> ret = e.ESearch('taxonomy', 'Staphylococcus aureus[all names]')
>>> ret = e.ESearch('pubmed', "cokelaer AND BioServices")

>>> ret = e.ESearch('protein', '15718680')
>>> # Let us show the first pubmed identifier in a browser
>>> identifiers = e.pubmed(ret['idlist'][0])

More complex requests can be used. We will not cover all the possibilities (see the NCBI website). Here is an example to tune the search term to look into PubMed for the journal PNAS Volume 16:

>>> e.ESearch("pubmed", "PNAS[ta] AND 16[vi]")

You can then look more closely at a specific identifier using EFetch:

>>> e = EUtils()
>>> e.EFetch("pubmed", identifiers)

Note

valid parameters can be found by calling _get_esearch_params()

ESpell(db, term, **kargs)[source]

Retrieve spelling suggestions for a text query in a given database.

Parameters:
  • db (str) – database to search. Value must be a valid Entrez database name (default = pubmed).

  • term (str) – Entrez text query. All special characters must be URL encoded.

>>> ret = e.ESpell(db="pubmed", term="aasthma+OR+alergy")
>>> ret = ret['eSpellResult']
>>> ret['Query']
'asthmaa OR alergies'
>>> ret['CorrectedQuery']
'asthma or allergy'
>>> ret = e.ESpell(db="pubmed", term="biosservices")
>>> ret = ret['eSpellResult']
>>> ret['CorrectedQuery']
bioservices
ESummary(db, id=None, **kargs)[source]

Returns document summaries for a list of input UIDs

Parameters:
  • db – a valid database

  • id (str) – list of identifiers (or string comma separated). all of the UIDs must be from the database specified by db. Limited to 200 identifiers

>>> from bioservices import *
>>> s = EUtils()
>>> ret = s.ESummary("snp","7535")
>>> ret = s.ESummary("snp","7535,7530")
>>> ret = s.ESummary("taxonomy", "9606,9913")
>>> proteins = e.ESearch("protein", "bacteriorhodopsin",
        retmax=20)
>>> ret = e.ESummary("protein", 449301857)
>>> ret['result']['449301857']['extra']
'gi|449301857|gb|EMC97866.1||gnl|WGS:AEIF|BAUCODRAFT_31870'
property databases

Returns list of valid databases

email

fill this with your email address

help()[source]

Open EUtils help page

parse_xml(ret, method=None)[source]
snp_summary(id)[source]

Alias to Efetch for the SNP database

Returns:

a json data structure

>>> ret = s.snp("123")
taxonomy_summary(id)[source]

Alias to EFetch for the taxonomy database

>>> s = EUtils()
>>> ret = s.taxonomy("9606")
>>> ret['9606']['species']
'sapiens'
>>> ret = s.taxonomy("9606,9605,111111111,9604")
>>> ret['9604']['taxid']
9604
class EUtilsParser(xml)[source]

Convert xml returned by EUtils into a structure easier to manipulate

Used by EUtils.EGQuery(), EUtils.ELink().

8.14. GEO (NCBI Gene Expression Omnibus)

Interface to the NCBI Gene Expression Omnibus (GEO) web service.

class GEO(verbose=False, cache=False)[source]

Interface to the NCBI Gene Expression Omnibus (GEO) database.

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. This class provides programmatic access to search and retrieve GEO records using the NCBI E-utilities REST API.

The Bioconductor R package GEOquery provides an equivalent R interface to the same database.

Example usage:

from bioservices import GEO
g = GEO()

# Search GEO for datasets related to a topic
results = g.search("breast cancer AND Homo sapiens[organism]")

# Get a GEO record summary by accession
summary = g.get_summary("GSE10")

# Fetch detailed information for a GEO accession
record = g.fetch("GSE10")

Note

Some methods use the NCBI E-utilities which may require an API key for high-volume access. See https://www.ncbi.nlm.nih.gov/account/ for API key registration.

Constructor

Parameters:
  • verbose (bool) – print informative messages (default False)

  • cache (bool) – use caching (default False)

fetch(uid, db='gds', rettype='summary', retmode='text')[source]

Fetch detailed information for a GEO record.

Parameters:
  • uid – GEO UID (integer or string) or accession string. Can also be a comma-separated list or Python list of UIDs.

  • db (str) – GEO database. Either "gds" (default) or "geo".

  • rettype (str) – retrieval type. For gds: "summary" (default), "full", "brief", "uilist". For sequence records other types may apply.

  • retmode (str) – retrieval mode. Either "text" (default) or "xml".

Returns:

record text or parsed data

Return type:

str or dict

>>> from bioservices import GEO
>>> g = GEO()
>>> record = g.fetch("200000010")
get_accession_info(accession)[source]

Get information about a GEO record by its accession number.

Accepts GEO accession numbers like GSE10, GSM12, GPL96, or GDS1234. First searches for the UID corresponding to the accession, then retrieves the summary.

Parameters:

accession (str) – GEO accession number (e.g., "GSE10", "GSM12", "GPL96", "GDS1234")

Returns:

dict with summary information for the accession, or None if not found

Return type:

dict or None

>>> from bioservices import GEO
>>> g = GEO()
>>> info = g.get_accession_info("GSE10")
get_geo_datasets(query, organism=None, dataset_type=None, retmax=20)[source]

Search for GEO DataSets (GDS) matching a query.

Parameters:
  • query (str) – search query term

  • organism (str) – optional organism filter (e.g., "Homo sapiens", "Mus musculus")

  • dataset_type (str) – optional dataset type filter. Common values: "Expression profiling by array", "Expression profiling by high throughput sequencing", "Genome binding/occupancy profiling by high throughput sequencing"

  • retmax (int) – maximum number of results to return (default: 20)

Returns:

dict with search results

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> results = g.get_geo_datasets("breast cancer", organism="Homo sapiens")
get_geo_platforms(query, organism=None, retmax=20)[source]

Search for GEO Platforms (GPL) matching a query.

Parameters:
  • query (str) – search query term

  • organism (str) – optional organism filter (e.g., "Homo sapiens")

  • retmax (int) – maximum number of results to return (default: 20)

Returns:

dict with search results

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> results = g.get_geo_platforms("Affymetrix", organism="Homo sapiens")
get_geo_samples(query, organism=None, retmax=20)[source]

Search for GEO Samples (GSM) matching a query.

Parameters:
  • query (str) – search query term

  • organism (str) – optional organism filter (e.g., "Homo sapiens")

  • retmax (int) – maximum number of results to return (default: 20)

Returns:

dict with search results

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> results = g.get_geo_samples("breast cancer", organism="Homo sapiens")
get_geo_series(query, organism=None, retmax=20)[source]

Search for GEO Series (GSE) matching a query.

Parameters:
  • query (str) – search query term

  • organism (str) – optional organism filter (e.g., "Homo sapiens")

  • retmax (int) – maximum number of results to return (default: 20)

Returns:

dict with search results

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> results = g.get_geo_series("BRCA1 expression", organism="Homo sapiens")
get_summary(uid, db='gds')[source]

Get summary information for one or more GEO records by UID or accession.

Parameters:
  • uid – GEO UID (integer or string) or accession (e.g., "GSE10"). Can also be a comma-separated list or a Python list of UIDs.

  • db (str) – GEO database. Either "gds" (default) or "geo".

Returns:

dict with summary data for the requested record(s)

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> summary = g.get_summary("200000010")  # GDS UID for GSE10
>>> summary = g.get_summary(["200000010", "200000011"])
search(query, db='gds', retmax=20, retstart=0)[source]

Search GEO for datasets matching a query term.

Parameters:
  • query (str) –

    search query. Supports NCBI Entrez query syntax. Examples:

    • "breast cancer"

    • "breast cancer AND Homo sapiens[organism]"

    • "GSE10[ACCN]"

    • "expression profiling[DataSet Type]"

  • db (str) – GEO database to search. Either "gds" (GEO DataSets, default) or "geo" (all GEO records).

  • retmax (int) – maximum number of results to return (default: 20, max: 10000)

  • retstart (int) – index of first result to return (default: 0, used for pagination)

Returns:

dict with search results containing count, idlist, and translationset

Return type:

dict

>>> from bioservices import GEO
>>> g = GEO()
>>> results = g.search("breast cancer AND Homo sapiens[organism]")
>>> print(results["count"])

8.15. GeneProf

Currently removed from the main API from version 1.6.0 onwards. You can still get the code in earlier version or in the github repository in the attic/ directory

8.16. QuickGO

Interface to the QuickGO interface

class QuickGO(verbose=False, cache=False)[source]

Interface to the QuickGO service

Retrieve information given a GO identifier:

>>> from bioservices import QuickGO
>>> go = QuickGO()
>>> res = go.get_go_terms("GO:0003824")

Changed in version we: use the new QuickGO API since version 1.5.0 To use the old API, please use version of bioservices below 1.5

Constructor

Parameters:
  • verbose (bool) – print informative messages.

  • cache (bool) – set to True to enable HTTP caching

Annotation(assignedBy=None, includeFields=None, limit=100, page=1, aspect=None, reference=None, geneProductId=None, evidenceCode=None, goId=None, qualifier=None, withFrom=None, taxonId=None, taxonUsage=None, goUsage=None, goUsageRelationships=None, evidenceCodeUsage=None, evidenceCodeUsageRelationships=None, geneProductType=None, targetSet=None, geneProductSubset=None, extension=None)[source]

Calling the Annotation service

Changed in version 1.4.18: due to service API changes, we refactored this method completely

Parameters:
  • assignedBy (str) – The database from which this annotation originates. Accepts comma separated values.E.g., BHF-UCL,Ensembl.

  • includeFields (str) – Optional fields retrieved from external services. Accepts comma separated values. accepted values: goName, taxonName, name, synonyms.

  • limit (int) – download limit (number of lines) (default 10,000 rows, which may not be sufficient for the data set that you are downloading. To bypass this default, and return the entire data set, specify a limit of -1).

  • page (int) – results may be stored on several pages. You must provide this number. There is no way to retrieve more than 100 results without calling this function several times changing this parameter (default to 1).

  • aspect (str) – use this to limit the annotations returned to a specific ontology or ontologies (Molecular Function, Biological Process or Cellular Component). The valid character can be F,P,C.

  • reference (str) – PubMed or GO reference supporting annotation. Can refer to a specific reference identifier or category (for category level, use * after ref type). Can be ‘PUBMED:*’, ‘GO_REF:0000002’.

  • geneProductId (str) – The id of the gene product annotated with the GO term. Accepts comma separated values.E.g., URS00000064B1_559292.

  • evidenceCode (str) – Evidence code indicating how the annotation is supported. Accepts comma separated values. E.g., ECO:0000255.

  • goId (str) – The GO id of an annotation. Accepts comma separated values. E.g., GO:0070125.

  • qualifier (str) – Aids the interpretation of an annotation. Accepts comma separated values. E.g., enables,involved_in.

  • withFrom (str) – Additional ids for an annotation. Accepts comma separated values. E.g., P63328.

  • taxonId (str) – The taxonomic id of the species encoding the gene product associated to an annotation. Accepts comma separated values. E.g., 1310605.

  • taxonUsage (str) – Indicates how the taxonomic ids within the annotations should be used. E.g., exact.

  • goUsage (str) – Indicates how the GO terms within the annotations should be used. Used in conjunction with ‘goUsageRelationships’ filter. E.g., descendants.

  • goUsageRelationships (str) – The relationship between the ‘goId’ values found within the annotations. Allows comma separated values. E.g., is_a,part_of.

  • evidenceCodeUsage (str) – Indicates how the evidence code terms within the annotations should be used. Is used in conjunction with ‘evidenceCodeUsageRelationships’ filter. E.g., descendants, exact

  • evidenceCodeUsageRelationships (str) – The relationship between the provided ‘evidenceCode’ identifiers. Allows comma separated values. E.g., is_a,part_of.

  • geneProductType (str) – The type of gene product. Accepts comma separated values. E.g., protein,RNA. can be protein, RNA and/or complex

  • targetSet (str) – Gene product set. Accepts comma separated values. E.g., KRUK,BHF-UCL,Exosome.

  • geneProductSubset (str) – A database that provides a set of gene products. Accepts comma separated values. E.g., TrEMBL.

  • extension (str) – Extensions to annotations, where each extension can be: EXTENSION(DB:ID) / EXTENSION(DB) / EXTENSION.

Returns:

a dictionary

>>> print(go.Annotation(geneProductId='UniProtKB:P12345', reference='PMID:*'))
>>> print(go.Annotation(geneProductId='UniProtKB:P12345,UniProtKB:Q4VCS5',
...     reference='PMID:,Reactome:'))
Annotation_from_goid(goId, max_number_of_pages=25, **kargs)[source]

Returns a DataFrame containing annotation on a given GO identifier

Parameters:
  • goId (str) – a GO identifier (e.g., "GO:0003824")

  • max_number_of_pages (int) – maximum number of result pages to fetch

Returns:

a pandas.DataFrame containing the annotation data, or a list if pandas is not installed

All parameters from Annotation() are also valid except format that is set to tsv and cols that is made of all possible column names.

Search for gene products matching a query string.

Parameters:
  • query (str) – search term

  • taxonID (str) – NCBI taxonomy ID to filter results (optional)

  • page (int) – page number for paginated results (default 1)

  • limit (int) – maximum number of results per page (max 100)

  • type (str) – gene product type filter (optional)

  • dbSubSet (str) – database subset filter (optional)

  • proteome (str) – proteome filter (optional)

Returns:

dict with gene product search results

get_go_ancestors(query, relations='is_a,part_of,occurs_in,regulates')[source]

Retrieve ancestors of given GO term(s).

Parameters:
  • query (str) – GO term ID(s) as a comma-separated string

  • relations (str) – comma-separated relationship types to traverse (default "is_a,part_of,occurs_in,regulates")

Returns:

list of ancestor GO term results

get_go_chart(query)[source]

Return a PNG chart image for the given GO term(s).

Parameters:

query (str) – GO term ID(s) as a comma-separated string

Returns:

raw PNG image bytes

res = go.get_chart("GO:0022804")
with open("temp.png", "wb") as fout:
    fout.write(res)
get_go_children(query)[source]

Retrieve direct children of given GO term(s).

Parameters:

query (str) – GO term ID(s) as a comma-separated string

Returns:

list of child GO term results

get_go_paths(_from, _to, relations='is_a,part_of,occurs_in,regulates')[source]

Retrieve paths between two specified sets of ontology terms.

Each path is formed from a list of (term, relationship, term) triples.

Parameters:
  • _from (str) – source GO term ID (e.g., "GO:0005215")

  • _to (str) – target GO term ID (e.g., "GO:0003674")

  • relations (str) – comma-separated relationship types to traverse

Returns:

dict with "results" key containing a list of paths

paths = go.get_go_paths("GO:0005215", "GO:0003674")
# First path is found as the first item in the "results"
paths["results"][0]
get_go_terms(query, max_number_of_pages=None)[source]

Get information on all terms and page through the result

Parameters:
  • query (str) – GO term ID(s) as a comma-separated string (e.g., "GO:0003824")

  • max_number_of_pages – maximum number of pages to retrieve

Returns:

list of GO term result dictionaries

Searches a simple user query, e.g., query=apopto

Parameters:
  • query (str) – search term (e.g., "apopto")

  • limit (int) – maximum number of results to return (max 600)

  • page (int) – page number for paginated results (default 1)

Returns:

list of matching GO term results

8.17. Kegg

This module provides a class KEGG to access to the REST KEGG interface. There are additional methods and functionalities added by BioServices.

Note

a previous interface to the KEGG WSDL service was designed but the WSDL closed in Dec 2012.

8.17.1. Some terminology

The following list is a simplified list of terminology taken from KEGG API pages.

  • organisms (org) are made of a three-letter (or four-letter) code (e.g., hsa stands for Human Sapiens) used in KEGG (see organismIds).

  • db is a database name. See databases attribute and KEGG Databases Names and Abbreviations section.

  • entry_id is a unique identifier. It is a combination of the database name and the identifier of an entry joined by a colon sign (e.g. ‘embl:J00231’).

    Here are some examples of entry Ids:

    • genes_id: A KEGG organism and a gene name (e.g. ‘eco:b0001’).

    • enzyme_id: ‘ec’ and an enzyme code. (e.g. ‘ec:1.1.1.1’). See enzymeIds.

    • compound_id: ‘cpd’ and a compound number (e.g. ‘cpd:C00158’). Some compounds also have ‘glycan_id’ and both IDs are accepted and converted internally. See compoundIds.

    • drug_id: ‘dr’ and a drug number (e.g. ‘dr:D00201’). See drugIds.

    • glycan_id: ‘gl’ and a glycan number (e.g.

    • ‘gl:G00050’). Some glycans also have ‘compound_id’ and both IDs are accepted and converted internally. see glycanIds attribute.

    • reaction_id: ‘rn’ and a reaction number (e.g.

    • ‘rn:R00959’ is a reaction which catalyze cpd:C00103 into cpd:C00668). See reactionIds attribute.

    • pathway_id: ‘path’ and a pathway number. Pathway numbers prefixed by ‘map’ specify the reference pathway and pathways prefixed by a KEGG organism specify pathways specific to the organism (e.g. ‘path:map00020’, ‘path:eco00020’) See pathwayIds attribute.

    • motif_id: a motif database names (‘ps’ for prosite, ‘bl’ for blocks, ‘pr’ for prints, ‘pd’ for prodom, and ‘pf’ for pfam) and a motif entry name. (e.g. ‘pf:DnaJ’ means a Pfam database entry ‘DnaJ’).

    • ko_id: identifier made of ‘ko’ and a ko number (e.g. ‘ko:K02598’). See koIds attribute.

8.17.2. KEGG Databases Names and Abbreviations

Here is a list of databases used in KEGG API with their name and abbreviation:

Database Name

Abbrev

kid

pathway

path

map number

brite

br

br number

module

md

M number

disease

ds

H number

drug

dr

D number

environ

ev

E number

orthology

ko

K number

genome

genome

T number

genomes

gn

T number

genes

ligand

ligand

compound

cpd

C number

glycan

gl

G number

reaction

rn

R number

rpair

rp

RP number

rclass

rc

RC number

enzyme

ec

8.17.3. Database Entries

Database entries can be written in on of the following ways:

<dbentries> = <dbentry>1[+<dbentry>2...]
<dbentry> = <db:entry> | <kid> | <org:gene>

Each database entry is identified by:

db:entry

where “db” is the database name or its abbreviation shown above and “entry” is the entry name or the accession number that is uniquely assigned within the database. In reality “db” may be omitted, for the entry name called the KEGG object identifier (kid) is unique across KEGG.:

kid = database-dependent prefix + five-digit number

In the KEGG GENES database the db:entry combination must be specified. This is more specifically written as:

org:gene

where “org” is the three- or four-letter KEGG organism code or the T number genome identifier and “gene” is the gene identifier, usually locus_tag or ncbi GeneID, or the primary gene name.

class KEGG(verbose=False, cache=False)[source]

Interface to the KEGG service

This class provides an interface to the KEGG REST API. The weblink tools are partially accessible. All dbentries can be parsed into dictionaries using the KEGGParser

Here are some examples. In order to retrieve the entry of the gene identifier 7535 of the hsa organism, type:

from bioservices import KEGG
s = KEGG()
print(s.get("hsa:7535"))

The output is the raw ouput sent by KEGG API. See KEGGParser to parse this output.

See also

The Database Entries to know more about the db entries format.

Another example here below shows how to print the list of pathways of the human organism:

print(s.list("pathway", organism="hsa"))

Further post processing would allow you to retrieve the pathway Ids. However, we provide additional functions to the KEGG API so the previous code and post processing to extract the pathway Ids can be written as:

s.organism = "hsa"
s.pathwayIds

and similarly you can get all databases() output and database Ids easily. For example, for the reaction database:

s.reaction   # equivalent to s.list("reaction")
s.reactionIds

Other methods of interest are conv(), find(), get().

Constructor

Parameters:
  • verbose (bool) – prints informative messages

  • cache (bool) – set to True to enable HTTP caching

Tnumber2code(Tnumber)[source]

Converts organism T number to its code

>>> from bioservices import KEGG
>>> s = KEGG()
>>> s.Tnumber2code("T01001")
'hsa'
property briteIds

returns list of brite Ids.

See also

list()

code2Tnumber(code)[source]

Converts organism code to its T number

>>> from bioservices import KEGG
>>> s = KEGG()
>>> s.code2Tnumber("hsa")
'T01001'
property compoundIds

returns list of compound Ids

See also

list()

conv(target, source)[source]

convert KEGG identifiers to/from outside identifiers

Parameters:
  • target (str) – the target database (e.g., a KEGG organism).

  • source (str) – the source database (e.g., uniprot) or a valid dbentries; see below for details.

Returns:

a dictionary with keys being the source and values being the target.

Here are the rules to set the target and source parameters.

If the second argument is not a dbentries, source and target parameters can be of two types:

  1. gene identifiers. If the target is a KEGG Id, then the source must be one of ncbi-gi, ncbi-geneid or uniprot.

    Note

    source and target can be swapped.

  2. chemical substance identifiers. If the target is one of the following kegg database: drug, compound, glycan then the source must be one of pubchem or chebi.

    Note

    again, source and target can be swapped

If the second argument is a dbentries, it can be again of two types:

  1. gene identifiers. The database used can be one ncbi-gi, ncbi-geneid, uniprot or any KEGG organism

  2. chemical substance identifiers. The database used can be one of drug, compound, glycan, pubchem or chebi only.

Note

if the second argument is a dbentries, target and dbentries cannot be swapped.

# conversion from NCBI GeneID to KEGG ID for E. coli genes
conv("eco","ncbi-geneid")
# inverse of the above example
conv("ncbi-geneid","eco")
#conversion from KEGG ID to NCBI GI
conv("ncbi-gi","hsa:10458+ece:Z5100")

To make it clear by taking another example, you can either convert an entire database to another (e.g., from uniprot to KEGG Id all human gene IDs):

uniprot_ids, kegg_ids = s.conv("hsa", "uniprot")

or a subset by providing a valid dbentries:

s.conv("hsa","up:Q9BV86+")

Warning

call to this function may be long. conv(“hsa”, “uniprot”) takes a minute surprisingly, conv(“uniprot”, “hsa”) takes just a few seconds.

Changed in version 1.1: the output is now a dictionary, not a list of tuples

property databases

Returns list of valid KEGG databases.

dbinfo(database='kegg')[source]

Displays the current statistics of a given database

Parameters:

database (str) – can be one of: kegg (default), brite, module, disease, drug, environ, ko, genome, compound, glycan, reaction, rpair, rclass, enzyme, genomes, genes, ligand or any organismIds.

from bioservices import KEGG
s = KEGG()
s.dbinfo("hsa") # human organism
s.dbinfo("T01001") # same as above
s.dbinfo("pathway")

Changed in version 1.4.1: renamed info method into dbinfo(), which clashes with Logging framework info() method.

property drugIds

returns list of drug Ids

See also

list()

entry(dbentries)[source]

Retrieve entry

There is a weblink service (see http://www.genome.jp/kegg/rest/weblink.html) Since it is equivalent to get(), we do not implement it for now

property enzymeIds

returns list of enzyme Ids

See also

list()

find(database, query, option=None)[source]

finds entries with matching query keywords or other query data in a given database

Parameters:
  • database (str) – can be one of pathway, module, disease, drug, environ, ko, genome, compound, glycan, reaction, rpair, rclass, enzyme, genes, ligand or an organism code (see organismIds attributes) or T number (see organismTnumbers attribute).

  • query (str) – See examples

  • option (str) – If option provided, database can be only ‘compound’ or ‘drug’. Option can be ‘formula’, ‘exact_mass’ or ‘mol_weight’

Note

Keyword search against brite is not supported. Use /list/brite to retrieve a short list.

# search for pathways that contain Viral in the definition
s.find("pathway", "Viral")
# for keywords "shiga" and "toxin"
s.find("genes", "shiga+toxin")
# for keywords "shiga toxin"
s.find("genes", '"shiga toxin"')
# for chemical formula "C7H10O5"
s.find("compound", "C7H10O5", "formula")
# for chemical formula containing "O5" and "C7"
s.find("compound", "O5C7","formula")
# for 174.045 =< exact mass < 174.055
s.find("compound", "174.05","exact_mass")
# for 300 =< molecular weight =< 310
s.find("compound", "300-310","mol_weight")
get(dbentries, option=None, parse=False)[source]

Retrieves given database entries

param str dbentries:

KEGG database entries involving the following database: pathway, brite, module, disease, drug, environ, ko, genome compound, glycan, reaction, rpair, rclass, enzyme or any organism using the KEGG organism code (see organismIds attributes) or T number (see organismTnumbers attribute).

param str option:

one of: aaseq, ntseq, mol, kcf, image, kgml

Note

you can add the option at the end of dbentries in which case

the parameter option must not be used (see example)

from bioservices import KEGG
s = KEGG()
# retrieves a compound entry and a glycan entry
s.get("cpd:C01290+gl:G00092")
# same as above
s.get("C01290+G00092")
# retrieves a human gene entry and an E.coli O157 gene entry
s.get("hsa:10458+ece:Z5100")
# retrieves amino acid sequences of a human gene and an E.coli O157 gene
s.get("hsa:10458+ece:Z5100/aaseq")
# retrieves the image file of a pathway map
s.get("hsa05130/image")
# same as above
s.get("hsa05130", "image")

# to retrieve genome, you must preceed the entry with gn:
s.get('gn:T01001')
# to retrieve a network, you must preceed it with network:
s.get('network:nt06214')

Another example here below shows how to save the image of a given pathway:

res =  s.get("hsa05130/image")
 # same as : res =  s.get("hsa05130","image")
 f = open("test.png", "w")
 f.write(res)
 f.close()

Note

The input is limited up to 10 entries (KEGG restriction).

get_pathway_by_gene(gene, organism)[source]

Search for pathways that contain a specific gene

Parameters:
  • gene (str) – a valid gene Id

  • organism (str) – a valid organism (e.g., hsa)

Returns:

list of pathway Ids that contain the gene

>>> s.get_pathway_by_gene("7535", "hsa")
['path:hsa04064', 'path:hsa04650', 'path:hsa04660', 'path:hsa05340']
property glycanIds

Returns list of glycan Ids

See also

list()

isOrganism(org)[source]

Checks if org is a KEGG organism

Parameters:

org (str) –

Returns:

True if org is in the KEGG organism list (code or Tnumber)

>>> from bioservices import KEGG
>>> s = KEGG()
>>> s.isOrganism("hsa")
True
property koIds

returns list of ko Ids

See also

list()

Find related entries by using database cross-references

Parameters:
  • target (str) – the target KEGG database or organism (see below for the list).

  • source (str) – the source KEGG database or organism (see below for the list) or a valid dbentries involving one of the database; see below for details.

The valid list of databases is pathway, brite, module, disease, drug, environ, ko, genome, compound, glycan, reaction, rpair, rclass, enzyme

# KEGG pathways linked from each of the human genes
s.link("pathway", "hsa")
# human genes linked from each of the KEGG pathways
s.link("hsa", "pathway")
# KEGG pathways linked from a human gene and an E. coli O157 gene.
s.link("pathway", "hsa:10458+ece:Z5100")
list(query, organism=None)[source]

Returns a list of entry identifiers and associated definition for a given database or a given set of database entries

Parameters:
  • query (str) – can be one of pathway, brite, module, disease, drug, environ, ko, genome, compound, glycan, reaction, rpair, rclass, enzyme, organism or an organism from the organismIds attribute or a valid dbentry (see below). If a dbentry query is provided, organism should not be used!

  • organism (str) – a valid organism identifier that can be provided. If so, database can be only “pathway” or “module”. If not provided, the default value is chosen (organism)

Returns:

A string with a structure that depends on the query

Here is an example that shows how to extract the pathways IDs related to the hsa organism:

>>> s = KEGG()
>>> res = s.list("pathway", organism="hsa")
>>> pathways = [x.split()[0] for x in res.strip().split("\n")]
>>> len(pathways)  # as of Dec 2012
261

Note, however, that there are convenient aliases to some of the databases. For instance, the pathway Ids can also be retrieved as a list from the pathwayIds attribute (after defining the organism attribute).

Note

If you set the query to a valid organism, then the second argument organism is irrelevant and ignored.

Note

If the query is not a database or an organism, it is supposed to be a valid dbentries string and the maximum number of entries is 100.

Other examples:

s.list("pathway")             # returns the list of reference pathways
s.list("pathway", "hsa")      # returns the list of human pathways
s.list("organism")            # returns the list of KEGG organisms with taxonomic classification
s.list("hsa")                 # returns the entire list of human genes
s.list("T01001")              # same as above
s.list("hsa:10458+ece:Z5100") # returns the list of a human gene and an E.coli O157 gene
s.list("cpd:C01290+gl:G00092")# returns the list of a compound entry and a glycan entry
s.list("C01290+G00092")       # same as above
lookfor_organism(query)[source]

Look for a specific organism

Parameters:

query (str) – your search term. upper and lower cases are ignored

Returns:

a list of definition that matches the query

lookfor_pathway(query)[source]

Look for a specific pathway

Parameters:

query (str) – your search term. upper and lower cases are ignored

Returns:

a list of definition that matches the query

property moduleIds

returns list of module Ids for the default organism.

organism must be set.

s = KEGG()
s.organism = "hsa"
s.moduleIds
property organism

returns the current default organism

property organismIds

Returns list of organism Ids

property organismTnumbers

returns list of organisms (T numbers)

See also

list()

parse(entry)[source]

See KEGGParser for details

Parse entry returned by get()

k = KEGG()
res = k.get("hsa04150")
d = k.parse(res)
parse_kgml_pathway(pathwayId, res=None)[source]

Parse the pathway in KGML format and returns a dictionary (relations and entries)

Parameters:
  • pathwayId (str) – a valid pathwayId e.g. hsa04660

  • res (str) – if you already have the output of the query get(pathwayId), you can provide it, otherwise it is queried.

Returns:

a dictionary with relations and entries as keys. Values of relations is a list of relations, each relation being dictionary with entry1, entry2, link, value, name. The list of entries is a list of dictionaries as well. Entry contains more details about the entry found in the relation. See example below for details.

Relation name values include e.g. activation, inhibition, phosphorylation, binding/association. Relations that carry no subtype information have name and value set to None.

Entries with type equal to "map" represent links to sub-pathways embedded in the current map. Their name field contains the KEGG pathway identifier (e.g. "path:hsa04010").

>>> res = s.parse_kgml_pathway("hsa04660")

>>> # inspect the name field of each relation
>>> set([x['name'] for x in res['relations']])

>>> res['relations'][-1]
{'entry1': u'15',
 'entry2': u'13',
 'link': u'PPrel',
 'name': u'phosphorylation',
 'value': u'+p'}

>>> set([x['link'] for x in res['relations']])
set([u'PPrel', u'PCrel'])

>>> # get information about an entry
>>> res['entries'][4]

>>> # look up the gene names for entry1/entry2 in a relation
>>> entry_map = {e['id']: e for e in res['entries']}
>>> rel = res['relations'][0]
>>> entry_map[rel['entry1']]['gene_names']

>>> # find sub-pathway (map) entries embedded in this pathway
>>> sub_pathways = [e for e in res['entries'] if e['type'] == 'map']
>>> [e['name'] for e in sub_pathways]

See also

KEGG API

pathway2sif(pathwayId, uniprot=True)[source]

Extract protein-protein interaction from KEGG pathway to a SIF format

Warning

experimental Not tested on all pathway. should be move to another package such as cellnopt

Parameters:
  • pathwayId (str) – a valid pathway Id

  • uniprot (bool) – convert to uniprot Id or not (default is True)

Returns:

a list of relations (A 1 B) for activation and (A -1 B) for inhibitions

This is longish due to the conversion from KEGGIds to UniProt.

This method can be useful to provide prior knowledge network to software such as CellNOpt (see http://www.cellnopt.org)

property pathwayIds

returns list of pathway Ids for the default organism.

organism must be set.

s = KEGG()
s.organism = "hsa"
s.pathwayIds
property reactionIds

returns list of reaction Ids

save_pathway(pathId, filename, scale=None, keggid={}, params={})[source]

Save KEGG pathway in PNG format

Parameters:
  • pathId (str) – a valid pathway identifier (e.g., "hsa00010")

  • filename (str) – output PNG file path; if None, defaults to "{pathId}.png"

  • scale (float) – optional scale factor for the pathway image

  • keggid (dict) – mapping of KEGG IDs to highlight on the pathway

  • params (dict) – additional POST parameters passed to the KEGG pathway viewer

show_entry(entry)[source]

Opens URL corresponding to a valid entry

s.www_bget("path:hsa05416")
show_module(modId)[source]

Show a given module inside a web browser

Parameters:

modId (str) – a valid module Id. See moduleIds()

Validity of modId is not checked but if wrong the URL will not open a proper web page.

show_pathway(pathId, scale=None, dcolor='pink', keggid={}, show=True)[source]

Show a given pathway inside a web browser

Parameters:
  • pathId (str) – a valid pathway Id. See pathwayIds()

  • scale (int) – you can scale the image with a value between 0 and 100

  • dcolor (str) – set the default background color of nodes

  • keggid (dict) – set color of entries contained in the pathway as key/value pairs; can also be a list, in which case all nodes have the same default color (red)

Note

if scale is provided, dcolor and keggid are ignored.

# show a pathway in the browser
s.show_pathway("path:hsa05416", scale=50)

# Same as above but also highlights some KEGG Ids (red for all)
s.show_pathway("path:hsa05416", dcolor="white",
    keggid=['1525', '1604', '2534'])

# You can refine the colors using a dictionary:
s.show_pathway("path:hsa05416", dcolor="white",
    keggid={'1525':'yellow,red', '1604':'blue,green', '2534':"blue"})
class KEGGParser(verbose=False)[source]

This is an extension of the KEGG class to ease parsing of dbentries

This class provides a generic method parse() that will read the output of a dbentry returned by KEGG.get() and converts it into a dictionary ready to use.

The parse() method parses any entry. It can be a pathway, a gene, a compound…

from bioservices import *
s = KEGG()

# Retrieve a KEGG entry
res = s.get("hsa04150")

# parse it
d = s.parse(res)

As a pedagogical example, you can then further process this dictionary. Here below, we convert the gene Ids found in the pathway into UniProt Ids:

# Get the KEGG Ids in the pathway
kegg_geneIds = [x for x in d['GENE']]

# Convert them
db_up, db_kegg = s.conv("hsa", "uniprot")

# Get the corresponding uniprot Ids
indices = [db_kegg.index("hsa:%s" % x ) for x in kegg_geneIds]
uniprot_geneIds = [db_up[x] for x in indices]

However, you could also have done it simply as follows:

kegg_geneIds = [x for x in d['gene']]
uprot_geneIds = [s.parse(s.get("hsa:"+str(e)))['DBLINKS']["UniProt:"] for e in d['GENE']]

Note

The 2 outputs are slightly different.

parse(res)[source]

Parse to any outputs returned by KEGG.get()

Parameters:

res (str) – output of a KEGG.get().

Returns:

a dictionary. Keys are those found in the KEGG entry (e.g., REACTION, ENTRY, EQUATION, …). The format of each value is various. It could be a string, a list (of strings generally), a dictionary, a float depending on the key. Depdending on the type of the entry (e.g., module, pathway), the type of the value may also differ (e.g., REACTION can be either a list of reactions or a dictionary depending on the content)

>>> # Parses a drug entry
>>> res = s.get("dr:D00001")
>>> # Parses a pathway entry
>>> res = s.get("path:hsa10584")
>>> # Parses a module entry
>>> res = s.get("md:hsa_M00554")
>>> # Parses a disease entry
>>> res = s.get("ds:H00001")
>>> # Parses a environ entry
>>> res = s.get("ev:E00001")
>>> # Parses Orthology entry
>>> res = s.get("ko:K00001")
>>> # Parses a Genome entry
>>> res = s.get('genome:T00001')
>>> # Parses a gene entry
>>> res = s.get("hsa:1525")
>>> # Parses a compound entry
>>> s.get("cpd:C00001")
>>> # Parses a glycan entry
>>> res = s.get("gl:G00001")
>>> # Parses a reaction entry
>>> res = s.get("rn:R00001")
>>> # Parses a rpair entry
>>> res = s.get("rp:RP00001")
>>> # Parses a rclass entry
>>> res = s.get("rc:RC00001")
>>> # Parses an enzyme entry
>>> res = s.get('ec:1.1.1.1')

>>> d = s.parse(res)

8.18. HGNC

Interface to HUGO/HGNC web services

class HGNC(verbose=False, cache=False)[source]

Wrapper to the genenames web service

See details at http://www.genenames.org/help/rest-web-service-help

Constructor

Parameters:
  • verbose (bool) – set to True to get more logging output

  • cache (bool) – set to True to enable HTTP caching

fetch(database, query, frmt='json')[source]

Retrieve particular records from a searchable field.

Parameters:
  • database (str) – a valid searchable field name (see searchable_fields)

  • query (str) – the exact value to look up; no wildcards accepted

  • frmt (str) – response format (default "json")

Returns:

JSON object with fields as listed in stored_fields

>>> h = HGNC()
>>> h.fetch('symbol', 'ZNF3')
>>> h.fetch('alias_name', 'A-kinase anchor protein, 350kDa')
get_info(frmt='json')[source]

Request information about the service.

Returns metadata including when the server was last updated (lastModified), the number of documents (numDoc), which fields can be queried using search and fetch (searchableFields), and which fields may be returned by fetch (storedFields).

Parameters:

frmt (str) – response format (default "json")

Returns:

dict with service metadata

search(database_or_query=None, query=None, frmt='json')[source]

Search a searchable field (database) for a pattern

The search request is more powerful than fetch for querying the database, but search will only returns the fields hgnc_id, symbol and score. This is because this tool is mainly intended to query the server to find possible entries of interest or to check data (such as your own symbols) rather than to fetch information about the genes. If you want to retrieve all the data for a set of genes from the search result, the user could use the hgnc_id returned by search to then fire off a fetch request by hgnc_id.

Parameters:
  • database_or_query (str) – field name to search (see searchable_fields), or a free-text query if query is omitted (searches all fields)

  • query (str) – the pattern to search for; supports wildcards (*, ?) and boolean operators (AND, OR, NOT)

  • frmt (str) – response format (default "json")

Returns:

JSON object with hgnc_id, symbol, and score for each hit

# Search all searchable fields for the tern BRAF
h.search('BRAF')

# Return all records that have symbols that start with ZNF
h.search('symbol', 'ZNF*')

# Return all records that have symbols that start with ZNF
# followed by one and only one character (e.g. ZNF3)
# Nov 2015 does not work neither here nor in within in the
# official documentation
h.search('symbol', 'ZNF?')

# search for symbols starting with ZNF that have been approved
# by HGNC
h.search('symbol', 'ZNF*+AND+status:Approved')

# return ZNF3 and ZNF12
h.search('symbol', 'ZNF3+OR+ZNF12')

# Return all records that have symbols that start with ZNF which
# are not approved (ie entry withdrawn)
h.search('symbol', 'ZNF*+NOT+status:Approved')

8.19. Intact (complex)

This module provides a class IntactComplex

class IntactComplex(verbose=False, cache=False)[source]

Interface to the Intact service

>>> from bioservices import IntactComplex
>>> u = IntactComplex()

Constructor IntactComplex

Parameters:

verbose – set to False to prevent informative messages

details(query)[source]

Return details about a complex

Parameters:

query (str) – EBI-1163476

search(query, frmt='json', facets=None, first=None, number=None, filters=None)[source]

Search for a complex inside intact complex.

Parameters:
  • query (str) – the query (e.g., ndc80)

  • frmt (str) – Defaults to json (could be a Pandas data frame if Pandas is installed; set frmt to ‘pandas’)

  • facets (str) – lists of facets as a string (separated by comma)

  • first (int) – offset into results for pagination (default None = 0)

  • number (int) – number of results to return (default None = server default)

  • filters (str) – filter expression string (e.g., 'species_f:("Homo sapiens")')

s = IntactComplex()
# search for ndc80
s.search('ncd80')

#  Search for ndc80 and facet with the species field:
s.search('ncd80', facets='species_f')

# Search for ndc80 and facet with the species and biological role fields:
s.search('ndc80', facets='species_f,pbiorole_f')

# Search for ndc80, facet with the species and biological role
# fields and filter the species using human:
s.search('Ndc80', first=0, number=10,
    filters='species_f:("Homo sapiens")',
    facets='species_f,ptype_f,pbiorole_f')

# Search for ndc80, facet with the species and biological role
# fields and filter the species using human or mouse:
s.search('Ndc80', first=0, number=10,
    filters='species_f:("Homo sapiens" "Mus musculus")',
    facets='species_f,ptype_f,pbiorole_f')

# Search with a wildcard to retrieve all the information:
s.search('*')

# Search with a wildcard to retrieve all the information and facet
# with the species, biological role and interactor type fields:
s.search('*', facets='species_f,pbiorole_f,ptype_f')

# Search with a wildcard to retrieve all the information, facet with
# the species, biological role and interactor type fields and filter
# the interactor type using small molecule:
s.search('*', facets='species_f,pbiorole_f,ptype_f',
    filters='ptype_f:("small molecule")')'

# Search with a wildcard to retrieve all the information, facet with
# the species, biological role and interactor type fields and filter
# the interactor type using small molecule and the species using human:
s.search('*', facets='species_f,pbiorole_f,ptype_f',
    filters='ptype_f:("small molecule"),species_f:("Homo sapiens")')

# Search for GO:0016491 and paginate (first is for the offset and number
# is how many do you want):
s.search('GO:0016491', first=10, number=10)

The organism name used in the filter must be exact. Here is the list found by typing:

res = set(ci.search('*', frmt='pandas')['organismName'])
'Bos taurus; 9913',
'Caenorhabditis elegans; 6239',
'Canis familiaris; 9615',
'Drosophila melanogaster; 7227',
'Escherichia coli (strain K12); 83333',
'Gallus gallus; 9031',
'Homo sapiens; 9606',
'Mus musculus; 10090',
'Oryctolagus cuniculus; 9986',
'Rattus norvegicus; 10116',
'Saccharomyces cerevisiae (strain ATCC 204508 / S288c);559292',
'Schizosaccharomyces pombe (strain 972 / ATCC 24843);284812',
'Xenopus laevis; 8355'

8.20. InterPro

Interface to the InterPro web service

class InterPro(verbose=False, cache=False)[source]

Interface to the InterPro service

InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.

from bioservices import InterPro
i = InterPro()

# Get information about an InterPro entry
entry = i.get_entry("IPR000001")

# Get all entries for a protein
entries = i.get_protein_entries("P00734")

# Search entries by name
results = i.search_entries("kinase")

InterPro integrates signatures from the following member databases:

  • CATH-Gene3D

  • CDD

  • HAMAP

  • MobiDB Lite

  • NCBIfam

  • Panther

  • Pfam

  • PIRSF

  • PRINTS

  • ProSite

  • SFLD

  • SMART

  • SUPFAM

  • TIGRFAMs

Constructor

Parameters:
  • verbose (bool) – prints informative messages (default is off)

  • cache (bool) – use cache (default is off)

>>> from bioservices import InterPro
>>> i = InterPro(verbose=False)
get_entries(page_size=20, page=1)[source]

Retrieve a paginated list of all InterPro entries

Parameters:
  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with results and pagination info

i = InterPro()
results = i.get_entries(page_size=10)
get_entries_by_member_database(database, page_size=20, page=1)[source]

Retrieve entries from a specific member database

Parameters:
  • database (str) – member database name (e.g. “pfam”, “prosite”)

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with results and pagination info

i = InterPro()
results = i.get_entries_by_member_database("pfam")
get_entries_by_type(entry_type, page_size=20, page=1)[source]

Retrieve InterPro entries filtered by type

Parameters:
  • entry_type (str) – entry type. One of: family, domain, homologous_superfamily, repeat, site, active_site, binding_site, conserved_site, ptm

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with results and pagination info

i = InterPro()
results = i.get_entries_by_type("domain")
get_entry(accession)[source]

Retrieve a specific InterPro entry by accession

Parameters:

accession (str) – an InterPro accession (e.g. “IPR000001”)

Returns:

dictionary with entry information

i = InterPro()
entry = i.get_entry("IPR000001")
print(entry["metadata"]["name"])
get_entry_proteomes(accession, page_size=20, page=1)[source]

Retrieve proteomes containing proteins annotated with a given InterPro entry

Parameters:
  • accession (str) – an InterPro accession (e.g. “IPR000001”)

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with proteomes and pagination info

i = InterPro()
proteomes = i.get_entry_proteomes("IPR000001")
get_entry_structures(accession, page_size=20, page=1)[source]

Retrieve structures associated with a given InterPro entry

Parameters:
  • accession (str) – an InterPro accession (e.g. “IPR000001”)

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with structures and pagination info

i = InterPro()
structures = i.get_entry_structures("IPR000001")
get_entry_taxonomy(accession, page_size=20, page=1)[source]

Retrieve taxonomy distribution of proteins annotated with an InterPro entry

Parameters:
  • accession (str) – an InterPro accession (e.g. “IPR000001”)

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with taxonomy distribution

i = InterPro()
taxons = i.get_entry_taxonomy("IPR000001")
get_member_database_entry(database, accession)[source]

Retrieve a specific entry from a member database

Parameters:
  • database (str) – member database name (e.g. “pfam”, “prosite”)

  • accession (str) – accession in the member database (e.g. “PF00001”)

Returns:

dictionary with entry information

The supported member databases are: cathgene3d, cdd, hamap, mobidblt, ncbifam, panther, pfam, pirsf, prints, profile, prosite, sfld, smart, ssf, tigrfam.

i = InterPro()
entry = i.get_member_database_entry("pfam", "PF00001")
get_protein(accession, database='uniprot')[source]

Retrieve information about a protein

Parameters:
  • accession (str) – a UniProt accession (e.g. “P00734”)

  • database (str) – protein database, currently only “uniprot” is supported (default: “uniprot”)

Returns:

dictionary with protein information

i = InterPro()
protein = i.get_protein("P00734")
print(protein["metadata"]["name"])
get_protein_entries(accession, database='uniprot')[source]

Retrieve InterPro entries associated with a protein

Parameters:
  • accession (str) – a UniProt accession (e.g. “P00734”)

  • database (str) – protein database (default: “uniprot”)

Returns:

dictionary with entries annotated on the protein

i = InterPro()
entries = i.get_protein_entries("P00734")
get_proteins_by_entry(accession, page_size=20, page=1)[source]

Retrieve proteins annotated with a given InterPro entry

Parameters:
  • accession (str) – an InterPro accession (e.g. “IPR000001”)

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with proteins and pagination info

i = InterPro()
proteins = i.get_proteins_by_entry("IPR000001")
get_proteome(accession, database='uniprot')[source]

Retrieve information about a proteome

Parameters:
  • accession (str) – a UniProt proteome accession (e.g. “UP000005640”)

  • database (str) – proteome database (default: “uniprot”)

Returns:

dictionary with proteome information

i = InterPro()
proteome = i.get_proteome("UP000005640")
get_set(database, accession)[source]

Retrieve information about a set (e.g. a Pfam clan)

Parameters:
  • database (str) – member database (e.g. “pfam” for Pfam clans)

  • accession (str) – set accession (e.g. “CL0001” for a Pfam clan)

Returns:

dictionary with set information

i = InterPro()
pfam_clan = i.get_set("pfam", "CL0001")
get_structure(accession, database='pdb')[source]

Retrieve information about a structure

Parameters:
  • accession (str) – a PDB accession (e.g. “1t2v”)

  • database (str) – structure database, currently only “pdb” is supported (default: “pdb”)

Returns:

dictionary with structure information

i = InterPro()
structure = i.get_structure("1t2v")
get_taxonomy(taxon_id, database='uniprot')[source]

Retrieve taxonomy information

Parameters:
  • taxon_id (str) – NCBI taxonomy ID (e.g. “9606” for human)

  • database (str) – taxonomy database (default: “uniprot”)

Returns:

dictionary with taxonomy information

i = InterPro()
taxon = i.get_taxonomy("9606")
print(taxon["metadata"]["scientific_name"])
search_entries(search, page_size=20, page=1)[source]

Search InterPro entries by name or description

Parameters:
  • search (str) – search term

  • page_size (int) – number of results per page (default 20)

  • page (int) – page number (default 1)

Returns:

dictionary with results and pagination info

i = InterPro()
results = i.search_entries("kinase")

8.21. MUSCLE

Interface to the MUSCLE web service

class MUSCLE(verbose=False)[source]

Interface to the MUSCLE service.

>>> from bioservices import *
>>> m = MUSCLE(verbose=False)
>>> sequencesFasta = open('filename','r')
>>> jobid = m.run(frmt="fasta", sequence=sequencesFasta.read(),
                email="name@provider")
>>> m.getResult(jobid, "out")

Warning

It is very important to provide a real e-mail address as your job otherwise very likely will be killed and your IP, Organisation or entire domain black-listed.

Here is another similar example but we use UniProt class provided in bioservices to fetch the FASTA sequences:

>>> from bioservices import UniProt, MUSCLE
>>> u = UniProt(verbose=False)
>>> f1 = u.get_fasta("P18413")
>>> f2 = u.get_fasta("P18412")
>>> m = MUSCLE(verbose=False)
>>> jobid = m.run(frmt="fasta", sequence=f1+f2, email="name@provider")
>>> m.getResult(jobid, "out")
get_parameter_details(parameterId)[source]

Get detailed information about a parameter.

Parameters:

parameterId (str) – a valid parameter name (see parameters)

Returns:

a dict with parameter details including name, description, and allowed values

For example:

>>> m.get_parameter_details("format")
get_parameters()[source]

List parameter names.

Returns:

An XML document containing a list of parameter names.

>>> from bioservices import MUSCLE
>>> m = MUSCLE()
>>> res = m.get_parameters()
>>> print(res)

See also

parameters to get a list of the parameters without need to process the XML output.

get_result(jobid, result_type)[source]

Get the job result of the specified type.

Parameters:
  • jobid (str) – a job identifier returned by run().

  • result_type (str) – type of result to retrieve. See get_result_types().

get_result_types(jobid)[source]

Get available result types for a finished job.

Parameters:

jobid (str) – a job identifier returned by run().

Returns:

a list of result type identifier strings (e.g., ["out", "sequence", "aln-fasta"])

get_status(jobid)[source]

Get status of a submitted job

Parameters:

jobid (str) – a job identifier returned by run().

Returns:

A string giving the jobid status (e.g. FINISHED).

The values for the status are:

  • RUNNING: the job is currently being processed.

  • FINISHED: job has finished, and the results can then be retrieved.

  • ERROR: an error occurred attempting to get the job status.

  • FAILURE: the job failed.

  • NOT_FOUND: the job cannot be found.

property parameters
run(frmt=None, sequence=None, tree='none', email=None)[source]

Submit a job with the specified parameters.

Compulsory arguments

Parameters:
  • frmt (str) – input format (e.g., fasta)

  • sequence (str) – query sequence. The use of fasta formatted sequence is recommended.

  • tree (str) – tree type (‘none’,’tree1’,’tree2’)

  • email (str) – a valid email address. Will be checked by the service itself.

Returns:

A jobid that can be analysed with getResult(), getStatus(), …

The up-to-date values accepted for each of these parameters can be retrieved from get_parameter_details().

For instance,:

from bioservices import MUSCLE
m = MUSCLE()
m.parameterDetails("tree")

Example:

jobid = m.run(frmt="fasta",
     sequence=sequence_example,
     email="test@yahoo.fr")

frmt can be a list of formats:

frmt=['fasta','clw','clwstrict','html','msf','phyi','phys']

The returned object is a jobid, which status can be checked. It must be finished before analysing/getting the results.

See also

getResult()

wait(jobId, checkInterval=5, verbose=True)[source]

This function checks the status of a jobid while it is running

Parameters:
  • jobId (str) – a job identifier returned by run().

  • checkInterval (int) – interval between status checks in seconds (default 5).

8.22. MyGeneInfo

Interface to the mygeneinfo web Service.

class MyGeneInfo(verbose=False, cache=False)[source]

Interface to mygene.info service

>>> from bioservices import MyGeneInfo
>>> s = MyGeneInfo()

Constructor

Parameters:
  • verbose (bool) – prints informative messages (default is off)

  • cache (bool) – set to True to enable HTTP caching

get_genes(ids, fields='symbol,name,taxid,entrezgene,ensemblgene', species=None, dotfield=True, email=None)[source]

Get matching gene objects for a list of gene ids

Parameters:
  • ids – list of geneinfo IDs

  • fields (str) – a comma-separated fields to limit the fields returned from the matching gene hits. The supported field names can be found from any gene object (e.g. http://mygene.info/v3/gene/1017). Note that it supports dot notation as well, e.g., you can pass “refseq.rna”. If “fields=all”, all available fields will be returned. Default: “symbol,name,taxid,entrezgene,ensemblgene”.

  • species (str) – can be used to limit the gene hits from given species. You can use “common names” for nine common species (human, mouse, rat, fruitfly, nematode, zebrafish, thale-cress, frog and pig). All other species, you can provide their taxonomy ids. Multiple species can be passed using comma as a separator. Default: human,mouse,rat.

  • dotfield – control the format of the returned fields when passed “fields” parameter contains dot notation, e.g. “fields=refseq.rna”. If True the returned data object contains a single “refseq.rna” field, otherwise (False), a single “refseq” field with a sub-field of “rna”. Default: True.

  • email (str) – If you are regular users of this services, the mygeneinfo maintainers/authors encourage you to provide an email, so that we can better track the usage or follow up with you.

mgi = MyGeneInfo()
mgi.get_genes(("301345,22637"))
# first one is rat, second is mouse. This will return a 'notfound'
# entry and the second entry as expected.
mgi.get_genes("301345,22637", species="mouse")
get_metadata()[source]

Return metadata about the MyGeneInfo service (e.g., species, build dates).

Returns:

a dict with service metadata

get_one_gene(geneid, fields='symbol,name,taxid,entrezgene,ensemblgene', dotfield=True, email=None)[source]

Get matching gene objects for one gene id

Parameters:
  • geneid – a valid gene ID

  • fields (str) – a comma-separated fields to limit the fields returned from the matching gene hits. The supported field names can be found from any gene object (e.g. http://mygene.info/v3/gene/1017). Note that it supports dot notation as well, e.g., you can pass “refseq.rna”. If “fields=all”, all available fields will be returned. Default: “symbol,name,taxid,entrezgene,ensemblgene”.

  • dotfield – control the format of the returned fields when passed “fields” parameter contains dot notation, e.g. “fields=refseq.rna”. If True the returned data object contains a single “refseq.rna” field, otherwise (False), a single “refseq” field with a sub-field of “rna”. Default: True.

  • email (str) – If you are regular users of this services, the mygeneinfo maintainers/authors encourage you to provide an email, so that we can better track the usage or follow up with you.

mgi = MyGeneInfo()
mgi.get_one_gene("301345")
get_one_query(query, email=None, dotfield=True, fields='symbol,name,taxid,entrezgene,ensemblgene', species='human,mouse,rat', size=10, _from=0, sort=None, facets=None, entrezonly=False, ensemblonly=False)[source]

Make gene query and return matching gene list. Support JSONP and CORS as well.

Parameters:
  • query (str) – Query string. Examples “CDK2”, “NM_052827”, “204639_at”, “chr1:151,073,054-151,383,976”, “hg19.chr1:151073054-151383976”. The detailed query syntax can be found from our docs.

  • fields (str) – a comma-separated fields to limit the fields returned from the matching gene hits. The supported field names can be found from any gene object (e.g. http://mygene.info/v3/gene/1017). Note that it supports dot notation as well, e.g., you can pass “refseq.rna”. If “fields=all”, all available fields will be returned. Default: “symbol,name,taxid,entrezgene,ensemblgene”.

  • species (str) – can be used to limit the gene hits from given species. You can use “common names” for nine common species (human, mouse, rat, fruitfly, nematode, zebrafish, thale-cress, frog and pig). All other species, you can provide their taxonomy ids. Multiple species can be passed using comma as a separator. Default: human,mouse,rat.

  • size (int) – the maximum number of matching gene hits to return (with a cap of 1000 at the moment). Default: 10.

  • _from (int) – the number of matching gene hits to skip, starting from 0. Combining with “size” parameter, this can be useful for paging. Default: 0.

  • sort – the comma-separated fields to sort on. Prefix with “-” for descending order, otherwise in ascending order. Default: sort by matching scores in descending order.

  • facets (str) – a single field or comma-separated fields to return facets, for example, “facets=taxid”, “facets=taxid,type_of_gene”.

  • entrezonly (bool) – when passed as True, the query returns only the hits with valid Entrez gene ids. Default: False.

  • ensemblonly (bool) – when passed as True, the query returns only the hits with valid Ensembl gene ids. Default: False.

  • dotfield – control the format of the returned fields when passed “fields” parameter contains dot notation, e.g. “fields=refseq.rna”. If True the returned data object contains a single “refseq.rna” field, otherwise (False), a single “refseq” field with a sub-field of “rna”. Default: True.

  • email (str) – If you are regular users of this services, the mygeneinfo maintainers/authors encourage you to provide an email, so that we can better track the usage or follow up with you.

Returns:

a dict with total, max_score, took, and hits fields

get_queries(query, email=None, dotfield=True, scopes='all', species='human,mouse,rat', fields='symbol,name,taxid,entrezgene,ensemblgene')[source]

Make gene query and return matching gene list. Support JSONP and CORS as well.

Parameters:
  • query (str) – Query string. Examples “CDK2”, “NM_052827”, “204639_at”, “chr1:151,073,054-151,383,976”, “hg19.chr1:151073054-151383976”. The detailed query syntax can be found from our docs.

  • fields (str) – a comma-separated fields to limit the fields returned from the matching gene hits. The supported field names can be found from any gene object (e.g. http://mygene.info/v3/gene/1017). Note that it supports dot notation as well, e.g., you can pass “refseq.rna”. If “fields=all”, all available fields will be returned. Default: “symbol,name,taxid,entrezgene,ensemblgene”.

  • species (str) – can be used to limit the gene hits from given species. You can use “common names” for nine common species (human, mouse, rat, fruitfly, nematode, zebrafish, thale-cress, frog and pig). All other species, you can provide their taxonomy ids. Multiple species can be passed using comma as a separator. Default: human,mouse,rat.

  • dotfield – control the format of the returned fields when passed “fields” parameter contains dot notation, e.g. “fields=refseq.rna”. If True the returned data object contains a single “refseq.rna” field, otherwise (False), a single “refseq” field with a sub-field of “rna”. Default: True.

  • email (str) – If you are regular users of this services, the mygeneinfo maintainers/authors encourage you to provide an email, so that we can better track the usage or follow up with you.

  • scopes (str) – not documented. Set to ‘all’

get_taxonomy()[source]

Return the taxonomy information from the MyGeneInfo service metadata.

Returns:

a dict mapping species names to their taxonomy IDs

8.23. NCBIblast

Interface to the NCBIBLAST web service

class NCBIblast(verbose=False)[source]

Interface to the NCBIblast service.

>>> from bioservices import *
>>> s = NCBIblast(verbose=False)
>>> jobid = s.run(program="blastp", sequence=s._sequence_example,
    stype="protein", database="uniprotkb", email="name@provider")
>>> s.getResult(jobid, "out")

Warning

It is very important to provide a real e-mail address as your job otherwise very likely will be killed and your IP, Organisation or entire domain black-listed.

When running a blast request, a program is required. You can obtain the list using:

>>> s.parametersDetails("program")
[u'blastp', u'blastx', u'blastn', u'tblastx', u'tblastn']
  • blastn: Search a nucleotide database using a nucleotide query

  • blastp: Search protein database using a protein query

  • blastx: Search protein database using a translated nucleotide query

  • tblastn Search translated nucleotide database using a protein query

  • tblastx Search translated nucleotide database using a translated nucleotide query

NCBIblast constructor

Parameters:

verbose (bool) – prints informative messages

property databases

Returns accepted databases.

get_parameter_details(parameterId)[source]

Get detailed information about a parameter.

Parameters:

parameterId (str) – a valid parameter name (see parameters)

Returns:

a list of accepted values for the parameter

For example:

>>> s.parameter_details("matrix")
[u'BLOSUM45',
 u'BLOSUM50',
 u'BLOSUM62',
 u'BLOSUM80',
 u'BLOSUM90',
 u'PAM30',
 u'PAM70',
 u'PAM250']
get_parameters()[source]

List parameter names.

Returns:

An XML document containing a list of parameter names.

Returns:

a list of parameter name strings

>>> from bioservices import NCBIblast
>>> n = NCBIblast()
>>> res = n.get_parameters()
>>> print(res)

See also

parameters to get a list of the parameters without need to process the XML output.

get_result(jobid, result_type)[source]

Get the job result of the specified type.

Parameters:
  • jobid (str) – a job identifier returned by run().

  • result_type (str) – type of result to retrieve. See get_result_types().

Returns:

the raw result content for the given type.

Use the format parameter to retrieve output in different formats and compressed=true to retrieve XML output in compressed form. Format options:

0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored showing identities,
4 = flat query-anchored no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1).

See NCBI Blast documentation for details. Use the ‘compressed’ parameter to return the XML output in compressed form. e.g. ‘?format=5&compressed=true’.

get_result_types(jobid)[source]

Get available result types for a finished job.

Parameters:

jobid (str) – a job identifier returned by run().

Returns:

a list of result type identifier strings

get_status(jobid)[source]

Get status of a submitted job

Parameters:

jobid (str) – a job identifier returned by run().

Returns:

a string giving the job status (e.g. "FINISHED").

The values for the status are:

  • RUNNING: the job is currently being processed.

  • FINISHED: job has finished, and the results can then be retrieved.

  • ERROR: an error occurred attempting to get the job status.

  • FAILURE: the job failed.

  • NOT_FOUND: the job cannot be found.

property parameters
run(program=None, database=None, sequence=None, stype='protein', email=None, **kargs)[source]

Submit a job with the specified parameters.

Compulsory arguments

Parameters:
  • program (str) – BLAST program to use to perform the search (e.g., blastp)

  • sequence (str) – query sequence. The use of fasta formatted sequence is recommended.

  • database (list) – list of database names for search or possible a single string (for one database). There are some mismatch between the output of parametersDetails(“database”) and the accepted values. For instance UniProt Knowledgebase should be given as “uniprotkb”.

  • email (str) – a valid email address. Will be checked by the service itself.

Optional arguments. If not provided, a default value will be used

Parameters:
  • stype (str) – query sequence type in ‘dna’, ‘rna’ or ‘protein’ (default is protein).

  • matrix (str) – scoring matrix to be used in the search (e.g., BLOSUM45).

  • gapalign (bool) – perform gapped alignments.

  • alignments (int) – maximum number of alignments displayed in the output.

  • exp – E-value threshold.

  • filter (bool) – low complexity sequence filter to process the query sequence before performing the search.

  • scores (int) – maximum number of scores displayed in the output.

  • dropoff (int) – amount score must drop before extension of hits is halted.

  • match_scores – match/miss-match scores to generate a scoring matrix for nucleotide searches.

  • gapopen (int) – penalty for the initiation of a gap.

  • gapext (int) – penalty for each base/residue in a gap.

  • seqrange – region of the query sequence to use for the search. Default: whole sequence.

Returns:

A jobid that can be analysed with getResult(), getStatus(), …

The up-to-date values accepted for each of these parameters can be retrieved from get_parameter_details().

For instance,:

from bioservices import NCBIblast
n = NCBIblast()
n.get_parameter_details("program")

Example:

jobid = n.run(program="blastp",
     sequence=n._sequence_example,
     stype="protein",
     database="uniprotkb",
     email="test@yahoo.fr")

Database can be a list of databases:

database=["uniprotkb", "uniprotkb_swissprot"]

The returned object is a jobid, which status can be checked. It must be finished before analysing/getting the results.

See also

getResult()

Warning

Cases are not important. Spaces in the database case should be replaced by underscore.

Note

database returned by the server have meaningless names since they do not map to the expected names. An example is “ENA Sequence Release” that should be provided as em_rel

http://www.ebi.ac.uk/Tools/sss/ncbiblast/help/index-nucleotide.html

wait(jobId)[source]

This function checks the status of a jobid while it is running

Parameters:

jobId (str) – a job identifier returned by run().

8.24. NCBIBlastAPI

Interface to the NCBI BLAST URL API

class NCBIBlastAPI(verbose=False, api_key=None)[source]

Interface to NCBI BLAST via NCBI’s own URL API.

Jobs are submitted with run(), polled with get_status() or wait(), and results retrieved with get_result().

Parameters:
  • verbose (bool) – print debug messages (default False).

  • api_key – NCBI API key. Raises the rate limit from 3 to 10 requests per second. Obtain one at https://www.ncbi.nlm.nih.gov/account/

get_result(rid, format_type='XML')[source]

Retrieve results for a finished job.

Parameters:
  • rid (str) – request ID returned by run().

  • format_type (str) – output format. "XML" (default) returns standard BLAST XML; "Text" returns the pairwise text report; "Tabular" returns tab-separated hits; "JSON2" returns JSON.

Returns:

result content as a string.

Return type:

str

Raises:

RuntimeError – if the job is not yet ready.

get_status(rid)[source]

Return the current status of a submitted job.

Parameters:

rid (str) – request ID returned by run().

Returns:

one of "WAITING", "READY", "FAILED", "UNKNOWN".

Return type:

str

run(program, database, sequence, email, evalue='1e-10', hitlist_size=100, **kwargs)[source]

Submit a BLAST job to NCBI and return the request identifier.

Parameters:
  • program (str) – BLAST program — one of blastn, blastp, blastx, tblastn, tblastx.

  • database (str) – target database (e.g. "nt", "nr").

  • sequence (str) – query sequence in FASTA or bare sequence format.

  • email (str) – contact address forwarded to NCBI (required by their usage policy).

  • evalue (str) – E-value threshold (default "1e-10").

  • hitlist_size (int) – maximum number of hits to return (default 100).

  • kwargs – additional NCBI BLAST parameters forwarded verbatim, e.g. WORD_SIZE, FILTER, GAPCOSTS, MATRIX_NAME, MEGABLAST.

Returns:

(rid, rtoe) — the NCBI request ID and estimated wait time in seconds.

Return type:

tuple[str, int]

Example:

rid, rtoe = b.run(
    program="blastn",
    database="nt",
    sequence="ATGAAAGCAATTTTCGTACTGAAAGGTTTT",
    email="you@example.org",
)
wait(rid, rtoe=None, timeout=600)[source]

Block until the job identified by rid is finished.

Parameters:
  • rid (str) – request ID returned by run().

  • rtoe (int) – estimated wait time in seconds returned by run(). When provided, the first poll is delayed by rtoe seconds so NCBI is not hit unnecessarily early.

  • timeout (int) – maximum number of seconds to wait before giving up and returning "TIMEOUT" (default: 600 s / 10 min). Set to None to wait indefinitely.

Returns:

final status string ("READY", "FAILED", "UNKNOWN", or "TIMEOUT").

Return type:

str

8.25. OmniPath Commons

Interface to OmniPath web service

class OmniPath(verbose=False, cache=False)[source]

Interface to the OmniPath service

>>> from bioservices import OmniPath
>>> o = OmniPath()
>>> net = o.get_network()
>>> interactions = o.get_interactions('P00533')

Constructor OmniPath

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to enable HTTP caching

get_about()[source]

Information about the version

get_info()[source]

Currently returns HTML page

get_interactions(query='', frmt='json', fields=[])[source]

Interactions of proteins

Parameters:
  • query (str) – a valid uniprot identifier (e.g. P00533). It can also be a list of uniprot identifiers, or a string with comma-separated identifiers.

  • fields (list) – additional fields to be added to the output (e.g., ["sources", "references"])

  • frmt (str) – format of the output ("json" or "tsv")

Example:

res_one = o.get_interactions('P00533')
res_many = o.get_interactions('P00533,O15117,Q96FE5')
res_many = o.get_interactions(['P00533','O15117','Q96FE5'])

res_one = o.get_interactions('P00533', fields='sources')
res_one = o.get_interactions('P00533', fields=['source'])
res_one = o.get_interactions('P00533', fields=['source', 'references'])

You may also keep query to an empty string, but the entire DB will then be downloaded. This may take time and the timeout may need to be increased manually.

If frmt is set to TSV, the output is a TSV table with a header. If set to json, a dictionary is returned.

get_network(frmt='json')[source]

Get basic statistics about the whole network including sources

get_ptms(query='', ptm_type=None, frmt='json', fields=[])[source]

List enzymes, substrates and PTMs

Parameters:
  • query (str) – a valid uniprot identifier (e.g. P00533). It can also be a list of uniprot identifiers, or a string with comma-separated identifiers.

  • ptm_type (str) – restrict the output to this type of PTM (e.g., "phosphorylation")

  • fields (list) – additional fields to be added to the output (e.g., ["sources", "references"])

  • frmt (str) – format of the output ("json" or "tsv")

get_resources(frmt='json')[source]

Return statistics about the databases and their contents

8.26. Panther

Interface to some part of the Panther web service

class Panther(verbose=True, cache=False)[source]

Interface to Panther pages

>>> from bioservices import Panther
>>> p = Panther()
>>> p.get_supported_genomes()
>>> p.get_ortholog("zap70", 9606)

>>> from bioservices import Panther
>>> p = Panther()
>>> taxon = [x[0]['taxon_id'] for x in p.get_supported_genomes() if "coli" in x['name'].lower()]
>>> # you may also use our method called search_organism
>>> taxon = p.get_taxon_id(pattern="coli")
>>> res = p.get_mapping("abrB,ackA,acuI", taxon)

The get_mapping returns for each gene ID the GO terms corresponding to each ID. Those go terms may belong to different categories (see get_annotation_datasets()):

  • MF for molecular function

  • BP for biological process

  • PC for Protein class

  • CC Cellular location

  • Pathway

Note that results from the website application http://pantherdb.org/ do not agree with the output of the get_mapping service… Try out the dgt gene from ecoli for example

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to enable HTTP caching

get_annotation_datasets()[source]

Retrieve the list of supported annotation data sets

get_enrichment(gene_list, organism, annotation, enrichment_test='Fisher', correction='FDR', ref_gene_list=None)[source]

Returns over represented genes

Compares a test gene list to a reference gene list, and determines whether a particular class (e.g. molecular function, biological process, cellular component, PANTHER protein class, the PANTHER pathway or Reactome pathway) of genes is overrepresented or underrepresented.

Parameters:
  • gene_list (str) – comma-delimited gene identifiers to test for enrichment

  • organism (int) – a valid taxon ID

  • enrichment_test – either Fisher or Binomial test

  • correction – correction for multiple testing. Either FDR, Bonferonni, or None.

  • annotation – one of the supported PANTHER annotation data types. See get_annotation_datasets() to retrieve a list of supported annotation data types

  • ref_gene_list – if not specified, the system will use all the genes for the specified organism. Otherwise, a list delimited by comma. Maximum of 100000 Identifiers can be any of the following: Ensembl gene identifier, Ensembl protein identifier, Ensembl transcript identifier, Entrez gene id, gene symbol, NCBI GI, HGNC Id, International protein index id, NCBI UniGene id, UniProt accession and UniProt id.

Returns:

a dictionary with the following keys. ‘reference’ contains the organism, ‘input_list’ is the input gene list with unmapped genes. ‘result’ contains the list of candidates.

>>> from bioservices import Panther
>>> p = Panther()
>>> res = p.get_enrichment('zap70,mek1,erk', 9606, "GO:0008150")
>>> # For molecular function, use:
>>> res = p.get_enrichment('zap70,mek1,erk', 9606,
        "ANNOT_TYPE_ID_PANTHER_GO_SLIM_MF")
get_family_msa(family, taxon_list=None)[source]

Returns MSA information for the specified family.

Parameters:
  • family – family ID

  • taxon_list – Zero or more taxon IDs separated by ‘,’.

get_family_ortholog(family, taxon_list=None)[source]

Search for matching orthologs in target organisms

Also return the corresponding position in the target organism sequence. The system searches for matching orthologs in the gene family that contains the search gene associated with the search term.

Parameters:
  • family – Family ID

  • taxon_list – Zero or more taxon IDs separated by ‘,’.

get_homolog_position(gene, organism, position, ortholog_type='all')[source]

Return the homolog at a given position in the family tree.

Parameters:
  • gene (str) – a gene identifier — can be any of: Ensembl gene/protein/transcript ID, Entrez gene id, gene symbol, NCBI GI, HGNC Id, International protein index id, NCBI UniGene id, UniProt accession or UniProt id

  • organism (int) – a valid taxon ID

  • position (int) – 1-based position in the gene family tree

  • ortholog_type (str) – ortholog type of target organism ("LDO" or "all")

get_mapping(gene_list, taxon)[source]

Map identifiers

Parameters:
  • gene_list (str) – comma-delimited gene identifiers (max 1000). Can be any of: Ensembl gene/protein/transcript ID, Entrez gene id, gene symbol, NCBI GI, HGNC Id, International protein index id, NCBI UniGene id, UniProt accession or UniProt id.

  • taxon – one taxon ID. See get_supported_genomes()

If an identifier is not found, information can be found in the unmapped_genes key while found identifiers are in the mapped_genes key.

Warning

found and not found identifiers are dispatched into unmapped and mapped genes. If there are not found identifiers, the input gene list and the mapped genes list do not have the same length. The input names are not stored in the output. Developers should be aware of that feature.

get_ortholog(gene_list, organism, target_organism=None, ortholog_type='all')[source]

search for matching orthologs in target organisms.

Searches for matching orthologs in the gene family that contains the search gene associated with the search terms. Returns ortholog genes in target organisms given a search organism, the search terms and a list of target organisms.

Parameters:
  • gene_list (str) – comma-delimited gene identifiers

  • organism (int) – a valid taxon ID

  • target_organism – zero or more taxon IDs separated by ‘,’. See get_supported_genomes()

  • ortholog_type – optional parameter to specify ortholog type of target organism

Returns:

a dictionary with “mapped” and “unmapped” keys, each of them being a list. For each unmapped gene, a dictionary with id and organism is returned. For the mapped gene, a list of ortholog is returned.

get_pathways()[source]

Returns all pathways from pantherdb

get_supported_families(N=1000, progress=True)[source]

Returns the list of supported PANTHER family IDs

This services returns only 1000 items per request. This is defined by the index. For instance index set to 1 returns the first 1000 families. Index set to 2 returns families between index 1000 and 2000 and so on. As of 20 Feb 2020, there was about 15,000 families.

This function simplifies your life by calling the service as many times as required. Therefore it returns all families in one go.

get_supported_genomes(type=None)[source]

Returns list of supported organisms.

Parameters:

type – can be chrLoc to restrict the search

get_taxon_id(pattern=None)[source]

Return all taxon IDs supported by the service.

If pattern is provided, we filter the name to keep those that contain the filter. If only one is found, we return the name itself, otherwise a list of candidates

get_tree_info(family, taxon_list=None)[source]

Returns tree topology information and node attributes for the specified family.

Parameters:
  • family – Family ID

  • taxon_list – Zero or more taxon IDs separated by ‘,’.

8.27. Pathway Commons

This module provides a class PathwayCommons

Data is freely available, under the license terms of each contributing database.

class PathwayCommons(verbose=True, cache=False)[source]

Interface to the PathwayCommons service

>>> from bioservices import *
>>> pc2 = PathwayCommons(verbose=False)
>>> res = pc2.get("http://identifiers.org/uniprot/Q06609")

Todo

traverse() method not implemented.

Constructor

Parameters:

verbose (bool) – prints informative messages

property default_extension

set extension of the requests (default is json). Can be ‘json’ or ‘xml’

get(uri, frmt='BIOPAX')[source]

Retrieves full pathway information for a set of elements

elements can be for example pathway, interaction or physical entity given the RDF IDs. Get commands only retrieve the BioPAX elements that are directly mapped to the ID. Use the traverse() query to traverse BioPAX graph and obtain child/owner elements.

Parameters:
  • uri (str) – valid/existing BioPAX element’s URI (RDF ID; for utility classes that were “normalized”, such as entity refereneces and controlled vocabularies, it is usually a Identifiers.org URL. Multiple IDs can be provided using list uri=[http://identifiers.org/uniprot/Q06609, http://identifiers.org/uniprot/Q549Z0’] See also about MIRIAM and Identifiers.org.

  • format (str) – output format (values)

Returns:

a complete BioPAX representation for the record pointed to by the given URI is returned. Other output formats are produced by converting the BioPAX record on demand and can be specified by the optional format parameter. Please be advised that with some output formats it might return “no result found” error if the conversion is not applicable for the BioPAX result. For example, BINARY_SIF output usually works if there are some interactions, complexes, or pathways in the retrieved set and not only physical entities.

>>> from bioservices import PathwayCommons
>>> pc2 = PathwayCommons(verbose=False)
>>> res = pc2.get("col5a1")
>>> res = pc2.get("http://identifiers.org/uniprot/Q06609")
get_sifgraph_common_stream(source, limit=1, direction='DOWNSTREAM', pattern=None)[source]

finds the common stream for them; extracts a sub-network from the loaded Pathway Commons SIF model.

Parameters:
  • source – set of gene identifiers (HGNC symbol). Can be a list of identifiers or just one string(if only one identifier)

  • limit (int) – Graph traversal depth. Limit > 1 value can result in very large data or error.

  • direction (str) – Graph traversal direction. Use UNDIRECTED if you want to see interacts-with relationships too.

  • pattern (str) – Filter by binary relationship (SIF edge) type(s). one of “BOTHSTREAM”, “UPSTREAM”, “DOWNSTREAM”, “UNDIRECTED”.

returns: the graph in SIF format. The output must be stripped and

returns one line per relation. In each line, items are separated by a tabulation. You can save the text with .sif extensions and it should be ready to use e.g. in cytoscape viewer.

res = pc.get_sifgraph_common_stream(['BRD4', 'MYC'])
get_sifgraph_neighborhood(source, limit=1, direction='BOTHSTREAM', pattern=None)[source]

finds the neighborhood sub-network in the Pathway Commons Simple Interaction Format (extented SIF) graph (see http://www.pathwaycommons.org/pc2/formats#sif)

Parameters:
  • source – set of gene identifiers (HGNC symbol). Can be a list of identifiers or just one string(if only one identifier)

  • limit (int) – Graph traversal depth. Limit > 1 value can result in very large data or error.

  • direction (str) – Graph traversal direction. Use UNDIRECTED if you want to see interacts-with relationships too.

  • pattern (str) – Filter by binary relationship (SIF edge) type(s). one of “BOTHSTREAM”, “UPSTREAM”, “DOWNSTREAM”, “UNDIRECTED”.

returns: the graph in SIF format. The output must be stripped and

returns one line per relation. In each line, items are separated by a tabulation. You can save the text with .sif extensions and it should be ready to use e.g. in cytoscape viewer.

res = pc.get_sifgraph_neighborhood('BRD4')
get_sifgraph_pathsbetween(source, limit=1, directed=False, pattern=None)[source]

finds the paths between them; extracts a sub-network from the Pathway Commons SIF graph.

Parameters:
  • source – set of gene identifiers (HGNC symbol). Can be a list of identifiers or just one string(if only one identifier)

  • limit (int) – Graph traversal depth. Limit > 1 value can result in very large data or error.

  • directed (bool) – Directionality: ‘true’ is for DOWNSTREAM/UPSTREAM, ‘false’ - UNDIRECTED

  • pattern (str) – Filter by binary relationship (SIF edge) type(s). one of “BOTHSTREAM”, “UPSTREAM”, “DOWNSTREAM”, “UNDIRECTED”.

returns: the graph in SIF format. The output must be stripped and

returns one line per relation. In each line, items are separated by a tabulation. You can save the text with .sif extensions and it should be ready to use e.g. in cytoscape viewer.

get_sifgraph_pathsfromto(source, target, limit=1, pattern=None)[source]

finds the paths between them; extracts a sub-network from the Pathway Commons SIF graph.

Parameters:
  • source – set of gene identifiers (HGNC symbol). Can be a list of identifiers or just one string(if only one identifier)

  • target – A target set of gene identifiers.

  • limit (int) – Graph traversal depth. Limit > 1 value can result in very large data or error.

  • pattern (str) – Filter by binary relationship (SIF edge) type(s). one of “BOTHSTREAM”, “UPSTREAM”, “DOWNSTREAM”, “UNDIRECTED”.

returns: the graph in SIF format. The output must be stripped and

returns one line per relation. In each line, items are separated by a tabulation. You can save the text with .sif extensions and it should be ready to use e.g. in cytoscape viewer.

graph(kind, source, target=None, direction=None, limit=1, frmt=None, datasource=None, organism=None)[source]

Finds connections and neighborhoods of elements

Connections can be for example the shortest path between two proteins or the neighborhood for a particular protein state or all states.

Graph searches take detailed BioPAX semantics such as generics or nested complexes into account and traverse the graph accordingly. The starting points can be either physical entites or entity references.

In the case of the latter the graph search starts from ALL the physical entities that belong to that particular entity references, i.e. all of its states. Note that we integrate BioPAX data from multiple databases based on our proteins and small molecules data warehouse and consistently normalize UnificationXref, EntityReference, Provenance, BioSource, and ControlledVocabulary objects when we are absolutely sure that two objects of the same type are equivalent. We, however, do not merge physical entities and reactions from different sources as matching and aligning pathways at that level is still an open research problem. As a result, graph searches can return several similar but disconnected sub-networks that correspond to the pathway data from different providers (though some physical entities often refer to the same small molecule or protein reference or controlled vocabulary).

Parameters:
  • kind (str) – graph query

  • source (str) – source object’s URI/ID. Multiple source URIs/IDs must be encoded as list of valid URI source=[‘http://identifiers.org/uniprot/Q06609’, ‘http://identifiers.org/uniprot/Q549Z0’].

  • target (str) – required for PATHSFROMTO graph query. target URI/ID. Multiple target URIs must be encoded as list (see source parameter).

  • direction (str) – graph search direction in [BOTHSTREAM, DOWNSTREAM, UPSTREAM] see _valid_directions attribute.

  • limit (int) – graph query search distance limit (default = 1).

  • format (str) – output format. see _valid-format

  • datasource (str) – datasource filter (same as for ‘search’).

  • organism (str) – organism filter (same as for ‘search’).

Returns:

By default, graph queries return a complete BioPAX representation of the subnetwork matched by the algorithm. Other output formats are available as specified by the optional format parameter. Please be advised that some output format choices might cause “no result found” error if the conversion is not applicable for the BioPAX result (e.g., BINARY_SIF output fails if there are no interactions, complexes, nor pathways in the retrieved set).

>>> from bioservices import PathwayCommons
>>> pc2 = PathwayCommons(verbose=False)
>>> res = pc2.graph(source="http://identifiers.org/uniprot/P20908",
        kind="neighborhood", format="EXTENDED_BINARY_SIF")
search(q, page=0, datasource=None, organism=None, type=None)[source]

Text search in PathwayCommons using Lucene query syntax

Some of the parameters are BioPAX properties, others are composite relationships.

All index fields are (case-sensitive): comment, ecnumber, keyword, name, pathway, term, xrefdb, xrefid, dataSource, and organism.

The pathway field maps to all participants of pathways that contain the keyword(s) in any of its text fields.

Finally, keyword is a transitive aggregate field that includes all searchable keywords of that element and its child elements.

All searches can also be filtered by data source and organism.

It is also possible to restrict the domain class using the ‘type’ parameter.

This query can be used standalone or to retrieve starting points for graph searches.

Parameters:
  • q (str) – requires a keyword , name, external identifier, or a Lucene query string.

  • page (int) – (N>=0, default is 0), search result page number.

  • datasource (str) – filter by data source (use names or URIs of pathway data sources or of any existing Provenance object). If multiple data source values are specified, a union of hits from specified sources is returned. datasource=[reactome,pid] returns hits associated with Reactome or PID.

  • organism (str) – The organism can be specified either by official name, e.g. “homo sapiens” or by NCBI taxonomy id, e.g. “9606”. Similar to data sources, if multiple organisms are declared a union of all hits from specified organisms is returned. For example organism=[9606, 10016] returns results for both human and mice.

  • type (str) – BioPAX class filter. (e.g., ‘pathway’, ‘proteinreference’)

>>> from bioservices import PathwayCommons
>>> pc2 = PathwayCommons(vverbose=False)
>>> pc2.search("Q06609")
>>> pc2.search("brca2", type="proteinreference",
        organism="homo sapiens",  datasource="pid")
>>> pc2.search("name:'col5a1'", type="proteinreference", organism=9606)
>>> pc2.search("a*", page=3)

Find the FGFR2 keyword:

pc2.search("FGFR2")

Find pathways by FGFR2 keyword in any index field.:

pc2.search("FGFR2", type="pathway")

Finds control interactions that contain the word binding but not transcription in their indexed fields:

pc2.search("binding NOT transcription", type="control")

Find all interactions that directly or indirectly participate in a pathway that has a keyword match for “immune” (Note the star after immune):

pc.search(“pathway:immune*”, type=”conversion”)

Find all Reactome pathways:

pc.search("*", type="pathway", datasource="reactome")
top_pathways(query='*', datasource=None, organism=None)[source]

This command returns all top pathways

Pathways can be top or pathways that are neither ‘controlled’ nor ‘pathwayComponent’ of another process.

param query:

a keyword, name, external identifier or lucene query string like in ‘search’. Default is “*”

param str datasource:

filter by data source (same as search)

param str organism:

organism filter. 9606 for human.

return:

dictionary with information about top pathways. Check the “searchHit” key for information about “dataSource” for instance

>>> from bioservices import PathwayCommons
>>> pc2 = PathwayCommons(verbose=False)
>>> res = pc2.top_pathways()

https://www.pathwaycommons.org/pc2/top_pathways?q=TP53

traverse(uri, path)[source]

Provides XPath-like access to the PC.

The format of the path query is in the form:

[InitialClass]/[property1]:[classRestriction(optional)]/[property2]... A "*"

sign after the property instructs path accessor to transitively traverse that property. For example, the following path accessor will traverse through all physical entity components within a complex:

"Complex/component*/entityReference/xref:UnificationXref"

The following will list display names of all participants of interactions, which are components (pathwayComponent) of a pathway (note: pathwayOrder property, where same or other interactions can be reached, is not considered here):

"Pathway/pathwayComponent:Interaction/participant*/displayName"

The optional parameter classRestriction allows to restrict/filter the returned property values to a certain subclass of the range of that property. In the first example above, this is used to get only the Unification Xrefs. Path accessors can use all the official BioPAX properties as well as additional derived classes and parameters in paxtools such as inverse parameters and interfaces that represent anonymous union classes in OWL. (See Paxtools documentation for more details).

Parameters:
  • uri (str) – a biopax element URI - specified similar to the ‘GET’ command. multiple IDs are allowed as a list of strings.

  • path (str) – a BioPAX propery path in the form of property1[:type1]/property2[:type2]; see above, inverse properties, Paxtools, org.biopax.paxtools.controller.PathAccessor.

See also

properties

Returns:

XML result that follows the Search Response XML Schema (TraverseResponse type; pagination is disabled: returns all values at once)

from bioservices import PathwayCommons
pc2 = PathwayCommons(verbose=False)
res = pc2.traverse(uri=['http://identifiers.org/uniprot/P38398','http://identifiers.org/uniprot/Q06609'], path="ProteinReference/organism")
res = pc2.traverse(uri="http://identifiers.org/uniprot/Q06609",
    path="ProteinReference/entityReferenceOf:Protein/name")
res = pc2.traverse("http://identifiers.org/uniprot/P38398",
    path="ProteinReference/entityReferenceOf:Protein")
res = pc2.traverse(uri=["http://identifiers.org/uniprot/P38398",
    "http://identifiers.org/taxonomy/9606"], path="Named/name")

8.28. PDB/PDBe modules

Interface to the PDB web Service (New API Jan 2021).

class PDB(verbose=False, cache=False)[source]

Interface to PDB service (new API Jan 2021)

With the new API, one method called search() is provided by PDB. To perform a search you need to define a query. Here is an example

>>> from bioservices import PDB
>>> s = PDB()
>>> query = {"query":
...              {"type": "terminal",
...               "service": "text",
...               "parameters": {
...                 "value": "thymidine kinase"
...                 }
...             },
...          "return_type": "entry"}
>>> res = s.search(query)

Note

as of December 2020, a new API has been set up by PDB. Some previous functionalities such as returning a list of Ligands are not supported anymore (Jan 2021). However, many more powerful searches are available. I encourage everyone to look at the PDB page for complex examples: http://search.rcsb.org/#examples

As mentioned above, the PDB service provides one method called search available in search(). We will not cover all the power and capability of this search function. User should refer to the official PDB help for that. Yet, given examples from PDB should all work with this method.

When possible, we will add convenient aliases function in this class. For now we have for example the get_current_ids() and get_similarity_sequence() that users may find useful.

The main idea behind the PDB API is to create queries that can access to different type of services. A query will need at least two keys:

  • query

  • return_type

Consider this basic example that searches for the text thymidine kinase:

{
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
      "value": "thymidine kinase"
    }
  },
  "return_type": "entry"
}

Here the query is defined by a query and a return_type indeed. The return type is a simple value such as entry. The query itself is composed of 3 pairs of key/value. Here we have the type service and parameters as defined below.

The query can have several fields:

  • type: the clause type can be either terminal or group

    • terminal: performs an atomic search operation, e.g. searches for a particular value in a particular field.

    • group: wraps other terminal or group nodes and is used to combine multiple queries in a logical fashion.

  • service:

    • text: linguistic searches against textual annotations.

    • sequence: uses MMSeq2 to perform sequence matching searches (blast-like). following targets that are available:

      • pdb_protein_sequence,

      • pdb_dna_sequence,

      • pdb_na_sequence

    • seqmotif: performs short motif searches against nucleotide or protein sequences using 3 different inputs:

      • simple (e.g., CXCXXL)

      • prosite (e.g., C-X-C-X(2)-[LIVMYFWC])

      • regex (e.g., CXCX{2}[LIVMYFWC])

    • structure: searches matching a global 3D shape of assemblies or chains of a given entry (identified by PDB ID), in either strict (strict_shape_match) or relaxed (relaxed_shape_match) modes

    • strucmotif: Performs structural motif searches on all available PDB structures.

    • chemical: queries of small-molecule constituents of PDB structures, based on chemical formula and chemical structure. Queries for matching and similar chemical structures can be performed using SMILES and InChI descriptors as search targets.

      • graph-strict: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment are used as matching criteria for this search type.

      • graph-relaxed: atom type, formal charge and bond order are used as matching criteria for this search type.

      • graph-relaxed-stereo: atom type, formal charge, bond order, atom and bond chirality are used as matching criteria for this search type.

      • fingerprint-similarity: Tanimoto similarity is used as the matching criteria

Concerning the return_type key, it can be one of :

  • entry: a list of PDB IDs.

  • assembly: list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies.

  • polymer_entity: list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities.

  • non_polymer_entity: list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands).

  • polymer_instance: list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains.

Optional arguments

There are many optional arguments. Let us see a couple of them. Pagination can be set (default is 10 entries) using the request_options (optional) key. Consider this query example:

{
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
        "attribute": "rcsb_polymer_entity.formula_weight",
        "operator": "greater",
        "value": 500
    }
  },
  "request_options": {
    "pager": {
      "start": 0,
      "rows": 100
    }
  },
  "return_type": "polymer_entity"
}

Here, the query searches for the polymer_entity that have a formula weight above 500. With the request_options pager set to 100, we will get the first 100 hits.

To return all hits, set this field in the request_options:

"return_all_hits": true

Coming back at the first basic example, we can reuse it to illustrate how to refine the search using attribute and operators:

{
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
      "value": "thymidine kinase",
      "attribute": "exptl.method",
      "operator": "exact_match",
    }
  },
  "return_type": "entry"
}

All valid combo of operators and attributes can be found here: http://search.rcsb.org/search-attributes.html

For instance, in the example above only in, exact_match and exists can be used with exptl.method attribute. This is not checked in bioservices.

Sorting is determined by the sort object in the request_options context. It allows you to add one or more sorting conditions to control the order of the search result hits. The sort operation is defined on a per field level, with special field name for score to sort by score (the default).

By default sorting is done in descending order (“desc”). The sort can be reversed by setting direction property to “asc”. This example demonstrates how to sort the search results by release date:

{
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
      "attribute": "struct.title",
      "operator": "contains_phrase",
      "value": ""hiv protease""
    }
  },
  "request_options": {
    "sort": [
      {
        "sort_by": "rcsb_accession_info.initial_release_date",
        "direction": "desc"
      }
    ]
  },
  "return_type": "entry"
}

Again, many more complex examples can be found on PDB page.

Constructor

Parameters:
  • verbose (bool) – prints informative messages (default is off)

  • cache (bool) – set to True to enable HTTP caching

get_current_ids()[source]

Get a list of all current PDB IDs.

get_similarity_sequence(seq)[source]

Search for sequence similarity with a protein sequence.

Parameters:

seq (str) – protein sequence in single-letter amino acid code

seq = "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTAVAHVDDMPNAL"
results = p.get_similarity_sequence(seq)
search(query, request_options=None, request_info=None, return_type=None)[source]

search request represented as a JSON object.

This is the only function in PDB API. You should be able to perform any valid PDB searches here (see the bioservices.pdb.PDB documentation for details. Note, however, that we have aliases methods in BioServices that will be added on demand for common searches.

Parameters:
  • query (dict) – the search expression. Can be omitted if, instead of IDs retrieval, facets or count operation should be performed. In this case the request must be configured via the request_options context.

  • request_options (dict) – (optional) controls various aspects of the search request including pagination, sorting, scoring and faceting.

  • request_info (dict) – additional information about the query, e.g. query_id. (optional)

  • return_type (str) – type of results to return (e.g. "entry", "polymer_entity").

Returns:

json results

You must define a query as defined in the PDB web page. For example the following query search for macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken):

query = {
  "query": {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "evalue_cutoff": 1,
      "identity_cutoff": 0.9,
      "target": "pdb_protein_sequence",
      "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"
    }
  },
  "request_options": {
    "scoring_strategy": "sequence"
  },
  "return_type": "polymer_entity"
}

What is important is that the dictionary called query contains 2 compulsory keys namely query and return_type. The two other optional keys are request_options and return_info

You would then call the PDB search as follows:

from bioservices import PDB
p = PDB()
results = p.search(query)

Now, in BioServices, you can also decompose the query as follows:

query = {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "evalue_cutoff": 1,
      "identity_cutoff": 0.9,
      "target": "pdb_protein_sequence",
      "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"
    }}
request_options =  { "scoring_strategy": "sequence"}
return_type= "polymer_entity"

and then use PDB search again:

from bioservices import PDB
p = PDB()
results = p.search(query, request_options=request_options, return_type=return_type)

or even simpler for the Pythonic lovers:

results = p.search(**query)

Interface to the PDBe web Service (v2 API).

class PDBe(verbose=False, cache=False)[source]

Interface to part of the PDBe service

>>> from bioservices import PDBe
>>> s = PDBe()
>>> res = s.get_files("1FBV")

Constructor

Parameters:
  • verbose (bool) – prints informative messages (default is off)

  • cache (bool) – set to True to enable HTTP caching

get_assembly(query)[source]

Provides information for each assembly of a given PDB ID.

This information is broken down at the entity level for each assembly. The information given includes the molecule name, type and class, the chains where the molecule occur, and the number of copies of each entity in the assembly.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_assembly('1cbs')
p.get_assembly('1cbs,4v5j')
get_binding_sites(query, entity_id)[source]

Provides details on binding sites for a specific entity in the entry.

STRUCT_SITE records in PDB files (or mmcif equivalent thereof), such as ligand, residues in the site, description of the site, etc.

Parameters:
  • query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

  • entity_id – an entity ID (integer or string, e.g., 1)

p.get_binding_sites('1cbs', 1)
get_branched_entities(query)[source]

Provides data for branched carbohydrate entities within an entry.

Overall information about each unique branched carbohydrate is returned, along with detailed information about each carbohydrate monomer within the branched entity.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_branched_entities('3d12')
p.get_branched_entities('3d12,7v7u')
get_carbohydrate_polymer(query)[source]

Provides data for carbohydrate polymers within an entry.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_carbohydrate_polymer('3d12')
p.get_carbohydrate_polymer('3d12,7v7u')
get_drugbank_annotation(query)[source]

Provides DrugBank annotation of all ligands, i.e. ‘bound’ molecules.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_drugbank_annotation('5hht')
get_electron_density_statistics(query)[source]

Provides statistics for electron density.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_electron_density_statistics('1cbs')
p.get_electron_density_statistics('1cbs,4v5j')
get_entities(query)[source]

Return details of entities modelled in the entry

This is an alias for get_molecules() using the entities endpoint.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_entities('1cbs')
p.get_entities('1cbs,2kv8')
get_experiment(query)[source]

Provides details of experiment(s) carried out in determining the structure of the entry.

Each experiment is described in a separate dictionary. For X-ray diffraction, the description consists of resolution, spacegroup, cell dimensions, R and Rfree, refinement program, etc. For NMR, details of spectrometer, sample, spectra, refinement, etc. are included. For EM, details of specimen, imaging, acquisition, reconstruction, fitting etc. are included.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_experiment('1cbs')
p.get_experiment('1cbs,2kv8')
get_files(query)[source]

Provides URLs and brief descriptions (labels) for PDB entry

Also, for mmcif files, biological assembly files, FASTA file for sequences, SIFTS cross reference XML files, validation XML files, X-ray structure factor file, NMR experimental constraints files, etc.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_files('1cbs')
p.get_files('1cbs,4v5j')
get_functional_annotation(query)[source]

Provides functional annotation of all ligands, i.e. ‘bound’ molecules.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_functional_annotation('1cbs')
get_ligand_monomers(query)[source]

Provides a list of modelled instances of ligands,

i.e. ‘bound’ molecules that are not waters.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_ligand_monomers('1cbs')
p.get_ligand_monomers('1cbs,2kv8')
get_modified_residues(query)[source]

Provides a list of modelled instances of modified amino acids or nucleotides in protein, DNA or RNA chains.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_modified_residues('4v5j')
p.get_modified_residues('4v5j,1cbs')
get_molecules(query)[source]

Return details of molecules (or entities in mmcif-speak) modelled in the entry

This can be entity id, description, type, polymer-type (if applicable), number of copies in the entry, sample preparation method, source organism(s) (if applicable), etc.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_molecules('1cbs')
p.get_molecules('1cbs,2kv8')
get_mutated_residues(query)[source]

Provides a list of modelled instances of mutated amino acids or nucleotides in protein, DNA or RNA chains.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_mutated_residues('1bgj')
p.get_mutated_residues('1bgj,4v5j')
get_observed_ranges(query)[source]

Provides observed ranges, i.e., segments of structural coverage of polymeric molecules that are modelled fully or partly.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_observed_ranges('1cbs')
p.get_observed_ranges('1cbs,4v5j')
get_observed_ranges_in_pdb_chain(query, chain_id)[source]

Provides observed ranges, i.e., segments of structural coverage of polymeric molecules in a particular chain.

Parameters:
  • query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

  • chain_id (str) – a PDB chain ID (e.g., "A")

p.get_observed_ranges_in_pdb_chain('1cbs', 'A')
get_observed_residues_ratio(query)[source]

Provides the ratio of observed residues for each chain in each molecule.

The list of chains within an entity is sorted by observed_ratio (descending order), partial_ratio (ascending order), and number_residues (descending order).

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_observed_residues_ratio('1cbs')
p.get_observed_residues_ratio('1cbs,4v5j')
get_publications(query)[source]

Return publications associated with the entry

Provides details of publications associated with an entry, such as title of the article, journal name, year of publication, volume, pages, doi, pubmed_id, etc. Primary citation is listed first.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_publications('1cbs')
p.get_publications('1cbs,2kv8')

Provides DOIs for related raw experimental datasets.

Includes diffraction image data, small-angle scattering data and electron micrographs.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_related_dataset('5o8b')
p.get_related_dataset('5o8b,5o8b')

Return publications obtained from both EuroPMC and UniProt.

These are articles which cite the primary citation of the entry, or open-access articles which mention the entry id without explicitly citing the primary citation of an entry.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_related_publications('1cbs')
p.get_related_publications('1cbs,2kv8')
get_release_status(query)[source]

Provides status of a PDB entry (released, obsoleted, on-hold etc) along with some other information such as authors, title, experimental method, etc.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_release_status('1cbs')
p.get_release_status('1cbs,4v5j')
get_residue_listing(query)[source]

Lists all residues (modelled or otherwise) in the entry.

Except waters, along with details of the fraction of expected atoms modelled for the residue and any alternate conformers.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_residue_listing('1cbs')
get_residue_listing_in_pdb_chain(query, chain_id)[source]

Lists all residues (modelled or otherwise) in a particular chain.

Except waters, along with details of the fraction of expected atoms modelled for the residue and any alternate conformers.

Parameters:
  • query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

  • chain_id (str) – a PDB chain ID (e.g., "A")

p.get_residue_listing_in_pdb_chain('1cbs', 'A')
get_secondary_structure(query)[source]

Provides residue ranges of regular secondary structure

(alpha helices and beta strands) found in protein chains of the entry. For strands, sheet id can be used to identify a beta sheet.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_secondary_structure('1cbs')
p.get_secondary_structure('1cbs,4v5j')
get_summary(query)[source]

Returns summary of a PDB entry

This can be title of the entry, list of depositors, date of deposition, date of release, date of latest revision, experimental method, list of related entries in case split entries, etc.

Parameters:

query (str) – a 4-character PDB id code, comma-separated list of IDs, or Python list of IDs

p.get_summary('1cbs')
p.get_summary('1cbs,2kv8')
p.get_summary(['1cbs', '2kv8'])

8.29. PRIDE module

Interface to PRIDE web service

class PRIDE(verbose=False, cache=False)[source]

Interface to the PRIDE service

from bioservices import PRIDE
p = PRIDE()
p.get_peptide_evidence(projectAccession)

Changed in version 1.10.1: Due to new API:

  • the method project_count was dropped.

  • get_project_list was renamed in get_project_files

  • get_assays, get_assay_count, get_assay_count_project_accession, get_assay_list were dropped in v2

  • get_protein_list, get_protein_count, get_protein_count_assay, get_protein_list, get_protein_list_assay replaced by get_protein_evidences method

  • get_peptide_list_assay, get_peptide_count, get_peptide_list, get_peptide_list_sequence, get_peptide_count_assay replaced by get_peptide_evidence.

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to use caching. Not recommended for this service that evolves a lot

get_peptide_evidence(project_accession=None, assay_accession=None, protein_accession=None, peptide_evidence_accession=None, peptide_sequence=None, pageSize=100, page=0, sortDirection='DESC', sortConditions='projectAccession')[source]

Get all the peptide evidences for a specific protein evidence.

Parameters:
  • project_accession (str) – filter by PRIDE project accession (optional)

  • assay_accession (str) – filter by assay accession (optional)

  • protein_accession (str) – filter by protein accession (optional)

  • peptide_evidence_accession (str) – filter by peptide evidence accession (optional)

  • peptide_sequence (str) – filter by peptide sequence (optional)

  • pageSize (int) – how many results to return per page (default 100)

  • page (int) – which page (starting from 0) of the result to return

  • sortConditions (str) – field(s) to sort by, comma-separated (default "projectAccession")

  • sortDirection (str) – the sorting order ("ASC" or "DESC")

Retrieving data from project accession should be fast:

p.get_peptide_evidence(protein_accession="Q8IX30")

but other methods may be slow:

p.get_peptide_evidence(peptide_sequence="CQGSPGASKAMLSCNR")
get_project(identifier)[source]

Retrieve project information by accession

List of PRIDE Archive Projects. The following method does not allow performing search; for search functionality you will need to use the search/projects. The result list is Paginated using the pageSize and page.

Parameters:

identifier (str) – a valid PRIDE identifier e.g., PRD000001

Returns:

if identifier is invalid, returns an empty dictionary {}

>>> from bioservices import PRIDE
>>> p = PRIDE()
>>> res = p.get_project("PRD000001")
>>> res['title']
'COFRADIC proteome of unstimulated human blood platelets'
get_project_files(accession, pageSize=100, page=0, sortConditions=None, sortDirection='DESC', filters='')[source]

list projects or given criteria

Parameters:
  • accession (str) – the accession number to look for

  • pageSize (int) – how many results to return per page

  • page (int) – which page (starting from 0) of the result to return

  • sortConditions (str) – default is submission_date but more fields can be separated by comma and passed. Example: submission_date,project_title

  • sortDirection (str) – the sorting order (ASC or DESC)

  • filters (str) – Parameters to filter the search results. The structure of the filter is: field1==value1, field2==value2. Example accession==PRD000001

>>> p = PRIDE()
>>> results = p.get_project_files(accession="PRD000001", pageSize=10, page=1)

In v1.10.1 due to new PRIDE API, the method get_file_count was dropped. You can use:

len(results['_embedded']['files'])

Similarly the get_file_list method was dropped since all results are stored in the output of this method

get_projects(pageSize=100, max_pages=1000000000.0)[source]

Retrieve all PRIDE projects, paginating automatically.

Parameters:
  • pageSize (int) – number of results per page (default 100)

  • max_pages – maximum number of pages to fetch (default: all pages)

Returns:

a list of project dictionaries

get_projects_count()[source]

Return total number of projects.

Note

When the API returns a paginated list (new format), this method returns the count for the first page only, not the total across all pages.

get_protein_evidences(project_accession=None, assay_accession=None, reported_accession=None, pageSize=100, page=0, sortDirection='DESC', sortConditions='projectAccession')[source]

Get all proteins evidence

Parameters:
  • project_accession (str) – filter by PRIDE project accession (optional)

  • assay_accession (str) – filter by assay accession (optional)

  • reported_accession (str) – filter by reported protein accession (optional)

  • pageSize (int) – how many results to return per page (default 100)

  • page (int) – which page (starting from 0) of the result to return

  • sortConditions (str) – field(s) to sort by, comma-separated (default "projectAccession")

  • sortDirection (str) – the sorting order ("ASC" or "DESC")

p.get_protein_evidences()['_embedded']['proteinevidences']
get_stats(name)[source]

Retrieve statistics by name.

Parameters:

name (str) – statistics name (e.g., "SUBMISSIONS_PER_YEAR")

Returns:

statistics data for the given name

p.get_stats("SUBMISSIONS_PER_YEAR")

8.30. Pfam

Interface to some part of the Pfam web service

class Pfam(verbose=True)[source]

Interface to Pfam pages

This is not a REST interface but rather a parser to some of the HTML pages related to Pfam families.

One can retrieve protein family information and associated sequences.

>>> from bioservices import *
>>> p = Pfam()

Constructor

Parameters:

verbose (bool) – set to False to prevent informative messages

get_protein(ID, output='json')[source]

Retrieve protein information from Pfam.

Parameters:
  • ID (str) – a UniProt accession (e.g., "P43403")

  • output (str) – response format (default "json")

Returns:

raw response content

show(Id)[source]

Open the Pfam protein page for a UniProt ID in a web browser.

Parameters:

Id (str) – a UniProt accession (e.g., "P43403")

p = Pfam()
p.show("P43403")

8.31. PubChem

Interface to the PubChem PUG REST web service

COMPOUND_PROPERTIES = ['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'IsomericSMILES', 'InChI', 'InChIKey', 'IUPACName', 'Title', 'XLogP', 'ExactMass', 'MonoisotopicMass', 'TPSA', 'Complexity', 'Charge', 'HBondDonorCount', 'HBondAcceptorCount', 'RotatableBondCount', 'HeavyAtomCount', 'IsotopeAtomCount', 'AtomStereoCount', 'DefinedAtomStereoCount', 'UndefinedAtomStereoCount', 'BondStereoCount', 'DefinedBondStereoCount', 'UndefinedBondStereoCount', 'CovalentUnitCount', 'Volume3D', 'XStericQuadrupole3D', 'YStericQuadrupole3D', 'ZStericQuadrupole3D', 'FeatureCount3D', 'FeatureAcceptorCount3D', 'FeatureDonorCount3D', 'FeatureAnionCount3D', 'FeatureCationCount3D', 'FeatureRingCount3D', 'FeatureHydrophobeCount3D', 'ConformerDependentDescriptorCount', 'ConformerCount3D', 'Fingerprint2D']

Properties available via the /property/ endpoint of the PUG REST API.

class PubChem(verbose=False, cache=False)[source]

Interface to the PubChem PUG REST service.

The PubChem PUG REST API provides access to compound, substance and assay data stored in PubChem. URL structure follows the pattern:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/{domain}/{namespace}/{identifier}/{operation}/{format}

Example usage:

from bioservices import PubChem
p = PubChem()

# Get CIDs for aspirin by name
cids = p.get_cids_by_name("aspirin")

# Get compound record by CID
record = p.get_compound_by_cid(2244)

# Get specific properties for aspirin (CID 2244)
props = p.get_properties(2244, properties=["MolecularFormula", "MolecularWeight"])

# Get synonyms for aspirin
synonyms = p.get_synonyms(2244)

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to cache requests

get_aids_by_cid(cid, frmt='json')[source]

Return assay IDs (AIDs) that tested a given compound CID.

Parameters:
  • cid – PubChem compound identifier

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing AID list

Example:

p.get_aids_by_cid(2244)
get_assay(aid, frmt='json')[source]

Return the full assay record for an AID.

Parameters:
  • aid – PubChem assay identifier

  • frmt (str) – response format (default "json")

Returns:

full assay record

Example:

p.get_assay(1)
get_assay_description(aid, frmt='json')[source]

Return the description section of an assay.

Parameters:
  • aid – PubChem assay identifier

  • frmt (str) – response format (default "json")

Returns:

dict containing assay description

Example:

p.get_assay_description(1)
get_assay_summary(cid, frmt='json')[source]

Return a bioactivity summary for a compound.

Parameters:
  • cid – PubChem compound identifier

  • frmt (str) – response format (default "json")

Returns:

dict containing assay summary data

Example:

p.get_assay_summary(2244)
get_cids_by_aid(aid, frmt='json')[source]

Return CIDs tested in a given assay.

Parameters:
  • aid – PubChem assay identifier

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_aid(1)
get_cids_by_formula(formula, frmt='json')[source]

Return CIDs for a molecular formula.

Parameters:
  • formula (str) – molecular formula (e.g. "C9H8O4")

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_formula("C9H8O4")
get_cids_by_inchi(inchi, frmt='json')[source]

Return CIDs for an InChI string.

Uses a POST request to safely transmit InChI strings that contain special characters.

Parameters:
  • inchi (str) – InChI string

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

get_cids_by_inchikey(inchikey, frmt='json')[source]

Return CIDs for an InChIKey.

Parameters:
  • inchikey (str) – InChIKey (e.g. "BSYNRYMUTXBXSQ-UHFFFAOYSA-N")

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_inchikey("BSYNRYMUTXBXSQ-UHFFFAOYSA-N")
get_cids_by_name(name, frmt='json')[source]

Return CIDs for a compound name.

Parameters:
  • name (str) – compound name (e.g. "aspirin")

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_name("aspirin")
get_cids_by_sid(sid, frmt='json')[source]

Return compound CIDs standardised from a given substance SID.

Parameters:
  • sid – PubChem substance identifier

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_sid(100)
get_cids_by_smiles(smiles, frmt='json')[source]

Return CIDs for a SMILES string.

Uses a POST request so that special characters in the SMILES are handled correctly.

Parameters:
  • smiles (str) – SMILES string (e.g. "CC(=O)Oc1ccccc1C(=O)O")

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

Example:

p.get_cids_by_smiles("CC(=O)Oc1ccccc1C(=O)O")
get_compound_by_cid(cid, frmt='json')[source]

Return the full compound record for a CID.

Parameters:
  • cid – PubChem compound identifier (integer or string)

  • frmt (str) – response format (default "json")

Returns:

full compound record

Example:

p.get_compound_by_cid(2244)   # aspirin
get_compound_by_name(name, frmt='json')[source]

Return the full compound record for a compound name.

Parameters:
  • name (str) – compound name (e.g. "aspirin")

  • frmt (str) – response format (default "json")

Returns:

full compound record

Example:

p.get_compound_by_name("aspirin")
get_compound_by_smiles(identifier, frmt='json')[source]

Return CIDs for a SMILES string.

Deprecated since version Use: get_cids_by_smiles() instead. This method is kept for backward compatibility.

Parameters:
  • identifier (str) – SMILES string

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing CID list

get_description(identifier, namespace='cid', frmt='json')[source]

Return the description for a compound.

Parameters:
  • identifier – compound identifier

  • namespace (str) – identifier type (default "cid")

  • frmt (str) – response format (default "json")

Returns:

dict containing InformationList with description text

Example:

p.get_description(2244)
p.get_description("aspirin", namespace="name")
get_properties(identifier, namespace='cid', properties=None, frmt='json')[source]

Return computed properties for a compound.

Parameters:
  • identifier – compound identifier (e.g. CID 2244 or name "aspirin")

  • namespace (str) – identifier type – one of "cid", "name", "smiles", "inchikey" (default "cid")

  • properties – property name(s) to retrieve. Either a comma-separated string or a list of names from COMPOUND_PROPERTIES. Defaults to all properties when None.

  • frmt (str) – response format (default "json")

Returns:

dict containing PropertyTable with the requested properties

Example:

p.get_properties(2244, properties=["MolecularFormula", "MolecularWeight"])
p.get_properties("aspirin", namespace="name", properties="InChIKey,XLogP")
get_sids_by_aid(aid, frmt='json')[source]

Return SIDs tested in a given assay.

Parameters:
  • aid – PubChem assay identifier

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing SID list

Example:

p.get_sids_by_aid(1)
get_sids_by_cid(cid, frmt='json')[source]

Return substance IDs (SIDs) deposited for a given compound CID.

Parameters:
  • cid – PubChem compound identifier

  • frmt (str) – response format (default "json")

Returns:

dict with IdentifierList key containing SID list

Example:

p.get_sids_by_cid(2244)
get_substance_by_sid(sid, frmt='json')[source]

Return the full substance record for a SID.

Parameters:
  • sid – PubChem substance identifier

  • frmt (str) – response format (default "json")

Returns:

full substance record

Example:

p.get_substance_by_sid(100)
get_synonyms(identifier, namespace='cid', frmt='json')[source]

Return synonyms for a compound.

Parameters:
  • identifier – compound identifier

  • namespace (str) – identifier type (default "cid")

  • frmt (str) – response format (default "json")

Returns:

dict containing InformationList with synonym lists

Example:

p.get_synonyms(2244)
get_xrefs(identifier, xref_type, namespace='cid', frmt='json')[source]

Return cross-references for a compound.

Parameters:
  • identifier – compound identifier

  • xref_type (str) – cross-reference type, one of "RegistryID", "RN", "PubMedID", "MMDBID", "PatentID", "WikipediaURL", "GeneID", etc. See XREF_TYPES for the full list.

  • namespace (str) – identifier type (default "cid")

  • frmt (str) – response format (default "json")

Returns:

dict containing cross-reference list

Example:

p.get_xrefs(2244, "PatentID")
XREF_TYPES = ['RegistryID', 'RN', 'PubMedID', 'MMDBID', 'DBURL', 'SBURL', 'AmericanChemicalSocietyID', 'WikipediaURL', 'PatentID', 'GeneID', 'ProteinGI', 'TaxonomyID', 'MIMID', 'BioSystemID', 'ReactomeID', 'BioCycID']

Valid cross-reference types for the /xrefs/ endpoint of the PUG REST API.

8.32. Rhea

Interface to the Rhea web services

class Rhea(verbose=True, cache=False)[source]

Interface to the Rhea service

You can search by compound name, ChEBI ID, reaction ID, cross reference (e.g., EC number) or citation (author name, title, abstract text, publication ID). You can use double quotes - to match an exact phrase - and the following wildcards:

  • ? (question mark = one character),

  • * (asterisk = several characters).

Searching for caffe* will find reactions with participants such as caffeine, trans-caffeic acid or caffeoyl-CoA:

from bioservices import Rhea
r = Rhea()
response = r.search("caffe*")

Searching for a?e?o* will find reactions with participants such as acetoin, acetone or adenosine.:

from bioservices import Rhea
r = Rhea()
response = r.search("a?e?o*")

The search() and query() methods accept a list of valid columns. By default all columns are used but you can restrict to only a few. Here is the description of the columns:

rhea-id :   reaction identifier (with prefix RHEA)
equation :  textual description of the reaction equation
chebi :     comma-separated list of ChEBI names used as reaction participants
chebi-id :  comma-separated list of ChEBI identifiers used as reaction participants
ec :        comma-separated list of EC numbers (with prefix EC)
uniprot :   number of proteins (UniProtKB entries) annotated with the Rhea reaction
pubmed :    comma-separated list of PubMed identifiers (without prefix)

and 5 cross-references:

reaction-xref(EcoCyc)
reaction-xref(MetaCyc)
reaction-xref(KEGG)
reaction-xref(Reactome)
reaction-xref(M-CSA)

Rhea constructor

Parameters:
  • verbose (bool) – set to True to get informative messages (default True)

  • cache (bool) – set to True to enable HTTP caching

>>> from bioservices import Rhea
>>> r = Rhea()
get_metabolites(rxn_id)[source]

Given a Rhea (http://www.rhea-db.org/) reaction id, returns its participant metabolites as a dict: {metabolite: stoichiometry},

e.g. ‘2 H + 1 O2 = 1 H2O’ would be represented ad {‘H’: -2, ‘O2’: -1, ‘H2O’: 1}.

Parameters:

rxn_id (str) – Rhea reaction ID (e.g., "RHEA:10661")

Returns:

dict with "reactants" and "products" keys, each a list of metabolite names

query(query, columns=None, frmt='tsv', limit=None)[source]

Retrieve a concrete reaction for the given id in a given format

Parameters:
  • query (str) – the entry to retrieve

  • query – the query string (e.g., "uniprot:*", "" for all)

  • columns (str) – comma-separated column names to include in the result. Defaults to all columns (see _valid_columns).

  • frmt (str) – result format (default "tsv"; only TSV is currently supported)

  • limit (int) – maximum number of results to retrieve

Returns:

a pandas DataFrame if pandas is installed, otherwise the raw TSV string

Retrieve Rhea reaction identifiers and equation text:

r.query("", columns="rhea-id,equation", limit=10)

Retrieve Rhea reactions with enzymes curated in UniProtKB (only first 10 entries):

r.query("uniprot:*", columns="rhea-id,equation", limit=10)

To retrieve a specific entry:

df = r.get_entry("rhea:10661")

Changed in version 1.8.0: (entry() method renamed in query() and no more format required. Must be given in the entry name e.g. query(“10281.rxn”) instead of entry(10281, format=”rxn”) the option frmt is now related to the result format

search(query, columns=None, limit=None, frmt='tsv')[source]

Search for Rhea (mimics https://www.rhea-db.org/)

Parameters:
  • query (str) – the search term (e.g., "caffeine", "caffe*")

  • columns (str) – comma-separated column names to include in the result. Defaults to all columns (see _valid_columns).

  • limit (int) – maximum number of results to return

  • frmt (str) – result format (default "tsv")

Returns:

a pandas DataFrame if pandas is installed, otherwise the raw TSV string

>>> r = Rhea()
>>> df = r.search("caffeine")
>>> df = r.search("caffeine", columns='rhea-id,equation')

8.33. Reactome

Interface to the Reactome webs services

class Reactome(verbose=True, cache=False)[source]

Interface to the Reactome knowledgebase.

Reactome is an open-source, manually curated and peer-reviewed pathway database. This class wraps the Reactome ContentService REST API.

>>> from bioservices import Reactome
>>> r = Reactome()
>>> r.get_species_main()

Todo

interactors, orthology, participants, person, query, references, schema

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to enable HTTP caching

get_complex_subunits(identifier, excludeStructuresSpecifies=False)[source]

A list with the entities contained in a given complex

Retrieves the list of subunits that constitute any given complex. In case the complex comprises other complexes, this method recursively traverses the content returning each contained PhysicalEntity. Contained complexes and entity sets can be excluded setting the ‘excludeStructures’ optional parameter to ‘true’

Parameters:
  • identifier (str) – a Reactome stable identifier for the complex

  • excludeStructuresSpecifies (bool) – if True, exclude contained complexes and entity sets from the response

r.get_complex_subunits("R-HSA-5674003")
get_complexes(resources, identifier)[source]

A list of complexes containing the pair (identifier, resource)

Retrieves the list of complexes that contain a given (identifier, resource). The method deconstructs the complexes into all its participants to do so.

Parameters:
  • resources (str) – the resource of the identifier (e.g., "UniProt")

  • identifier (str) – the identifier for which complexes are requested

r.get_complexes(resources, identifier)
r.get_complexes("UniProt", "P43403")
get_discover(identifier)[source]

The schema.org for an Event in Reactome knowledgebase

For each event (reaction or pathway) this method generates a json file representing the dataset object as defined by schema.org (http). This is mainly used by search engines in order to index the data

Parameters:

identifier (str) – a Reactome stable identifier (e.g., "R-HSA-446203")

Returns:

schema.org JSON-LD representation of the event

r.get_discover("R-HSA-446203")
get_diseases()[source]

list of diseases objects

get_diseases_doid()[source]

retrieves the list of disease DOIDs annotated in Reactome

Returns:

dictionary with DOID contained in the values()

get_entity_componentOf(identifier)[source]

A list of larger structures containing the entity

Retrieves the list of structures (Complexes and Sets) that include the given entity as their component. It should be mentioned that the list includes only simplified entries (type, names, ids) and not full information about each item.

r.get_entity_componentOf("R-HSA-199420")
get_entity_otherForms(identifier)[source]

All other forms of PhysicalEntity

Retrieves a list containing all other forms of the given PhysicalEntity. These other forms are PhysicalEntities that share the same ReferenceEntity identifier, e.g. PTEN H93R[R-HSA-2318524] and PTEN C124R[R-HSA-2317439] are two forms of PTEN.

r.get_entity_otherForms("R-HSA-199420")
get_event_ancestors(identifier)[source]

The ancestors of a given event

The Reactome definition of events includes pathways and reactions. Although events are organised in a hierarchical structure, a single event can be in more than one location, i.e. a reaction can take part in different pathways while, in the same way, a sub-pathway can take part in many pathways. Therefore, this method retrieves a list of all possible paths from the requested event to the top level pathway(s).

Parameters:

identifier (str) – a Reactome stable identifier for the event

r.get_event_ancestors("R-HSA-5673001")
get_eventsHierarchy(species)[source]

The full event hierarchy for a given species

Events (pathways and reactions) in Reactome are organised in a hierarchical structure for every species. By following all ‘hasEvent’ relationships, this method retrieves the full event hierarchy for any given species. The result is a list of tree structures, one for each TopLevelPathway. Every event in these trees is represented by a PathwayBrowserNode. The latter contains the stable identifier, the name, the species, the url, the type, and the diagram of the particular event.

Parameters:

species – taxonomy ID (e.g., 9606) or species name (e.g., "Homo sapiens")

r.get_eventsHierarchy(9606)
get_exporter_diagram(identifier, ext='png', quality=5, diagramProfile='Modern', analysisProfile='Standard', filename=None)[source]

Export a given pathway diagram to raster file

This method accepts identifiers for Event class instances. When a diagrammed pathway is provided, the diagram is exported to the specified format. When a subpathway is provided, the diagram for the parent is exported and the events that are part of the subpathways are selected. When a reaction is provided, the diagram containing the reaction is exported and the reaction is selected.

Parameters:
  • identifier (str) – event identifier (pathway with diagram, subpathway, or reaction)

  • ext (str) – file extension / image format — one of "png", "jpeg", "jpg", "svg", "gif"

  • quality (int) – result image quality between 1 and 10 (default 5)

  • diagramProfile (str) – diagram color profile ("Modern" or "Standard")

  • analysisProfile (str) – analysis color profile

  • filename (str) – if given, save the result to this file path

Returns:

raw image data if filename is not set; None after saving otherwise

get_exporter_fireworks()[source]
get_exporter_reaction()[source]
get_exporter_sbml(identifier)[source]

Export given Pathway to SBML

Parameters:

identifier (str) – DbId or StId of the requested pathway

r.exporter_sbml("R-HSA-68616")
get_interactors_psicquic_molecule_details()[source]

Retrieve clustered interaction, sorted by score, of a given accession by resource.

get_interactors_psicquic_molecule_summary()[source]

Retrieve a summary of a given accession by resource

get_interactors_psicquic_resources()[source]

Retrieve a list of all Psicquic Registries services

get_interactors_static_molecule_details()[source]

Retrieve a detailed interaction information of a given accession

get_interactors_static_molecule_pathways()[source]

Retrieve a list of lower level pathways where the interacting molecules can be found

get_interactors_static_molecule_summary()[source]

Retrieve a summary of a given accession

get_mapping_identifier_pathways(resource, identifier)[source]

Retrieve pathways containing a mapped identifier.

Parameters:
  • resource (str) – the external resource (e.g., "UniProt")

  • identifier (str) – the identifier to map (e.g., "P43403")

Returns:

list of pathway objects

get_mapping_identifier_reactions(resource, identifier)[source]

Retrieve reactions containing a mapped identifier.

Parameters:
  • resource (str) – the external resource (e.g., "UniProt")

  • identifier (str) – the identifier to map (e.g., "P43403")

Returns:

list of reaction objects

get_pathway_containedEvents(identifier)[source]

All the events contained in the given event

Events are the building blocks used in Reactome to represent all biological processes, and they include pathways and reactions. Typically, an event can contain other events. For example, a pathway can contain smaller pathways and reactions. This method recursively retrieves all the events contained in any given event.

res = r.get_pathway_containedEvents("R-HSA-5673001")
get_pathway_containedEvents_by_attribute(identifier, attribute)[source]

A single property for each event contained in the given event

Events are the building blocks used in Reactome to represent all biological processes, and they include pathways and reactions. Typically, an event can contain other events. For example, a pathway can contain smaller pathways (subpathways) and reactions. This method recursively retrieves a single attribute for each of the events contained in the given event.

Parameters:
  • identifier (str) – The event for which the contained events are requested

  • identifier – the event for which the contained events are requested

  • attribute (str) – attribute to filter (e.g., "stId")

r.get_pathway_containedEvents_by_attribute("R-HSA-5673001", "stId")
get_pathways_low_diagram_entity(identifier)[source]

A list of lower level pathways with diagram containing a given entity or event

This method traverses the event hierarchy and retrieves the list of all lower level pathways that have a diagram and contain the given PhysicalEntity or Event.

Parameters:

identifier (str) – the entity that has to be present in the pathways

r.get_pathways_low_diagram_entity("R-HSA-199420")
get_pathways_low_diagram_entity_allForms(identifier)[source]

A list of lower level pathways with diagram containing any form of a given entity.

Parameters:

identifier (str) – a Reactome stable identifier or accession

r.get_pathways_low_diagram_entity_allForms("R-HSA-199420")
get_pathways_low_entity(identifier)[source]

A list of lower level pathways containing a given entity or event

This method traverses the event hierarchy and retrieves the list of all lower level pathways that contain the given PhysicalEntity or Event.

r.get_pathways_low_entity("R-HSA-199420")
get_pathways_low_entity_allForms(identifier)[source]

A list of lower level pathways containing any form of a given entity

This method traverses the event hierarchy and retrieves the list of all lower level pathways that contain the given PhysicalEntity in any of its variant forms. These variant forms include for example different post-translationally modified versions of a single protein, or the same chemical in different compartments.

r.get_pathways_low_entity_allForms("R-HSA-199420")
get_pathways_top(species)[source]

Retrieve the list of top-level pathways for a given species.

Parameters:

species – taxonomy ID (e.g., 9606) or species name (e.g., "Homo sapiens")

Returns:

list of top-level pathway objects

get_references(identifier)[source]

All referenceEntities for a given identifier

Retrieves a list containing all the reference entities for a given identifier.

r.get_references(15377)
get_species_all()[source]

the list of all species in Reactome

get_species_main()[source]

the list of main species in Reactome

r.get_species_main()
property name
search_facet()[source]

A list of facets corresponding to the whole Reactome search data

This method retrieves faceting information on the whole Reactome search data.

search_facet_query(query)[source]

A list of facets corresponding to a specific query

This method retrieves faceting information on a specific query

search_query(query)[source]

Queries Solr against the Reactome knowledgebase

This method performs a Solr query on the Reactome knowledgebase. Results can be provided in a paginated format.

search_spellcheck(query)[source]

Spell-check suggestions for a given query

This method retrieves a list of spell-check suggestions for a given search term.

search_suggest(query)[source]

Autosuggestions for a given query.

This method retrieves a list of suggestions for a given search term.

Parameters:

query (str) – search term (e.g., "apopt")

Returns:

list of suggestion strings

>>> r.search_suggest("apopt")
['apoptosis', 'apoptosome', 'apoptosome-mediated', 'apoptotic']
property version

8.34. Readseq

This module provides a class Seqret to access to Seqret WS.

class Seqret(verbose=True)[source]

Interface to the Seqret service

>>> from bioservices import *
>>> s = Seqret()

The ReadSeq service was replaced by the Seqret service (2015).

Changed in version 0.15.

Constructor

Parameters:

verbose (bool) – set to True to get informative messages

get_parameter_details(parameterId)[source]

Get details of a specific parameter.

Parameters:

parameterId (str) – identifier/name of the parameter to fetch details of.

Returns:

a data structure describing the parameter and its values.

s = Seqret()
print(s.get_parameter_details("stype"))
get_parameters()[source]

Get a list of the parameter names.

Returns:

a list of strings giving the names of the parameters.

get_result(jobid, result_type='out')[source]

Get the result of a job of the specified type.

Parameters:
  • jobid (str) – job identifier returned by run().

  • result_type (str) – result type to retrieve (default "out"). See get_result_types() for available types.

Returns:

the result as a string, or None if the job is not finished

get_result_types(jobid)[source]

Get the available result types for a finished job.

Parameters:

jobid (str) – job identifier.

Returns:

a list of wsResultType data structures describing the available result types.

get_status(jobid=None)[source]

Get the status of a submitted job.

Parameters:

jobid (str) – job identifier.

Returns:

string containing the status.

The values for the status are:

  • RUNNING: the job is currently being processed.

  • FINISHED: job has finished, and the results can then be retrieved.

  • ERROR: an error occurred attempting to get the job status.

  • FAILURE: the job failed.

  • NOT_FOUND: the job cannot be found.

property parameters

Get list of parameter names

run(email, title, **kargs)[source]

Submit a job to the service.

Parameters:
  • email (str) – user e-mail address.

  • title (str) – job title.

  • kargs – additional tool parameters (e.g., sequence, stype, inputformat, outputformat). See get_parameter_details().

Returns:

string containing the job identifier (jobId).

Deprecated format values from the old ReadSeq service:

Format Name     Value
Auto-detected   0
EMBL            4
GenBank         2
Fasta(Pearson)  8
Clustal/ALN     22
ACEDB           25
BLAST           20
DNAStrider      6
FlatFeat/FFF    23
GCG             5
GFF             24
IG/Stanford     1
MSF             15
NBRF            3
PAUP/NEXUS      17
Phylip(Phylip4)     12
Phylip3.2       11
PIR/CODATA      14
Plain/Raw       13
SCF             21
XML             19

As output, you also have

Pretty 18

s = Seqret()
jobid = s.run("cokelaer@test.co.uk", "test", sequence=fasta, inputformat=8,
    outputformat=2)
genbank = s.get_result(jobid)

8.35. STRING

Interface to the STRING protein interaction database web service.

class STRING(verbose=True, cache=False)[source]

Interface to the STRING database.

STRING is a database of known and predicted protein-protein interactions. It covers both direct (physical) and indirect (functional) associations derived from genomic context, high-throughput experiments, co-expression, and the literature.

>>> from bioservices import STRING
>>> s = STRING()
>>> interactions = s.get_interactions("ZAP70", species=9606)
>>> partners = s.get_interaction_partners("ZAP70", species=9606)

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to enable caching of requests

get_enrichment(identifiers, species=None, background_string_identifiers=None, caller_identity=None)[source]

Perform functional enrichment analysis on a set of proteins.

Tests whether the input proteins are significantly enriched for Gene Ontology (GO) terms, KEGG pathways, Pfam domains, InterPro signatures, and other annotation categories.

Parameters:
  • identifiers – gene/protein name(s). Separate multiple identifiers with %0d or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human). Required when identifiers are gene symbols.

  • background_string_identifiers – optional set of proteins to use as the statistical background. Defaults to the entire proteome.

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts, each representing an enriched annotation term with fields such as category, term, description, number_of_genes, p_value, and fdr.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_enrichment("ZAP70,LCK,CD3E,CD3D", species=9606)
>>> len(res) > 0
True
get_functional_annotation(identifiers, species=None, allow_pubmed=0, caller_identity=None)[source]

Get functional annotations for a set of proteins.

Returns GO terms, KEGG pathway membership, and other annotations for the queried proteins.

Parameters:
  • identifiers – gene/protein name(s). Separate multiple identifiers with %0d or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human).

  • allow_pubmed (int) – include PubMed references (0 or 1, default: 0).

  • caller_identity (str) – optional application name for tracking.

Returns:

list of functional annotation records.

Return type:

list

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_functional_annotation("TP53", species=9606)
get_homology(identifiers, species=None, species_b=None, required_score=None, caller_identity=None)[source]

Retrieve homology data for a set of proteins.

Returns homologous protein pairs between the query species and species_b (or within the query species if species_b is not given).

Parameters:
  • identifiers – gene/protein name(s). Separate multiple identifiers with %0d or provide a list.

  • species (int) – NCBI taxonomy ID of the query species.

  • species_b (int) – NCBI taxonomy ID of the second species. If None, homologs are retrieved within species.

  • required_score (int) – minimum combined interaction score (0–1000).

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts describing homology relationships.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_homology("ZAP70", species=9606, species_b=10090)
get_interaction_partners(identifiers, species=None, required_score=None, limit=None, network_type='functional', caller_identity=None)[source]

Retrieve interaction partners for the given proteins.

Returns proteins that interact with the query proteins. Compared to get_interactions(), this method returns partners even if they are not in the original query set.

Parameters:
  • identifiers – gene/protein name(s). Separate multiple identifiers with %0d or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human).

  • required_score (int) – minimum combined interaction score (0–1000).

  • limit (int) – maximum number of interaction partners to return per input protein.

  • network_type (str) – either "functional" (default) or "physical".

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts, each representing one interaction.

>>> from bioservices import STRING
>>> s = STRING()
>>> partners = s.get_interaction_partners("ZAP70", species=9606, limit=5)
>>> len(partners) > 0
True
get_interactions(identifiers, species=None, required_score=None, network_type='functional', add_nodes=0, show_query_node_labels=0, caller_identity=None)[source]

Retrieve protein-protein interactions for the given identifiers.

Returns the STRING interaction network for a set of proteins. Each interaction record includes scores for different evidence channels (neighbourhood, co-occurrence, co-expression, experimental, database, text-mining) as well as a combined interaction score.

Parameters:
  • identifiers – gene/protein name(s). Use %0d as separator for multiple identifiers, or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human). Required when identifiers are gene symbols.

  • required_score (int) – minimum combined interaction score (0–1000). Interactions below this threshold are excluded.

  • network_type (str) – either "functional" (default) or "physical".

  • add_nodes (int) – number of additional white-list nodes to add to the network.

  • show_query_node_labels (int) – set to 1 to display labels for input nodes even when they are not directly connected.

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts, each representing one interaction with scores.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_interactions("ZAP70", species=9606)
>>> len(res) > 0
True
get_network(identifiers, species=None, required_score=None, network_type='functional', add_nodes=0, show_query_node_labels=0, caller_identity=None)[source]

Retrieve protein-protein interactions for the given identifiers.

This is an alias for get_interactions().

Parameters:
  • identifiers – gene/protein name(s). Use %0d as separator for multiple identifiers, or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human).

  • required_score (int) – minimum combined interaction score (0–1000).

  • network_type (str) – either "functional" (default) or "physical".

  • add_nodes (int) – number of additional white-list nodes to add to the network.

  • show_query_node_labels (int) – set to 1 to display labels for input nodes.

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts, each representing one interaction with scores.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_network(["TP53", "BRCA1"], species=9606)
get_ppi_enrichment(identifiers, species=None, required_score=None, background_string_identifiers=None, caller_identity=None)[source]

Test whether the input proteins are enriched in interactions.

Returns a single record indicating the observed number of interactions, expected number, p-value, and the average interaction score for the input protein set.

Parameters:
  • identifiers – gene/protein name(s). Separate multiple identifiers with %0d or provide a list.

  • species (int) – NCBI taxonomy ID (e.g. 9606 for human).

  • required_score (int) – minimum combined interaction score (0–1000). If None, uses STRING default.

  • background_string_identifiers – optional background gene set for enrichment calculation.

  • caller_identity (str) – optional application name for tracking.

Returns:

dict with keys number_of_nodes, number_of_edges, average_node_degree, local_clustering_coefficient, expected_number_of_edges, and p_value.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_ppi_enrichment("ZAP70,LCK,CD3E", species=9606)
>>> "p_value" in res
True
get_string_ids(identifiers, species=None, limit=1, echo_query=True, caller_identity=None)[source]

Resolve identifiers to STRING identifiers.

Maps gene/protein names or other identifiers to their STRING IDs.

Parameters:
  • identifiers – identifier(s) to resolve. Multiple identifiers should be separated by %0d or provided as a list.

  • species (int) – NCBI taxonomy ID. For example, 9606 for Homo sapiens. If None, STRING will search across all species.

  • limit (int) – maximum number of results per input identifier. Default is 1 (best match).

  • echo_query (bool) – if True, include the query identifier in the response.

  • caller_identity (str) – optional application name for tracking.

Returns:

list of dicts with STRING identifier mappings.

>>> from bioservices import STRING
>>> s = STRING()
>>> res = s.get_string_ids("ZAP70", species=9606)
>>> res[0]["stringId"]
'9606.ENSP00000379990'
get_version()[source]

Return the current STRING API version information.

Returns:

dict with version details.

>>> from bioservices import STRING
>>> s = STRING()
>>> ver = s.get_version()
>>> "string_version" in ver
True

8.36. UniChem

This module provides a class UniChem

class UniChem(verbose=False, cache=False)[source]

Interface to the UniChem service

>>> from bioservices import UniChem
>>> u = UniChem()

There are lots of sources such as Chembl, Chebi, etc. You will probably need the identifiers of those sources. You can get all information about a source using these methods:

# Get information about a source
u.get_source_info_by_name('chembl')
u.get_source_info_by_id(10)
u.get_id_from_name('chembl')
u.get_all_src_ids()

but for developers, everything is contained in the source_ids dictionary.

The first important method provided by Unichem API is the get_compounds(). For example, you can request all compounds related to the CHEMBL12 identifier from ChEMBL using:

res = u.get_compounds('CHEMBL12', 'chembl')
compounds = res['compounds'][0]

Note that the second argument is ‘chembl’ and lower/upper cases is important. All names are stored in source_ids together with their identifiers.

You can use also get_id_from_name() and get_name_from_id` if needed.

Legacy methods are available:

get_compound_ids_from_src_id –> use get_compounds() get_src_compound_ids_from_inchikey –> replaced by get_compounds() get_all_src_ids() –> uses new API get_src_compound_ids_all_from_inchikey –> get_source_by_inchikey() get_verbose_src_compound_ids_from_inchikey –> get_sources_by_inchikey_verbose() get_structure –> uses new API get_compounds() and bioservices code get_structure_all –> dropped get_src_compound_id_url –> dropped. One can use the get_compounds() get_src_compound_ids_all_from_obsolete –> removed

get_src_compound_ids_from_src_compound_id –> removed; was obsolet get_src_compound_ids_all_from_src_compound_id –> removed was already obsolet get_all_compound_ids_from_all_src_id –> removed. no more API get_mapping –> removed. no more API get_auxiliary_mappings –> removed. no more API

Most old functions can be replaced by a syntax such as:

res = u.get_compound('CHEMBL12', 'chembl')
res['compounds'][0]

Changed in version version: 1.9. drop xml parser.

Constructor UniChem

Parameters:

verbose (bool) – set to False to prevent informative messages

get_all_src_ids()[source]

Obtain all src_ids of sources available in UniChem

Returns:

list of ‘src_id’s.

uni.get_all_src_ids()
get_compounds(compound, source_type)[source]

Get matched compounds information

Parameters:
  • compound (str) – InChI, InChIKey, Name, UCI or Compound Source ID

  • source_type (str) – uci, inchi, inchikey, sourceID (e.g. chembl)

  • sourceID (str) – ID for the source assigned in UniChem when the type is “sourceID”

Returns:

a list of matched compounds and their assigned sources

A legacy function allows you to retrieve a compound from its inchikey:

u.get_sources_by_inchikey('GZUITABIAKMVPG-UHFFFAOYSA-N')

However, this new function is faster presumably and allows you to do the same:

res = u.get_compounds('GZUITABIAKMVPG-UHFFFAOYSA-N', 'inchikey')
res['compounds']

You can get the first element, from which inchi, sources, standardInchikey, uci can be extracted. The sources key contains all compound identifiers for each source:

res['compounds'][0]['uci']
res['compounds'][0]['sources']

Looks like there is always a single element in res[‘compounds’] but since it is a list, you must access to first element (unique) using [0] syntax.

get_connectivity(compound, source_type)[source]

Fetch multiple source data sets for a given compound with common connectivity to a given id on the database source, InChI, InChIkey or UCI

Parameters:
  • compound (str) – InChI, InChIKey, Name, UCI or Compound Source ID (e.g. chembl)

  • source_type (str) – uci, inchi, inchikey, sourceID

The returned dictionary contains 5 keys:

  • response: service response (‘Success’ if everything is right)

  • searchedCompound: the summary in terms of inchi, standardInchikey and uci

  • sources: a dictionary with e.g. compoundID and name of the source.

    A ‘comparison’ dictionary is also provided.

  • totalCompounds: number of searchedCompound entries

  • totalSources: number of sources entries

get_id_from_name(name)[source]

Return the ID of a source given its name.

Parameters:

name (str) – a valid database name (e.g., chembl)

u.get_id_from_name("chembl")
get_images(uci, filename=None)[source]

Return / create compound image

Parameters:
  • uci (str) – the UCI of the compound

  • filename – optional file name to save the SVG+XML output

Returns:

the SVG+XML string

(Source code)

get_inchi_from_inchikey(inchikey)[source]

Get a list of inchis given a valid inchikey.

Parameters:

inchikey – InChI Key to search. Unlike the rest API, you can also provide a list.

Returns:

a list of inchis matching the InChI Key provided. If input is a list, a dictionary is returned where keys are the inchikey input lists.

from bioservices import UniChem
u = UniChem()
res = u.get_inchi("AAOVKJBEBIDNHE-UHFFFAOYSA-N")

Note

this is a legacy function. introduced in v1.9 after unichem API update

get_source_info_by_id(ID)[source]

Obtain all information on a source by querying with a source ID.

Parameters:

ID (int) – valid source ID (see get_all_src_ids())

Returns:

dictionary with source information (see get_source_info_by_name() for keys)

u.get_source_info_by_id(1)
get_source_info_by_name(src_name)[source]

Description: Obtain all information on a source by querying with a source id

Parameters:

src_name (str) – valid identifiers can be found in source_ids e.g. chebi, chembl)

Returns:

dictionary (or list of dictionaries) with following keys:

  • UCICount: number of entries

  • baseIdUrl: URL of the source

  • created: date of creation

  • description: a description of the content of the source

  • lastUpdated: last date of the update

  • name: the unique name for the source in UniChem, always lower case

  • nameLabel: A name for the source suitable for use as a ‘label’ for the source

  • nameLong: the full name of the source, as defined by the source

  • private: is it private or not ?

  • sourceID: the src_id for this source

  • srcDetails: details about the source

  • srcReleaseDate: release date of the source database

  • srcReleaseNumber: release number of the source

  • srcUrl: src_url (the main home page of the source)

  • updateComments: possible updates from this source

>>> res = u.get_source_info_by_name("chebi")
get_sources()[source]

Returns all information about all sources used in Unichem

from bioservices import UniChem
u = UniChem()
res = u.get_sources()
res[0]
get_sources_by_inchikey(inchikey)[source]

Get sources by inchikey

Parameters:

inchikey – InChI Key to search. Unlike the rest API, you can also provide a list.

Returns:

A list of sources for the provided InChIKey if input is a single string. a dictionary with keys as inchikey if input is a list.

Note

this is a legacy function. introduced in v1.9 after unichem API update

get_sources_by_inchikey_verbose(inchikey)[source]

Get sources by inchikey

Parameters:

inchikey – InChI Key to search. Unlike the rest API, you can also provide a list.

Returns:

A list of sources for the provided InChIKey if input is a single string. a dictionary with keys as inchikey if input is a list.

Note

this is a legacy function. introduced in v1.9 after unichem API update

get_structure(compound_id, src_id)[source]

Obtain structure(s) CURRENTLY assigned to a query src_compound_id.

Parameters:
  • compound_id (str) – a valid compound identifier

  • src_id (int) – corresponding database identifier (name or id).

Returns:

dictionary with ‘standardinchi’ and ‘standardinchikey’ keys

>>> uni.get_structure("CHEMBL12", "chembl")

8.37. UniProt

Interface to some part of the UniProt web service

class UniProt(verbose=False, cache=False)[source]

Interface to the UniProt service

>>> from bioservices import UniProt
>>> u = UniProt(verbose=False)
>>> u.mapping("UniProtKB_AC-ID", "KEGG", query='P43403')
{'results': [{'from': 'P43403', 'to': 'hsa:7535'}]}
>>> res = u.search("P43403")

# Returns sequence on the ZAP70_HUMAN accession Id
>>> sequence = u.search("ZAP70_HUMAN", columns="sequence")

Changed in version 1.10: Uniprot update its service in June 2022. Changes were made in the bioservices API with small changes. User API is more or less the same. Main issues that may be faced are related to change of output column names. Please see the _legacy_names for corresponding changes.

Some notes about searches. The and and or are now upper cases. The organism and taxonomy fields are now organism_id and taxonomy_id

Constructor

Parameters:
  • verbose (bool) – set to False to prevent informative messages

  • cache (bool) – set to True to cache request

get_df(entries, nChunk=100, organism=None, limit=10, columns=None, progress=False)[source]

Given a list of uniprot entries, returns a dataframe with all possible columns

Parameters:
  • entries – list of valid entry name. if list is too large (about >200), you need to split the list

  • nChunk (int) – queries are processed by chunks of this size

  • limit – limit number of entries per identifier to 10. You can set it to None to keep all entries but this will be very slow

Returns:

dataframe with indices being the uniprot id (e.g. DIG1_YEAST)

To get about 100 columns related to the accession P62988, type:

df = u.get_df('P62988')

Note that you may precede the accession by the keyword sec_acc to access secondary accessions numbers:

df = u.get_df('sec_acc:P62988')
get_fasta(uniprot_id)[source]

Returns FASTA string given a valid identifier

Parameters:

uniprot_id (str) – a valid identifier (e.g. P12345)

This is just an alias to retrieve() when setting the format to ‘fasta’. Method kept for legacy.

mapping(fr='UniProtKB_AC-ID', to='KEGG', query='P13368', polling_interval_seconds=3, max_waiting_time=100, progress=True)[source]

This is an interface to the UniProt mapping service

Parameters:
  • fr (str) – the source database identifier. See valid_mapping.

  • to (str) – the target database identifier. See valid_mapping.

  • query – a string containing one or more IDs separated by a comma It can also be a list of strings.

  • polling_interval_seconds – the number of seconds between each status check of the current job

  • max_waiting_time – the maximum number of seconds to wait for the final answer.

Returns:

a dictionary with two possible keys. The first one is ‘results’ with the from / to answers and the second one ‘failedIds’ with Ids that were not found

>>> u.mapping("UniProtKB_AC-ID", "KEGG", 'P43403')
{'results': [{'from': 'P43403', 'to': 'hsa:7535'}]}

The output is a dictionary. Identifiers that were not found are stored in the keys ‘failedIds’. Successful queries are stored in the ‘results’ key that is a list of dictionaries with two keys set to ‘from’ and ‘to’. The ‘from’ key should be in your input list. The ‘to’ key is the result. Here we have the KEGG identifier recognised by its prefix ‘hsa:’, which is for human. Sometimes the output (‘to’) it is more complicated. Consider the following example:

u.mapping("UniParc", "UniProtKB", 'UPI0000000001,UPI0000000002')

You will see that the UniParc results is more complex than just an identifier.

See valid_mapping attribute for list of valid mapping identifiers.

Note that according to Uniprot (June 2022), there are various limits on ID Mapping Job Submission:

Limit

Details

100,000

Total number of ids allowed in comma separated param ids in /idmapping/run api

500,000

Total number of “mapped to” ids allowed

100,000

Total number of “mapped to” ids allowed to be enriched by UniProt data

10,000

Total number of “mapped to” ids allowed with filtering

Changed in version 1.1.1: to return a dictionary instaed of a list

Changed in version 1.1.2: the values for each key is now made of a list instead of strings so as to store more than one values.

Changed in version 1.2.0: input query can also be a list of strings instead of just a string

Changed in version 1.3.1: use http_post instead of http_get. This is 3 times faster and allows queries with more than 600 entries in one go.

Changed in version 1.10.0: new API due to uniprot website update

Changed in version 1.11.0: implement batch to prevent limit of 25 results.

a specialised version of search()

This is equivalent to:

u = uniprot.UniProt()
u.search(query, frmt='tsv', sort="score", limit=1)
Returns:

a dictionary.

retrieve(uniprot_id, frmt='json', database='uniprot', include=False)[source]

Search for a uniprot ID in UniProtKB database

Parameters:
  • uniprot_id (str) – a valid UniProtKB ID, or uniref, uniparc or taxonomy.

  • frmt (str) – expected output format amongst xml, txt, fasta, gff, rdf

  • database (str) – database name in (uniprot, uniparc, uniref, taxonomy)

  • include (bool) – include data with RDF format.

Returns:

if uniprot_id is a string, returns the entry directly; if a list of identifiers is provided, returns a list of results. The content depends on the value of frmt.

>>> u = UniProt()
>>> res = u.retrieve("P09958", frmt="txt")
>>> fasta = u.retrieve(['P29317', 'Q5BKX8', 'Q8TCD6'], frmt='fasta')
>>> print(fasta[0])

Changed in version 1.10: the xml format is now returned as raw XML. It is not interpreted anymore. The RDF has now an additional option to include data from referenced data sets directly in the returned data (set include=True parameter). Default output format is now set to json.

search(query, frmt='tsv', columns=None, include_isoforms=False, sort='score', compress=False, limit=None, size=25, database='uniprotkb', progress=False)[source]

Provide some interface to the uniprot search interface.

Parameters:
  • query (str) – query must be a valid uniprot query. See https://www.uniprot.org/help/query-fields and examples below

  • frmt (str) – a valid format amongst xlsx, fasta, gff, tsv and json. OTher format are not available within bioservices (rss, obo, rdf, xml) (default is tsv)

  • columns (str) – comma-separated list of values. Works only if format is tsv or xlsx. For UnitProtKB, some possible columns are: id, entry name, length, organism. See also valid_mapping for the full list of column keywords.

  • include_isoforms (bool) – include isoform sequences when the frmt parameter is fasta. Include description when frmt is rdf.

  • sort (str) – by score by default. Set to None to bypass this behaviour

  • compress (bool) – gzip the results

  • limit (int) – Stops the download of results once this limit is crossed. if size is 25 and limit is set to 30, 25+25 results will be returned though. users need to do a post filtering.

  • size (int) – chunk of results (25 by default on uniprot website).

Returns:

depends on the value of frmt. Uniprot API returns all results in several pages with size elements per page. If frmt is set to xlsx, output is a list of excel-like page with size per item. If frmt is set to tsv, bioservices concatenate all pages in a single string. Similarly for gff, fasta or json, bioservices concatenates all pages in a single variable (txt or dictionary depending on the requested format).

To obtain the list of uniprot ID returned by the search of zap70 can be retrieved as follows:

>>> u.search('zap70+AND+organism_id:9606')
>>> u.search("zap70+AND+taxonomy_id:9606", frmt="tsv", limit=3,
...    columns="accession,length,id, gene_names")
Entry       Length  Entry Name      Gene Names
P43403      619     ZAP70_HUMAN     ZAP70 SRK
P22681      906     CBL_HUMAN       CBL CBL2 RNF55
P20963      164     CD3Z_HUMAN      CD247 CD3Z T3Z TCRZ

other examples:

>> u.search("ZAP70+AND+organism_id:9606", limit=3, columns="id,xref_pdb")

You can also do a search on several keywords. This is especially useful if you have a list of known entry names.:

>>> u.search("ZAP70_HUMAN+OR+CBL_HUMAN", frmt="tsv", limit=3,
...    columns="entry name,length,id, genes")
Entry name  Length  Entry   Gene names

Finally, note that when you search for a query, you may have several hits:

>>> u.search("P12345")

including the ID P12345 but also related entries. If you need only the entry that perfectly match the query, use:

>>> u.search("accession:P12345")

This was provided from a user issue that was solved here: https://github.com/cokelaer/bioservices/issues/122

Warning

some columns although valid may not return anything, not even in the header: ‘score’, ‘taxonomy’, ‘tools’. this is a uniprot feature, not bioservices.

Changed in version 1.10: Due to uniprot API changes in June 2022:

  • parameter ‘include’ is now named ‘include_isoform

  • default parameter ‘tab’ is now ‘tsv’ but does not change the results

Changed in version 1.11:

  • removed the offset argument

  • add size parameter and keep limit parameter

  • add progress bar option (True by default)

  • drop frmt in : rdf, obo, xml, html

uniref(query)[source]

Calls UniRef service

This is an alias to retrieve()

>>> u = UniProt()
>>> u.uniref("Q03063")

Another example from https://github.com/cokelaer/bioservices/issues/121 is the combination of uniprot and uniref filters:

u.uniref("uniprot:(ec:1.1.1.282 taxonomy_name:bacteria reviewed:true)")

Changed in version 1.10: due to uniprot API changes in June 2022, we now return a json instead of a pandas dataframe.

property valid_mapping

8.38. DBFetch

Interface to DBFetch web service

class DBFetch(verbose=False)[source]

Interface to DBFetch service

>>> from bioservices import DBFetch
>>> w = DBFetch()
>>> data = w.fetchBatch("uniprot" ,"zap70_human", "xml", "raw")

For more information about the API, check this page: http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp

Constructor

Parameters:

verbose (bool) – print informative messages

fetch(query, db='ena_sequence', format='default', style='raw', pageHtml=False)[source]

Fetch an entry in a defined format and style.

Parameters:
  • query (str) – the entry identifier in db:id format (e.g. 'UniProtKB:WAP_RAT')

  • db (str) – database name (default "ena_sequence")

  • format (str) – the name of the format required (default "default")

  • style (str) – the name of the style required: "raw", "default", or "html"

  • pageHtml (bool) – if True, return the result wrapped in an HTML page

Returns:

entry data; format depends on the format/style parameters

from bioservices import DBFetch
u = DBFetch()
u.fetch(db="ena_sequence", format="fasta", query="L12344,L12345")
u.fetch(db="uniprot", format="fasta", query="P53503")

If db is omitted, the default is ena_sequence. If format is omitted, the default is EMBL format. The default style is raw data.

get_all_database_info()[source]

Get details of all available databases, including formats and result styles.

Returns:

a dict of data structures describing the databases. See get_database_info() for a description of each entry.

get_database_format_styles(db, format)[source]

Get a list of style names available for a given database and format.

Parameters:
  • db (str) – database name to get available styles for (e.g. uniprotkb).

  • format (str) – the data format to get available styles for (e.g. fasta).

Returns:

list of style name strings

>>> u.get_database_format_styles("uniprotkb", "fasta")
['default', 'raw', 'html']
get_database_formats(db)[source]

Get list of format names for a given database.

Parameters:

db (str) – valid database name

>>> db.get_database_formats("uniprotkb")
['default',
 'annot',
 'entrysize',
 'fasta',
 'gff3',
 'seqxml',
 'uniprot',
 'uniprotrdfxml',
 'uniprotxml',
 'dasgff',
 'gff2']
get_database_info(db=None)[source]

Get details describing specific database (data formats, styles)

Parameters:

db (str) – a valid database.

Returns:

dict describing the database; can be introspected for formats, styles, etc.

>>> res = u.get_database_info('uniprotkb')
>>> print(res['description'])
'The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-references. Search UniProtKB to retrieve everything that is known about a particular sequence.'
property supported_databases

Alias to getSupportedDBs.

8.39. Wikipathway

Interface to the WikiPathway service

class WikiPathways(verbose=True, cache=False)[source]

Interface to Pathway service

>>> from bioservices import WikiPathways
>>> s = WikiPathways()
>>> s.organism  # default organism
'Homo sapiens'

Examples:

s.findPathwaysByText('MTOR')
s.getPathway('WP1471')
s.getPathwaysByOntologyTerm('DOID:344')
s.findPathwaysByXref('P45985')

The methods that require a login are not implemented (login(), updatePathway(), removeCurationTag(), saveCurationTag(), createPathway())

Methods not implemented at all:

  • u’getCurationTagHistory’: No API found in Wikipathway web page

  • u’getRelations’: No API found in Wikipathway web page

Constructor

Parameters:

verbose (bool) –

createPathway(gpmlCode, authInfo)[source]

Create a new pathway on the WikiPathways website with a given GPML code.

Warning

Interface not exposed in bioservices.

Note

To create/modify pathways via the web service, you need to have an account with web service write permissions. Please contact us to request write access for the web service.

Parameters:
  • gpml (str) – The GPML code.

  • auth (object WSAuth) – The authentication info.

Returns:

WSPathwayInfo The pathway info for the created pathway (containing identifier, revision, etc.).

findInteractions(query)[source]

Find interactions defined in WikiPathways pathways.

Parameters:

query (str) – The name of an entity to find interactions for (e.g. ‘P53’)

Returns:

list of dictionaries

res = w.findInteractions("P53")
findPathwaysByLiterature(query)[source]

Find pathways by their literature references.

Parameters:

query (str) – The query, can be a pubmed id, author name or title keyword.

Returns:

dictionary with Pathway as keys

res = s.findPathwaysByLiterature(18651794)
findPathwaysByText(query, species=None)[source]

Find pathways using a textual search on the description and text labels of the pathway objects.

The query syntax offers several options:

  • Combine terms with AND and OR. Combining terms with a space is equal to using OR (‘p53 OR apoptosis’ gives the same result as ‘p53 apoptosis’).

  • Group terms with parentheses, e.g. ‘(apoptosis OR mapk) AND p53’

  • You can use wildcards * and ?. * searches for one or more characters, ? searches for only one character.

  • Use quotes to escape special characters. E.g. ‘“apoptosis*”’ will include the * in the search and not use it as wildcard.

This function supports REST-style invocation. Example: http://www.wikipathways.org/wpi/webservice/webservice.php/findPathwaysByText?query=apoptosis

Parameters:
  • query (str) – The search query (e.g. ‘apoptosis’ or ‘p53’).

  • species (str) – The species to limit the search to (leave blank to search on all species).

Returns:

Array of WSSearchResult An array of search results.

s.findPathwaysByText(query="p53 OR mapk",species='Homo sapiens')

Warning

AND or OR must be in big caps

findPathwaysByXref(ids, codes=None)[source]

Find pathways by searching on the external references of DataNodes.

Parameters:
  • ids (str string) – One or mode DataNode identifier(s) (e.g. ‘P45985’). Datanodes can be (gene/protein/metabolite identifiers). For one node, you can use a string (or number) or list of one identifier. you can also provide a list of identifiers.

  • codes (str) – You can restrict the search to a specific database. See http://developers.pathvisio.org/wiki/DatabasesMapps#Supporteddatabasesystems for details. Examples are “L” for entrez gene, “En” for ensembl. See also the note here below for multiple identifiers/codes.

Returns:

a dictionary

>>> s.findPathwaysByXref(ids="P45985")
>>> s.findPathwaysByXref(ids="P45985", codes="L")
>>> s.findPathwaysByXref(ids=["P45985"], codes=["L"])
>>> s.findPathwaysByXref(ids=["P45985", "ENSG00000130164"], codes=["L", "En"])

Note that in the last example, we specify multiple ids and codes parameters to query for multiple xrefs at once. In that case, the number of ids and codes parameters should match. Moreover, they will be paired to form xrefs, so P45985 is searched for in the “L” database while “ENSG00000130164” is searched for in the En” database only.

getColoredPathway(pathwayId, filetype='svg', revision=0, color=None, graphId=None)[source]

Get a colored image version of the pathway.

Parameters:
  • pwId (str) – The pathway identifier.

  • revision (int) – The revision number of the pathway (use ‘0’ for most recent version).

  • fileType (str) – The image type (One of ‘svg’, ‘pdf’ or ‘png’). Not yet implemented. svg is returned for now.

Returns:

Binary form of the image.

Todo

graphId, color parameters

getCurationTags(pathwayId)[source]

Get all curation tags for the given pathway.

Parameters:

pathwayId (str) – the pathway identifier.

Returns:

Array of WSCurationTag. The curation tags.

s.getCurationTags("WP4")
getCurationTagsByName(name)[source]

Get all curation tags for the given tag name.

Use this method if you want to find all pathways that are tagged with a specific curation tag.

Parameters:

tagName (str) – The tag name.

Returns:

Array of WSCurationTag. The curation tags (one instance for each pathway that has been tagged).

s.getCurationTagsByName("Curation:FeaturedPathway")
getOntologyTermsByPathway(pathwayId)[source]

Get a list of ontology terms for a given pathway.

Parameters:

pathwayId (str) – the pathway identifier.

Returns:

Array of WSOntologyTerm. The ontology terms.

s.getOntologyTermsByPathway("WP4")
getPathway(pathwayId, revision=0)[source]

Download a pathway from WikiPathways.

Parameters:
  • pathwayId (str) – the pathway identifier.

  • revision (int) – the revision number of the pathway (use ‘0’ for most recent version).

Returns:

The pathway as a dictionary. The pathway is stored in gpml format.

s.getPathway("WP2320")
getPathwayHistory(pathwayId, date)[source]

Get the revision history of a pathway.

Parameters:
  • pathwayId (str) – the pathway identifier.

  • date (str) – limit the results by date, only history items after the given date (timestamp format) will be included. Can be a string or number of the form YYYYMMDDHHMMSS.

Returns:

The revision history.

Warning

seems unstable does not return the results systematically.

s.getPathwayHistory("WP4", 20110101000000)
getPathwayInfo(pathwayId)[source]

Get some general info about the pathway.

Parameters:

pathwayId (str) – the pathway identifier.

Returns:

The pathway info.

>>> from bioservices import *
>>> s = Wikipathway()
>>> s.getPathwayInfo("WP2320")
getPathwaysByOntologyTerm(terms)[source]

Get a list of pathways tagged with a given ontology term.

Parameters:

terms (str) – the ontology term identifier.

Returns:

dataframe with pathways infomation.

>>> from bioservices import WikiPathways
>>> s = Wikipathway()
>>> s.getPathwaysByOntologyTerm('PW:0000724')
getPathwaysByParentOntologyTerm(term)[source]

Get a list of pathways tagged with any ontology term that is the child of the given Ontology term.

Parameters:

term (str) – the ontology term identifier.

Returns:

List of WSPathwayInfo The pathway information.

getRecentChanges(timestamp)[source]

Get the recently changed pathways.

Parameters:

timestamp (str) – Only get changes from after this time. Timestamp format: yyyymmddMMHHSS (string or number)

Returns:

The changed pathways in XML format

s.getRecentChanges(20110101000000)

Todo

interpret XML

getXrefList(pathwayId, code)[source]
listOrganisms()[source]
listPathways(organism=None)[source]

Get a list of all available pathways.

Parameters:

organism (str) – If provided, the data is filtered to keep only the organism provided, which must be a valid name (check out organism attribute)

Returns:

dataframe. Index are the pathways identifiers (e.g. WP1)

from bioservices import WikiPathways
w = WikiPathways()
df = w.listPathways()
df.groupby("species").count()['name'].sort_values().plot(kind="barh")
login(usrname, password)[source]

Start a logged in session using an existing WikiPathways account.

Warning

Interface not exposed in bioservices.

This function will return an authentication code that can be used to excecute methods that need authentication (e.g. updatePathway).

Parameters:
  • name (str) – The username of the WikiPathways account.

  • password (str) – The password of the WikiPathways account.

Returns:

The authentication code for this session.

property organism

Read/write attribute for the organism

organisms

Get a list of all available organisms.

removeCurationTag(pathwayId, name)[source]

Remove a curation tag from a pathway.

Warning

Interface not exposed in bioservices.

saveCurationTag(pathwayId, name, revision)[source]

Apply a curation tag to a pathway. This operation will overwrite any existing tag with the same name.

Warning

Interface not exposed in bioservices.

Parameters:

pathwayId (str) – the pathway identifier.

savePathwayAs(pathwayId, filename, revision=0, display=True)[source]

Save a pathway.

Parameters:
  • pathwayId (str) – the pathway identifier.

  • filename (str) – the name of the file. If a filename extension is not provided the pathway will be saved as a png (default).

  • revision (int) – deprecated, kept for backwards compatibility.

  • display (bool) – if True the pathway will be displayed in your browser.

Note

Method from bioservices. Not a WikiPathways function

Changed in version 1.7: return PNG by default instead of PDF. PDF not working as of 20 Feb 2020 even on wikipathway website.

Changed in version 1.10: fetch from wikipathways-assets instead of the retired getPathwayAs web service endpoint.

showPathwayInBrowser(pathwayId)[source]

Show a given Pathway into your favorite browser.

Parameters:

pathwayId (str) – the pathway identifier.

updatePathway(pathwayId, describeChanges, gpmlCode, revision=0)[source]

Update a pathway on WikiPathways website with a given GPML code.

Warning

Interface not exposed in bioservices.

Note

To create/modify pathways via the web service, you need to have an account with web service write permissions. Please contact us to request write access for the web service.

Parameters:
  • pwId (str) – The pathway identifier.

  • description (str) – A description of the modifications.

  • gpml (str) – The updated GPML code.

  • revision (int) – The revision number of the version this GPML code was based on. This is used to prevent edit conflicts in case another client edited the pathway after this client downloaded it.

  • WSAuth_auth (object) – The authentication info.

Returns:

Boolean. True if the pathway was updated successfully.

9. Applications and extra tools

Web services have lots of overlap amongst themselves. For instance, fetching a FASTA sequence can be done using many different services. Yet, once a FASTA is retrieved, one may want to perform additional tasks or save the FASTA into a file or whatever repetitive functionalities not included in Web Services anymore.

The goal of this sub-package is to provide convenient tools, which are not web services per se but that makes use of one or several Web Services already available within BioServices.

Warning

this is experimental and was added in version 1.2.0 so it may change quite a lot.

9.1. Peptides

class Peptides(verbose=False)[source]
>>> p = Peptides()
>>> p.get_peptide_position("Q8IYB3", "VPKPEPIPEPKEPSPE")
189

Sometimes, peptides are provided with a pattern indicating the phospho site. e.g.,

>>>
get_fasta_sequence(uniprot_name)[source]
get_phosphosite_position(uniprot_name, peptide)[source]

9.2. FASTA

class FASTA[source]

Dedicated class to manipulates FASTA sequence(s)

Here is a FASTA file example:

>sp|P43408|KADA_METIG Adenylate kinase OS=Methanotorris igneus GN=adkA PE=1 SV=2
MKNKVVVVTGVPGVGGTTLTQKTIEKLKEEGIEYKMVNFGTVMFEVAKEEGLVEDRDQMR
KLDPDTQKRIQKLAGRKIAEMAKESNVIVDTHSTVKTPKGYLAGLPIWVLEELNPDIIVI
VETSSDEILMRRLGDATRNRDIELTSDIDEHQFMNRCAAMAYGVLTGATVKIIKNRDGLL
DKAVEELISVLK

The format is made of a header and a sequence. Any FASTA can be read and the pair of header/sequence retrieved from the sequence and header attributes. However, headers differ from one database to another one and interpretation is not implemented except for SWISS-PROT. Identifiers can be retrieved whatsoever.

You can read a FASTA sequence from a local file or download one from UniProt

>>> from bioservices.apps.fasta import FASTA
>>> f = FASTA()
>>> f.load("P43403")
>>> acc = f.accession    # the accession (P43403)
>>> fasta = f.fasta      # raw FASTA string
>>> seq = f.sequence     # the sequence itself
>>> header = f.header    # the header itself
>>> identifier = f.identifier

You can also get a dataframe also using Pandas library.:

>>> f.df

The columns stored in the dataframe encompase:

  • Accession that is taken from the header (e.g., P43403 from uniprot)

  • Sequence, a copy of the FASTA sequence

  • Size, the length of the sequence.

  • Database, the database type found in the header (e.g., sp for SWISS-PROT; see below for a list of database and their header format).

  • Some column such as Organism are filled only for some database

  • Identififers is the begining of the header.

See also

MultiFASTA for multi FASTA manipulation.

List of identifiers corresponding to different databases.

GenBank

gi|gi-number|gb|accession|locus

EMBL Data Library

gi|gi-number|emb|accession|locus

DDBJ, DNA Database of Japan

gi|gi-number|dbj|accession|locus

NBRF PIR

pir||entry

Protein Research Foundation

prf||name

SWISS-PROT

sp|accession|name

Brookhaven Protein Data Bank (1)

pdb|entry|chain

Brookhaven Protein Data Bank (2)

entry:chain|PDBID|CHAIN|SEQUENCE

Patents

pat|country|number

GenInfo Backbone Id

bbs|number

General database identifier

gnl|database|identifier

NCBI Reference Sequence

ref|accession|locus

Local Sequence identifier

lcl|identifier

The :meth::load_fasta relies on UniProt service.

property PE

returns PE keyword found in the header if any

property SV

returns SV keyword found in the header if any

property accession
property dbtype
property df
property entry

returns entry only

property fasta

returns FASTA content

property gene_name

returns gene name from GN keyword found in the header if any

get_fasta(id_)[source]

Fetches FASTA from uniprot and loads into attrbiute fasta

Parameters:

id (str) – a given uniprot identifier

Returns:

the FASTA contents

property header

returns header only

property identifier
known_dbtypes = ['sp', 'gi']
load(id_)[source]
load_fasta(id_)[source]

Fetches FASTA from uniprot and loads into attribute fasta

Parameters:

id (str) – a given uniprot identifier

Returns:

nothing

Note

same as get_fasta() but returns nothing

property name
property organism

returns organism from OS keyword found in the header if any

read_fasta(filename)[source]

Reads a FASTA file and loads it

Type:

>>> f = FASTA()
>>> f.read_fasta(filename)
>>> f.fasta
Returns:

nothing

Warning

If more than one FASTA is contained in the file, an error is raised

save_fasta(filename)[source]

Save FASTA file into a filename

Parameters:
  • data (str) – the FASTA contents

  • filename (str) – where to save it

property sequence

returns the sequence only

class MultiFASTA[source]

Class to manipulate several several FASTA items

Here, we load some FASTA using UniProt web service:

>>> from bioservices import MultiFASTA
>>> mf = MultiFASTA()
>>> mf.load_fasta("P43408")
>>> mf.load_fasta("P21318")

You can then get back to your accession entries as follows

>>> mf.ids
['P43408', 'P21318']

And the sequences in the same order can be accessed:

>>> len(mf)
2

Each FASTA is stored in fasta, which is a dictionary where each values is an instance of FASTA:

>>> print(mf._fasta["P43408"].fasta)
>sp|P43408|KADA_METIG Adenylate kinase OS=Methanotorris igneus GN=adkA PE=1 SV=2
MKNKVVVVTGVPGVGGTTLTQKTIEKLKEEGIEYKMVNFGTVMFEVAKEEGLVEDRDQMR
KLDPDTQKRIQKLAGRKIAEMAKESNVIVDTHSTVKTPKGYLAGLPIWVLEELNPDIIVI
VETSSDEILMRRLGDATRNRDIELTSDIDEHQFMNRCAAMAYGVLTGATVKIIKNRDGLL
DKAVEELISVLK

The most convenient way to access to all data is to use the dataframe attribute:

>>> mf.df.Sequence
>>> from bioservices.apps import MultiFASTA
>>> f = MultiFASTA()
>>> f.load_fasta(["P43403", "P43410"])
>>> f.df.Size.hist()

(Source code, png, hires.png, pdf)

_images/references-2.png
property df
property fasta

Returns all FASTA instances

hist_size(**kwds)
property ids

returns list of keys/accession identifiers

load_fasta(ids)[source]

Loads a single FASTA file into the dictionary

read_fasta(filename)[source]

Load several FASTA from a filename

save_fasta(filename)[source]

Save all FASTA into a file