# Accessing Ensembl with BioServices

This notebook illustrates some of the functionalities of the ensembl service accessible from BioServices ensembl module. 

- [Introductory example](#introduction)
- [Archive](#archive)
- [Comparative genomics](#comparative)
- [Cross References](#reference)
- [Information](#information)
- [Lookup](#lookup)
- [Mappingp](#mapping)
- [Ontology and Taxonomy](#ontology)
- [Overlap](]overlap)
- [Regulation](#regulation)
- [Sequences](#sequences)
- [Variation](#variation)

 **References**   : http://rest.ensembl.org/ 
             

In [1]:
from bioservices import ensembl
# for debugigng
reload(ensembl)

<module 'bioservices.ensembl' from '/home/cokelaer/Work/github/bioservices/src/bioservices/ensembl.pyc'>

In [2]:
e = ensembl.Ensembl()

In [3]:
e.TIMEOUT = 60

## <a name="introduction"></a> Introductory example

- Most of the methods takes one or 2 compulsary arguments
- an argument that is not part of the Ensembl API is **frmt**. It can be set to one of the Ensemble output format that is:
    - json
    - jsonp
    - xml
    - phyloxml
- By default, output is in json format, which is transformed into a Python dictionary

In [5]:
res = e.get_archive('ENSG00000157764')
res

{u'assembly': u'GRCh38',
 u'id': u'ENSG00000157764',
 u'is_current': u'1',
 u'latest': u'ENSG00000157764.12',
 u'peptide': None,
 u'possible_replacement': [],
 u'release': u'83',
 u'type': u'Gene',
 u'version': u'12'}

In [7]:
# you can change the format to phyloxml (even though it does not make sense in this context)
print(e.get_archive('ENSG00000157764', frmt='json'))

{u'peptide': None, u'possible_replacement': [], u'version': u'12', u'is_current': u'1', u'release': u'83', u'assembly': u'GRCh38', u'type': u'Gene', u'id': u'ENSG00000157764', u'latest': u'ENSG00000157764.12'}


In [8]:
res = e.get_genetree_by_member_id('ENSG00000157764', frmt='json', nh_format='phylip')
print(res[0:100])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http:/


> Here, the input frmt (json) is changed since nh_format can be only in phyloxml format
 So example requires a parameter called nh_format that may overwrite the value of the argument **frmt** even if provided. 
 An example is shown later with the nh_format set to phylib, which is an xml format. If the user set frmt to json, it does 
 make sense so that arguments is ignored.

In [9]:
# If your identifier is incorrect, you will get a 500 error code returned (most probably)
wrong = e.get_map_cds_to_region('ENST0000288602', '1..1000')
good = e.get_map_cds_to_region('ENST00000288602', '1..1000')
wrong, good['mappings'][0]


(500,
 {u'assembly_name': u'GRCh38',
  u'coord_system': u'chromosome',
  u'end': 140924703,
  u'gap': 0,
  u'rank': 0,
  u'seq_region_name': u'7',
  u'start': 140924566,
  u'strand': -1})

## <a name="archive"></a> Archive

In [10]:
# Get archived sequence given an identifer
e.get_archive('ENSG00000157764')

{u'assembly': u'GRCh38',
 u'id': u'ENSG00000157764',
 u'is_current': u'1',
 u'latest': u'ENSG00000157764.12',
 u'peptide': None,
 u'possible_replacement': [],
 u'release': u'83',
 u'type': u'Gene',
 u'version': u'12'}

## <a name="comparitive"></a> Comparative genomics

### Gene tree by identifier

In [11]:
res = e.get_genetree_by_id('ENSGT00390000003602', nh_format='simple')
res['id'], res.keys()

(u'ENSGT00390000003602', [u'type', u'tree', u'rooted', u'id'])

In [12]:
res = e.get_genetree_by_id('ENSGT00390000003602', frmt='phyloxml')
print(res[0:200])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ht


Retrieve genetree by member id and returns a phylip structure
This takes a few seconds and output xml is large`

In [14]:
# Here, the input frmt (json) is changed since nh_format can be only in phyloxml format
res = e.get_genetree_by_member_id('ENSG00000157764', frmt='json', nh_format='phylip')

In [15]:
len(res)

2204716

In [16]:
print(res[0:500])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.phyloxml.org">
  <phylogeny rooted="true" type="gene tree">
    <clade branch_length="0">
      <confidence type="duplication_confidence_score">1.0000</confidence>
      <taxonomy>
        <id>33213</id>
        <scientific_name>Bilateria</scientific_name>
      </taxonomy>
      <events>
 


In [17]:
res = e.get_genetree_by_member_symbol('human', 'BRCA2', nh_format='simple')

In [18]:
print(res[0:200])

(((((((ENSTRUP00000015030:0.074146,ENSTNIP00000002435:0.10955):0.221789,ENSGACP00000015199:0.154409):0.017205,((ENSXMAP00000006983:0.039547,ENSPFOP00000001575:0.039869):0.385538,ENSONIP00000006940:0.3


## <a name="reference"></a> Cross references

In [13]:
res = e.get_xrefs_by_id()

TypeError: get_xrefs_by_id() takes at least 2 arguments (1 given)

## <a name="information"></a> Information

In [20]:
e.get_info_ping()


1

In [21]:
e.get_info_rest()


{u'release': u'4.3'}

In [22]:
res = e.get_info_software()
res

{u'release': 82}

In [34]:
res = e.get_info_species()
[x['name'] for x in res['species']]

[u'saccharomyces_cerevisiae',
 u'ciona_savignyi',
 u'myotis_lucifugus',
 u'taeniopygia_guttata',
 u'sorex_araneus',
 u'otolemur_garnettii',
 u'macropus_eugenii',
 u'erinaceus_europaeus',
 u'anolis_carolinensis',
 u'gadus_morhua',
 u'dasypus_novemcinctus',
 u'chlorocebus_sabaeus',
 u'tursiops_truncatus',
 u'mus_musculus',
 u'bos_taurus',
 u'monodelphis_domestica',
 u'choloepus_hoffmanni',
 u'sus_scrofa',
 u'rattus_norvegicus',
 u'caenorhabditis_elegans',
 u'pteropus_vampyrus',
 u'microcebus_murinus',
 u'sarcophilus_harrisii',
 u'ovis_aries',
 u'papio_anubis',
 u'pelodiscus_sinensis',
 u'equus_caballus',
 u'xiphophorus_maculatus',
 u'macaca_mulatta',
 u'astyanax_mexicanus',
 u'latimeria_chalumnae',
 u'ficedula_albicollis',
 u'gasterosteus_aculeatus',
 u'gorilla_gorilla',
 u'oryctolagus_cuniculus',
 u'oreochromis_niloticus',
 u'echinops_telfairi',
 u'nomascus_leucogenys',
 u'homo_sapiens',
 u'dipodomys_ordii',
 u'lepisosteus_oculatus',
 u'anas_platyrhynchos',
 u'canis_familiaris',
 u'call

### Sequence

#### Get a sequence

In [36]:
sequence = e.get_sequence_by_id('ENSG00000157764', frmt='text')
print(sequence[0:60])

CGCCTCCCTTCCCCCTCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAA


### Variation

In [14]:
e.get_variation_by_id

TypeError: get_variation_by_id() takes at least 3 arguments (1 given)

## Lookup

In [39]:
res = e.get_lookup_by_id('ENSG00000157764', expand=True)
res.keys()

[u'assembly_name',
 u'display_name',
 u'description',
 u'seq_region_name',
 u'logic_name',
 u'object_type',
 u'start',
 u'id',
 u'source',
 u'db_type',
 u'version',
 u'biotype',
 u'end',
 u'Transcript',
 u'species',
 u'strand']

In [40]:
res = e.post_lookup_by_id(["ENSG00000157764", "ENSG00000248378" ], expand=0)
res['ENSG00000157764']


{u'assembly_name': u'GRCh38',
 u'biotype': u'protein_coding',
 u'db_type': u'core',
 u'description': u'B-Raf proto-oncogene, serine/threonine kinase [Source:HGNC Symbol;Acc:HGNC:1097]',
 u'display_name': u'BRAF',
 u'end': 140924764,
 u'id': u'ENSG00000157764',
 u'logic_name': u'ensembl_havana_gene',
 u'object_type': u'Gene',
 u'seq_region_name': u'7',
 u'source': u'ensembl_havana',
 u'species': u'homo_sapiens',
 u'start': 140719327,
 u'strand': -1,
 u'version': 12}

In [41]:
res = e.get_lookup_by_symbol('homo_sapiens', 'BRCA2', expand=True)
len(res['Transcript'])

7

In [42]:
res = e.post_lookup_by_symbol('human', ["BRCA2", "BRAF" ], expand=True)
len(res['BRCA2']['Transcript'])

easydev tolist deprecated since 0.8.0. use to_list() instead


7

## Mapping

	Description
- Convert from cDNA coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl API.
- GET map/cds/:id/:region 	Convert from CDS coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl API.
- GET map/:species/:asm_one/:region/:asm_two 	Convert the co-ordinates of one assembly to another
- GET map/translation/:id/:region 	Convert from protein (translation) coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl 

In [43]:
# the commented statement does not work
# res = e.get_map_assembly_one_to_two('GRCh37', 'NCBI36', region='X:10000000..1000100:1', species='human')
res = e.get_map_assembly_one_to_two('GRCh37', 'GRCh38', region='X:1000000..1000100:1')
res

{u'mappings': [{u'mapped': {u'assembly': u'GRCh38',
    u'coord_system': u'chromosome',
    u'end': 1039365,
    u'seq_region_name': u'X',
    u'start': 1039265,
    u'strand': 1},
   u'original': {u'assembly': u'GRCh37',
    u'coord_system': u'chromosome',
    u'end': 1000100,
    u'seq_region_name': u'X',
    u'start': 1000000,
    u'strand': 1}}]}

In [44]:
res = e.get_map_translation_to_region('ENSP00000288602', '100..300')
res['mappings'][0]  # bioservices API may change to res[0] to simpify the output ?

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140834815,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140834609,
 u'strand': -1}

In [45]:
res = e.get_map_cds_to_region('ENST00000288602', '1..1000')
res['mappings'][0]

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140924703,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140924566,
 u'strand': -1}

In [46]:
res = e.get_map_cdna_to_region('ENST00000288602', '100..300')
res['mappings'][0]

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140924665,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140924566,
 u'strand': -1}