on June 21, 2011 by in Bioinformatics APIs, Comments (5)

Accessing Ensembl using Ruby

This tutorial describes how to use the Ruby API to the Ensembl Core and Variation databases. It is intended as an introduction and demonstration of the general API concepts. This tutorial is not comprehensive, but it will hopefully enable to reader to become productive quickly, and facilitate a rapid understanding of the underlying systems. This tutorial assumes at least some familiarity with Ruby. It is the first of a three-part-tutorial: overview of the Ruby API system, installation and a minimal script (= part 1); the API to the Ensembl Core database, the Slice concept and coordinate projections (= part 2), and the API to the Ensembl Variation database (= part 3).

The Ruby API provides a level of abstraction over the Ensembl Core and Variation databases. To external users the API may be useful to automate the extraction of particular data. This API is only one of many ways of accessing the data stored in Ensembl. Additionally there is a genome browser web interface, and the BioMart system. BioMart may be a more appropriate tool for certain types of data mining. The API is for read-only querying of the database.

This text was written by Jan Aerts, but copy-paste-modified from the excellent perl API tutorial at www.ensembl.org/info/software/core/core_tutorial.html (with permission of the core Ensembl team). It is based on release 60 of the Ensembl database, but is completely compatible with other versions (see below).

For feedback concerning this API (ideas, hints, bugs), please go to the UserEcho website.

The Ensembl Core and Variation APIs have a decent set of code documentation in the form of standard Ruby RDOC. This documentation is mixed in with the actual code, but can be automatically extracted and formatted using some software tools.

If you have your RUBYLIB environment variable set correctly, you can use the command ri. For example the following command will bring up some documentation about the Slice class and each of its methods:

[sourcecode language=”ruby”] ri Ensembl::Core::Slice[/sourcecode]

For additional information you can contact Jan Aerts (jan.aerts@esat.kuleuven.be) or preferably send an email to the bioruby mailing list (see www.bioruby.org).

Referencing

When using the Ruby API to the Ensembl database, please use the following reference:

Strozzi F & Aerts J. A Ruby API to query the Ensembl database for genomic features. Bioinformatics 27(7):1013-1014 (2011)

Code conventions used in this tutorial

Several naming conventions are used throughout the API. Knowing these conventions will aid in your understanding of the code.

Variable names are underscore-separated all-lower-case words.

  slice_1
  exon_1
  my_gene

Class and package names are CamelCase words that begin with capital letters.

  Ensembl::Core::Gene
  Ensembl::Core::Exon
  Ensembl::Core::CoordSystem
  Ensembl::Core::SeqRegion

Method names are entirely lower-case, underscore separated words. Methods are called on an object or class by appending a period to that object or class and adding the method name.

[sourcecode language=”ruby”] Ensembl::Core::Slice.genes
transcript_a.five_prime_utr_seq[/sourcecode]

Class methods are responsible for the creation of various objects. Most of this is standard ActiveRecord behaviour and will be discussed below.

Obtaining and installing the API

The Ensembl Ruby API is made available as a gem. See the github website for more information.

Basically, it comes down to:

  sudo gem install ruby-ensembl-api

ActiveRecord

Most of the API is based on ActiveRecord to get data from that database. In general, each table is described by a class with the same name and each class instance (object) corresponds to a single row in that table: the coord_system table is covered by the Ensembl::Core::CoordSystem class, the seq_region table is covered by the Ensembl::Core::SeqRegion class, etc. As a result, accessors are available for all columns in each table. For example, the seq_region table has the following columns: seq_region_id, name, coord_system_id and length. Through ActiveRecord, these column names become available as attributes of Ensembl::Core::SeqRegion objects:

[sourcecode language=”ruby”] puts my_seq_region.seq_region_id
puts my_seq_region.name
puts my_seq_region.coord_system_id
puts my_seq_region.length.to_s[/sourcecode]

ActiveRecord makes it easy to extract data from those tables using the collection of find methods. Let’s for example take the coord_system table. SELECT * FROM coord_system returns:

coord_system_id  species_id  name         version  rank  attrib                        
---------------  ----------  -----------  -------  ----  ------------------------------
1                1           contig       NULL     4     default_version,sequence_level
2                1           chromosome   GRCh37   1     default_version               
3                1           supercontig  NULL     2     default_version               
4                1           clone        NULL     3     default_version               
27               1           chromosome   NCBI36   5                                   
101              1           chromosome   NCBI35   6                                   
1001             1           chromosome   NCBI34   7                                   
1003             1           lrg          NULL     8     default_version

There are three types of find methods (e.g. for the Ensembl::Core::CoordSystem class):

  • find based on primary key in table:
    [sourcecode language=”ruby”] my_coord_system = CoordSystem.find(5)[/sourcecode]
  • find_by_sql:
    [sourcecode language=”ruby”] my_coord_system = CoordSystem.find_by_sql(‘SELECT * FROM coord_system WHERE name = ‘chromosome’")[/sourcecode]
  • find_by_
    [sourcecode language=”ruby”] my_coord_system1 = CoordSystem.find_by_name(‘chromosome’)
    my_coord_system2 = CoordSystem.find_by_rank(3)[/sourcecode]

To find out which find_by_ methods are available, you can list the column names using the column_names class methods:

[sourcecode language=”ruby”] puts Ensembl::Core::CoordSystem.column_names.join("\t")[/sourcecode]

For more information on the find methods, see the ActiveRecord Base documentation.

The relationships between different tables are accessible through the classes as well. For example, to loop over all seq_regions belonging to a coord_system (a coord_system “has many” seq_regions):

[sourcecode language=”ruby”] chr_coord_system = CoordSystem.find_by_name(‘chromosome’)
chr_coord_system.seq_regions.each do |seq_region|
puts seq_region.name
end[/sourcecode]

Of course, you can go the other way as well (a seq_region “belongs to” a coord_system):

[sourcecode language=”ruby”] chr4 = SeqRegion.find_by_name(‘4’)
puts chr4.coord_system.name #–> ‘chromosome'[/sourcecode]

To find out what relationships exist for a given class, you can use the reflect_on_all_associations class methods:

[sourcecode language=”ruby”] puts SeqRegion.reflect_on_all_associations(:has_many).collect{|a| a.name.to_s}.join("\n")
puts SeqRegion.reflect_on_all_associations(:has_one).collect{|a| a.name.to_s}.join("\n")
puts SeqRegion.reflect_on_all_associations(:belongs_to).collect{|a| a.name.to_s}.join("\n")[/sourcecode]

This method is typically used when working in an interactive environment (irb, or using the ensembl command if you installed this API).

Connecting to the Ensembl database and a minimal script

All data used and created by Ensembl is stored in MySQL relational databases. If you want to access this database the first thing you have to do is to connect to it. This is done behind the scenes using the ActiveRecord module.

First, we need to tell our computer where they can find the API code. This information is contained in the RUBYLIB environment variable. Suppose you have saved the API in /usr/local/lib/ruby/ensembl-api (with subdirectories lib/, test/, samples/, …), you could set the environment variable on a bash shell like this:

  export RUBYLIB=$RUBYLIB:/usr/local/lib/ruby/ensembl-api/lib

Next, we need to import all Ruby modules that we will be using. Every Ensembl script that you will write will contain a use statement like the following:

[sourcecode language=”ruby”] require ‘ensembl'[/sourcecode]

Alternatively, if you installed the API as a gem, you would write:

[sourcecode language=”ruby”] require ‘rubygems’
require_gem ‘ensembl-api'[/sourcecode]

Ensembl stores its data in a separate database for each species and each release of that species. The Ruby Ensembl API does a lot automatically, so you only have to know the species name to connect to the release 60 version of its core database. This name should be provided in snake_case (all lowercase connected by underscore):

[sourcecode language=”ruby”] Ensembl::Core::DBConnection.connect(‘homo_sapiens’)[/sourcecode]

With the connection established, you’ll be able to get objects from the database, e.g.

[sourcecode language=”ruby”] chromosome_4 = Ensembl::Core::SeqRegion.find_by_name(‘4’)[/sourcecode]

You have to include the Ensembl::Core:: bit to every call to a class. However, if you include the line

[sourcecode language=”ruby”] include Ensembl::Core[/sourcecode]

just after you “require ‘ensembl’”, you don’t have to anymore. The rest of this tutorial expects you to have done the include command. So a very short but complete Ruby script could look like this:

[sourcecode language=”ruby”] require ‘ensembl’
include Ensembl::Core
DBConnection.connect(‘homo_sapiens’)
chromosome_4 = SeqRegion.find_by_name(‘4’)
puts chromosome_4.name[/sourcecode]

Tags: , ,

5 Comments

  1. Accessing Ensembl using Ruby – the Core database » The Bioinformatics Knowledgeblog

    June 21, 2011 @ 1:23 pm

    […] database. It is part of a collection of 3 tutorials on the Ruby API to Ensembl. Please see this introductory tutorial for more information on the Ruby API system, installation and a minimal script. […]

  2. Bert Overduin

    June 21, 2011 @ 2:31 pm

    How can one specify to which version of Ensembl to connect?

  3. Jan Aerts

    June 21, 2011 @ 2:35 pm

    By default it connects to release 60. This will be changed in the future though to be the last version. You can connect to a specific release by adding the release number to the DBConnection, e.g.
    DBConnection.connect('homo_sapiens', 62)

  4. Tim Booth

    June 21, 2011 @ 3:24 pm

    Hi – this is good stuff. As someone with minimal Ruby knowledge and fairly rusty on EnsEMBL (I’m a prokaryote man!) I found it totally readable. Just some little suggestions to prove I read it all.

    In code conventions, under variable names, I read “gene_a” as being an array of genes (sorry, too much C exposure). Maybe slightly different examples would be better – eg. “my_gene”.

    “each table is described by a class with the same name” … it is just stating the obvious to extend this to “each table is described by a class with the same name, and each class instance represents a single row from that table”?

    The find… methods are introduced with examples based on the coord_system table, but I’m not sure exactly what this table contains (I guess it’s described in part 2). Is there another table that could be used here that might be more obvious, or at least could you quickly say what rows in the coord_system table represent? For example I have no idea what I expect to get back from
    my_coord_system2 = CoordSystem.find_by_rank(3)

    I’m not clear on the use of “reflect_on_all_associations” – it looks like this is a way to query properties of the API, in which case can the information not be found in the documentation? I’d not normally expect to have to write code to find out how to use an API, though reflection methods are always nice to have. Or is it actually telling you something dynamic about the database?

    Cheers,
    TIM

  5. Jan Aerts

    June 21, 2011 @ 3:40 pm

    Thanks for the comments. All of them are now incorporated into the text.

Leave a comment

Login