Friday 18 September 2009

Gene Expression Atlas DAS Source

Hello!

The Gene Expression Atlas now provides a DAS track that can be viewed with any number of compatible DAS clients. The most famous of these is the Ensembl Genome Browser.

Here's a screenshot of what the Atlas DAS annotations look like in Ensembl Gene View:


For every gene, the Atlas provides a brief one sentence summary of the gene's differential expression in various tissues, diseases, cell types, cell lines and other conditions. Moreover, per-condition gene activity breakdown is provided for every biological site or condition where this gene was observed. Links are provided to the Atlas gene pages.

The Gene Expression Atlas DAS is registered in the DAS Registry, where further details about it are available. For step-by-step instructions for adding Atlas annotations to your view of Ensembl, check out our Atlas DAS Help page.

We hope you find this useful. Over time we'll add more functionality to the Atlas DAS Source.

Cheers,

--Misha Kapushesky and the Atlas Team
Gene Expression Atlas Project Coordinator

Monday 14 September 2009

Gene Expression Atlas 1.1.3

Hello!
We released Atlas 1.1.3 today. The main highlights of this release are:
There are a number of bug-fixes, also. You can now link to genes on any number of identifiers, including synonyms and alternate names:
and non-specific identifiers:
Now, a little more detail on the Atlas REST APIs. These will return JSON or XML, and provide a comprehensive programmatic interface to everything you see in the Atlas Web interface. Here are some examples (lots more detail in the API documentation).
Gene and Condition Queries




These queries return gene results as in the Atlas Heatmap or List Views, with complete information on which conditions the matching genes are over/under-expressed in, including the EFO information.
Try adding "&indent" to the end of these queries to see pretty-printed output.


Experiment Queries

These queries execute searches for experiments, either listing general experiment information, or, if one or more genes are specified, the full experiment/assay/sample relationship matrix is produced together with gene expression values and, if available, the differential expression statistics.
Again, "&indent" will pretty-print the results.
We urge you to check out the docs, try these APIs out and get back to us with feedback and requests for further functionality. We'll post here separately the instructions to enable the Gene Expression Atlas DAS Track in Ensembl. For now, here's a screenshot:

Yours,

--Misha Kapushesky and the Atlas Team
Gene Expression Atlas Project Coordinator

Tuesday 11 August 2009

Gene Expression Atlas 1.1.2

Hello!

Together with the monthly data release we have put out there for your enjoyment Gene Expression Atlas 1.1.2. This is mostly a bug-fix release, though it does feature some visible changes:
More importantly, however, this is a preparatory release for what's coming next (end of August). That is, a quite extensive REST API to Atlas, as well as a dedicated Atlas DAS track in Ensembl, and (if we can squeeze it in) a Java Remoting and Web Services APIs to Atlas.

Yours.

--Misha Kapushesky and the Atlas Team
Gene Expression Atlas Project Coordinator

Tuesday 16 June 2009

Gene Expression Atlas - 1.1 Released!

Hello!

A major Atlas release happened yesterday, 1.1! Here's what's new:
  1. Ontology-driven interface
    EFO
    is now used not only for query expansion, but actively in the user interface. You can choose terms from the ontology to construct your queries, your queries use the ontology to display hits and the ontology is used to organise the results display. A simple query for "brain" will produce a detailed listing of gene expression activity in all brain compartments, from cortex to hypothalamus, and will also give you a broad overview upwards through the semantic hierarchy to the central nervous system.
  2. List view of search results
    In addition to viewing search results as a heatmap, you can view them as an expandable list, viewing the condition/experiment hits for each gene in the result set. Here's a list view of transcriptional activity in zebrafish brain.
  3. Download results
    You can download your search results as tab-delimited files. The download link is at the top of every list view. This is an experimental feature - let us know how you find it. It's a bit on the slow side right now but we'll work on improving it if people find it useful.
  4. Experiment pages with gene plots and similarity search
    Each experiment has a page where a large, detailed gene expression plot can be viewed. You can plot multiple lines on the plot, selecting from top differentially expressed genes, genes similar to ones displayed, or any gene that matches your search criteria.

... and lots of small changes and bug fixes under the hood - improved index, some of the underlying mechanics and so on.

Additionally, we've graduated to a top-level EBI resource. We are now called Gene Expression Atlas, and the URL is http://www.ebi.ac.uk/gxa. All the old URLs will continue to work, but do update your bookmarks.

We are keen to hear your comments and requests - write to us here or email us at our mailing list (sign up on the Atlas homepage). There are lots of new and interesting things we are planning to roll out in the next several months!

Thanks,

--Misha Kapushesky and the Atlas team.

Thursday 26 March 2009

Linking to the Atlas!

Hello!

As the previous post says, we released Atlas 1.0.2. In addition to the memory usage fix discussed below, and to major autocompletion speed-ups (we switched from using Solr's prefix faceting to our own prefix tree-based implementation), we improved how you can link to the Atlas now.

To link to Atlas Gene Pages, you can use the following URL format:

http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=IDENTIFIER

Valid identifiers are, at the moment, the following:
  • embl - EMBL IDs
  • ensgene - Ensembl Gene IDs
  • ensprotein - Ensembl Protein IDs
  • enstranscript - Ensembl Transcript IDs
  • goid - Gene Ontology IDs
  • interproid - InterPro IDs
  • locuslink - Entrez Gene IDs (formerly LocusLink)
  • omimid - OMIM IDs
  • refseq - RefSeq IDs
  • unigene - UniGene IDs
  • uniprot - UniProt Accessions
  • dbxref - A mixed bag of various external identifiers, primarily MGI IDs.
You can download a large file, listing all indexed gene identifiers, to help you link to Atlas:



This file is automatically generated with every data release. Where an identifier links to a single gene, e.g., UniProt accession Q8IZT6, the link http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=Q8IZT6 goes directly to the respective gene page. If the chosen identifier is linked to a group of genes - an InterPro or a Gene Ontology ID, say, then the link will go to the gene expression summary Heatmap View, e.g., for "phosphopyruvate hydratase complex", GO:0000015, the link is http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=GO:0000015

Atlas Architecture: Optimizing Memory Usage

Hello!

It's been two weeks since we released Atlas 1.0. We knew when releasing that Atlas was a memory-intensive application (the Solr index underneath is quite large), but we did not expect it to run out of memory every 24 hours. So, we had to fix that quickly. Atlas 1.0.2 was quietly rolled out about a week ago, and since then we have not seen any more problems.

We thought you might be interested to find out what was wrong. We oblige.

Atlas index is gene-centric. For every gene, there are a bunch of fields that are indexed. Some of these are gene attributes (e.g., identifier, synonyms, keywords etc.), and some are numeric, that is, expression summaries (e.g., count of studies associated with under/over expression in a given condition). This way, each gene has a limited number of fields only, since most genes show differential expression in a small subset of all (~5000) conditions in the Atlas. All the same, the total number of distinct fields associated to genes is quite large, several tens of thousands.

Underneath Solr there is the Lucene library managing the index. In order to be able to sort search results by field values, as we need - by study count/statistical significance of differential expression, Lucene keeps a special cache of term values, called the fieldCache. As search requests cover more and more of the total Atlas content, this cache fills up.

This fieldCache is a static in-memory data store with no eviction policy at all. In fact, it is really a rectangular array, as long as there are genes in the index and as wide as the total number of fields we have. Since most genes have very few fields, the fieldCache is actually sparse, but the underlying implementation forces it to have pretty much the maximal memory footprint you can imagine.

So, what was happening was that we were exhausting memory very rapidly through this fieldCache. Apparently, this is a well-known issue for Solr/Lucene users (see, for instance this JIRA ticket).

What are our options? One way out of this is to implement our own fieldCache, which could make use of the sparse nature of the indexed data, be disk-backed and/or use some sort of an LRU eviction strategy or we could just plug a third-party cache there, such as ehcache. That takes a bit of work. Another, simpler, strategy is to look whether there is room for optimization of the index.

For now we chose the latter option. It turns out that there are several fields we could use a smaller type for - short instead of int, float instead of double, and so on, and also using some more pre-computing, compress several of these fields into one. This allowed us to shrink the index approximately three-fold and now memory usage is well under control. Until, of course, our data content grows significantly (in terms of experimental conditions indexed, not in terms of studies or genes) - when that happens, we'll have to either increase memory usage or, write our own fieldCache. But by that time lots of things might be different - Lucene's implementation, our model, and so on. For now, we are happy that Atlas runs comfortably in under 2GB allocated to a tomcat servlet container.

Hope you found this interesting.

Misha Kapushesky
ArrayExpress Atlas Coordinator

P. S. One of the developers, Pavel Kurnosov, prepared a general presentation about the Atlas architecture. You can view the slides below.

Thursday 12 March 2009

ArrayExpress Atlas 1.0 Released!

Hello!

After a long, productive silence, we have released the first real production version of ArrayExpress Atlas. Take a look at http://www.ebi.ac.uk/microarray-as/atlas.

What's new? Well, under the hood, just about everything:
We hope you like the new software. There are lots of other things we are preparing and we'll try to keep this blog updated more frequently now.

Cheers,

Misha Kapushesky & the Atlas Team.