Thursday, 26 March 2009

Linking to the Atlas!


As the previous post says, we released Atlas 1.0.2. In addition to the memory usage fix discussed below, and to major autocompletion speed-ups (we switched from using Solr's prefix faceting to our own prefix tree-based implementation), we improved how you can link to the Atlas now.

To link to Atlas Gene Pages, you can use the following URL format:

Valid identifiers are, at the moment, the following:
  • embl - EMBL IDs
  • ensgene - Ensembl Gene IDs
  • ensprotein - Ensembl Protein IDs
  • enstranscript - Ensembl Transcript IDs
  • goid - Gene Ontology IDs
  • interproid - InterPro IDs
  • locuslink - Entrez Gene IDs (formerly LocusLink)
  • omimid - OMIM IDs
  • refseq - RefSeq IDs
  • unigene - UniGene IDs
  • uniprot - UniProt Accessions
  • dbxref - A mixed bag of various external identifiers, primarily MGI IDs.
You can download a large file, listing all indexed gene identifiers, to help you link to Atlas:

This file is automatically generated with every data release. Where an identifier links to a single gene, e.g., UniProt accession Q8IZT6, the link goes directly to the respective gene page. If the chosen identifier is linked to a group of genes - an InterPro or a Gene Ontology ID, say, then the link will go to the gene expression summary Heatmap View, e.g., for "phosphopyruvate hydratase complex", GO:0000015, the link is

Atlas Architecture: Optimizing Memory Usage


It's been two weeks since we released Atlas 1.0. We knew when releasing that Atlas was a memory-intensive application (the Solr index underneath is quite large), but we did not expect it to run out of memory every 24 hours. So, we had to fix that quickly. Atlas 1.0.2 was quietly rolled out about a week ago, and since then we have not seen any more problems.

We thought you might be interested to find out what was wrong. We oblige.

Atlas index is gene-centric. For every gene, there are a bunch of fields that are indexed. Some of these are gene attributes (e.g., identifier, synonyms, keywords etc.), and some are numeric, that is, expression summaries (e.g., count of studies associated with under/over expression in a given condition). This way, each gene has a limited number of fields only, since most genes show differential expression in a small subset of all (~5000) conditions in the Atlas. All the same, the total number of distinct fields associated to genes is quite large, several tens of thousands.

Underneath Solr there is the Lucene library managing the index. In order to be able to sort search results by field values, as we need - by study count/statistical significance of differential expression, Lucene keeps a special cache of term values, called the fieldCache. As search requests cover more and more of the total Atlas content, this cache fills up.

This fieldCache is a static in-memory data store with no eviction policy at all. In fact, it is really a rectangular array, as long as there are genes in the index and as wide as the total number of fields we have. Since most genes have very few fields, the fieldCache is actually sparse, but the underlying implementation forces it to have pretty much the maximal memory footprint you can imagine.

So, what was happening was that we were exhausting memory very rapidly through this fieldCache. Apparently, this is a well-known issue for Solr/Lucene users (see, for instance this JIRA ticket).

What are our options? One way out of this is to implement our own fieldCache, which could make use of the sparse nature of the indexed data, be disk-backed and/or use some sort of an LRU eviction strategy or we could just plug a third-party cache there, such as ehcache. That takes a bit of work. Another, simpler, strategy is to look whether there is room for optimization of the index.

For now we chose the latter option. It turns out that there are several fields we could use a smaller type for - short instead of int, float instead of double, and so on, and also using some more pre-computing, compress several of these fields into one. This allowed us to shrink the index approximately three-fold and now memory usage is well under control. Until, of course, our data content grows significantly (in terms of experimental conditions indexed, not in terms of studies or genes) - when that happens, we'll have to either increase memory usage or, write our own fieldCache. But by that time lots of things might be different - Lucene's implementation, our model, and so on. For now, we are happy that Atlas runs comfortably in under 2GB allocated to a tomcat servlet container.

Hope you found this interesting.

Misha Kapushesky
ArrayExpress Atlas Coordinator

P. S. One of the developers, Pavel Kurnosov, prepared a general presentation about the Atlas architecture. You can view the slides below.

Thursday, 12 March 2009

ArrayExpress Atlas 1.0 Released!


After a long, productive silence, we have released the first real production version of ArrayExpress Atlas. Take a look at

What's new? Well, under the hood, just about everything:
We hope you like the new software. There are lots of other things we are preparing and we'll try to keep this blog updated more frequently now.


Misha Kapushesky & the Atlas Team.