Thursday 26 March 2009

Atlas Architecture: Optimizing Memory Usage

Hello!

It's been two weeks since we released Atlas 1.0. We knew when releasing that Atlas was a memory-intensive application (the Solr index underneath is quite large), but we did not expect it to run out of memory every 24 hours. So, we had to fix that quickly. Atlas 1.0.2 was quietly rolled out about a week ago, and since then we have not seen any more problems.

We thought you might be interested to find out what was wrong. We oblige.

Atlas index is gene-centric. For every gene, there are a bunch of fields that are indexed. Some of these are gene attributes (e.g., identifier, synonyms, keywords etc.), and some are numeric, that is, expression summaries (e.g., count of studies associated with under/over expression in a given condition). This way, each gene has a limited number of fields only, since most genes show differential expression in a small subset of all (~5000) conditions in the Atlas. All the same, the total number of distinct fields associated to genes is quite large, several tens of thousands.

Underneath Solr there is the Lucene library managing the index. In order to be able to sort search results by field values, as we need - by study count/statistical significance of differential expression, Lucene keeps a special cache of term values, called the fieldCache. As search requests cover more and more of the total Atlas content, this cache fills up.

This fieldCache is a static in-memory data store with no eviction policy at all. In fact, it is really a rectangular array, as long as there are genes in the index and as wide as the total number of fields we have. Since most genes have very few fields, the fieldCache is actually sparse, but the underlying implementation forces it to have pretty much the maximal memory footprint you can imagine.

So, what was happening was that we were exhausting memory very rapidly through this fieldCache. Apparently, this is a well-known issue for Solr/Lucene users (see, for instance this JIRA ticket).

What are our options? One way out of this is to implement our own fieldCache, which could make use of the sparse nature of the indexed data, be disk-backed and/or use some sort of an LRU eviction strategy or we could just plug a third-party cache there, such as ehcache. That takes a bit of work. Another, simpler, strategy is to look whether there is room for optimization of the index.

For now we chose the latter option. It turns out that there are several fields we could use a smaller type for - short instead of int, float instead of double, and so on, and also using some more pre-computing, compress several of these fields into one. This allowed us to shrink the index approximately three-fold and now memory usage is well under control. Until, of course, our data content grows significantly (in terms of experimental conditions indexed, not in terms of studies or genes) - when that happens, we'll have to either increase memory usage or, write our own fieldCache. But by that time lots of things might be different - Lucene's implementation, our model, and so on. For now, we are happy that Atlas runs comfortably in under 2GB allocated to a tomcat servlet container.

Hope you found this interesting.

Misha Kapushesky
ArrayExpress Atlas Coordinator

P. S. One of the developers, Pavel Kurnosov, prepared a general presentation about the Atlas architecture. You can view the slides below.

5 comments:

Anonymous said...

I enjoyed your presentation on ArrayExpress Atlas architecture. It was very interesting to learn of your use of SOLR for AE Atlas. I was left a bit puzzled though as why you have chosen a document per gene instead of a document per experiment approach. Superficially, it seems doc per experiment is more effective, since adding and experiment becomes simpler. What am I missing?

Thank you for your great job with AE Atlas. It is definitely way ahead of GEO profiles :)

Best regards,
Zufar

Misha Kapushesky said...

Hi Zufar,

Thanks for the kind words!

The main reason for keeping things at gene-level granularity is that all our searches are for genes, hence it is the most sensible thing - we always aim to return gene sets, we do faceting on genes (e.g., how many genes are annotated under a particular Gene Ontology category, how many genes are active in a particular condition, etc.), and so on.

That said, what you may not see directly from the presentation, is that the Atlas architecture actually consists not of one index but of two. We have a gene index and an experiment index, in fact. The latter is used for some advanced functionality (such as gene pages and - experiment pages - coming in the next Atlas version, out in a couple weeks).

Keep commenting - all the feedback is very welcome and very useful!

Thanks,

Misha.

Misha Kapushesky said...

P.S. By the way, the ArrayExpress Archive, which is undergoing a major redesign, will also be running using a full-text experiment-centric index.

Anonymous said...

Hi, Misha,
Thank you for quick reply. So do you mean search in ArrayExpress Archive will also be done via SOLR? I'm looking forward to see your new version of the Archive.

I just started looking at SOLR for our in-house tool for gene-centric searches of processed expression data (say, results of diff expression analysis). Usual big table in a DB approach seems to be prohibitively slow as the table grows too large. The fact that you used SOLR for such a major project as AE Atlas sends a very encouraging message to me.

Thank you very much for your blog!
--Zufar

Misha Kapushesky said...

Hi Zufar,

As far as I know, search in ArrayExpress Archive will be done via Lucene directly, without Solr.

Good luck with your project! By the way, hopefully before the end of this year, we'll release a "stand-alone" version of Atlas that you'll be able to run locally, on your own data.

--Misha