It's been two weeks since we released Atlas 1.0. We knew when releasing that Atlas was a memory-intensive application (the Solr index underneath is quite large), but we did not expect it to run out of memory every 24 hours. So, we had to fix that quickly. Atlas 1.0.2 was quietly rolled out about a week ago, and since then we have not seen any more problems.
We thought you might be interested to find out what was wrong. We oblige.
Atlas index is gene-centric. For every gene, there are a bunch of fields that are indexed. Some of these are gene attributes (e.g., identifier, synonyms, keywords etc.), and some are numeric, that is, expression summaries (e.g., count of studies associated with under/over expression in a given condition). This way, each gene has a limited number of fields only, since most genes show differential expression in a small subset of all (~5000) conditions in the Atlas. All the same, the total number of distinct fields associated to genes is quite large, several tens of thousands.
Underneath Solr there is the Lucene library managing the index. In order to be able to sort search results by field values, as we need - by study count/statistical significance of differential expression, Lucene keeps a special cache of term values, called the fieldCache. As search requests cover more and more of the total Atlas content, this cache fills up.
This fieldCache is a static in-memory data store with no eviction policy at all. In fact, it is really a rectangular array, as long as there are genes in the index and as wide as the total number of fields we have. Since most genes have very few fields, the fieldCache is actually sparse, but the underlying implementation forces it to have pretty much the maximal memory footprint you can imagine.
So, what was happening was that we were exhausting memory very rapidly through this fieldCache. Apparently, this is a well-known issue for Solr/Lucene users (see, for instance this JIRA ticket).
What are our options? One way out of this is to implement our own fieldCache, which could make use of the sparse nature of the indexed data, be disk-backed and/or use some sort of an LRU eviction strategy or we could just plug a third-party cache there, such as ehcache. That takes a bit of work. Another, simpler, strategy is to look whether there is room for optimization of the index.
For now we chose the latter option. It turns out that there are several fields we could use a smaller type for - short instead of int, float instead of double, and so on, and also using some more pre-computing, compress several of these fields into one. This allowed us to shrink the index approximately three-fold and now memory usage is well under control. Until, of course, our data content grows significantly (in terms of experimental conditions indexed, not in terms of studies or genes) - when that happens, we'll have to either increase memory usage or, write our own fieldCache. But by that time lots of things might be different - Lucene's implementation, our model, and so on. For now, we are happy that Atlas runs comfortably in under 2GB allocated to a tomcat servlet container.
Hope you found this interesting.
ArrayExpress Atlas Coordinator
P. S. One of the developers, Pavel Kurnosov, prepared a general presentation about the Atlas architecture. You can view the slides below.