Thursday, 26 March 2009

Linking to the Atlas!

Hello!

As the previous post says, we released Atlas 1.0.2. In addition to the memory usage fix discussed below, and to major autocompletion speed-ups (we switched from using Solr's prefix faceting to our own prefix tree-based implementation), we improved how you can link to the Atlas now.

To link to Atlas Gene Pages, you can use the following URL format:

http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=IDENTIFIER

Valid identifiers are, at the moment, the following:
  • embl - EMBL IDs
  • ensgene - Ensembl Gene IDs
  • ensprotein - Ensembl Protein IDs
  • enstranscript - Ensembl Transcript IDs
  • goid - Gene Ontology IDs
  • interproid - InterPro IDs
  • locuslink - Entrez Gene IDs (formerly LocusLink)
  • omimid - OMIM IDs
  • refseq - RefSeq IDs
  • unigene - UniGene IDs
  • uniprot - UniProt Accessions
  • dbxref - A mixed bag of various external identifiers, primarily MGI IDs.
You can download a large file, listing all indexed gene identifiers, to help you link to Atlas:



This file is automatically generated with every data release. Where an identifier links to a single gene, e.g., UniProt accession Q8IZT6, the link http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=Q8IZT6 goes directly to the respective gene page. If the chosen identifier is linked to a group of genes - an InterPro or a Gene Ontology ID, say, then the link will go to the gene expression summary Heatmap View, e.g., for "phosphopyruvate hydratase complex", GO:0000015, the link is http://www.ebi.ac.uk/microarray-as/atlas/gene?gid=GO:0000015

Atlas Architecture: Optimizing Memory Usage

Hello!

It's been two weeks since we released Atlas 1.0. We knew when releasing that Atlas was a memory-intensive application (the Solr index underneath is quite large), but we did not expect it to run out of memory every 24 hours. So, we had to fix that quickly. Atlas 1.0.2 was quietly rolled out about a week ago, and since then we have not seen any more problems.

We thought you might be interested to find out what was wrong. We oblige.

Atlas index is gene-centric. For every gene, there are a bunch of fields that are indexed. Some of these are gene attributes (e.g., identifier, synonyms, keywords etc.), and some are numeric, that is, expression summaries (e.g., count of studies associated with under/over expression in a given condition). This way, each gene has a limited number of fields only, since most genes show differential expression in a small subset of all (~5000) conditions in the Atlas. All the same, the total number of distinct fields associated to genes is quite large, several tens of thousands.

Underneath Solr there is the Lucene library managing the index. In order to be able to sort search results by field values, as we need - by study count/statistical significance of differential expression, Lucene keeps a special cache of term values, called the fieldCache. As search requests cover more and more of the total Atlas content, this cache fills up.

This fieldCache is a static in-memory data store with no eviction policy at all. In fact, it is really a rectangular array, as long as there are genes in the index and as wide as the total number of fields we have. Since most genes have very few fields, the fieldCache is actually sparse, but the underlying implementation forces it to have pretty much the maximal memory footprint you can imagine.

So, what was happening was that we were exhausting memory very rapidly through this fieldCache. Apparently, this is a well-known issue for Solr/Lucene users (see, for instance this JIRA ticket).

What are our options? One way out of this is to implement our own fieldCache, which could make use of the sparse nature of the indexed data, be disk-backed and/or use some sort of an LRU eviction strategy or we could just plug a third-party cache there, such as ehcache. That takes a bit of work. Another, simpler, strategy is to look whether there is room for optimization of the index.

For now we chose the latter option. It turns out that there are several fields we could use a smaller type for - short instead of int, float instead of double, and so on, and also using some more pre-computing, compress several of these fields into one. This allowed us to shrink the index approximately three-fold and now memory usage is well under control. Until, of course, our data content grows significantly (in terms of experimental conditions indexed, not in terms of studies or genes) - when that happens, we'll have to either increase memory usage or, write our own fieldCache. But by that time lots of things might be different - Lucene's implementation, our model, and so on. For now, we are happy that Atlas runs comfortably in under 2GB allocated to a tomcat servlet container.

Hope you found this interesting.

Misha Kapushesky
ArrayExpress Atlas Coordinator

P. S. One of the developers, Pavel Kurnosov, prepared a general presentation about the Atlas architecture. You can view the slides below.

Thursday, 12 March 2009

ArrayExpress Atlas 1.0 Released!

Hello!

After a long, productive silence, we have released the first real production version of ArrayExpress Atlas. Take a look at http://www.ebi.ac.uk/microarray-as/atlas.

What's new? Well, under the hood, just about everything:
We hope you like the new software. There are lots of other things we are preparing and we'll try to keep this blog updated more frequently now.

Cheers,

Misha Kapushesky & the Atlas Team.

Monday, 23 June 2008

Atlas Web Services Alpha

Hello!

A new build is up - 4866. No major visible changes but several important reworkings under the hood:
  • Improved caching - hopefully you should notice faster responses
  • Better experimental factor ontology (structured vocabulary of experimental variables) expansion - now expands down through all available levels instead of just one as before, and is on by default. Example: search for tumor.
  • Auto-suggest drop-down works a bit more intuitively, esp. on conditions. You'll see what I mean, just start typing a query.
The big deal with this release is, however, the very limited, initial SOAP Web Services API to the Atlas. See http://www.ebi.ac.uk/microarray/doc/atlas/api.html for further detail on this. Among other things it's a step towards batch querying of data.

As always, your feedback is welcome! More interesting and wonderful things are in the works.

--Misha
P.S. It would be easy for us to offer expansion by lots of other ontologies. Would that be useful to users, too?

Wednesday, 4 June 2008

Hello!

A brief update on the ArrayExpress Atlas project. Build 4720 is up! It turns out the mailing list subscription form on the front page wasn't working. If you tried to subscribe and didn't hear anything back from us, please subscribe again.

ArrayExpress Experimental Factor Ontology

An experimental feature has been added. Now that our curation team has released an updated version of the ArrayExpress Experimental Factor Ontology (read the news item or browse the EFO ontology in EBI's OLS) we can use this to help your queries across experimental factor values (conditions) work better. For example, in EFO the "cancer" term has three children, "sarcoma", "chordoma" and "carcinoma". If you just query for cancer, you will get a certain number of hits, including "breast cancer" and "gastric cancer" and so forth.

Now, try to run the same query, turning on the EFO expansion checkbox. You will see that among the conditions you've found now there are such hits as "clear cell sarcoma of the kidney" and "bladder carcinoma", etc. The reason these conditions now are coming up is that we expanded your original "cancer" query with 1 level below it in EFO. Try it out.

Of course this just a first attempt at using the EFO ontology in this project. There are many issues that still need to be addressed, such as - how far into the ontology should we dig to expand our queries? 1 level? 2 or more? Also, what is the best way to present the results of an expanded query?

In any case, while we are working on improving this, we thought we'd let you try out this first approach already. Do let us know what you think!


All the best,

Misha Kapushesky
ArrayExpress Atlas Coordinator

Tuesday, 27 May 2008

Introducing the ArrayExpress Atlas of Gene Expression!

Hello, everyone!

We are launching a new little project, the ArrayExpress Atlas of Gene Expression - do take a look. At the moment it is a basic query engine over a curated subset of microarray data in ArrayExpress, capable of ranking genes in order of their strength of differential expression in various tissues, disease states, and other factor variables in the database.

It is already a pretty useful tool. A recent Science paper by Bouatia-Naji et al., reported that a polymorphism in a gene called G6PC2 is associated with Fasting Plasma Glucose (FPG) levels, a finding important for understanding glucose homeostasis in the general human population. Curious about this gene, we queried the atlas database to see where it is over-expressed. The pancreatic islet organism part came up high on the list, as did, more generally pancreas. This computational finding verifies the fact that this gene encodes a protein selectively expressed in pancreatic islets. It is interesting, perhaps, to observe that it also was found over-expressed in several brain tissues.


We hope that eventually the atlas will become a platform for interesting research and a tool for extracting interesting biological data from the large corpus of public microarray studies. We will update you via this blog and our mailing list and hope to hear your feedback, requests and and ideas as well.

All the best,

Misha Kapushesky
ArrayExpress Atlas Coordinator