We increasingly need efficient data storage and retrieval for the large data projects that we work on. To meet these needs we have added the ElasticSearch search engine to our toolset, in addition to Apache’s CouchDB as our NoSQL database, and have had excellent results using them together.
What is ElasticSearch?
ElasticSearch is a new front end to the Lucene search index, putting it into the same product category as the widely used Solr project. Where ElasticSearch differs from Solar is in ease of use, flexibility, and very importantly simplicity. Unlike Solr, there is nary a XML configuration file in sight!
To index a document with ElasticSearch you simply PUT a json object to the ElasticSearch web service. To query the generated index, you build a query JSON object using the Query DSL to the web service, and it returns the relevant documents. This query language is very expressive, unlocking most of the functionality available in Lucene.
ElasticSearch and CouchDB share many philosophies and practical similarities, and because of this they complement each other very well. CouchDB can also be used to index your data, and ElasticSearch could be used on its own as a key-value data store, but when they are used together they make up for the deficiencies in the other wonderfully.
Because of some solid design decisions on both project’s part, integrating the two systems is just about painless. The very innovative River system in ElasticSearch allows you to accomplish this by simply running a few REST requests against the server, with almost no additional configuration requirements. Personally I knew it was something special when I realized that every time it starts it gives itself a unique nickname (for clustering purposes), and I was now the owner of a search node called Algrim the Strong.
Which limitations in CouchDB make this solution attractive ?
There is a lot to love about CouchDB, such as its very intuitive REST API and its blistering speed at storing and retrieving documents. What we quickly ran into issues with however was how indexing and searching works with Couch views. Quite simply, they are not nearly flexible enough and the additional processing required for them caused severe performance problems for our use cases.
On a recent project, our usage patterns involved large sets of data being imported regularly and running concurrently with previous versions of the same data. With very little control or pre-knowledge of the data itself, we found that the time it took to import the data was growing almost exponentially, due to the time it took to rebuild the necessary views.
While CouchDB might be ‘schema-less’, the views actually need to be built up-front, according to how you want to query the data. The result is that in a dynamic system where a user may define the structure of the content as well as how to display it you end up having to index on every field just in case it is needed later. This drastically increases the items in the index, and each additional facet to index multiplies this.
CouchDB views also behave in a way that made the application potentially unstable, in that couch will only start building the views once you actually query them. What this meant for our application is that after the user uploaded their CSV file, the entire system would grind to a halt until couch managed to catch up with its view generation. The obvious answer to this was to use batching and manually hit all views in between batches. While this worked, we still found that each new dataset would take longer than the previous, which posed long term usability issues.
How does ElasticSearch help with this?
By adjusting our approach to be more in line with what is known as a search oriented architecture, we were able to take a lot of the load off of couch and move it to ElasticSearch. On a practical level this meant setting up ElasticSearch river indexes to automatically sync with our couch instances and replace as many existing views as we could with ElasticSearch queries. In SQL terms, we replaced our SELECT statement with queries against the ES index and kept the INSERT/UPDATE statement against CouchDB.
The extra flexibility of this approach allows for additional functionality in the application as well. Whereas previously facets would need to be pre-determined and have views generated for them, they could now be determined on the fly.
What do we wish we could change about ElasticSearch?
Documentation! As ElasticSearch is a very young project, there is a lack of documentation around the project. The official guide is a very good reference material, but what is missing is some necessary context and examples of how the various API’s and components are used. Existing documentation only becomes really useful once you have climbed the learning curve to some extent. We would also have benefited from documentation about the configuration process, which involves configuring things such as sharding and replicas, which are not properly explained in the documentation.
The default dynamic field mappings are pretty good, but the mapping system in general is not very well explained and may lead to data just being ignored. It also makes it a bit frustrating to work with highly dynamic data, as a number quoted in a string will run afoul of the ‘long’ restriction on a field and not allow you to do range queries on it.
The debugging tools for ElasticSearch are also not as mature as we would have liked, even though elasticsearch-head does help a lot. It is also not always clear from the debugging output from the server what the actual error is at times, and this leads to a fair amount of trial and error that could be lessened by better error message and troubleshooting documentation.
It should be noted that these are all solvable problems that will improve over time.
We have found that ElasticSearch and CouchDB are a really powerful combination, due to the ease of setting up ElasticSearch ‘Rivers’, and the incredible flexibility of the underlying Lucene search index. We are very excited about this project and expect it to accomplish many wonderful things.