February 4, 2013

Key-Value Indexation on Cassandra

During my work at Inria Sophia-Antipolis, there was an exciting challenge to add a search logic on top of Cassandra database, which is a bit 'raw' and designed for efficiency.
Cassandra tables are structured in a key-value scheme like DHTs, at the difference that values are themselves ordered* maps of inner (key-value) tuples called columns, the table is called column-family. (*specified at their creation)
Cql3 Index creation is an alternative way, but would not be very adequate, with any possible index name.

A realistic way is for example tree structures such as B-trees, however in our use case, we chose to explore a very simple method: inverted indexes that acts simply like a flat directory lookup.
Let's illustrate it with an example:
{  
  'a5e9c146f': {
    'text': 'The quick brown fox jumps over lazy dog',
    'title': 'Fox'
    'author': 'mike123@gmail.com',
    'date': '2012-11-12'
  }
}

For indexing this small data chunk, we will store in a different table the so-called 'inverted indexes', which are rows pointing to a particular data id:
{
  'author=mike123%40gmail.com'                   : {'a5e9c146f': null}, 
  'text=quick'                                   : {'a5e9c146f': null},
  'text=brown'                                   : {'a5e9c146f': null},
  'date=2012-11-12'                              : {'a5e9c146f': null},
  //...
key=value pairs are generated from the original data, either with the exact value or post-processed in special cases (in this example, 'text' value is split in single words).
Column entries are simply formed with a unique data id and an empty value (we don't need it).
Row-keys can also be compositions of name-value pairs sorted alphabetically and concatenated:
  //...
  'author=mike123%40gmail.com&text=quick'        : {'a5e9c146f': null},
  'author=mike123%40gmail.com&text=brown'        : {'a5e9c146f': null},
  'text=quick&title=Fox'                         : {'a5e9c146f': null},
  //..
  'date=2012-11-12&text=fox&text=quick'          : {'a5e9c146f': null},
  //...
  //...
}

You can see how it grows up, we would end up with 2^k insertions. This exponential growth is not acceptable, we can limit the length of composites with min(3, k), for instance. This way, we don't exceed Sum(i=0..min(3,k) of BinomialCoefficient(i, k)) insertions, which scales better, in O(n³). Actually there are high chances that rows like text=the or text=dog already exist in database, so we are just appending this data id column to these rows.

Search queries

{
  'title': ['Fox'], 'text': ['brown', 'quick']
}
will be a query matching our example above, thanks to the index row 'text=brown&text=quick&title=Fox'.

For larger composites exceeding 3 elements, the first 3 elements will be queried directly against the database, the next 3 also, untill the last 3 (in order to reduce results count); before intersecting these different result sets.

We can also form 'OR' queries, that perform parallel queries and merge results in the same map:
[
  {'title':['Fox'], 'text':['brown','quick']},
  {'title':['Fox'], 'text':['test']},
  {'title':['Fox'], 'date':['2012-11-12']}
]

Range queries


With numbers, we rather want to search over a range. Cassandra has a built-in column-keys comparator (range queries are allowed within a row), that could serve for 1-Dimension ranges. But we preferred to use a DB-agnostic method, with k-ary trees (in particular decimal trees).
For instance, for indexing '2012-11-12' (1352678400 unix time in seconds), we add all starting subprefixes of this value to the rest of indexed pairs of this data:
{
  '_t=1'                                         : {'a5e9c146f': null},
  //..
  '_t=1352678400'                                : {'a5e9c146f': null},
  //..
  '_t=1&_t=13&text=quick'                        : {'a5e9c146f': null},
  '_t=13&_t=135&text=quick'                      : {'a5e9c146f': null},
  '_t=135&_t=1352&text=quick'                    : {'a5e9c146f': null},
  //...
  '_t=135267&text=quick'                         : {'a5e9c146f': null},
  //...
  '_t=131&_t=13526&_t=1352678400'                : {'a5e9c146f': null},
  //...

Searching a range, the whole 2012 year, [1325376000, 1356998400] is done by finding the 'best' covering indexes. The more accurate the covering is, the larger the OR query will be.
[132,133,134,135] has a resolution of only 3 (representing an interval of 1e7 seconds=115 days), [1325,1326,1327,1328,1329,133,134,1350,1351,1352,1353,1354,1355,1356] has a resolution of 4 (thus at worse 1e6 seconds=11days of error), etc..

For 2D, we can continue to apply this method for each coordinate:
{
  '_lat=8e21&_lat=8e21f&_lon=1a1242'             : {'a5e9c146f': null},
  //...
Searching in a bounding box is similar, by covering ranges of each dimension.
note: there exists alternatives like space-filling curves, to go down to 1D. (see this blog)

Conclusion

  • Searches are fast (mainly under 3 'AND' searched terms) 
Management is more complex:
  • Insertion, update and deletion will need to regenerate those indexes each time. 
  • We index (and search) differently K-V terms: either directly (exact value of V), or by range, or with a semantic parsing of V (e.g. stemming, de-conjugate verbs, eliminate plurals ... similarly to Lucene text-processing).
  • An optional object in the query maps a specific field in data to a defined handler in our backend, ex: range1|range2|...|range15|text|binary (rangeN generates indexes with the first N subprefixes)
ex:
 {type: 'create', data: {t: 958, name:'Bolt'}, handler: {t: 'range4'}} // generates indexes [0,09,095,0958]
 {type: 'search', data: {t: [900,1000]}, handler: {t: 'range4'}}  //finds [09] as the optimal list of indexes to search, "handler: {t: 'range4'}" is optional, it will enforce a maximal range value of length 4