NoSQL Query

IndexedDB is a key-document database and does not have a SQL processor. Query in IndexedDB is scanning keys over a range. For more complex query, application is responsible to join or place in buffer to get necessary result. SQL processor performs the same thing in the lower level during its query execution.

In javascript use case, such low level key scanning gives perceive fast response since we are getting immediate result, record-by-record. Whereas in WebSQL, all processing are performed in the database and return to javascript only after all finished. Consequently, memory and CPU times may have been wasted if the result are no longer require by the UI process.

Test data

Before we start, let us populate the database with some data base on previous schema. Schema definition and random data generation are defined in data-seeding.js file.

db = new ydn.db.Storage('nosql-query', author_article_topic_schema);

db.count('topic').done(function(cnt_topic) {
  var n_topics = 10;
  var n_authors = 100;
  var article_per_author = 10;
  if (cnt_topic < n_topics) {
    var topics = topics = genTopics(n_topics);
    db.put('topic', topics).done(function(t_keys) {
      console.log(t_keys.length + ' topics added.');
      var authors = genAuthors(n_authors);
      var emails = authors.map(function(x) {return x.email;});   
      var articles = genArticles(article_per_author*n_authors, t_keys);
      var article_keys = articles.map(function(x) {return pickOne(emails) + '/' + x.publish;});
      db.put('author', authors).done(function(x) {console.log(x.length + ' authors added.');});
      db.put('article', articles, article_keys).done(function(x) {console.log(x.length + ' articles added.');});
    });
  } else {
    console.log('db ready.');
  }
});

Above code snippet will generate three stores and randomly populate them with 10 topics, 100 authors and 1000 articles. An example of author record and article records is as follow:

author_1 = {
  email: 'me@aaronsw.com',
  born: 531763200000,
  first: 'Aaron',
  last: 'Swartz',
  company: 'Reddit',
  hobby: ['programming', 'blogging', 'politics']
};
article_1 = {
  title:"Database process integrated library system into a class",
  content:"...",
  license:"NC",
  publisher:"Springer",
  publish:640022400000,
  topics:["object-relational mapper","website wireframe","distributed computing","dynamic systems development method","open source","integrated library system"]} 

Article store use out-off-line primary key as composite key of parent key, author email, and publish date. Ideally composite key will use array key, but here we just concat them so that it is compatible with IE11, which does not support array key.

Filtering

Filtering or retrieving records in a key range of indexed files is as efficient as primary key query. The query field must be indexed. In this library, it is performed as follow:

var key_range = ydn.db.KeyRange.only('SA');
req = db.get(new ydn.db.IndexValueIterator('article', 'license', key_range));
req.then(function(record) {
    console.log(record);
  }, function(e) {
    throw e;
  }
);

However instead of giving ydn.db.KeyRange, to get method, an iterator is used. This is to avoid confusion and high light the fact that we are iterating. Unlike primary key, secondary key can reference to multiple records. If multiple results are expected, list method can be use as follow:

var limit = 10;
req = db.values(new ydn.db.IndexValueIterator('article', 'license', key_range), limit);
req.done(function(records) {
    console.log(records);
  }
);

An iterator is essentially a combination of store name, index name and key range to specify a continuous stream of records. There are two type of iterators: key iterator and value iterator. Value iterator reference to record value whereas key iterator to index key and hence avoid serialization on retrieval. For example, we can retrieve all name starting with 'M' as follow without serialization cost of the records:

var key_range = ydn.db.KeyRange.starts('M');
req = db.values(new ydn.db.IndexIterator('article', 'title', key_range), 10);
req.done(function(titles) {
    console.log(titles);
  }
);

If the field is not indexed, all records in the store must be iterated. This is achieved by using open method.

var count = 0;
req = db.open(new ydn.db.ValueIterator('author'), function(cursor) {
  var record = cursor.value();
  console.log(record.first + ' ' + record.last);
  count++;
});
req.done(function() {
  console.log(count + ' authors found.');
}, 'readonly');

Opening can be invoked by readonly or readwrite mode.

In general, query involves filtering multiple fields, paging and the results are to be sorted. Such complex queries are solved by using index, compound index and algorithmic joining.

Compound index

Suppose if we want to filter 'license' and 'publisher' fields of the 'article' store, we can index both fields into a composite index by specifying keyPath into array as follow:

index_schema = {
  keyPath: ['license', 'publisher']
};

The index value is array of two elements having first element as 'publisher' field value and second element as 'license' field value of the record. The key comparison algorithm ensure that  records are sorted by license followed by publisher by iterating this index value. If we filter 'license' to 'SA', we get the result sorted by publisher as follow:

key_range = ydn.db.KeyRange.starts(['SA']);
db.values(new ydn.db.IndexValueIterator('article', 'license, publisher', key_range), 10).done(function(articles) {
    console.log(articles); 
  }
);

It is notice that to get 'publisher' field sorted, we keep 'license' field constant. However we can filter 'publisher' by range.

key_range = ydn.db.KeyRange.bound(['SA', 'A'], ['SA', 'B']);
db.values(new ydn.db.IndexValueIterator('article', 'license, publisher', key_range)).done(function(articles) {
    console.log(articles); 
  }
);

Only the last field can be filtered by a range in index query. Keep in mind this limitation.

With above index, it is not possible to filter 'publisher' and sorted by 'license'. That require another index of ['publisher', 'license']. This limit compound index to use in ad hoc query.

Note that, when iterator is used, list method do not have offset. It is for performance and robustness reasons. Skipping a number of records is not fast for large offset value. At the same time, we cannot sure that, the number of records are not change during the time. Iterator save current position and hence it can resume from the last position. For paging next view, invoke the list again with the existing iterator. It will page to the next view. Position can be be persist to storage and resume on later time. To reverse the direction, use reverse method of the iterator.

Query using compound index take only logarithmic time on object store records and hence it is very fast. But it requires more CPU time on write and extra storage for indexes. Query have to be plan before for indexing. Query are limited to one range filter. Multiple sort order is supported.

Algorithmic join

Filtering is, in fact, joining on primary key. The join process will scan index keys of desire filter and the matching primary keys are result. Numerous join algorithms can be found in database books and review articles.

scan database operation method is used for key scanning process. The method accepts array of iterators (usually key iterator for performance) and a callback for each iteration. The callback is invoked with two argument of arrays having primary keys and secondary keys (record values if value iterator is used). The callback return an advancement array. Each advancement element refer to the respective iterator, true to continue next iteration, false to restart the iteration. If any value is given, it is taken as primary key and advance to it within current index key range. If a primary key larger than given key is found, it was returned on the first occurrence.

In general, the callback can return an object having fields of advance, next_index_keys and next_primary_keys. All three fields are arrays, with each element represent to respective iterator. If element of next_index_keys is a valid key, the cursor advance to it. If element of next_primary_keys is given, the cursor advance to it within given index key or current index key. If element of advance array is given, it must be boolean. true refer to next position and false to rewind the iterator. If all three are given, it starts with rewind, positioning and advancement.

Available algorithms can be found in ydn.db.algo module.

Nested-loop join

A naive join algorithm iterates each iterator forming nested loops. And hence it is called nested-loop join algorithm.

The following snippet filter 'license' field to 'SA' and 'publisher' to 'Nature'.

match_keys = [];
var nested_loop = function(keys, values) {
  // here: keys = index keys, values = primary keys
  // index 0 refer to license field and index 1 to publisher
  if (keys[0] != null) {
    if (values[1] != null && ydn.db.cmp(values[0], values[1]) == 0) {
      match_keys.push(values[0]); // we got the matching primary key.
    } 
    if (keys[1] != null) {
      return [null, true]; // iterate on publisher
    } else {
      return [true, false]; // iterate on license and restart on publisher
    }
  } else {
    return []; // no more iteration. we are done.
  }
};

license_filter = new ydn.db.IndexIterator('article', 'license', ydn.db.KeyRange.only('SA'));
publisher_filter = new ydn.db.IndexIterator('article', 'publisher', ydn.db.KeyRange.only('Nature'));
var req = db.scan([license_filter, publisher_filter], nested_loop);
req.then(function() {
  db.values('article', match_keys).done(function(results) {
    console.log(results);
  });
}, function(e) {
  throw e;
});

The inner loop, publisher, iterate the number of licenses times. Whereas the outer loop, license iterate only once. The result is the same if we flip the loop order since the joins are commutative. However run time is not the same. This is the subject of query optimization.

The inner loop result can be cached into the memory. This lead to hash merge algorithm.

Sorted-merge join

In the above example, we noticed that the results of both iterators are sorted by ascending order of primary key since we held constant on index key. If the results of all filters are sorted, we can advance the iterator skipping keys not in the filter. This make sorted join algorithm independent of number of records in the object store.

match_keys = [];
var sorted_join = function(keys, values) {
  if (keys[0] != null && keys[1] != null) {
    var cmp = ydn.db.cmp(values[0], values[1]);
    if (cmp === 1) {
      return {continuePrimary: [null, values[0]]}; // move publisher to match license
    } else if (cmp === -1) {
      return {continuePrimary: [values[1], null]}; // move license to match publisher
    } else {
      match_keys.push(values[0]); // we got the matching primary key.
      return [true, true]; // move forward both
    } 
  } else {
    return []; // no more iteration. we are done.
  }
};

license_filter.reset();
publisher_filter.reset();
var req = db.scan([license_filter, publisher_filter], sorted_join);
req.then(function() {
  db.values('article', match_keys).done(function(results) {
    console.log(results);
  });
}, function(e) {
  throw e;
});

Notice that sorted merge out perform nested loop in order magnitude. Using naive nested loop join in a practical web app is catastrophic.

We can limit the result by existing the loop any by returning empty array to the iteration callback. Iterators are resume if use again. That is why we need to reset the iterators since they haven been used in nested loop. This how paging is handle in scanning.

The result of sorted merge join is sorted by primary key. To sort by a specific column field other than primary key, a zigzag merge join algorithm is used.

Zigzag merge join

In sorted merge join, effective key is held constant while joining sorted list of values. In Zigzag merge join, the effective itself comprise both parts. The first part, known as pre-fix is held constant consisting what we want to be filsterd. The second parts, known as post-fix is sorted value. The composite index is used to construct post-fix and prefix as required by the query. 

To query SELECT * FROM article WHERE license = 'SA' AND publisher = 'Nature' ORDER BY 'title', we construct two iterator using composite index of ['license', 'title'] and ['publisher', 'title']. 'title' is the postfix we are joining. The 'license' prefix is held constant to 'SA' and 'publisher' prefix is held constant to 'Nature' as follow.

var license_sa_iter = ydn.db.IndexIterator.where('article', 'license, title', '^', ['SA']);
var publisher_nature_iter = ydn.db.IndexIterator.where('article', 'publisher, title', '^', ['Nature']);
var match_keys = [];
var solver = new ydn.db.algo.ZigzagMerge(match_keys);
var req = db.scan([license_sa_iter, publisher_nature_iter], solver);
req.then(function() {
  db.values('article', match_keys).done(function(results) {
    console.log(results);
  });
}, function(e) {
  throw e;
});

Performance analysis of query

To analyze the performance of various query strategy, a demo with performance data recording is created.

NoSQL demonstration application

References

  1. Héctor García-Molina, Jeffrey D. Ullman, Jennifer Widom Database System Implementation Prentice Hall, 2000.