Deep Learning for High Performance Time-series Databases

05/27/2014 - 12:20 to 13:00
long talk (40 min)

Session abstract: 

Recent developments in deep learning make it possible to improve time series databases. I will show how these methods work and how to implement them using Apache Mahout.

Systems such as the Open Time Series Database (Open TSDB) make good use of the ability of HBase and related databases to store columns sparsely.  This allows a single row to store many time samples and allows raw scans to retrieve a large number of samples very quickly for visualization or analysis.  Typically, older data points are batched together and compressed to save space. At high insertion rates, this approach falters largely because of the limited insert/update rate of HBase.  In such situations, it is often better to short segments of data and insert batches that span short time ranges rather than inserting individual data points.

When inserting compressed batches in this fashion, there are a number of obvious strategies that can be used.  General compression utilities such as gzip do not normally provide particularly high compression rates.  Bespoke crafted compression systems may provide point solutions with high compression rates, but they are generally fairly time-intensive to develop.  I will describe how deep learning and sparse-coding techniques can be used to build systems that have very high compression levels (50x or more is typical) and which have the very interesting property that the resulting compressed data can often be queried or analyzed directly without ever decompressing the data.  Moreover, it is possible to selectively decompress signals only from desired time ranges within a compressed batch. 

These new techniques for building time series data bases enable some exciting capabilities. The benefits include the ability to do query push-down into the time-series database from systems like Apache Drill, better visualization systems, and the ability to build an interesting form of anomaly detector on top of the time-series database.

I will describe how to build these systems using Apache Mahout and illustrate the results with several real examples.