There are many mechanisms for storing and processing a collection of data sets so large and complex that we collectively refer to it as Big Data. From No SQL data stores to the Distributed File Systems and Computation engines to columnar stores to flat files - its all about capture, storage, analysis, searches etc. We want it all and we want it fast and traditional data processing applications can no longer support our demands. And while technologies such as Hadoop and its ecosystem derivatives paved an initial path to solving Big Data problems the approaches and assumptions they are built on starting to show its limitations one could only overcome by radically changing the way we think about storing and accessing data in general. In the end its all about I/O and how to make it more efficient.
The following is the small sub-set of questions that will help set the scope and drive this presentation.
- How to deal with capturing high data volumes (1+ million events per/sec).
- How to store and organize the data? Unstructured doesn't mean un-organized
- Compress, encode or pack? What are the differences, pros and cons?
- Data-Type patterns. What does it mean? How to spot them during data capture and what are the benefits?
- Loss of analytical data available (for free) during the capture. What, Why, the implications and how to deal with them?
- Is disk speed the limit for how fast the data can be captured/accessed?
- Role of CPU/RAM in I/O intensive environments and can they play a role?
In the end using live demos and code we'll show you how simple yet well known and very powerful techniques can help you optimize:
- CAPTURE of data in high volumes environments (1+ million events per/sec)
- STORAGE of captured data, making it much smaller (10:1 to 20:1), thus more efficient for general read/write.
- ACCESS of stored data based on optimization techniques used during its capture and storage, further increasing I/O read speeds when accessing such data (e.g., search 1B records in just few seconds - single laptop).