Adding Search as a First Class Citizen to Hadoop

05/26/2014 - 17:30 to 18:10
Frannz Club
long talk (40 min)

Session abstract: 

So far Search has largely been missing as a first class citizen from the Hadoop ecosystem. We describe how Cloudera Search deeply integrates SolrCloud/Lucene with Hadoop. This enables rich user friendly low latency Search and Analytics over Big Data stored in HDFS and HBase as well as Near Real Time Search and Analytics over streaming data such as logs, social media, structured and unstructured data, all in a manner that is flexible, scalable, reliable, cost-effective and easy to operate.

GFS, MapReduce and BigTable were originally built to store and index the web. Apache Hadoop, HDFS and HBase implement these concepts in open source. Proprietary Google Search sits on top of this infrastructure, and we wanted to build something similar in open source for Hadoop.

This talk starts out with an overview of the system from the user's perspective, followed by a presentation of the architecture and implementation. You will learn details of the integration of SolrCloud/Lucene with HDFS and learn how Near Real Time ingestion works from Flume and HBase into Solr. We will dive into the scalability aspects of Batch MapReduce ingestion from HDFS and HBase into Solr, and describe the role of corresponding embedded streaming ETL flows using Morphlines.

The talk concludes with a summary of the systems benefits, current limitations and exciting future directions.