Decision Trees through an Example

Nov 7, 2022

Decision Trees - Feature Selection for a Split

Sep 17, 2022

Decision Trees - Homogeneity Measures

Sep 4, 2022

Search

Hadoop for Analysts - Apache Druid, Apache Kylin and Interactive Query Tools

Sai Geetha M N
Mar 19, 2021
4 min read

Updated: Mar 22, 2021

#ToolComparison #ArchitectureDecision

Introduction

Traditional Data Warehouses have existed in the industry for quite some time now. They have served very good purposes of Data Storage and Data Analytics. But with the exponential growth of data and with the potent intelligence lying in them, enterprises are scrambling toward adopting Big Data Technologies.

Towards this goal, most have started the adoption of Hadoop. The first step in this journey is, of course, to start getting your data onto Hadoop, after deciding whether the enterprise is ready to go to cloud or have an on-prem installation.

The data movement itself has many challenges and many architectural patterns that could help. I will talk about it in another post. However, here I want to focus on one aspect of the Hadoop adoption journey.

The Analyst Ask

If all data moves on to Hadoop, you need to move your analysts also to Hadoop.

While Hadoop provides a reliable, scalable and distributed computing power, its initial design was towards enabling batch processing of large data and did not look to provide interactivity with the platform. But if you need to move your analysts to use Hadoop, to unleash the power of insights into large data, you need to, at a bare minimum, enable:

Interactive querying capability
Reporting and dashboard development through BI Tools

The Tools

The ecosystem around Hadoop is rapidly developing and adding on tools for various data needs. No single tool can provide both and hence we would have to go with a combination of tools.

The solution could be a combination of one Interactive Querying tool and one OLAP (online Analytical Processing) tool on Hadoop. Let us look at what’s available in these two areas.

Part 1: Interactive Querying Tools on Hadoop

Over time, a few tools that allow you to query large data sets with response times in the range of a few seconds have been made available.

In the erstwhile Cloudera stack (typically called the CDH) Impala is a very useful tool for interactive querying. In competition, the HDP stack from the erstwhile Hortonworks (HDP stack) came up with the Hive LLAP (Long Live and Process a.k.a Low Latency Analytical Processing). Even though both have merged and the CDP platform has taken shape, they continue to support the earlier tools sets so far for customers who are on them.

So, in short, Impala or Hive LLAP are good tools to go with for the on-prem versions of Hadoop.

Spark SQL can also be used while it requires a slightly in-depth knowledge of Spark in order to write the queries efficiently and use its power.

Presto is another interactive SQL query engine that runs on top of Hive, HBase and other relational databases. This is another contender in this domain.

Based on an HDP stack that I have tried on, Hive LLAP proved to be a decently fast Query engine and most queries gave a response time comparable to its equivalents on Teradata database. There were a few cases that did not do well like 'Top N' records out of a 300 million records due to the fact that all data is loaded into memory to achieve this. So, one may have to figure out these edge cases and find work around for it.

Commercially, IBM’s Big SQL is available in the same space.

Part 2: OLAP Tools on Hadoop

Trying to build reports on data directly stored on Hadoop as files be it parquet, ORC or Avro is not a good idea. Even trying to build reports on Hive is not recommended as you will not get the interactivity you expect out of your reports and dashboards on BI tools like Tableau, MicroStrategy or Power BI

The typical asks here are the ability to

Drill down and roll up data in reports
Slice and dice data
Support for star schemas at a bare minimum
Joins between large fact tables
Connectivity from various BI tools
A good to have is support for near-real time data (for real-time analytics)

There are 2 open source tools competing in this space:

Apache Druid
Apache Kylin

They both are purpose built for OLAP and have their own strengths and areas to watch out for.

I have evaluated both these tools through a set of POCs on the following parameters on Apache Druid 0.20.1 and Apache Kylin 3.1:

OLAP features and Ingestion Mechanism
Architecture and Enterprise Deployment
Performance
User Experience
Community Support

Here is summary of my findings:

OLAP features and Ingestion Mechanism

Architecture and Enterprise Deployment

Performance Comparison

User Experience Comparison

Community Support and Documentation

Based on what features you are more keen on, you could choose between one of these two.

PS: There could be a bit of subjectivity based on the specific use cases I was interested in and hence these results could vary if you look at a different set of use cases for your evaluation

If you are open to commercial tools, here are a few others you can look at:

AtScale - AtScale claims to accelerate and simplify business intelligence resulting in faster time-to-insight, better business decisions, and more ROI on your Cloud analytics investment.
Kyvos Insights – claims to be a Smart OLAP technology that can deep dive into trillions of data points effortlessly to uncover insights that were simply impossible before
Dremio – it says that it is a next-generation data lake engine that liberates your data with live, interactive queries directly on cloud data lake storage
Jethro – it claims to provide a SQL engine on Hadoop for BI that automatically indexes data as soon as it gets written to Hadoop and can delivery very fast responses compared to Impala or Hive.

Conclusion:

If you can enable analysts with the combination of the above tools that provide interactive query capability and OLAP capabilities, you are ready to move analysts off of traditional data warehouses and on to Hadoop ecosystem with the required re-skilling of course, which may not be too much.

3 opmerkingen

Mamta Sharma

08 feb 2022

What is Apache Druid Architecture? https://www.decipherzone.com/blog-detail/apache-druid-architecture

Apache Druid is among the most popular open-source real-time solutions for Online Analytics Processing (OLAP). As pre-existing open-source databases like Relational Database Management System (RDBMS) and NoSQL were incapable of offering a low latency data ingestion and query platform for interactive applications, the need for Druid was stimulated.

Like

Roopa Kushtagi

30 mrt 2021

Great insights Sai. Thanks. Any inputs around where Apache Presto stands w.r.t. Druid and Kylin? Also, which Big data Data Model you are using in your team?

Like

Sai Geetha M N

30 mrt 2021

Reageren op

Roopa, Presto would be more comparable on the lines of Hive LLAP and Impala as a very quick query engine, more than an OLAP tool. It was the leading product outside of Cloudera (Impala) till Hive LLAP came into the picture.

Like