In its most recent release, version 25.0, Apache Druid provides a number of improvements and enhancements to its high-performance real-time datastore. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready; Kubernetes can be used for task launch and management eliminating the need for middle managers; simplified deployment; and a new dedicated binary for Hadoop 3.x users.
Druid’s design incorporates concepts from data warehouses, time-series databases, and search systems in order to produce real-time analytics and reduce time to insight.
The architecture is based on microservices and is cloud-ready, and it consists of several types of services, including: the Coordinator service, which maintains data availability on the cluster, the Overlord service, which assigns workloads for data ingestion, the Broker service, which deals with external client inquiries, and the MiddleManager service, which ingests data into the cluster.
As part of the ingestion phase, Druid reads the data from the source system and stores it in data files called segments. The average segment file contains a few million rows. Each segment file is partitioned by time and organized into a columnar structure that is stored separately in order to decrease query latency by scanning only those columns that are actually necessary for the query.
Druid supports both streaming and batch data ingestion. Typically, it connects to a source of raw data, usually a message bus such as Apache Kafka (for streaming data loads), a distributed file system such as HDFS (for batch data loads), or cloud-based storage such as Amazon S3 and Azure Blob Storage (for batch data loads), and is capable of converting raw data into a more read-optimized format (segment) through a process known as “indexing.” Apache Druid is capable of ingesting denormalized data in JSON, CSV, Parquet, Avro and other custom formats.
Druid SQL can be used to query Druid data sources. Druid translates SQL queries into its native query language.
The Druid application comes with a web console through which you can load data, manage data sources and tasks, as well as control the server status and segment information. Additionally, you are able to execute SQL queries and native Druid queries from the console.
Apache Druid is frequently used when real-time ingest, fast query performance, and high uptime are essential.
As a result, Druid is commonly used as a backend for APIs requiring quick aggregation or to power analytical apps. It is best to use Druid with event-oriented data.
Applications typically include clickstream analytics (web and mobile analytics), risk/fraud analysis, network telemetry analytics (network performance monitoring), application performance metrics, and business intelligence / OLAP.
Google Kubernetes Engine Adds Multishares for Filestore Enterprise
Google Cloud has made Filestore Enterprise Multishares for Google Kubernetes Engine (GKE)…
Geoffrey Hinton publishes new deep learning algorithm
Geoffrey Hinton, professor at the University of Toronto and engineer at Google Brain, recently…