Why Use AWS Redshift Spectrum with Data Lake

by L. Peng · September 9, 2018

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. AWS uses S3 to store data in any format, securely, and at a massive scale. However, this creates a “Dark Data” problem – most generated data is unavailable for analysis. To solve this Dark Data issue, AWS introduced Redshift Spectrum which is an extra layer between data warehouse Redshift clusters and the data lake in S3.

Key Features

Redshift Spectrum has the following key features:

A group of managed nodes in the private VPC
Independence of the Redshift cluster and availability to any Redshift clusters that are Spectrum-enabled
A single query on open file formats in S3 and data in Redshift without any ETL
Fully managed and priced on a per query basis to provide higher performance and lower cost

Life of Query

There are three key components to run a query with Redshift Spectrum:

External data catalog contains the schema definitions for the data to access in S3. It’s a central metadata repository for the data assets. The options for external data catalog are AWS Athena (default), AWS Glue, or Apache Hive metastore (either from your own Hadoop ecosystem or from AWS EMR.)
External Schema contains your tables.
External tables allow you to run queries between S3 and Redshift local tables. The external tables are read-only.

Here is the life of the query on Redshift Spectrum:

A query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum.
The query plan is sent to all compute nodes.
Compute nodes obtain partition info from the Data Catalog; dynamically prune partitions.
Each Compute node issues multiple requests to the Redshift Spectrum layer.
Amazon Redshift Spectrum nodes scan your S3 data.
Amazon Redshift Spectrum projects, filters, and aggregates.
Final aggregations and joins with local Amazon Redshift tables done in the cluster.
The result is sent back to the client.

Lab on Redshift Spectrum

The lab on Redshift Spectrum is covered in my online course.

Why Use AWS Redshift Spectrum with Data Lake

You may also like...

Leave a Reply Cancel reply

Recent Posts

About Our Blog

Archives

Sign Up

CLOUD COMPUTING CERTS

Training Courses

Why Use AWS Redshift Spectrum with Data Lake

Key Features

Life of Query

Lab on Redshift Spectrum

You may also like...

Which is Right Hadoop Solution for You?

PaaS Solutions: AWS Elastic Beanstalk vs. Red Hat OpenShift

AWS Virtual Private Cloud

Leave a Reply Cancel reply

Recent Posts

About Our Blog

Archives

Sign Up

CLOUD COMPUTING CERTS

Training Courses

Popular Posts