Why Use AWS Redshift Spectrum with Data Lake
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. AWS uses S3 to store data in any format, securely, and at a massive scale. However, this creates a “Dark Data” problem – most generated data is unavailable for analysis. To solve this Dark Data issue, AWS introduced Redshift Spectrum which is an extra layer between data warehouse Redshift clusters and the data lake in S3.
Redshift Spectrum has the following key features:
- A group of managed nodes in the private VPC
- Independence of the Redshift cluster and availability to any Redshift clusters that are Spectrum-enabled
- A single query on open file formats in S3 and data in Redshift without any ETL
- Fully managed and priced on a per query basis to provide higher performance and lower cost
Life of Query
There are three key components to run a query with Redshift Spectrum:
- External data catalog contains the schema definitions for the data to access in S3. It’s a central metadata repository for the data assets. The options for external data catalog are AWS Athena (default), AWS Glue, or Apache Hive metastore (either from your own Hadoop ecosystem or from AWS EMR.)
- External Schema contains your tables.
- External tables allow you to run queries between S3 and Redshift local tables. The external tables are read-only.
Here is the life of the query on Redshift Spectrum:
- A query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum.
- The query plan is sent to all compute nodes.
- Compute nodes obtain partition info from the Data Catalog; dynamically prune partitions.
- Each Compute node issues multiple requests to the Redshift Spectrum layer.
- Amazon Redshift Spectrum nodes scan your S3 data.
- Amazon Redshift Spectrum projects, filters, and aggregates.
- Final aggregations and joins with local Amazon Redshift tables done in the cluster.
- The result is sent back to the client.
Lab on Redshift Spectrum
The lab on Redshift Spectrum is covered in my online course.
- Hands-on with DynamoDB
- AWS Data Warehouse – Build with Redshift and QuickSight
- AWS Relational Database Solution: Hands-on with AWS RDS
- Which is Right Hadoop Solution for You?
- Apache Hadoop Ecosystem Cheat Sheet
- Data Storage for Big Data: Aurora, Redshift or Hadoop?
- AWS Kinesis Data Streams vs. Kinesis Data Firehose
- Streaming Platforms: Apache Kafka vs. AWS Kinesis
- AWS Machine Learning on AWS Redshift Data
- Why Use AWS Redshift Spectrum with Data Lake
- How to Design AWS DynamoDB Data Modeling
- When Should Use Amazon DynamoDB Accelerator (AWS DAX)?
- Web Application with Aurora Serverless Cluster
- Top IT Certifications for 2018
- How I Passed AWS CSAA in 3 Months
- How to Pass AWS Certified Big Data Specialty
- AWS Elastic Beanstalk or AWS Elastic Container Service for Kubernetes (AWS EKS)
- How to Use AWS CodeStar to Manage Lambda Java Project from Source to Test Locally