Data Storage for Big Data: Aurora, Redshift or Hadoop?
I mentioned in my post “Which is Right Hadoop Solution for You?” that if your Big Data is below 1.6PB, you may want to take a look at the Redshift data warehouse option. When we choose a fault-tolerant and self-healing storage system for big data, we should take a look at Aurora, Redshift, and Hadoop. Let’s first go through Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) before choosing from these three products.
OLTP vs. OLAP
If you’re familiar with OLTP and OLAP, you can skip this section. To easily explain OLTP and OLAP, the data is divided into two categories: transactional data and behavioral data.
Transactional Data vs. Behavioral Data
Transactional data refers to the description of the events or actions. Transaction data always has a time dimension, a numerical value and refers to one or more objects. For example, the direct deposit transaction will have the transaction date/time, the total amount of money, your bank/account information, and the originator information etc.
Behavioral data refers to the information produced as a result of the events or actions. The raw transactional data is captured during the period of time. Then you can utilize the massive volumes of behavioral data to predict future actions. For example, the bank decides your personal loan based on the collected behavioral data. The data included such as your current income/expenses, your employment history, your repayment history, and your equated monthly Installment.
Online Transaction Processing (OLTP) vs. Online Analytical Processing (OLAP)
The management of transactional data using computer systems is referred to as Online Transaction Processing (OLTP). OLTP supports transaction-oriented applications, typically for data entry and retrieval transaction processing in real-time. There are four key goals of OLTP applications: availability, speed, concurrency, and recoverability. OLTP systems process all kinds of queries (e.g. read, insert, update, and delete).
Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex analysis of behavioral data. OLAP allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies. It is batched-based. Or the streaming processing frameworks ingest the real-time data into OLAP. The main purpose of OLAP is for business intelligence or reporting rather than to process transactions. OLAP is generally optimized for read-only and might not support other kinds of queries.
Now let’s go through the details on each product:
Aurora is a fully managed, MySQL- and PostgreSQL-compatible, relational database engine. Aurora is available as part of Amazon’s AWS Relational Database Service (RDS).
Aurora has the following key features:
- Storage engine
- A distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance.
- High performance
- Aurora MySQL delivers up to five times the performance of MySQL without requiring any changes to most MySQL applications.
- Similarly, Amazon Aurora PostgreSQL delivers up to three times the performance of PostgreSQL.
- High availability
- Up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones (AZs).
- Fully managed
- AWS RDS manages your Aurora databases, handling time-consuming tasks such as provisioning, patching, backup, recovery, failure detection, and repair.
Aurora provides the security, availability, and reliability of commercial-grade databases at one-tenth the cost.
You can use Aurora for OLTP applications like ERP, CRM, and eCommerce to log transactions and store structured data.
Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.
Redshift has the following key features:
- Storage engine
- Redshift supports Single Node Clusters to 100 Nodes Clusters with up to 1.6 PB of storage.
- Moreover, with Redshift Spectrum, you can store an almost unlimited amount of data in S3.
- High performance
- Massively parallel processing (MPP) query execution enables Redshift to distribute and parallelize queries across multiple nodes.
- Columnar storage and data compression on a high-performance disk to reduce the amount of I/O required for queries.
- Data sharding to partition the tables across different servers for better performance.
- AWS claims that Redshift delivers 10x better performance than other data warehouses.
- Fully managed
- Redshift automates most of the common administrative tasks to set up, scale, manage and maintain your data warehouse.
Redshift clusters currently support only Single Availability Zone (AZ) deployment.
Redshift is the most cost-effective cloud data warehouse, and less than one-tenth the cost of traditional data warehouses on-premises. There are no upfront costs with Redshift, and you only pay for what you use. You can start small for just $0.25 per hour with no commitments, and scale out for just $250 per terabyte per year.
Redshift uses for analytic applications (OLAP) for operational reporting and querying from Terabyte to Exabyte scale data (data less than 1.6PB).
Apache Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides the software framework for massive storage, enormous MapReduce processing power, and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop Ecosystem provides storage, data processing, query access, security, etc. For example, integrating Hive with HBase allows a single Hive query to perform complex data operations. Such as join, union, and aggregation across combinations of HBase and native Hive tables. Moreover, the Hive’s INSERT statement can be used to move data between HBase and native Hive tables or to reorganize data within the HBase itself.
Hadoop ecosystem has the following key features:
- Hadoop Distributed File System (HDFS) is highly fault-tolerant. The data is stored in HDFS where data automatically get replicated at two other locations.
- Flexibility in data pressing
- Supports both structured and unstructured data such as social media, email conversations or clickstream data.
- MapReduce is the best framework for processing data in batches.
- Highly scalable storage
- There are virtually no limits to scale Hadoop.
Please review my previous post on the Hadoop performance. When data less than 1.6PB, Redshift is faster than Hadoop because of database vs. file system.
Please review my previous post on the Hadoop cost. When data less than 1.6PB, Redshift is cheaper than Hadoop.
Hadoop provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Great OLAP choice for data more than 1.6PB, limitless report viewing, complex BI analytics, and predictive analytics.
In conclusion, all three products provide a great deal on processing big data with high availability, scalability, performance, and security. You should consider the purpose of your applications, the total data size, the potential data increase, and the cost holistically to choose the right product(s) for your business. If you still have any question on your Big Data system architecture decision, please contact me for your first 30-minute free consulting.