MapReduce and RDBMS

Ahsan Ijaz

Distributed File system

For very large files (TBs, PBs)
Each file is partioned into chunks (64MB)
Each chunk is replicated several times ($ > 3 $)
Implementations - Hadoop file system (HDFS- open source), Google distributed File system (GFS- proprietary)

Implementation

One master node
Master partitions file into M splits by key
Master assigns workers to the M map tasks and keep track of their progress.
Workers write output into local disks with R regions.
Master assigns workers to R reduced tasks.
Reduce workers read regions from the map worker's local disk.

MapReduce Complete cycle

Large-scale Data Processing

Many tasks process big data.
Want to use hundreds or thousands of CPUs
- Parallel databases exist - Expensive! Difficult to set up.
MapReduce is a lightweight framework, providing:
- Automatic parallelization and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring

Design space for Big data

Parallel query processing

Distributed Query (Microsoft SQL server)
- Rewrite the query as a union of subqueries
- Workers communicate through standard interfaces
- Same as MR, BUT all results are sent back to head node
Parallel Query
- Each operator is implemented with a parallel query
- Same as MR

Distributed Query

CREATE VIEW SALES AS
SELECT * from janSales
UNION ALL
SELECT * from febSales
UNION ALL
SELECT * from marchSales
UNION ALL

Parallel Query

TeraData Example

SELECT *
    FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()

TeraData Example

TeraData Example

MapReduce Extensions

Pig
- Relational algebra over Hadoop
Hive
- SQL over hadoop
Impala
- SQL over HDFS; builds on HIVE code

MapReduce vs RDBMS

RDBMS

Declarative query language (Pig, HIVE)
Schemas (HIVE)
Logical data independence
Indexing (Hbase)
Algebraic optimization (Pig, HIVE)
Caching Views
ACID/Transactions

MapReduce

High Scalability ( \(>\) 1000 Nodes)
Fault tolerance

Hadoop vs. RDBMS

Comparison of 3 systems
- Hadoop
- Vertica (column oriented)
- Oracle (row oriented)
Qualitative
- Programming model, ease of setup, features
Quantitative
- Data loading, different types of queries

Grep Task

Find 3-byte pattern in 100 byte record
- 1 match per 10,000 records
Data set:
- 10 byte unique key, 90 byte value
- 1TB spread across 25, 50 or 100 nodes
- 10 billion records

Grep task loading results

Grep Task

Selection and filter task

Aggregate tasks