MapReduce and RDBMS

Ahsan Ijaz

Distributed File system

  • For very large files (TBs, PBs)
  • Each file is partioned into chunks (64MB)
  • Each chunk is replicated several times ($ > 3 $)
  • Implementations - Hadoop file system (HDFS- open source), Google distributed File system (GFS- proprietary)

Implementation

  • One master node
  • Master partitions file into M splits by key
  • Master assigns workers to the M map tasks and keep track of their progress.
  • Workers write output into local disks with R regions.
  • Master assigns workers to R reduced tasks.
  • Reduce workers read regions from the map worker's local disk.

MapReduce Complete cycle

Large-scale Data Processing

  • Many tasks process big data.
  • Want to use hundreds or thousands of CPUs
    • Parallel databases exist - Expensive! Difficult to set up.
  • MapReduce is a lightweight framework, providing:
    • Automatic parallelization and distribution
    • Fault-tolerance
    • I/O scheduling
    • Status and monitoring

Design space for Big data

Parallel query processing

  • Distributed Query (Microsoft SQL server)
    • Rewrite the query as a union of subqueries
    • Workers communicate through standard interfaces
    • Same as MR, BUT all results are sent back to head node
  • Parallel Query
    • Each operator is implemented with a parallel query
    • Same as MR

Distributed Query

CREATE VIEW SALES AS
SELECT * from janSales
UNION ALL
SELECT * from febSales
UNION ALL
SELECT * from marchSales
UNION ALL

Parallel Query

TeraData Example

SELECT *
    FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()

TeraData Example

TeraData Example

MapReduce Extensions

  • Pig
    • Relational algebra over Hadoop
  • Hive
    • SQL over hadoop
  • Impala
    • SQL over HDFS; builds on HIVE code

MapReduce vs RDBMS

RDBMS

  • Declarative query language (Pig, HIVE)
  • Schemas (HIVE)
  • Logical data independence
  • Indexing (Hbase)
  • Algebraic optimization (Pig, HIVE)
  • Caching Views
  • ACID/Transactions

MapReduce

  • High Scalability ( \(>\) 1000 Nodes)
  • Fault tolerance

Hadoop vs. RDBMS

  • Comparison of 3 systems
    • Hadoop
    • Vertica (column oriented)
    • Oracle (row oriented)
  • Qualitative
    • Programming model, ease of setup, features
  • Quantitative
    • Data loading, different types of queries

Grep Task

  • Find 3-byte pattern in 100 byte record
    • 1 match per 10,000 records
  • Data set:
    • 10 byte unique key, 90 byte value
    • 1TB spread across 25, 50 or 100 nodes
    • 10 billion records

Grep task loading results

Grep Task

Selection and filter task

Aggregate tasks