Manipulation of Big data

Ahsan Ijaz

Four dimensions of Data Science

  • Breadth
  • Depth
  • Scale
  • Target

Breadth

Tools Abstraction
Hadoop MapReduce
PostgreSQL Relational Algebra
glm in R Logistic regression
Tableau InfoVis

Depth

Structures Statistics
Management Linear Algebra
Relational Algebra Analysis
Standards Ad hoc files

Scale

Desktop Cloud
Main memory Distributed
R Hadoop
Local files S3, Azure...

Target

Hackers Analysts
assume proficiency in R, Python No programming knowledge

What is big data?

Volume

  • The size of data

Velocity

  • Latency of data processing to interactivity

Variety

  • The diversity of sources, formats, quality, structures

Databases

  • Sharing
    • Concurrent access to multiple users
  • Data model enforcement
    • Make sure all applications see clean, organized data
  • Scale
    • work with datasets too large to fit in memory
    • Complexity hiding interface
  • Flexibility
    • use the data in new, unexpected ways

Selection of a database

  • How is data physically organized on disk?
  • What kinds of queries are efficient on this organization?
  • How hard it is to update/add new data? (Organization for reading data might not be good for writing data fast)
  • what happens with Adhoc queries?

Relational databases

  • Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representations are changed.

  • Key idea: Programs that manipulate tabular data exhibit an algebric structure allowing reasoning and manipulation independently of physical data representation. Physical data independence

Scalability

Operationally

  • In the past: Works even if data doesn't fit in memory
  • Now: Can use 1000s of cheap computers to operate

Algorithmically

  • In the past: Find a polynomial time algorithm that requires no more than \(N^m\) operations
  • Now: If you have N data items, require no more than \(\frac{N^m}{k}\) operations for very large \(k\)

Data parallel algorithms

Convert all images from TIFF to PNG

Data parallel algorithms

Run 1000s of simulations.

Data parallel algorithms

Most frequent word in each document.

Data parallel algorithms

Histogram of words in each document.

So what can we see?

  • A function that maps TIFF into PNG images.
  • A function that maps parameters to simulation.
  • A function that maps a document to its most common word.
  • A function that maps a document to word frequencies.
  • what if we want to compute the frequencies across all documents?

MapReduce Idea:

Computing frequencies across all documents.