Manipulation of Big data

Ahsan Ijaz

Four dimensions of Data Science

Breadth
Depth
Scale
Target

Breadth

Tools	Abstraction
Hadoop	MapReduce
PostgreSQL	Relational Algebra
glm in R	Logistic regression
Tableau	InfoVis

Depth

Structures	Statistics
Management	Linear Algebra
Relational Algebra	Analysis
Standards	Ad hoc files

Scale

Desktop	Cloud
Main memory	Distributed
R	Hadoop
Local files	S3, Azure...

Target

Hackers	Analysts
assume proficiency in R, Python	No programming knowledge

What is big data?

Volume

The size of data

Velocity

Latency of data processing to interactivity

Variety

The diversity of sources, formats, quality, structures

Databases

Sharing
- Concurrent access to multiple users
Data model enforcement
- Make sure all applications see clean, organized data
Scale
- work with datasets too large to fit in memory
- Complexity hiding interface
Flexibility
- use the data in new, unexpected ways

Selection of a database

How is data physically organized on disk?
What kinds of queries are efficient on this organization?
How hard it is to update/add new data? (Organization for reading data might not be good for writing data fast)
what happens with Adhoc queries?

Relational databases

Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representations are changed.
Key idea: Programs that manipulate tabular data exhibit an algebric structure allowing reasoning and manipulation independently of physical data representation. Physical data independence

Scalability

Operationally

In the past: Works even if data doesn't fit in memory
Now: Can use 1000s of cheap computers to operate

Algorithmically

In the past: Find a polynomial time algorithm that requires no more than \(N^m\) operations
Now: If you have N data items, require no more than \(\frac{N^m}{k}\) operations for very large \(k\)

Data parallel algorithms

Convert all images from TIFF to PNG

Data parallel algorithms

Run 1000s of simulations.

Data parallel algorithms

Most frequent word in each document.

Data parallel algorithms

Histogram of words in each document.

So what can we see?

A function that maps TIFF into PNG images.
A function that maps parameters to simulation.
A function that maps a document to its most common word.
A function that maps a document to word frequencies.
what if we want to compute the frequencies across all documents?

MapReduce Idea:

Computing frequencies across all documents.