Introduction to Data Science

Ahsan Ijaz

Topics

  • Linear Algebra
  • Statistics
  • Programming (R/Python/Scala)
  • Visualization (ggplot2, D3.js, Tableau)
  • Feature Selection
  • Hypothesis testing
  • Machine learning (Regression, Classification, Recommender Systems, Clustering, Deep learning)
  • Reproducible documentation
  • Big data (MapReduce, hadoop, Spark, Mesos/Yarn, ...)

Data Science Demand (LinkedIn Top skills)

Data Science Demand (Gartner 2013)

Data Science Demand (Gartner 2014)

Data Science Demand (Gartner 2015)

Data Science Demand (Gartner 2016)

General Skill set

Type of Data Scientists

Roadmap to Data Science

Data Science ToolBox

  • R programming language
  • Python Language
  • R Markdown, Knitr, Slidify, Shiny
  • IPython notebooks

Jeff HammerBach's Model for Data problems

  • Identify problem
  • Instrument data sources
  • Collect data
  • Prepare data (integrate, transform, clean, filter, aggregate)
  • Build model
  • Evaluate model
  • Communicate results

Our first Data science project!


question -> input data -> features -> algorithm -> parameters -> evaluation

SPAM Example


question -> input data -> features -> algorithm -> parameters -> evaluation


Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

SPAM Example

SPAM Example


question -> input data -> features -> algorithm -> parameters -> evaluation


Dear Sir,

Can you send me your address so I can send you the invitation?

Thanks,

Ahsan

SPAM Example


question -> input data -> features -> algorithm -> parameters -> evaluation


Dear Sir,

Can you send me your address so I can send you the invitation?

Thanks,

Ahsan


Frequency of you \(= 2/17 = 0.118\)

SPAM Example


question -> input data -> features -> algorithm -> parameters -> evaluation

library(kernlab)
data(spam)
head(spam,3)
  make address  all num3d  our over remove internet order mail receive will people report addresses
1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64   0.00   0.00      0.00
2 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79   0.65   0.21      0.14
3 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45   0.12   0.00      1.75
  free business email  you credit your font num000 money hp hpl george num650 lab labs telnet
1 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00  0.00  0   0      0      0   0    0      0
2 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43  0.43  0   0      0      0   0    0      0
3 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16  0.06  0   0      0      0   0    0      0
  num857 data num415 num85 technology num1999 parts pm direct cs meeting original project   re  edu
1      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
2      0    0      0     0          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.00
3      0    0      0     0          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.06
  table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar
1     0          0          0.00            0.000                 0           0.778      0.000
2     0          0          0.00            0.132                 0           0.372      0.180
3     0          0          0.01            0.143                 0           0.276      0.184
  charHash capitalAve capitalLong capitalTotal type
1    0.000      3.756          61          278 spam
2    0.048      5.114         101         1028 spam
3    0.010      9.821         485         2259 spam

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

plot(density(spam$our[spam$type=="nonspam"]),
     col="blue",main="",xlab="Frequency of 'your'")
lines(density(spam$our[spam$type=="spam"]),col="red")
plot of chunk unnamed-chunk-2

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation



Our algorithm

  • Find a value \(C\).
  • frequency of 'your' \(>\) C predict "spam"

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

plot(density(spam$your[spam$type=="nonspam"]),
     col="blue",main="",xlab="Frequency of 'your'")
lines(density(spam$your[spam$type=="spam"]),col="red")
abline(v=0.5,col="black")
plot of chunk unnamed-chunk-3

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

prediction <- ifelse(spam$your > 0.5,"spam","nonspam")
table(prediction,spam$type)

prediction nonspam spam
   nonspam    2112  468
   spam        676 1345

Accuracy$ \approx 0.459 + 0.292 = 0.751$

Applications of Machine Learning

Case Studies

A case study approach to machine learning and data science.

Case Study 1, Regression:

Prediction of House prices

Case Study 2, Classification:

Sentiment Analysis

Case Study 3, Clustering :

Document Retrieval

Case Study 4, Recommender System (Matrix factorization):

Case Study 5, Deep learning:

Visual recommender system.

Regression techniques

  • Linear Regression
  • Multiple linear Regression
  • Polynomial Regression
  • Ridge Regression
  • Lasso Regression
  • Local/Kernel Regression

Classification methods

  • Linear Classifier
  • Logistic Regression
  • Decision Trees
  • Bagging
  • Boosting
  • Discussion about scaling of algorithms

Feature Selection

  • Subset selection
  • Greedy passes
  • Basis selection
  • Bias variance trade-off

Document retrieval and Clustering

  • Nearest Neighbor
  • K-means clustering
  • DB-scan
  • Hierarchical Clustering
  • Latent Dirichlet allocation

Recommender System

  • Collaborative Filtering
  • SVD and PCA
  • Matrix factorization

Deep learning

  • Examples of cross-learning
  • Using deep features for image, text matching

Hypothesis Testing

  • t-distributions
  • A/B testing scenarios
  • Real world examples

Reproducible Documents

  • R Markdown example, embedding code and text
  • IPython

Big Data Paradigm

  • MapReduce
  • Hadoop
  • Spark
  • Mesos/Yarn
  • Sql to NoSql Databases