Helpful Resources for Workshops
Installations
-
Installation of R and RStudio (You don’t need to install the SDSFoundations)
Tutorials for R
- Swirl is an excellent tutorial for learning R interactively in RStudio. Highly recommended.
Data science examples and technology landscape
- (example) Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-László Barabási, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
- (example) Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
- (example) Google Flu Trends
- (example) Eigenfactor, and publications
- (example) L’Aquila quake: Italy scientists guilty of manslaughter, BBC
- eScience: The Fourth Paradigm (Foreward and Introduction, pages xi - xxxi; Gray’s Laws, pages 5-12)
- Chris Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” , Wired magazine, 2008 Responses to Chris Anderson, 2008
Data at scale
Databases and the relational algebra
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- E. F. Codd, 1981 Turing Award Lecture, ” Relational Database: A Practical Foundation for Productivity”, 1981
- [Advanced] Cohen et al.“MAD Skills: New Analysis Practices for Big Data”, 2009
- [Advanced] Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011
MapReduce, Hadoop, relationship to databases, algorithms, extensions, language; key-value stores and NoSQL; tradeoffs of SQL and NoSQL Readings
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM, January 2010.
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, January 2010.
- Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
- Optional Technical Background: The Hadoop Distributed File System
Data cleaning, entity resolution, data integration, information extraction
- Elmagarmid, et. al. Duplicate Record Detection: A Survey,
- Koudas, et. al. Record Linkage: Similarity Measures and Algorithms
Machine Learning resources
- Statistics is Easy! Dennis Sasha, Manda Wilson, Morgan and Claypool
- Chapter 3 of A Handbook of Statistical Analyses Using R
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37.
- Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
- Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
- The Art of Data Science
Visualization and communicating results
- Hans Rosling, The Joy of Stats
- Pat Hanaran, Tools for Data Enthusiasts
- Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010
- Graphics Lies, Misleading Visuals-Alberto Cairo
Interesting Data blogs and reading links
- Simply Statistics
- OpenAI blog
- Kaggle blog
- Machine learning blog
- kdnuggets
- Dataists
- A comprehensive list of Data blogs
- Visual information theory
- Deep learning reading list