Internet is replete with posts on ‘how to become a data scientist’. Unfortunately, I’m also adding a post on this topic. The breadth of knowledge expected of a practicing data scientist can be quite daunting. However, one can limit the depth in some of these skills. Here, I’d put together the bare minimum resources that a data scientist should go through. I’ll attempt to provide the resources needed to sharpen each of the needed skill sets and how to approach the learning path. Data Scientist role has morphed into a concoction of skill sets. What follows is a catalog of resources divided under different skill sets.
Mathematical Background
There is no way around it! You need strong mathematics to be a data scientist. I usually start the interview process by asking the candidates to draw equations of commonly used functions (, , ). Why is this needed? Once you start on the machine learning journey, you can’t go around without sigmoid functions, trend analysis, radial basis kernels and estimators using log scale. Therefore, it is important to familiarize yourself with geometry of these functions:
- An intuition of exponential functions
- sin and cos functions
- Common mathematical functions
- Activation functions in deep learning
Calculus review
Since, most of the machine learning models require a geometric interpretation, I’d recommend to create a geometric intuition of mathematical concepts as well. Here, the suggested resources are provided since the explanations are provided within a geometric framework. In the links that follow, explanations are provided that help in visualizing the derivatives; critical points and using these to plot lines. You’d see that this interpretation will help you to create a deeper appreciation of optimization algorithms like gradient descent, RMSprop and stochastic gradient descent:
Also, don’t forget to review the chain rule needed for back propagation algorithms:
What follows is an article of using integrals to compute areas. I find this interpretation important when working on Monte Carlo simulations:
Here is another article that discusses integration as area and can more easily be related to dot product (mother of all machine learning and feature engineering).
Finally, it is important to understand the interplay of integration and probability distributions:
If it is easier for you to follow videos instead of reading, here’s the link to some exceptional visually explained lectures on Calculus: Cal lectures.
Linear Algebra
Linear algebra is the core of feature engineering; basis selection; dimensionality reduction; matrix factorization; spectral clustering; deep learning and generalized linear models to name some. Almost all the algorithms have a linear algebra interpretation to them. Therefore, it is the backbone of Machine learning. I haven’t found a better resource than Gilbert Strang’s lecture series on linear algebra. It can be arduous to follow through all of them, but it will be highly rewarding if you do. If you just want the essentials for Machine Learning from the complete course, take the following of his lectures:
- Lec 6: Column space and null space
- Lec 9: Independence, basis and Dimension
- Lec 14: Orthogonal vectors
- Lec 15: Projection into subspaces
- Lec 16: Projection matrices and least squares
- Lec 21: Eigenvalues and Eigenvector
- Lec 29: Singular Value Decomposition
- Lec 31: Change of basis
The mathematical background provided by Ian Goodfellow is quite thorough although somewhat dry. It is a good review nevertheless, so referring it over here. In case you are taking the selected lectures only, do go through this small series of linear algebra lectures before embarking on the selected lectures. It is one of the best visualization explanation lecture series that I’ve come across.
Finally, these are some beautiful javascript visualizations to understand eigenvalues, eigenvectors, PCA and least squares.
Probability and Statistics
Read the first two parts of Larry Wasserman’s All of Statistics (you can skip Chapter 5,6, 12 and 13). In addition to that, do look through the following articles:
Programming
The usual candidates for Data Science programming languages are Python, R and Scala. As this post is directed towards industry folks, my recommendation would be to go with Python as your primary language. Python is already used by many organizations and is a complete production ready programming language. Integration is also easier as it has good support of Web Frameworks and the larger engineering environment. Furthermore, TensorFlow provides a rich environment for production ready machine learning deployments and it has good high level API support of Python. PyTorch (albeit a bit behind in terms of production grade pipelines) is also written in Python. Also, if you’d look into Cloud Deployment solutions (AWS and GCP), Python support is far more than that of R (except for Azure).
You might use R in the following cases:
- You have small data sets and need to make beautiful visualizations.
- You are working on a cutting edge statistical problem; As statistics research community works heavily on R, it is easier to find R packages of recent papers as compared to that of Python implementations.
- You are working purely on a research problem; in my experience, it is faster to write a machine learning pipeline and prototyping using R as compared to Python.
A final note on Statistical packages; I’ve started working on TensorFlow and especially Edward2 programming language for statistical computation. You can write Bayesian models, MCMC problems, variational approximation methods and trainable distributions. There are numerous online resources available for learning Python. I’d suggest to develop of thorough understanding of data structures and algorithms in Python instead of diving directly into the world of ‘dataframes’ and ‘scikit-learn’. This free online book provides a great resource and is replete with excellent examples on data structures and algorithms. I’d suggest to do the first seven chapters with all the sample problems: Problem solving with algorithms and data structures.
Finally, I cannot stress enough to consistently sharpen your programming skills through practice. LeetCode and Coding games are great platforms that provide interesting programming problems and games to solve; so get in a habit of solving these programming problems.
Data Engineering
This might be a digression from pure data science. At best, you’d need a high level knowledge of data engineering pipelines so that it is easier for you to integrate your work with real time, big data systems in a secure manner. There is no alternate to real world experience; but to get the basic knowledge on data engineering, I’d suggest to take one of these data engineering certifications and follow the course work needed for that. Preparing for any one of these will give you significant understanding of the state of art technologies and infrastructure needed for data engineering pipelines.
I’d recommend the CCP certification if most of your work is on-premise systems. Otherwise, both AWS and GCP certifications are equally challenging and rewarding. Once certified, you’d know how to build and maintain databases; analyze data to enable machine learning; and most importantly, design for reliability, security and compliance.
Google Cloud Certified Data Engineer
As I’ve only done the GCP certification, what follows are some useful resources for it. Start off with the five course Data Engineering specialization on Coursera. There is no need to give it too much time; play around with Qwiklab links that you find personally interesting. Other than that, the key takeaway should be learning about all the major GCP components for Data Engineering.
- Data lifecycle on GCP
- Transferring Big DataSets to cloud Platform
- BigQuery Basics
- BigQuery Cost optimization
- BigQuery Performance
- BigQuery Storage
- BigQuery access control
- BigQuery nested and repeated columns
- BigQuery streaming inserts
- Authorized Views in BigQuery
- Query plan explanation
- Handling sensitive Data for machine learning
- Why Quizlet choose GCP
- Large Scale Ingestion
- Financial Time Series analysis
- Real Time Processing IOT
- Processing Logs at Scale using DataFlow
- Transitioning from Data Warehousing in Teradata to GCP Big Data Services
- Financial Time Series on GCP
- Encryption at rest
- Encryption in transit
- Best Practices for datastore
- Cloud DataFlow Use Cases
- Cloud Dataflow use cases part 2
- 12 Myths regarding BigQuery
- Bigtable overview
- Bigtable schema design
- Bigtable performance optimization
- Bigtable time series schema design
- Bigtable performance
- Choosing between SSD and HDD for Bigtable
- Apache beam programming guide
- Streaming 101
- Steaming 102
- Using cloud dataprep and cloud ml in a unified timeline
- Spotify’s Event delivery
- Data Science on the Google Cloud Platform
Big Data Paradigm
I’ve suggested one of the data engineering certification since most of the components used by one of the cloud providers have equivalent services offered by other cloud providers. What follows is a table providing the mapping of these services:
Data Engineering Product | AWS | GCP | Azure | On-premise |
---|---|---|---|---|
Object Storage | Amazon Simple Storage Service | Cloud Storage | Azure Blob Storage | HDFS |
Block Storage | Amazon Elastic Block Store | Persistent Disk | Disk Storage | Hard disks |
File Storage | Amazon Elastic File System | Cloud Filestore | Azure File Storage | HDFS |
RDBMS | Amazon Relational Database Service, Amazon Aurora | Cloud SQL, Cloud Spanner | SQL Database | PostgreSQL |
Reduced-availability Storage | Amazon S3 Standard-Infrequent Access, Amazon S3 One Zone-Infrequent Access | Cloud Storage Nearline | Azure Cool Blob Storage | HDFS |
Archival Storage | Amazon Glacier | Cloud Storage Coldline | Azure Archive Blob Storage | HDFS |
NoSQL: Key-value | Amazon DynamoDB | Cloud Datastore, Cloud Bigtable | Table Storage | HBase/Cassandra |
NoSQL: Indexed | Amazon SimpleDB | Cloud Datastore | Cosmos DB | MongoDB |
Batch Data Processing | Amazon Elastic MapReduce, AWS Batch | Cloud Dataproc, Cloud Dataflow | HDInsight, Batch | Hadoop/Spark |
Stream Data Processing | Amazon Kinesis | Cloud Dataflow | Stream Analytics | Apache Beam/Spark stream |
Stream Data Ingest | Amazon Kinesis | Cloud Pub/Sub | Event Hubs, Service Bus | Kafka |
Analytics | Amazon Redshift, Amazon Athena | BigQuery | Data Lake Analytics, Data Lake Store | Kudu/GreenPlum |
Workflow Orchestration | Amazon Data Pipeline, AWS Glue | Cloud Composer | Apache AirFlow | |
Monitoring | Amazon CloudWatch | StackDriver Monitoring | Application Insights | Elastic Search/Solr |
Fully Managed ML | Amazon SageMaker | Cloud Machine Learning Engine | ML Studio | KubeFlow/Seldon.io |
In addition to this, I’d suggest to take the first course of the specialization Data Science at Scale to get a cursory understanding of Hadoop, Spark and big data design space.
Machine Learning
For machine learning, I’ll highly recommend taking the Machine learning specialization at Coursera. It is a four course specialization that touches on all the important topics. The first course will give an overview of ML techniques and use cases. The second course will have you implement regression, gradient descent, modifying it for ridge and lasso, feature selection and introduction to kernel regression. The third course will have implementation exercises on logistic regression, maximum likelihood, regularization, decision trees, boosting and scaling. While implementing these methods, you’d get to know quite a lot about different optimization techniques and machine learning scalability. The final course of the specialization will start from topics of information retrieval and what similarity metrics are suitable depending on use cases. It will follow with implementation details on KD-trees and locality sensitive hashing. Next section will address K-means, K-means++ and its implementation using MapReduce. This will be followed with Gaussian mixture models and expectation maximization techniques. The next week will cover Latent Dirichlet Allocation model and Gibbs Sampling. The final week will cover hierarchical clustering. Just as previous courses, you’ll have to implement all of these algorithms. This is not an easy specialization and will require considerable time and attention to complete but it stands out as one of the best machine learning specialization out there. The link is given below:
On parallel to this specialization, you can read “An introduction to statistical learning with applications in R”. Please read it thoroughly as it is an easy to read book that covers machine learning from a statistical perspective and covers a few topics not covered in ML specialization (confidence intervals in Regression; splines; SVM; QDA and linear discriminant analysis; resampling methods and Random Forests to name a few). As it is an introductory book, it is imperative to know each and every concept discussed in this. The good part is, it won’t take more than a week to finish this book.
Deep Learning
If you want to get a decent introduction to deep learning, Deep Learning specialization by Andrew Ng is the best resource out there. The exercises are provided so that you get to implement the key components of Deep Learning models (the code snippets are quite simple and the comments almost provide the needed code so it will be quite easy to complete these exercises). The second and third course of the specialization covers practical issues as well: including hyperparameter tuning; weight initialization; faster optimization algorithms; gradient checking; vanishing and exploding gradients and batch normalization. Once you’ve completed this specialization, I’d highly recommend taking Advanced Machine Learning with TensorFlow on Google Cloud Platform Specialization. This specialization covers the practical aspects of Machine learning quite well including machine learning Devops. It also covers aspects of hybrid deployments (Kubeflow); adaptable machine learning; use of managed services; TPU/GPU/CPU. If you just want the essential out of this specialization that generalizes to more than just the GCP, take the second course: Production Machine Learning Systems. I’ve found that this specialization has provided better advice on practical aspects of deep learning as compared to Andrew Ng’s Deep learning specialization.
Machine learning frameworks
For industry and production ready systems, go with TensorFlow if you are programming in Python; H2o if you are working in R. TensorFlow, is by far, the best machine learning framework to go into production; whether you are doing cloud or on-premise deployment. Its portability means that the models can be deployed on any operating system and computing device. Whenever you are judging a machine learning platform, be mindful about the following watchpoints:
- Does it provide capability of fast prototyping?
- Does the framework provide model check pointing?
- Can it handle out of memory datasets?
- Does it provide a good method for monitoring training and validation errors during train sessions?
- Can it provide distributed training?
- Does it provide a method for hyperparameter tuning?
- Does it provide a framework for serving predictions from a trained model?
TensorFlow resolves all of these watch points and is therefore, the most production ready machine learning framework in market.