This might seem to you as a digression from pure data science. But one of the essential requirements for real world data science and end to end machine learning is setting up of data engineering pipelines. At time, you might just need to pull data in from a database, split it in to training and testing and validate your models. But, when in production, you’ll need to engineer the data pipelines, or atleast assist the data engineers on how to design the upstream systems for your machine learning system to work as expected. There is no alternate to real world experience; but to get the basic knowledge on data engineering, I’d suggest to take one of these data engineering certifications and follow the course work needed for that. Preparing for any one of these will give you significant understanding of the state of art technologies and infrastructure needed for data engineering pipelines.
I’d recommend the CCP certification if most of your work is on-premise systems. Otherwise, both AWS and GCP certifications are equally challenging and rewarding. Once certified, you’d know how to build and maintain databases; analyze data to enable machine learning; and most importantly, design for reliability, security and compliance.
Google Cloud Certified Data Engineer
As I’ve only done the GCP certification, what follows are some useful resources for it. Start off with the five course Data Engineering specialization on Coursera. There is no need to give it too much time; play around with Qwiklab links that you find personally interesting. Other than that, the key takeaway should be learning about all the major GCP components for Data Engineering. In additon to that, here is a curation of some links that I’ve found useful during my preparation:
- Data lifecycle on GCP
- Transferring Big DataSets to cloud Platform
- BigQuery Basics
- BigQuery Cost optimization
- BigQuery Performance
- BigQuery Storage
- BigQuery access control
- BigQuery nested and repeated columns
- BigQuery streaming inserts
- Authorized Views in BigQuery
- Query plan explanation
- Handling sensitive Data for machine learning
- Why Quizlet choose GCP
- Large Scale Ingestion
- Financial Time Series analysis
- Real Time Processing IOT
- Processing Logs at Scale using DataFlow
- Transitioning from Data Warehousing in Teradata to GCP Big Data Services
- Financial Time Series on GCP
- Encryption at rest
- Encryption in transit
- Best Practices for datastore
- Cloud DataFlow Use Cases
- Cloud Dataflow use cases part 2
- 12 Myths regarding BigQuery
- Bigtable overview
- Bigtable schema design
- Bigtable performance optimization
- Bigtable time series schema design
- Bigtable performance
- Choosing between SSD and HDD for Bigtable
- Apache beam programming guide
- Streaming 101
- Steaming 102
- Using cloud dataprep and cloud ml in a unified timeline
- Spotify’s Event delivery
- Data Science on the Google Cloud Platform