This might seem to you as a digression from pure data science. But one of the essential requirements for real world data science and end to end machine learning is setting up of data engineering pipelines. At time, you might just need to pull data in from a database, split it in to training and testing and validate your models. But, when in production, you’ll need to engineer the data pipelines, or atleast assist the data engineers on how to design the upstream systems for your machine learning system to work as expected. There is no alternate to real world experience; but to get the basic knowledge on data engineering, I’d suggest to take one of these data engineering certifications and follow the course work needed for that. Preparing for any one of these will give you significant understanding of the state of art technologies and infrastructure needed for data engineering pipelines.

I’d recommend the CCP certification if most of your work is on-premise systems. Otherwise, both AWS and GCP certifications are equally challenging and rewarding. Once certified, you’d know how to build and maintain databases; analyze data to enable machine learning; and most importantly, design for reliability, security and compliance.

Google Cloud Certified Data Engineer

As I’ve only done the GCP certification, what follows are some useful resources for it. Start off with the five course Data Engineering specialization on Coursera. There is no need to give it too much time; play around with Qwiklab links that you find personally interesting. Other than that, the key takeaway should be learning about all the major GCP components for Data Engineering. In additon to that, here is a curation of some links that I’ve found useful during my preparation: