Data Engineering Life Cycle


The data engineering life cycle (lifecycle) helps us to look at the bigger picture. It’s tempting to dive right into the technologies themselves before understanding the landscape. However, the data engineering lifecycle shifts the conversation away from technology and toward the data itself and the organization’s objectives.

Simply put, data engineers get data, store it and prepare it for data analysts, data scientists, machine learning analysts, business intelligence analysts and others. Data engineering sits “upstream” from data science and analytics. However some people will argue that data engineering is a subdiscipline of data science. The authors of the book Fundamentals of Data Engineering, Joe Reis and Matt Housley, would disagree. I agree with them and say that data engineering is separate from data science.

Data engineers are becoming more in demand in the job marketplace, partly because of the increase in the volume of data and the high amount of time that data analysts and data scientists spend gathering and cleaning data before they find insights and run machine learning programming. In a perfect world data engineers would take some of that work away from the data analysts and scientists. Why? Data engineers are trained to do that work and data scientists are not.

Here below is the Data Engineering Lifecycle diagram from the book by Joe Reis and Matt Housley called Fundamentals of Data Engineering Plan and Build Robust Data Systems published by O’Reilly in 2022. It currently rates a high 4.7 at amazon.com.

Click to Enlarge

The data engineering lifecycle is a subset of the data lifecycle. It starts with generation. It mores into ingestion, transformation, serving with storage underpinning all three. The data is served to three different areas: data analytics, machine learning and reverse ETL. There are undercurrents. These are security, data management, DataOps, data architecture, orchestration and software engineering.

The Data Engineer

The skill set of a data engineer encompasses the data engineering lifecycle and its undercurrents. The conversation used to be around the technologies, however due to the rapid pace of change, it’s better to approach the topic from a life cycle perspective. Now the data-tooling world is easier thanks to the advances in the tools themselves. Data engineers should have a good understanding of what those people close to him/her does to be able to work better with them. Data engineers should understand, but not do, the following: create reports, build machine learning models, perform data analysis, create dashboards, or develop software applications.

Other Lifecycle Models

We have a post on the Data Analytics Life Cycle. It is a more general higher-level framework that the data engineering lifecycle presented here.

Leave a Reply