Data Ingestion


Data ingestion is the process of moving data from one place to another. You move data from source systems to storage. The second stage of the data life cycle is data ingestion. It’s also the second stage in the data engineering life cycle.

There are a lot of questions you could ask when you are planning to build an data ingestion system.

You get data from source systems, which are likely not in your control. However, you can ask questions and learn about the data so you prepared. How reliable are the source and ingestion systems? What are the use cases foe the data I am ingesting? What is the data format and volume? How frequently will I be ingesting the data? There are many more questions that need to be answered.

Batch vs Streaming

Essentially, all data is streaming. For example, orders or sales at a retail store are streaming in all the time, hopefully. Batch processing is just a convenient way of breaking the stream into manageable chunks. This chunk might represent the daily sales, or daily production on an assembly line.

Batch data is ingested either as data reaches a certain threshold or on a predetermined time interval.

Push vs Pull

In the push model, the source system writes data out to a target system. The target system may be a file system, a database or an object store. In the pull model, data is retrieved from the source system. The pulling system takes the initiative.

In the extract, transform and load process (ETL), The extract process by definition means that we are dealing with a pull model.

Leave a Reply