The Growth of Data


This entry is part 4 of 4 in the series Big Data

It’s necessary to discuss some computing history to be able to understand the big data landscape and where it might go from here.

For most of their history, computers became faster every year through processor speed increases: the new processors each year could run more instructions per second than the previous year’s. Moore’s law is the observation that the number of transistors in an integrated circuit doubles about every two years. Moore’s law is an observation and projection of a historical trend. Gordon Moore stated this “law” in 1965.

Meanwhile, in the 1990’s we often measured large volumes of data in terabytes. A terabyte is 1000 gigabytes. Large organizations stored their data in structured rows and columns in relational database management systems (RDBMS) and data warehouses. In the 2000’s we saw different kinds of data sources such as text, documents, spreadsheets, PDF files and others. Now we measure large data volumes in petabytes (1 PB = 1000 TB).

Collectively we are generating about 2.5M terabytes of data per day, according to a course on SQL at Coursera. Also they say that in total by 2025 (next year at the time of this writing), we’ll have 175 zettabytes of data, which is 175 billion terabytes. An exabyte is one quintillion (10) bytes, or one billion gigabytes. One billion gigabytes is one million terabytes (TB).

The trend in hardware speed growth stopped around 2005 due to hard limits in heat dissipation, hardware developers stopped making individual processors faster, and switched toward adding more parallel CPU cores all running at the same speed. We need computing power to handle all of the big data we have.

In the 2010’s and beyond, large volumes of data is measured in exabytes (1EB = 1000PB). Now we have massive amounts of data in YouTube, social media, pictures, medical information

The cost to store 1 TB of data continues to drop by roughly two times every 14 months, meaning that it is very inexpensive for organizations of all sizes to store large amounts of data. Moreover, many of the technologies for collecting data (sensors, cameras, public datasets, etc.) continue to drop in cost and improve in resolution. Collecting data is extremely inexpensive.

These days big data processing requires large, parallel computations, often on clusters of machines. Software developed in the past 50 years cannot automatically scale up, and neither can the traditional programming models for data processing applications, creating the need for new programming models. It is this world that Apache Spark was built for.

Series Navigation<< Map Reduce

Leave a Reply