SQL Data Introduction


This post is based on the stairway series at SQL Server Central called Stairway to Data, by Joe Celko.

Data are facts, but we want facts represented in a such a way that they can be manipulated by a computer. Even more than that, we want a digital computer doing that manipulation. As Joe Celko says “This narrowed definition disallows maps, pictures, video, music, dance, speeches, literature and a lot of good things that give us information. Life is full of trade-offs. The trade-off here is that a machine can do the heavy lifting for us.”

The most basic distinction we can make is between things that are continuous versus those that are discrete. A discrete set has individual members that you can pick out. We like discrete data because it can be made digital instantly. Each value gets a symbol that we can manipulate. In theory, a set can be countably infinite, but we do not actually find infinite sets lying around in the database world.

In a continuum, you can find individual members, but there are always more members. Look at an analog wall clock; what time is it exactly? The second hand is always moving so the best you can do is to round the time to the nearest minute, or second, or smaller depending on the tools you have access to.

Quantitative data allows for mathematical operations. It makes sense to “do math on” things. It makes sense to add kilograms to kilograms and get kilograms. Qualitative data assumes things are not uniform. Individual elements in the set have attributes that vary from one element to the next. For example, given two eggs, one is a Grade ‘A’ egg and the other is a Grade ‘B’ egg.

Information is what you get when you distill data. A collection of raw facts does not help anyone to make a decision until it is reduced to a higher level abstraction. You could keep track of how much money you have at any point in time and then track the amount over a period of years. There is a long time horizon into both the past and an attempt to make predictions for the future. The information is qualitative and not just quantitative.

SQL and the relational database model are based on sets and logic. This makes SQL very good at finding set relations, but very weak at finding statistical and other relations.

Range, Granularity, Accuracy and Precision are properties of a measurements whether it is discrete or continuous.

Range

Measurements (and the tools that take the measurement) have a range — what are the highest and lowest values which can appear on the scale? Database designers do not have infinite storage, so we have to pick a subrange to use in the database when we have no upper or lower bound.

Granularity

Look at a ruler and a micrometer. They both measure length, using the same SI unit scale, but there is a difference. A micrometer is more precise because it has a finer granularity of units. Granularity is a static property of the scale itself — how many notches there are on your ruler.

Accuracy

Accuracy is how close the measurement, repeated again and again, comes to the actual value.

Precision

Precision is a measure of how repeatable a measurement is. This is different from accuracy.