SQL Database Design Introduction


This entry is part 1 of 1 in the series SQL Database Design
  • SQL Database Design Introduction

Over at the website SQL Server Central they have series of articles that they group into what they call “Stairways”. There is a series called Stairway to Database Design.

In this post I will just highlight a few points from that article. The article was written in 2013 by Joe Celko, the author of several books on SQL.

The article says “Before you start to think about your database schema or tables, you need to consider your data: The type of data it is, the scale you use for values. It needs to be unique, precise and unambiguous. Then you need to name it in such a way that it can be generally understood.”

It also says “SQL handles is structured data as opposed to unstructured data like text or pictures. One of the basic concepts in RDBMS is what Dr. Codd called the Information Principle. This rule states that all data in an RDBMS is modeled as scalar values in columns in rows in tables.”

This means that all the rows in a table have exactly the same structure, and each column in a row is a value for the same data element. Choosing the scales and data types is important. It makes sense to do math and comparisons with some scales and not with others. SQL is a strongly-typed language.

The term “scalar” comes from linear algebra, where it is used to differentiate a single number from a vector or matrix. The meaning in computing is similar. It distinguishes a single value like an integer or float from a data structure like an array.

Scales, however, refer to units of measure. If we are recording temperature in a column, we need to be unambiguous and consistent. If the scale in in Celsius, then all values should be in Celsius.

Naming

The basic rule is that a data element name tells us what it is. Do not put any prefixes or suffixes that describe the data type (i, str, flt, dbl…) or how the data is stored (pk, fk for primary key or foreign key).

Data element names that are too short and too long are both bad. Avoid all special characters in names. Stick to letters in the Latin-1 alphabet, digits and underscores. They will port to other programming languages.

Enforce capitalization rules to avoid case sensitivity problems. My rules are that SQL keywords are uppercase, scalar data elements are lowercase and schema objects are capitalized. Don’t use “id” as a column name. It is too generic and ambiguous. For example, vehicles could have an identifier called vehicle_id, but even better than that is vin, which is Vehicle Identification Number. Do not name columns with a series of adjective without something to modify (noun) such as “type_code_id”, or even worse “type_code_str_pk_id”. We still need to know what it is.

You can use suffixes and there is a list of standard ones, such as _id, _date, _nbr, _name, _code, _size, _seq, _tot, _tally, _status, _cat, _class, _type, _addr and _img. You do have some flexibility here and you can create your own if necessary.

Joe Celko says: “‘_id’ = identifier. It is unique in the schema and refer to one entity anywhere it appears in the schema. Uniqueness is not the same thing as being an identifier. The number pi is unique, but it does not identify an entity (a thing or noun). Never use “

_id”; that is a name based on location and tell you this is probably not a real key at all. Just plain “id” is too vague and screws up your data dictionary when you have a zillion of them. Obviously, auto-incrementing values are not identifiers.” This is a quote from the website SQL Server Central.

We cannot ever have, say, a generic, magical, universal “id” as a data element. It has to be an exact name with a precise meaning in the schema.

Entity Framework

Microsoft’s Entity Framework relies on every entity having a key value that it uses for tracking entities. One of the conventions that code first depends on is how it implies which property is the key in each of the code first classes. That convention is to look for a property named “Id” or one that combines the class name and “Id”, such as “BlogId”. The property will map to a primary key column in the database.

Domain

A domain has legal operations that can be done on its values in addition to what comes with a data type. For example, I can use INTEGER to record temperature in my database; that is the data type. When I see 100, that is a value. While I can add, subtract, multiply and divide integers, it makes no sense to do math like that on temperatures. The type of scale determines what operations are allowed and a scale is the unit of measurement (like degrees Celsius or Fahrenheit).

Data Types

The next stair discusses data types. There are three main data types in SQL. You will also find more data types, but these are the “Big Three” that will do most of your work.

  • Numeric
  • String
  • Temporal

Numeric types are broken down into exact and approximate. The exact numerics are INTEGER, SMALLINT, DECIMAL, NUMERIC and BIGINT. They hold exact numeric values and have well-defined mathematical operations. Approximate numerics are REAL, FLOAT and DOUBLE PRECISION. These are floating point numbers and they have some rounding issues; these days the IEEE Floating Standard is universally used.

String types are broken down into fixed and varying length. The fixed length strings are CHAR (n) and NCHAR(n), where (n) is their length. The NCHAR(n) is a shorthand for “National Character”, which really means any character from any language that Unicode has in actual implementations. Likewise, CHAR(n) is the local ASCII character set. Varying length strings are not padded out with blanks like the fixed length strings. Picking the length and the character set for a column is a constraint that you really need to think about before you do it.

Temporal types are broken down into datetime and interval types. The datetime types break down into dates and times, which makes sense. It models a point in time. A date type includes years, months and days. A time type includes hours, minutes, seconds and decimal sub-seconds. Put together, they make a TIMESTAMP data type in Standard SQL, which SQL server calls DATETIME. Interval data types are durations of time like days, hours, minutes and seconds. SQL Server does not represent them as a special type, but uses functions with integers to get similar results.

Rounding and truncation for all data types are implementation defined. All data types allow for a NULL.

Celko says: “Use temporal data types for temporal data. Yes, that sounds so obvious that I should not have to say it. But one of the most common design errors is to use strings for date and time data. Of course people who do this never write constraints to prevent dates like “2010-02-31” or functions to do simple temporal math. They have committed the design error of putting display formatting into the database instead of the front end.”

Real numbers and Time are examples of a continuum, while the other data types are discrete. A discrete scale has a finite number (possibly zero) between between any two distinct values. Think about the integers {4, 9}, which have {5, 6, 7, 8} between them. A continuum is a mathematical structure that has an infinite number of data values between any two distinct values. You can always add more and more decimal places to a real number or a time without any limit. Floating point numbers have built-in functions to handle rounding and computational problems, but temporal data does not.

Constraints

Constraints are one reason that columns in a table are nothing like fields in a record. A constraint is a declarative clause that restricts the values in a column. The most important one is NOT NULL. Declare every column with it the first time and then if you decide to allow NULLs, comment the row declaration to explain what it means in context. Besides data integrity, constraints do two other nice things for you. They are available to the optimizer to improve your queries, inserts, updates and deletes. They save you a lot of front end coding; do it once here and not over and over in thousands of application programs, present and future. They guarantee that all the front end programs are using the same definitions for the data elements.

CHECK is the simplest row-level constraint. The CHECK constraint is used to limit the value range that can be placed in a column. If you define a CHECK constraint on a single column it allows only certain values for this column. If you define a CHECK constraint on a table it can limit the values in certain columns based on values in other columns in the row. It involves two or more columns. In standard SQL, you can also have a CHECK() constraint that references other tables, but it is not widely implemented or part of SQL Server.

Here is an example from w3schools.com. The following SQL creates a CHECK constraint on the “Age” column when the “Persons” table is created. The CHECK constraint ensures that you can not have any person below 18 years (applies to SQL Server / Oracle / MS Access::

CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int CHECK (Age>=18)
);

To allow naming of a CHECK constraint, and for defining a CHECK constraint on multiple columns, use the following SQL syntax (applies to MySQL / SQL Server / Oracle / MS Access):

CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int,
    City varchar(255),
    CONSTRAINT CHK_Person CHECK (Age>=18 AND City='Sandnes')
);

Another option in a row declaration is the DEFAULT clause.