Structured Data 1


Data is either structured or unstructured. If we are to analyze data using statistics, we must process the unstructured raw data into a structured form.

The vast majority of data in the world today is unstructured. Unstructured data included photographs, videos, text documents and PDF files. In machine learning, this type of data is called unlabeled data and when we are working with that we use unsupervised machine learning techniques.

There are two basic types of structured data.

  • Numeric (quantitative) – data that are in a numeric scale. Tableau calls these “measures”.
  • Categorical (qualitative) – can have a specific set of vales. Tableau calls these “dimensions”.

Numeric (quantitative) Data

Numeric (quantitative) data is of two types:

  • Continuous – Data that can take on any value as an interval such as float, numeric, interval
  • Discrete – Data that can take on only integer values such as integer, count.

Levels of measurement.

  • Interval – do not have a true zero. The categorical items consist of a sequential series of numerical ranges that subdivide a larger range of numerical values into smaller ranges. For example, from 0 to 500, 500 to 1000 and so on.
  • Ratio – have a true zero. A ratio is a relationship that compares two quantitative values by dividing one by the other. It might be expressed as a percent. Sales of housewares may be 14% of the total sales for the month.

Categorical Data

Categorical data can take on only a specific set of values representing a set of possible categories. Categories put labels on numbers. Our numbers might be sales and the category might be department or region.

  • Binary. A special case of categorical. Can be 0 or 1, Yes or No, True or False.
  • Nominal. Categorical data that doesn’t have an order. North, south, east, west. Sporting goods, housewares, automotive etc.
  • Ordinal. Categorical data that has an order. First, second, third. Small medium, large.

The other two are interval and hierarchical. An interval relationship is one in which the categorical items are a sequence of numerical ranges that we create. They are “bins”. They subdivide. They have order.

The type of data in computer software acts as a signal to the software on how to process the data.

When you have multiple closely associated categories, you might have a hierarchical relationship. They are parent-child relationships. A company might have divisions that each have departments within the divisions.

Time is continuous, not discrete.

Rectangular Data

Rectangular data is like a database table or a spreadsheet. It is a two-dimensional matrix. The rows are records and the columns are features or variables. In pandas of Python, rectangular data is a DataFrame. In R it’s a data frame, or a data.frame object.

A correlation compared two columns of numerical data to determine whether increases in one correlates to increases or decreases in the other. A positive correlation means that increases in one corresponds to increases in the other. You might wonder if increases in the outside temperature in the summer correlates to increases in ice cream sales. A negative correlation means increases in one correlates to increases in the other. Correlation does not imply causation.


Leave a Reply

One thought on “Structured Data