Metadata Introduction

This entry is part 5 of 6 in the series Data

What is metadata? Wikipedia says the following about Metadata: “Metadata is “data [information] that provides information about other data”. Wikipedia gives as really good example of metadata that many of us will easily understand: “In the 2010s, metadata typically refers to digital forms; however, even traditional card catalogs from the 1960s and 1970s are an example of metadata, as the cards contain information about the books in the library (author, title, subject, etc.).”

There are three types of metadata: descriptive metadata, structural metadata, and administrative metadata.

Descriptive metadata: Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, ISBN, and keywords. ID numbers are descriptive metadata
Structural metadata: Structural metadata is metadata about containers of metadata and indicates how compound objects are put together, for example, how pages are ordered to form chapters. What would a book’s index be? Structural metadata indicates exactly how many collections a piece of data lives in.
Administrative metadata: Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. The date and time a photo was taken.

Elements of Metadata

Metadata stores different types of information about the data.

Title – name of the file or website or other source
Description – what type of data does it include
Tags – what are some key words that describe the data
Categories – what type of data is it
Who – what person or group of persons created the data
When – when was the data gathered
Modified – who and when was it last modified or updated
Access – who can gain access to the data and who can update it

Metadata is Everywhere in the Digital World

Our post The Data Ecosystem listed a few sources of data in the data ecosystem. This is from a big data perspective. Cell phones, GPS, Internet games, ATM, RFID, Computer, IoT, cable box, medical imaging, video surveillance, loyalty cards. On a more personal level, metadata data comes from our photos, emails, documents, websites, any computer file and books. In the example of photos, any photo we tske with a digital camera has metadata automatically attached to it.

Metadata puts data into context. It creates a single source of truth by keeping things consistent and uniform. Metadata also makes data more reliable by making sure it’s accurate, precise, relevant, and timely and this also makes it easier for data analysts to identify the root causes of any problems that might come up. One of the ways data analysts make sure their data is consistent and reliable is by using something called a metadata repository.

Metadata Repository

A metadata repository is a database specifically created to store metadata. Metadata repositories make it easier and faster to bring together multiple sources for data analysis. They do this by describing the state and location of the metadata, the structure of the tables inside, and how data flows through the repository. They even keep track of who accesses the metadata and when.

Metadata Generation

Data engineers divide metadata into two main categories: autogenerated and human generated. Organizing Metadata

In larger companies it can be a challenge to organize large amounts of data from various sources. Data can span many different processes and systems such as accounting, production (manufacturing), marketing and human resources. Data may be local and/or in the cloud. Each of these systems has its own rules and requirements, so each organizes the data in a completely different way, adding even more complexity. Metadata includes information about where each system is located and where the data sets are located. Metadata describes how all of the data is connected between the various systems. Data governance is a process to ensure the formal management of a company’s data assets.

Data Lake

A data lake is a centralized repository that ingests and stores large volumes of data in its original form. You don’t have to first structure the data. It is a centralized repository designed to store, process, and secure large amounts of structured, semi-structured, and unstructured data. A data lake can process data in real-time or batch mode. Data in the data lake can be analyzed using SQL, Python, R language, or any other language, third-party data, or analytics application. You can run different types of analytic, from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The key difference between a data lake and a data warehouse is that the data lake tends to ingest data very quickly and prepare it later on the fly as people access it. With a data warehouse, on the other hand, you prepare the data very carefully upfront before you ever allow it in the data warehouse.

The first post in this series is called An Introduction to Data.

Series Navigation<< Tabular Data FormatAccessing Data >>

BeginCodingNow.com

for data analysts & software developers