What is metadata? Wikipedia says the following about Metadata: “Metadata is “data [information] that provides information about other data”. Wikipedia gives as really good example of metadata that many of us will easily understand: “In the 2010s, metadata typically refers to digital forms; however, even traditional card catalogs from the 1960s and 1970s are an example of metadata, as the cards contain information about the books in the library (author, title, subject, etc.).”
There are three types of metadata: descriptive metadata, structural metadata, and administrative metadata.
- Descriptive metadata
- Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, ISBN, and keywords. ID numbers are descriptive metadata
- Structural metadata
- Structural metadata is metadata about containers of metadata and indicates how compound objects are put together, for example, how pages are ordered to form chapters. What would a book’s index be? Structural metadata indicates exactly how many collections a piece of data lives in.
- Administrative metadata
- Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. The date and time a photo was taken.
Elements of Metadata
Metadata stores different types of information about the data.
- Title – name of the file or website or other source
- Description – what type of data does it include
- Tags – what are some key words that describe the data
- Categories – what type of data is it
- Who – what person or group of persons created the data
- When – when was the data gathered
- Modified – who and when was it last modified or updated
- Access – who can gain access to the data and who can update it
Metadata is Everywhere in the Digital World
Our post The Data Ecosystem listed a few sources of data in the data ecosystem. This is from a big data perspective. Cell phones, GPS, Internet games, ATM, RFID, Computer, IoT, cable box, medical imaging, video surveillance, loyalty cards. On a more personal level, metadata data comes from our photos, emails, documents, websites, any computer file and books. In the example of photos, any photo we tske with a digital camera has metadata automatically attached to it.
Metadata puts data into context. It creates a single source of truth by keeping things consistent and uniform. Metadata also makes data more reliable by making sure it’s accurate, precise, relevant, and timely and this also makes it easier for data analysts to identify the root causes of any problems that might come up. One of the ways data analysts make sure their data is consistent and reliable is by using something called a metadata repository.
Metadata Repository
A metadata repository is a database specifically created to store metadata. Metadata repositories make it easier and faster to bring together multiple sources for data analysis. They do this by describing the state and location of the metadata, the structure of the tables inside, and how data flows through the repository. They even keep track of who accesses the metadata and when.
Metadata Generation
Data engineers divide metadata into two main categories: autogenerated and human generated. Organizing Metadata
In larger companies it can be a challenge to organize large amounts of data from various sources. Data can span many different processes and systems such as accounting, production (manufacturing), marketing and human resources. Data may be local and/or in the cloud. Each of these systems has its own rules and requirements, so each organizes the data in a completely different way, adding even more complexity. Metadata includes information about where each system is located and where the data sets are located. Metadata describes how all of the data is connected between the various systems. Data governance is a process to ensure the formal management of a company’s data assets.
Data Lake
A data lake is a centralized repository that ingests and stores large volumes of data in its original form. You don’t have to first structure the data. It is a centralized repository designed to store, process, and secure large amounts of structured, semi-structured, and unstructured data. A data lake can process data in real-time or batch mode. Data in the data lake can be analyzed using SQL, Python, R language, or any other language, third-party data, or analytics application. You can run different types of analytic, from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
The key difference between a data lake and a data warehouse is that the data lake tends to ingest data very quickly and prepare it later on the fly as people access it. With a data warehouse, on the other hand, you prepare the data very carefully upfront before you ever allow it in the data warehouse.
The first post in this series is called An Introduction to Data.