Distributed Database


What is a distributed database? Here is what Wikipedia says: “A distributed database is a database in which storage devices are not all attached to a common processor. It may be stored in multiple computers, located in the same physical location; or may be dispersed over a network of interconnected computers.”

Wikipedia says: “System administrators can distribute collections of data (e.g. in a database) across multiple physical locations. A distributed database can reside on organized network servers or decentralized independent computers on the Internet, on corporate intranets or extranets, or on other organization networks.”

What is a distributed ledger? Wikipedia says: “A distributed ledger (also called shared ledger) is a consensus of replicated, shared, and synchronized digital data geographically spread across multiple sites, countries, or institutions. There is no central administrator or centralized data storage. A peer-to-peer network is required as well as consensus algorithms to ensure replication across nodes is undertaken. One distributed ledger design is through implementation of a public or private blockchain system. But all distributed ledgers do not have to necessarily employ a chain of blocks to successfully provide secure and valid achievement of distributed consensus, a Blockchain is only one type of data structure considered to be a distributed ledger.”

What is a hyperledger? Wikipedia says: “Hyperledger (or the Hyperledger project) is an umbrella project of open source blockchains and related tools, started in December 2015 by the Linux Foundation, to support the collaborative development of blockchain-based distributed ledgers. The objective of the project is to advance cross-industry collaboration by developing blockchains and distributed ledgers, with a particular focus on improving the performance and reliability of these systems (as compared to comparable cryptocurrency designs) so that they are capable of supporting global business transactions by major technological, financial and supply chain companies. The project will integrate independent open protocols and standards by means of a framework for use-specific modules, including blockchains with their own consensus and storage routines, as well as services for identity, access control and smart contracts.”

Digging down a little deeper into the world of blockchain and building your own applications on top of a blockchain, let’s briefly look at Hyperledger Fabric. Some information can be found here.

To have a look in a more general sense, we can define distributed computing. What is distributed computing? Wikipedia says: “Distributed computing is a field of computer science that studies distributed systems. A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including pure HTTP, RPC-like connectors and message queues”

Replication and Duplication

Two processes ensure that the distributed databases remain up-to-date and current: replication and duplication. Both replication and duplication can keep the data current in all distributive locations.

Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be complex and time-consuming depending on the size and number of the distributed databases. This process can also require a lot of time and computer resources.

Duplication, on the other hand, has less complexity. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, users may change only the master database. This ensures that local data will not be overwritten.

Today the distributed DBMS market is evolving dramatically, with new, innovative entrants and incumbents supporting the growing use of unstructured data and NoSQL DBMS engines, as well as XML databases and NewSQL databases. These databases are increasingly supporting distributed database architecture that provides high availability and fault tolerance through replication and scale out ability. Some examples are Aerospike, Cassandra, Clusterpoint, ClustrixDB, Couchbase, Druid (open-source data store), FoundationDB, NuoDB, Riak and OrientDB.