数据库历史导读和笔记

原文：What goes around comes around pdf slides

TLDR

This is a history of data models, where the following eras are represented. There's an evolution of data models from complex hierarchical ones to simpler relational ones. CODASYL was a complex hierarchical model, which morphed into SQL, and then into XML (another hierarchical model). Evolutions in database models seem to be driven mostly by performance.

Hierarchical (IMS): late 1960’s and 1970’s
Network (CODASYL): 1970’s
Relational: 1970’s and early 1980’s
Entity-Relationship: 1970’s
Extended Relational: 1980’s
Semantic: late 1970’s and 1980’s
Object-oriented: late 1980’s and early 1990’s
Object-relational: late 1980’s and early 1990’s
Semi-structured (XML): late 1990’s to the present

Quotes

The debate between the XML advocates and the relational crowd bears a suspicious resemblance to the first “Great Debate” from a quarter of a century ago. A simple data model is being compared to a complex one.

The only ideas that got market traction were user-defined functions and user-defined access methods, and these were performance constructs not data model constructs.

Key Concepts

Logical data independence
Physical data independence
Semantic heterogeneity
Schema last
Schema first
Record at a time language

Guiding Questions

Why do people keep re-inventing a new "hierarchical" database (CODASYL, XML, JSON)?
How would we view the NoSQL movement and document stores in light of history?

Slightly more conceptual questions

What types of database use cases are B trees suited for and unsuited for, why?

Business transactions.
Geographic data. Better treated with quad trees or R trees.

What happened to the persistent programming language movement? How does this relate to the user-defined functions and access methods?

What does it mean that XML can "go through a firewall"? Do the same principles apply to JSON and REST? (I could also be overthinking this... as it might literally just mean that this stuff is easier to transmit between different computers).

Why did SQL become the de facto database standard?

Why is a "general purpose hierarchical data format" a bad idea?

Notes and Quotes

IMS

Takeaways

Lesson 1: Physical and logical data independence are highly desirable
Lesson 2: Tree structured data models are very restrictive
Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree
structured data
Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard.

Notes

Hierarchical data model (tree)
Record type: collection of named fields with associated datatypes.
Instance of record type obeys definition.
Each record type has a key (some subset of named fields).
Each record type has unique parent record key.
Each record has a hierarchical sequence key (concatenate keys of ancestors and key of current record).
Data manipulation language.

IMS supported four different storage formats for hierarchical data. Basically root records can either be:

Stored sequentially
Indexed in a B-tree using the key of the record
Hashed using the key of the record

Dependent records are found from the root using either

Physical sequentially
Various forms of pointers.

The ability of a data base application to continue to run, regardless of what tuning is
performed at the physical level will be called physical data independence.

IMS supports a certain level of logical data independence, because DL/1 is actually defined on a logical data base, not on the actual physical data base that is stored.

CODASYL

Takeaways

Lesson 5: Networks are more flexible than hierarchies but more complex
Lesson 6: Loading and recovering networks is more complex than hierarchies

Notes

Network data model. Not tree. Graph.
Record-at-a-time data manipulation language. Enters database at entry point, navigates via sets.
No physical data independence or logical data independence.
Trades increased complexity for the possibility of easily representing non-hierarchical data. CODASYL offers poorer logical and physical data independence than IMS.

Relational

Takeaways

Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence.
Lesson 8: Logical data independence is easier with a simple data model than with a
complex one.
Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.
Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers.

Notes

Relational algebra
SQL Won due to:
a) the success of the VAX
b) the non-portability of CODASYL engines
c) the complexity of IMS logical data bases

Entity-Relationship Model

Takeaways

Lesson 11: Functional dependencies are too difficult for mere mortals to understand.

Notes

Databases are collections of instances of entities.
Entities have attributes.
Entities have relationships with each other.
Do database design by constructing an initial collection of tables and then normalizing them.

First, real DBAs immediately asked “How do I get an initial set of tables?” Normalization theory had no answer to this important question. Second, and perhaps more serious, normalization theory was based on the concept of functional dependencies, and real world DBAs could not understand this construct.

R++

Takeaways

Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere

Notes

In the 1980's, database papers looked like this:
- Consider an application, call it X
- Try to implement X on a relational DBMS
- Show why the queries are difficult or why poor performance is observed
- Add a new “feature” to the relational model to correct the problem
Here are the significant features that came up:
- set-valued attributes
- aggregation (tuple-reference as a data type): pointers instead of foreign keys. weird.
- generalization: i.e. inheritance hierarchies

Semantic Data Model

Notes

Focus on classes and inheritance.
They failed because "they were a lot of machinery that was easy to simulate on relational systems".
Also offered no performance improvement.

OO (Object Oriented)

Takeaways

Lesson 13: Packages will not sell to users unless they are in “major pain”
Lesson 14: Persistent languages will go nowhere without the support of the programming language community.

Notes

To address the engineering market, an implementation of persistent C++ had the following requirements:

no need for a declarative query language. All one needed was a way to reference large disk-based engineering objects in C++.
no need for fancy transaction management. This market is largely one user-at-a-time processing large engineering objects. Rather, some sort of versioning system would be nice.
The run-time system had to be competitive with conventional C++ when operating on the object. In this market, the performance of an algorithm using persistent C++ had to be competitive with that available from a custom load program and conventional C++

In our opinion, there are a number of reasons for this market failure.

absence of leverage. The OODB vendors presented the customer with the opportunity to avoid writing a load program and an unload program. This is not a major service, and customers were not willing to pay big money for this feature.
No standards. All of the OODB vendor offerings were incompatible.
Relink the world. In anything changed, for example a C++ method that operated on persistent data, then all programs which used this method had to be relinked. This was a noticeable management problem.
No programming language Esperanto. If your enterprise had a single application not written in C++ that needed to access persistent data, then you could not use one of the OODB products

What is the idea of a persistent programming language?

"one where the variables in the language could represent disk-based data as well as main memory data and where data base search criteria were also language constructs"
it requires the compiler for the programming language to be extended with DBMS-oriented functionality
What does this mean?

O2 supported an object-oriented data model, but it was not C++. Also, they embedded a high level declarative language called OQL into a programming language. Hence, they proposed what amounted to a semantic data model with a declarative query language, but marketed it as an OODB.

Object-Relational

Takeaways

Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby blurring the distinction between code and data) and user-defined access methods.
Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard.

Notes

Motivated by storing geographic data. Searching for all points within a rectangle is a 2-dimensional search.
GIS queries are difficult to say in SQL and perform badly on B-Trees.
Basically the outcome of the whole OR era was better support for UDFs.
the OR proposal added user defined things to SQL: data types, operators, functions, access methods.

Semi Structured Data

Schema Last

This means having a self-describing schema.
semantic heterogeneity: information on a common object does not conform to a common representation; difficult for query processing as there is no structure on which to base indexing decisions.
designed for semi-structured data

XML Data Model

XML records can be hierarchical, as in IMS
XML records can have “links” (references to) other records, as in CODASYL, Gem and SDM
XML records can have set-based attributes, as in SDM
XML records can inherit from other records in several ways, as in SDM

Vocabulary

union types: an attribute in a record can be of one of a set of possible types
set-at-a-time query language
Logical data independence: The ability to change the logical (conceptual) schema without changing the External schema (User View) is called logical data independence.
Physical data independence: The ability to change the physical schema without changing the logical schema

Other resources

PostGres was originally created as a research project for geolocation data.
A comparison of SOAP vs. REST slides article

数据库历史导读和笔记