page size: typically ~100 KB, extreme 256 MB
First question: row store or column store?
row store--whole records are stored contiguously --- classical method
page:
| header | rec1 | rec2 | rec3 .....
most useful for transaction processing.
Downside: when there are a large number of columns, much of the information in a row is useless if you only need to access a subset of data.
column store: ---increasingly common organization for databases.
Page:
header|||||||||| <---at I for Rec1, at I for Rec2, .....at I for RecN
Down side: records are at different pages. have to access multiple pages to get a row.
Up side: analytical process: tables are really really wide.
Useful for different circumstances.
What should be in a header?
- Info about table : 1. name, 2. schema (attributes)
- Timestamp (for Concurrency Control and Recover CCR)
- Info about Storage format
- Pointers to start of each record. (bad: complicated. Good: may reduce CPU) -- useful as there might be variable length and you cannot simply get record by offset * unit_length
- Number of recs
- Pointer to first unused slot ...
What should be included in a record header
- Time stamp for CCR
- Pointer to next record (do this to trade space for CPU, otherwise you will need to parse the next record and parsing is expensive)
- Pointers to attributes
- Valid or not? A bit or byte to tell if the record is valid or not. --- To delete, simply put a bit there. --valid or not byte
In next assignment, use provided code (page storage) to store bunch of records
"Pointers" in DBMS
C pointer is a memory address.
smart pointer is the same with a counter associated with it.
Types of pointers:
- Classic pointers: memory locations. --DBMS support code (Binary Tree to store the data etc.). In header, not using classical pointer ()move one page to another place and a classical pointer breaks.
- "Offset pointer": not absolute locations; store a distance from the specific memory location. --> assuming having the location of the header. an attribute is 28 bytes away.
- Off page pointer (points to DB data not on the same page). Why? Big reason: in DB structures (e.g. B tree) requires off page pointer
2)3) need to translate to the memory locations. 3) is bit difficult will be covered later.
What info do we like to include in an Off-page pointer / what data goes into an off-page pointer
- ID (identifier) for page
- IP address (could be a distributed system), file, page position) (drawback: this might change)
- Logical: file, page ID. --- you have to have some sort of server allow you to look up actual location with page ID. --- slower because you will need two searches.
- location of data on page. (might be an offset Ora "logical" location)
Jan 31
Swizzling - refers to process of converting logical pointers to C-style memory addresses
when to swizzle and when to "un" swizzle?
Fully on demand swizzling(every time we defer, we swizzle)
pro - easy! QUIZ some one want to access a page, they call getBytes, what they have is a logical pointer, in our system, finalize with c-style pointer with a swizzling process
Con:expensive (repeatedly go and lookup pointer)-
Automatic swizzling: automatic means the system swizzles every pointer when loaded into RAM
- pro: easy from application programmer's view. I just dereference things, don't need to look up (the logical 2 physical table). Also looks efficient.
To implement:
- need to handle eviction (swizzle ptr to null)
- need to handle buffer cache misses (looks like on demand)
- also need a lookup table to keep track of all the pointer in the RAM: so can handle page movements and evictions. third solution: Hybrid (Once a pointer is dereferenced, its management is automatic )
Organizing records and pages into files:Big topic that going to spend a week to cover
go through several file organizing
(1) Heap (as database people, we really refers an unordered "pile" of records)
- inserts: simplest implementation: write to end of last page in file.
- Finds: iterate or scan from start to finish
- Deletes: remove rec(s) from page, periodically (Garbage Collection)
* Pros:
1. heap is simple; easy to iterate them;
2.can have good performance for large file;
3. good for adding data
* Cons:
1. Bad for "point" finds that access with a specific keys
(2) Sorted file: at records sorted according to some "search key" (one or more user supplied attributes)
- inserts: Buffer a large batch, then periodically re-sort.
- Question: how to sort a large (bigger than RAM) file? --answered by student (merge sort)
TPMMS (two-phase multi-way merge sort)
TWO phases: sort phase and merge phase
sort phase: while some portion of file is still unsorted:
1) Read in R pages from input file (R < B # numb buffered pages available)
2) Sort all of the records on these pages.
3) Write new pages (w. sorted records -- called a "run") to disk