Monday, February 8, 2010

Type 7 database pages and Firebird 2.x


Norman Dunbar posted the following on the Firebird Development list:

"I'm documenting the internal page formats of a database for the Doc project. I've checked in (to cvs) the document so far, but obviously a half finished document is no good in real life. To this end, I have a couple of questions on the internal formats of the type 7 database pages for Firebird 2.x, if you don't mind:

As before I have read Ann's (1.5?) documentation over at IBPhoenix/R&D, however, Firebird 2 seems to have changed things (slightly).

What compression is used in the page/btree_nod entries? I've got a database with a couple of known entries and I cannot decode the btree_nods sensibly. Happy to be pointed at the appropriate code file in the source.

Is there a difference in the page layout at all if the page I'm examining is not a leaf page? (btr_level != 0)"

Dimitry Yemanov replied.

"This isn't going to be easy to answer, as the page layout is much different between ODS10 and ODS11.

In all ODS incarnations, prefix compression is used for index keys. It means that the first key is stored "as is" and subsequent keys are stored as "deltas" represented by three values:
(1) length of the data that should be taken from the prior key (aka prefix),
(2) length of the data that is stored in our key (aka suffix), and
(3) the suffix itself which length is described by (2).

In ODS11, key internals (prefix length, suffix length, page number, record number) are also kinda compressed to store only the significant part of an appropriate integer value.

In ODS10, you'll see them of the fixed size:

struct btree_nod
{
UCHAR btn_prefix; --> prefix length
UCHAR btn_length; --> suffix length
UCHAR btn_number[4]; --> page or record number
UCHAR btn_data[1]; --> suffix data of btn_length bytes
};

Also, ODS11 has some special flags in the first byte of the index entry which allows to avoid storing prefix/suffix values at all in some cases.

Relevant source code: jrd/btr.cpp and jrd/btn.cpp.

> Is there a difference in the page layout at all if the page I'm examining is not a > leaf page? (btr_level != 0)

They're nearly identical. IIRC keys on non-leaf pages contain both page numbers and record numbers.

Also beware about the jump nodes introduced in ODS11. It's a sparse lookup table (key -> offset) which is stored on the page along with the keys themselves."

Ann Harrison also replied.

> They're nearly identical. IIRC keys on non-leaf pages contain both page numbers
> and record numbers.

That's right. Leaf pages contain nodes that have a header, data - possibly compressed, and the record number of the record that corresponds to the data. Upper level pages in ODS11 contain nodes that have a header, data, record number that contains that data, and the page number of the lower level page that starts with that
data value. The reason for "promoting" the record number to the upper level is not to give faster access to the record during a normal search, it's to avoid a garbage collection problem with indexes with lots of duplicates.

In ODS-10 and earlier, the algorithm for storing duplicates put the newest record first. So if you stored 100,000 records with the same key value and a generated primary key, the records would be stored in primary key order (more or less) and the index entries would be stored in inverse primary key order. When you delete
those records, you delete the lowest primary key value first. Unfortunately, the index entry for the key with duplicates will be at the end of the chain of duplicates, so you have to read 99,999 entries before you find the one you want.

Then you delete the second record and read 99,998 entries, etc. It's called
thrashing the cache.

In ODS-11 and higher, the record number becomes part of the key. Duplicates are stored in record number order. In effect, every key is unique.

> Also beware about the jump nodes introduced in ODS11. It's a sparse lookup table
> (key -> offset) which is stored on the page along with the keys themselves.

Jump nodes are like an index into an index page. Doesn't seem to make sense, but with pages larger than about 4K, Firebird was spending an inordinate amount of time reading across index pages - reading on average 1/2 the page size at each level. Jump nodes reduce the average read to about 500 bytes. As it turns out, the optimal size for an index page is different from the optimal size for data and jump nodes are a way of getting the best performance for both.

Note that systems that don't compress keys can use a binary search on an index page, so the size of an index page doesn't matter as much. On the other hand, they pay for that binary search in I/O.

No comments: