SQLite the hard way

Published: 1/9/2025

I recently wrote a sqlite file parser in python from scratch to learn more about SQLite’s internals. I just started this blog, so I figured it would be fun to write everything up in a blogpost! I’ve chosen to keep things as simple as possible, as my goal isn’t to write a production-ready sqlite library (the stdlib already has one), rather it’s to gain a more concrete understanding of how b-trees are implemented in practice.

All the code from this project is available in this github repo.

Creating a test database

Keeping the goal in mind, I decided to stick with a fairly database schema:

1
CREATE TABLE users (
2
    id INTEGER PRIMARY KEY AUTOINCREMENT,
3
    username TEXT NOT NULL UNIQUE,
4
    email TEXT NOT NULL UNIQUE,
5
    password_hash TEXT NOT NULL,
6
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
7
);

It’s a pretty basic users table, however there are 2 things worth noting:

First is the id INTEGER PRIMARY KEY AUTOINCREMENT. In most circumstances, creating a column as a PRIMARY KEY does the same as adding a NOT NULL UNIQUE constraint on the column. As we’ll see, sqlite tables can only actually be (efficiently) queried based on a special hidden column called rowid, which is a unique 64 bit unsigned int value that’s always unique to each row automatically. Defining a UNIQUE column implicitly creates an index for that column. Indexes are really just an efficient means of converting the value in a column into a rowid. From there we always use the rowid to look up the actual database record(s). One exception to this, however, is for INTEGER PRIMARY KEY AUTOINCREMENT columns. In this particular case, that column basically just becomes an alias for the rowid column. For the purposes of this blogpost, this means that I can look up records by id before I need to add support for resolving data using an index.

The other notable thing is the UNIQUE constraints on the username/email columns. The only reason this matters is that the unique constrains implicitly cause sqlite to create indexes for those columns, which we’ll be using later when searching for records by email address.

Besides creating the schema, I need to populate it with some records that we can search for. To make everything as easy as possible during debugging, my dummy records are all written like this:

id	username	email	password_hash	created_at
1	user_1	user_1@example.com	password_1	NOW()
2	user_2	user_2@example.com	password_2	NOW()
3	user_3	user_3@example.com	password_3	NOW()

Here’s a simple script that bootstraps the database with 1k test users:

1
import sqlite3
2

3
# Connect to SQLite database (or create it if it doesn't exist)
4
conn = sqlite3.connect("example.db")
5
cursor = conn.cursor()
6

7
# Create the users table if it doesn't exist
8
cursor.execute("""
9
CREATE TABLE IF NOT EXISTS users (
10
    id INTEGER PRIMARY KEY AUTOINCREMENT,
11
    username TEXT NOT NULL UNIQUE,
12
    email TEXT NOT NULL UNIQUE,
13
    password_hash TEXT NOT NULL,
14
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
15
)
16
""")
17

18
# Insert 1000 test users
19
users_to_insert = []
20
for i in range(1, 1001):
21
    username = f"user_{i}"
22
    email = f"user_{i}@example.com"
23
    password = f"password_{i}"  # Predictable test value for password
24
    users_to_insert.append((username, email, password))
25

26
# Use executemany for batch insertion
27
cursor.executemany("""
28
INSERT INTO users (username, email, password_hash)
29
VALUES (?, ?, ?)
30
""", users_to_insert)
31

32
# Commit changes and close connection
33
conn.commit()
34
conn.close()

Challenge

I always like to have an objective in mind when I’m learning things. For this blogpost, my goal is to be able to look up user records from their id:

1
SELECT * FROM users WHERE id = 450;

Or by their email address:

1
SELECT * FROM users where email = 'user_450@example.com';

Overview

Before jumping into writing code, I think it’ll make sense to have a high level understanding of what the process will be. The high level steps necessary reading a record from a table based on its rowid (which, remember, is the same thing as the id column for this table) are:

Parse the file header to figure out the page size for this database
Read the sqlite_schema table to figure out what page the users starts on
Read the users table to find the record corresponding to our target rowid

Pages and B-trees

A sqlite database is composed of pages. All pages in a database are the same size (a power of 2 between 512 and 65536). The pages are simply concatenated together, so if you know the index of a page (sqlite indexes the pages starting from 1, for whatever reason) and the page size for your database, it’s easy to find the page itself.

For our purposes, sqlite has 2 categories of pages: table pages and index pages. The table pages store the actual contents of the tables (the username/password/email/etc, in our case), but table pages can only be searched efficiently by a special column called rowid that’s normally not visible. Index pages are used to efficiently map the values of other columns to the rowid, which can then be looked up in the table pages. So to find a user by email, first we’d use index pages to convert the email into a rowid, and then we’ll use that rowid to find their row in the users table.

Sqlite (and most other relational databases) are fundamentally built around a type of tree called a b-tree (sadly nobody knows what b stands for). Each table/index is internally represented as a own b-tree, where each node in the tree is stored as one page in the file.

Besides categorizing pages based on the type of tree they’re part of, pages are also grouped based on whether they’re a leaf (at the bottom layer of a b-tree) or an interior page (any of the other ones).

Each page in a sqlite database is composed of cells. The details of the cells depend a bit on what kind of page we’re looking at (internal/leaf, and index/table), but for the most part cells have a key and a payload. Within each page, the cells are sorted based on their key (so we can use a binary search to find the cell we want). For interior pages, each cell contains a pointer to a child page. For leaf pages, the payload of the cell is either a rowid (for index leaf pages), or the contents of the database row (in the case of table leaf pages).

I think the easiest way to learn how this works is with an example. Here’s a (conceptual) diagram of how our users table is stored, lets walk through finding the user with id (which in our case is the same thing as rowid) 5:

btree example

We’ll start off at the root of the tree. Notice how the cells (green boxes) in the root of the tree are in ascending order (left to right) based on their key. Also, all the cells in the child pages of a cell have keys that are less than or equal to the cell in the parent page. The cell with Key: 5 in the root page points to page 8, and page 8 only contains cells with keys that are less than or equal to 5.

So in our example, the smallest key that’s >= our target of 5 is 5, and it’s got a pointer to page 8. This implies that our target record is page 8, or one of page 8’s children (if page 8 turns out to to be another interior page).

In our case, page 8 is a leaf table page, so its cells contain the actual data for the table, rather than just pointers to other pages. Once again, the cells are in order based on their key so we can do another binary search to quickly find cell 5.

And that’s pretty much how databases work! There’s a lot of extra implementation details to deal with, but conceptually that’s pretty much it!

On to the implementation details.

The File Header

Sqlite files (like so many file formats) start with a header. It’s just a bunch of big endian (very important!) fields, concatenated together. Here’s a class I wrote that handles parsing the database header out of a file.

1
@dataclass
2
class FileHeader:
3
    magic_bytes: bytes
4
    page_size: int
20 collapsed lines
5
    format_write_version: int
6
    format_read_version: int
7
    reserved_bytes: int
8
    max_embedded_payload_frac: int
9
    min_embedded_payload_frac: int
10
    leaf_payload_frac: int
11
    file_change_counter: int
12
    file_size_pages: int
13
    freelist_trunk_ptr: int
14
    num_freelist_pgs: int
15
    schema_cookie: int
16
    schema_format_number: int
17
    default_page_cache_size: int
18
    largest_root_b_tree_ptr: int
19
    text_encoding: str
20
    user_version: int
21
    vacuum_mode: bool
22
    app_id: int
23
    version_valid_for: int
24
    sqlite_version_number: int
25

26
    def __init__(self, file: BinaryIO):
27
        (
28
            self.magic_bytes,
29
            self.page_size,
20 collapsed lines
30
            self.format_write_version,
31
            self.format_read_version,
32
            self.reserved_bytes,
33
            self.max_embedded_payload_frac,
34
            self.min_embedded_payload_frac,
35
            self.leaf_payload_frac,
36
            self.file_change_counter,
37
            self.file_size_pages,
38
            self.freelist_trunk_ptr,
39
            self.num_freelist_pgs,
40
            self.schema_cookie,
41
            self.schema_format_number,
42
            self.default_page_cache_size,
43
            self.largest_root_b_tree_ptr,
44
            text_encoding,
45
            self.user_version,
46
            vacuum_mode,
47
            self.app_id,
48
            self.version_valid_for,
49
            self.sqlite_version_number,
50
        ) = struct.unpack(">16sH6B12i20x2i", file.read(100))
51
        self.text_encoding = {
52
            1: 'utf8',
53
            2: 'utf16-le',
54
            3: 'utf16-be'
55
        }[text_encoding]
56
        self.vacuum_mode = vacuum_mode != 0

Most of the heavy lifting here is happening on line 50. I’m using the python struct.unpack module to parse the first 100 bytes of the file and break it up into a list of separate values that can be manipulated easily. The format string I’m passing (>16sH6B12i20x2i) can be read like this:

code	meaning
>	Use little endian
16s	An array of 16 bytes of raw data
H	A single unsigned short (2 byte) value
6B	6 unsigned 1-byte integers
12i	12 signed 4-byte integers
20x	A gap of 20 bytes (reserved in the sqlite spec for future use)
2i	2 integer values

The only field of importance right now is the pageSize field, but I parsed everything out to make life a little simpler later. Pages are the highest level structure in a sqlite file, so knowing how big they are is pretty useful. We’ll skip the rest of the fields for now, but some of them will be coming up a little later.

The `sqlite_schema` table

Now that we know the page size, we need to parse the special sqlite_schema table. This is a special table that gives us information about the other tables in the database. The first page of the database file is always the root of the b-tree for this table, so it’s very easy to find.

Conveniently for us, the example database I set up has only 1 table and a few indexes, so the entire sqlite_schema table fits in one page. That means that the root page (i.e. page 1) is also a leaf page, which means (for now) we can put off dealing with interior pages, and just implement leaf table pages.

Each page has its own header, and parsing that is the first step to reading the page. I used an extremely similar pattern to what I did to parse the file level header:

1
class BTreePageType(Enum):
2
    TABLE_INTERIOR = 0x05
3
    TABLE_LEAF = 0x0D
4
    INDEX_INTERIOR = 0x02
5
    INDEX_LEAF = 0x0A
6

7
@dataclass
8
class PageHeader:
9
    type_id: BTreePageType
10
    first_free_block_offset: int
11
    num_cells_in_page: int
12
    cell_content_area_offset: int
13
    fragmented_free_bytes: int
14
    right_ptr: Optional[int]
15
    header_size: int
16

17
    def __init__(self, file: BinaryIO):
18
        self.type_id = BTreePageType(file.read(1)[0])
19
        (self.first_free_block_offset,
20
         self.num_cells_in_page,
21
         self.cell_content_area_offset,
22
         self.fragmented_free_bytes) = struct.unpack(">3HB", file.read(7))
23
        if self.type_id in (BTreePageType.TABLE_INTERIOR, BTreePageType.INDEX_INTERIOR):
24
            self.right_ptr = struct.unpack(">I", file.read(4))[0]
25
            self.header_size = 12
26
        else:
27
            self.right_ptr = None
28
            self.header_size = 8

You may have noticed in the diagram earlier that there’s one pointer at the end of the internal page that’s not part of a cell:

btree example 2

This special pointer is used when we’re searching for a value that’s greater than all the keys in the list. So in that example, if we were looking for a key of 1000, we’d binary search all the way to the right end of the cells, and if we don’t find anything >= 1000 then we’d traverse follow that right pointer to get to the appropriate child page. Since that last pointer doesn’t fit in any of the cells, it’s stored in the header (line 13 in the above code snippet). This is only relevant for internal pages though, so we only set that rightPtr field on interior page types.

In my example database, the page header for the sqlite_schema table looks like this:

Field	Value
type_id	TABLE_LEAF
first_free_block_offset	4088
num_cells_in_page	4
cell_content_area_offset	3680
fragmented_free_bytes	0
right_ptr	None
header_size	8

As expected, this is a TABLE_LEAF page. For reading data, we don’t really have to care about free blocks or fragmented free bytes, so we’ll ignore those fields (they’re important for efficiently adding data).

In a table leaf page like this one, each cell corresponds to one row in the table. So num_cells_in_page tells us that this table has 4 items in it. Reassuringly, this lines up with what we get from reading the table using the sqlite CLI:

sqlite> SELECT COUNT(1) FROM sqlite_schema;
4

Incidentally, this explains why SELECT COUNT(1) FROM some_big_table queries are so slow! The database engine would need to walk the b-tree to find every leaf page, and add up the num_cells_in_page for every one to get the answer.

Anyway, we now have everything we need to read the cells from the table! Here’s how I implemented that:

1
class TableLeafCell(Cell):
2
    row_id: int
3
    payload_size: int
4
    record: "Record"
5

6
    def __init__(self): # TODO (we'll come back to this later)
7
        pass
8

9
@dataclass
10
class TableLeafPage:
11
    offset: int
12
    pageHeader: PageHeader
13
    cells: List[TableLeafCell]
14

15
    def __init__(self, file: BinaryIO):
16
        self.offset = file.tell()
17
        self.cells = []
18
        self.pageHeader = PageHeader(file)
19

20
        num_cells = self.pageHeader.numCellsInPage
21
        cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
22
        for ptr in cell_ptrs:
23
            file.seek(self.offset + ptr)
24
            self.cells.append(TableLeafCell(file, database.header.text_encoding))

This is pretty strait forward. First we read the page header to figure out how many cells to expect. Immediately after the header ends there’s an array of offsets (relative to the beginning of the page) which tell us where to find the cells. We iterate over those offsets and parse each cell! Since this is a table leaf page, the cells hold the actual contents of the table. Each cell contains its key, and array of values, one value per column in the table!

Here’s what we get by dumping all the cell payloads in the sqlite_schema table:

1
['table', 'users', 'users', 2, 'CREATE TABLE users (\n    id INTEGER PRIMARY KEY AUTOINCREMENT,\n    username TEXT NOT NULL UNIQUE,\n    email TEXT NOT NULL UNIQUE,\n    password_hash TEXT NOT NULL,\n    created_at DATETIME DEFAULT CURRENT_TIMESTAMP\n)']
2
['index', 'sqlite_autoindex_users_1', 'users', 3, None]
3
['index', 'sqlite_autoindex_users_2', 'users', 4, None]
4
['table', 'sqlite_sequence', 'sqlite_sequence', 5, 'CREATE TABLE sqlite_sequence(name,seq)']

This is identical (albeit not quite so nicely formatted) to the output of selecting the rows of the table:

sqlite> SELECT * FROM sqlite_schema;

table|users|users|2|CREATE TABLE users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    username TEXT NOT NULL UNIQUE,
    email TEXT NOT NULL UNIQUE,
    password_hash TEXT NOT NULL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
index|sqlite_autoindex_users_1|users|3|
index|sqlite_autoindex_users_2|users|4|
table|sqlite_sequence|sqlite_sequence|5|CREATE TABLE sqlite_sequence(name,seq)

Paraphrasing the docs, the rows here each represent an object in the database. The columns represent (in order):

The type of object (table/index/trigger/etc)
The name of the object (in the case of a table, it’s the name of the table)
The table that the object references
- For a table, this is just the table’s name again
- For an index, this is the table that the index is on
The root page number for the object
The sql used to create the object. This is empty for the implicit indexes that sqlite created as a result of the UNIQUE constraints on the username & email columns

For now, all we need is the root page number of the users table: 2.

Reading the `users` table

Now that we know that the root of the users table is page 2, we can repeat this process for that tree! Except there’s a hitch. Since I populated this table with 1000 records (as opposed to the 4 records in the sqlite_schema table), this table doesn’t fit on one page. So we’ve got slightly more work to do here to find the leaf page.

Since we’re about to implement several page types, this seems like a good time to do a little refactoring so that the code can stay organized. Specifically, I’m splitting things out into an abstract base class called Page, which will have different implementations for each type of page we’ll encounter.

1
@dataclass
2
class Page(ABC):
3
    offset: int
4
    pageHeader: PageHeader
5
    cells: List[Cell]
6

7
    def __init__(self, file: BinaryIO):
8
        self.offset = file.tell()
9
        # If this is the first page of the database, the file-level header shifts everything by 100 bytes
10
        if self.offset <= 100:
11
            self.offset = 0
12
        self.cells = []
13
        self.pageHeader = PageHeader(file)
14

15
    @staticmethod
16
    def build_page(file: BinaryIO) -> "Page":
17
        page_type = BTreePageType(file.read(1)[0])
18
        file.seek(-1, os.SEEK_CUR)
19
        match page_type:
20
            case BTreePageType.TABLE_LEAF:
21
                return TableLeafPage(file)
22
            case BTreePageType.TABLE_INTERIOR:
23
                return TableInteriorPage(file)
24
            case _:
25
                raise ValueError(f"Unsupported page type: {page_type}")
26

27
    @abstractmethod
28
    def get_record(self, key: any) -> Optional["Record"]:
29
        pass
30

31

32
@dataclass
33
class TableLeafPage(Page):
34
    cells: List[TableLeafCell]
35

36
    def __init__(self, file: BinaryIO):
37
        super().__init__(file)
38

39
        self.offset = file.tell()
40
        self.cells = []
41
        self.pageHeader = PageHeader(file)
42

43
        num_cells = self.pageHeader.num_cells_in_page
44
        cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
45
        for ptr in cell_ptrs:
46
            file.seek(self.offset + ptr)
47
            self.cells.append(TableLeafCell(file, db.header.text_encoding))

I’ll also add a get_record function to the TableLeaf class. For the sqlite_schema table it was fine to do a linear scan to find the record we were looking for, but going forward we should be using a binary search to find things in $O(log(n))$ time:

1
@dataclass
2
class TableLeafPage(Page):
15 collapsed lines
3
    cells: List[TableLeafCell]
4

5
    def __init__(self, file: BinaryIO):
6
        super().__init__(file)
7

8
        self.offset = file.tell()
9
        self.cells = []
10
        self.pageHeader = PageHeader(file)
11

12
        num_cells = self.pageHeader.num_cells_in_page
13
        cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
14
        for ptr in cell_ptrs:
15
            file.seek(self.offset + ptr)
16
            self.cells.append(TableLeafCell(file, db.header.text_encoding))
17

18
    def get_record(self, row_id: int) -> Optional[Record]:
19
      # Use the stdlib bisect_left function to find the first cell with a key that's >= row_id
20
      idx = bisect_left(self.cells, row_id, key=lambda cell: cell.row_id)
21
      # Check if we actually found the cell we're looking for
22
      if idx < len(self.cells) and self.cells[idx].row_id == row_id:
23
          # ... and return it
24
          return self.cells[idx].record
25
      else:
26
          # or return None if the cell doesn't exist
27
          return None

While we’re at it, there are a few other housekeeping additions that should make life a lot easier. I’ll add a Database abstraction that makes it easy to actually get a Page object given a page id:

1
class Database:
2
    def __init__(self, file: BinaryIO):
3
        self.file = file
4
        self.header = FileHeader(file)
5
        self.schema_page = self.get_page(1)
6

7
    def get_root_page_num(self, target_object_name, target_object_type="table"):
8
        for record in self.schema_page.records:
9
            object_type, object_name, object_table, object_page, schema = record.values
10
            if object_name == target_object_name and object_type == target_object_type:
11
                return object_page
12

13
    def get_page(self, page_number) -> Page:
14
        page_offset = (page_number - 1) * self.header.page_size
15
        if page_offset == 0:
16
            page_offset += 100
17
        self.file.seek(page_offset)
18
        return Page.build_page(self, self.file)

And I’ll extend the Page abstract class to add a constructor for easily creating pages:

1
@dataclass
2
class Page(ABC):
3
    offset: int
4
    pageHeader: PageHeader
5
    cells: List[Cell]
6

7
    def __init__(self, file: BinaryIO):
8
        self.offset = file.tell()
9
        # If this is the first page of the database, the file-level header shifts everything by 100 bytes
10
        if self.offset <= 100:
11
            self.offset = 0
12
        self.cells = []
13
        self.pageHeader = PageHeader(file)
14

15
    @staticmethod
16
    def build_page(database: "Database", file: BinaryIO) -> "Page":
17
        page_type = BTreePageType(file.read(1)[0])
18
        file.seek(-1, os.SEEK_CUR)
19
        match page_type:
20
            case BTreePageType.TABLE_LEAF:
21
                return TableLeafPage(file, database)
22
            # canonically, I haven't written these other classes yet... just ignore them for now :P
23
            case BTreePageType.TABLE_INTERIOR:
24
                return TableInteriorPage(file, database)
25
            case BTreePageType.INDEX_INTERIOR:
26
                return IndexInteriorPage(file, database)
27
            case BTreePageType.INDEX_LEAF:
28
                return IndexLeafPage(file, database)
29
            case _:
30
                raise ValueError(f"Unsupported page type: {page_type}")

With this structure in place, it’s easy to extend the pattern to handle other page types. For now, the only additional one we need is a TableInteriorPage:

1
@dataclass
2
class TableInteriorPage(Page):
3
    cells: List[TableInteriorCell]
4

5
    def __init__(self, file: BinaryIO):
6
        super().__init__(file)
7

8
        num_cells = self.pageHeader.num_cells_in_page
9
        cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
10
        for ptr in cell_ptrs:
11
            file.seek(self.offset + ptr)
12
            self.cells.append(TableInteriorCell(file))
13

14
    def get_record(self, row_id) -> Optional[Record]:
15
      idx = bisect_left(self.cells, row_id, key=lambda cell: cell.key)
16
      if idx < len(self.cells) and self.cells[idx].key >= row_id:
17
          child_page = self.database.get_page(self.cells[idx].child_page_ptr)
18
      else:
19
          child_page = self.database.get_page(self.pageHeader.right_ptr)
20
      return child_page.get_record(row_id)

There are 2 main differences between a TableInteriorPage and a TableLeafPage. First, they contain TableInteriorCells and TableLeafCells, respectively. This is because the interior cells contain pointers to other pages, compared to leaf cells which contain the actual data inside the database records. As such, they’ve got somewhat different internal structures.

The more important difference here is how the get_record function works. When we find the cell we’re looking for, we get a page_ptr out of the cell instead of the actual database record. This pointer is then used with the Database abstraction I just added to find a new page. This new page might be another interior page, or it might be a leaf page.

Either way, we delegate the get_record call into the child page, and let the child handle it. Either it’ll be a leaf & we’ll get the record back, or we’ll recurse down interior pages until we finally get to a leaf.

And that’s all we need to read records out of the database if we know their rowid!

The original goal I set was to query the database for user id 450. Here’s how we can do that:

1
with open("./example.db", "rb") as file:
2
    db = Database(file)
3
    users_table_page_num = db.get_root_page_num("users")
4
    users_table_root = db.get_page(users_table_page_num)
5
    record = users_table_root.get_record(row_id)
6
    print(record.values)

which prints:

1
[None, 'user_450', 'user_450@example.com', 'password_450', '2025-01-02 05:44:00']

It’s worth mentioning what’s going on with the None value at the beginning. This has to do with that column being an INTEGER PRIMARY KEY AUTOINCREMENT. Like I mentioned earlier, that causes sqlite to use the rowid as the value of that column, rather than doing another level of indirection with an index. In this situation, sqlite still puts an entry in the database records for that column, it just sets it to null (which is what we’re seeing).

Using Indexes

But what if we didn’t use an INTEGER PRIMARY KEY AUTOINCREMENT? Well lets work through an example like that, extending the code to handle this situation.

If we didn’t start out knowing the rowid for our record, we’d first need to use an index to find it. An index is structured almost the same as the table. Each cell in an index leaf page corresponds to one row in the corresponding table, and the interior b-tree pages store pointers to child pages until we reach leaf pages. There are 2 main differences between the representation of a table and that of an index:

Instead of the keys of the cells being rowid’s, the key is a concatenation of the value in the column that’s indexed, followed by the rowid

So our user 450, the index for the email row has a key of user_450@example.com450

Instead of the values in the cells in the leaves holding the data for the table, the value is the rowid of a record

So the process for looking up a record based on an indexed column goes like this:

Use the sqlite_schema table to find the root page for an index on the column we need
Use the internal pages to locate the leaf page containing a cell for the record we want
Read the rowid for the record out of that leaf cell
Use that rowid to find the actual record, just like we did before when we looked up user number 450

Since we already have most of the abstractions necessary to do this, it’s fairly strait forward. I’ll need to add additional classes to represent the index internal/leaf pages, and additional classes to represent how cells are structured in index internal/leaf pages. Here’s what that looks like:

1
class IndexInteriorCell(Cell):
2
  child_page_ptr: int
3
  payload_size: int
4
  record: "Record"
5

6
  def __init__(self, file: BinaryIO, encoding: str):
7
      self.child_page_ptr = struct.unpack(">I", file.read(4))[0]
8
      self.payload_size = read_varint(file)
9
      self.record = Record(file, encoding)
10

11

12
class IndexLeafCell(Cell):
13
  payload_size: int
14
  record: "Record"
15

16
  def __init__(self, file: BinaryIO, encoding: str):
17
      self.payload_size = read_varint(file)
18
      self.record = Record(file, encoding)
19

20
@dataclass
21
class IndexInteriorPage(Page):
22
  database: "Database"
23
  cells: List[IndexInteriorCell]
24

25
  def __init__(self, file: BinaryIO, db: "Database"):
26
      super().__init__(file)
27

28
      self.database = db
29
      num_cells = self.pageHeader.num_cells_in_page
30
      cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
31
      for ptr in cell_ptrs:
32
          file.seek(self.offset + ptr)
33
          self.cells.append(IndexInteriorCell(file, db.header.text_encoding))
34

35
  def get_record(self, key: List[any]) -> Optional[Record]:
36
      idx = bisect_left(self.cells, key, key=lambda cell: cell.record.values)
37
      if idx < len(self.cells) and self.cells[idx].record.values >= key:
38
          child_page = self.database.get_page(self.cells[idx].child_page_ptr)
39
      else:
40
          child_page = self.database.get_page(self.pageHeader.right_ptr)
41
      return child_page.get_record(key)
42

43

44
class IndexLeafPage(Page):
45
  cells: List[IndexLeafCell]
46

47
  def __init__(self, file: BinaryIO, db: "Database"):
48
      super().__init__(file)
49
      self.database = db
50
      num_cells = self.pageHeader.num_cells_in_page
51
      cell_ptrs = struct.unpack(f'>{num_cells}H', file.read(num_cells * 2))
52
      for ptr in cell_ptrs:
53
          file.seek(self.offset + ptr)
54
          self.cells.append(IndexLeafCell(file, db.header.text_encoding))
55

56
  def get_record(self, key: List[any]) -> Optional[Record]:
57
      idx = bisect_left(self.cells, key, key=lambda cell: cell.record.values)
58
      if idx >= len(self.cells) or self.cells[idx].record.values[:len(key)] != key:
59
          return None
60
      cell_row_id = self.cells[idx].record.values[-1]
61
      return get_user_info_by_id(self.database, cell_row_id)
62

63
def get_user_info_by_id(db, row_id) -> Record | None:
64
  users_table_page_num = db.get_root_page_num("users")
65
  users_table_root = db.get_page(users_table_page_num)
66
  record = users_table_root.get_record(row_id)
67
  return record

This should all look pretty familiar from the first part, since it’s all nearly the same. The biggest difference is line 61. When we get all the way down to a leaf page and find the correct record, instead of just returning the payload of that cell (after some parsing) we pass the payload of that cell into get_user_info_by_id, using the value of the leaf index cell as the id for looking up the real database record.

Parsing records

I glossed over one important detail above, which is the actual format that the data is stored inside the payload of the cells.

The first part of decoding cells is parsing integers. Sqlite uses a format for storing integers that they call a varint (variable length int). This format supports storing ints that are between 1 and 8 bytes long. They use the first bit of each byte to signify whether there are more bytes to come, or if you’re done reading the number. If you reach the 9th byte, then you automatically know that this is the last byte, so the full byte is used. Here’s a (probably not very efficient) implementation of that:

1
def read_varint(file):
2
  ans = 0
3
  for i in range(9):            # Process at most 9 bytes of data
4
      val = file.read(1)[0]      # Read one byte
5
      if i < 8:                  # If we're on the first 8 bytes
6
          ans <<= 7                # shift the answer left 7 bits
7
          ans += val & ~0x80       # and add the low 7 bits to the answer
8
      else:                      # If we're on the final byte
9
          ans <<= 8                # shift a full 8 bytes
10
          ans += val              # and add the full 8 bits to the answer
11
      if not (val & 0x80):
12
          break
13
  return ans

Another detail I skipped is the payload of the cells in the table leaf pages are actually structured as a record, with its own header. The header for a cell defines the data type for each column in the record, and then the data for the columns follows immediately after that. The reason they save a seperate header for each cell instead of just having one per table is that if a new column is added to the table after records already exist, new cells might get created with columns that don’t exist in the previous cells. Or put differently, rows in a sqlite table don’t necessarily have the same number of columns.

The header is a series of varints. The first varint tells us how long the header is, and the rest of the varints are basically an enum that corresponds to the data type for the column (see the docs for the specifics). Here’s what the process of parsing a record looks like:

1
class Record:
2
  def __init__(self, file: BinaryIO, encoding: str):
3
      self.values = []
4
      record_start_offset = file.tell()
5
      header_size = read_varint(file)
6
      dtypes = []
7
      while file.tell() - record_start_offset < header_size: # varints until we get to the end of the header
8
          dtypes.append(read_varint(file))
9

10
      for i, dtype in enumerate(dtypes): # use the values we got in the header to interpret each value in the record
11
          field_offset = file.tell()
12
          match dtype:
13
              case 0:
14
                  val = None
15
              case 1:
16
                  val = file.read(1)[0]
17
              case 2:
18
                  val = struct.unpack(">h", file.read(2))[0]
19
              case 3:
20
                  val = struct.unpack(">i", b'\x00' + file.read(3))[0]
21
              case 4:
22
                  val = struct.unpack(">i", file.read(4))[0]
23
              case 5:
24
                  val = struct.unpack(">q", b'\x00\x00' + file.read(6))
25
              case 6:
26
                  val = struct.unpack(">q", file.read(8))
27
              case 7:
28
                  # Not sure if this is the right binary IEEE 754 float representation. We'll see...
29
                  val = struct.unpack(">d", file.read(8))
30
              case 8:
31
                  val = 0
32
              case 9:
33
                  val = 1
34
              case x if x >= 12 and x % 2 == 0:  # blob
35
                  size = (x - 12) // 2
36
                  val = file.read(size)
37
              case x if x >= 13 and x % 2 == 1:  # text
38
                  size = (x - 13) // 2
39
                  val = file.read(size)
40
                  val = val.decode(encoding)
41
              case _:
42
                  raise ValueError(
43
                      f"Unexpected dtype: {dtype} at record starting at offset: 0x{record_start_offset:08x}, field: {i} (of {len(dtypes)} fields) at offset: {field_offset:08x}")
44

45
          self.values.append(val)

The very last implementation detail I skipped was the actual definitions for the different types of cells. I don’t think they’re all that interesting, but for completeness here’s what they look like:

1
class Cell(ABC):
2
    pass
3

4

5
class TableLeafCell(Cell):
6
    row_id: int
7
    payload_size: int
8
    record: "Record"
9

10
    def __init__(self, file, encoding):
11
        self.payload_size = read_varint(file)
12
        self.row_id = read_varint(file)
13
        self.record = Record(file, encoding)
14

15

16
class TableInteriorCell(Cell):
17
    child_page_ptr: int
18
    key: int
19

20
    def __init__(self, file: BinaryIO):
21
        self.child_page_ptr = struct.unpack(">I", file.read(4))[0]
22
        self.key = read_varint(file)
23

24

25
class IndexInteriorCell(Cell):
26
    child_page_ptr: int
27
    payload_size: int
28
    record: "Record"
29

30
    def __init__(self, file: BinaryIO, encoding: str):
31
        self.child_page_ptr = struct.unpack(">I", file.read(4))[0]
32
        self.payload_size = read_varint(file)
33
        self.record = Record(file, encoding)
34

35

36
class IndexLeafCell(Cell):
37
    payload_size: int
38
    record: "Record"
39

40
    def __init__(self, file: BinaryIO, encoding: str):
41
        self.payload_size = read_varint(file)
42
        self.record = Record(file, encoding)

And that’s it! If you got this far, thanks for reading!