Abstraction for Vanilla GC

This section presents an initial abstraction of the entities and interactions between them in Kopia. This initial abstraction represents a starting point I used based on reading the architecture document and code. Later sections explain some techniques used to further simplify/modify the abstraction and they are general enough to be used in specifications for other systems.

Kopia stores its data in a data structure called Repository which resides on remote cloud storage. The atomic unit of data in a repository is called a content (each content has a unique content id which is a deterministic hash of the content data). A bunch of contents (or a bunch of index entries that point to the contents; explained more later) are stored together in a blob. Contents are stored in data blobs and index entries are stored in index blobs. Each blob has a randomly generated blob id. A new blob can be written to the repository as a whole, but not partially. Once written, a blob can’t be modified. A client specifies a filesystem root to take a snapshot and upload to the repository. When a snapshotting process writes some contents, they are are packed (i.e., appended) over time into a local data blob and the data blob is written to the repository once some threshold for the size of the local data blob¹ is crossed. Alongside writing contents to the repository in batches of approximately blob size, index blobs, which contain information about how to find data corresponding to content ids, are written to the repository. When contents are packed into data blobs, correponding index entries (with information about where to find the content later) are packed in a local index blob which is written/flushed periodically to the repository. Each index blob is a set of entries of the form - content id of newly written content, the data blob id and offset in the data blob to which the content was written, timestamp of when the content was packed to the local data blob and a flag to indiciate if this marks the content for the content id as deleted (more on how this flag works shortly). The local index blob is periodicially flushed and reset to be empty. Each index blob usually contains entries for contents from multiple data blobs which have been written to the repository since the last index blob flush. At a time there exists only one local index and data blob waiting to be written to the remote repository.

Tieing it all up, a repository contains index blobs and data blobs. The data blobs contain contents which are referenced by entries in the index blobs. Keep in mind that in all figures, the content id (such as C4) is some hash of the content data that is written in the content. The content id provides content-addresability i.e., any snapshot process can reuse the content already written earlier (perhaps by another snapshot process) by searching the repository for the content data to be written using the hash of the content data. All index entries for a data blob will be found in the same index blob. Below is a sample depiction of three data blobs and 2 index blobs in a repository.

By local data/index blobs I mean locally maintained in a process’ memory. ↩