350 lines
16 KiB
Rust
350 lines
16 KiB
Rust
// Copyright 2019-2020 Google LLC
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 (the "License");
|
|
// you may not use this file except in compliance with the License.
|
|
// You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing, software
|
|
// distributed under the License is distributed on an "AS IS" BASIS,
|
|
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
// See the License for the specific language governing permissions and
|
|
// limitations under the License.
|
|
|
|
// TODO(ia0): Add links once the code is complete.
|
|
//! Store abstraction for flash storage
|
|
//!
|
|
//! # Specification
|
|
//!
|
|
//! The store provides a partial function from keys to values on top of a storage
|
|
//! interface. The store total capacity depends on the size of the storage. Store
|
|
//! updates may be bundled in transactions. Mutable operations are atomic, including
|
|
//! when interrupted.
|
|
//!
|
|
//! The store is flash-efficient in the sense that it uses the storage lifetime
|
|
//! efficiently. For each page, all words are written at least once between erase
|
|
//! cycles and all erase cycles are used. However, not all written words are user
|
|
//! content: lifetime is also consumed with metadata and compaction.
|
|
//!
|
|
//! The store is extendable with other entries than key-values. It is essentially a
|
|
//! framework providing access to the storage lifetime. The partial function is
|
|
//! simply the most common usage and can be used to encode other usages.
|
|
//!
|
|
//! ## Definitions
|
|
//!
|
|
//! An _entry_ is a pair of a key and a value. A _key_ is a number between 0
|
|
//! and 4095. A _value_ is a byte slice with a length between 0 and 1023 bytes (for
|
|
//! large enough pages).
|
|
//!
|
|
//! The store provides the following _updates_:
|
|
//! - Given a key and a value, `Insert` updates the store such that the value is
|
|
//! associated with the key. The value for other keys are left unchanged.
|
|
//! - Given a key, `Remove` updates the store such that no value is associated for
|
|
//! the key. The value for other keys are left unchanged. Additionally, if there
|
|
//! was a value associated with the key, the value is wiped from the storage
|
|
//! (all its bits are set to 0).
|
|
//!
|
|
//! The store provides the following _read-only operations_:
|
|
//! - `Iter` iterates through the store returning all entries exactly once. The
|
|
//! iteration order is not specified but stable between mutable operations.
|
|
//! - `Capacity` returns how many words can be stored before the store is full.
|
|
//! - `Lifetime` returns how many words can be written before the storage lifetime
|
|
//! is consumed.
|
|
//!
|
|
//! The store provides the following _mutable operations_:
|
|
//! - Given a set of independent updates, `Transaction` applies the sequence of
|
|
//! updates.
|
|
//! - Given a threshold, `Clear` removes all entries with a key greater or equal
|
|
//! to the threshold.
|
|
//! - Given a length in words, `Prepare` makes one step of compaction unless that
|
|
//! many words can be written without compaction. This operation has no effect
|
|
//! on the store but may still mutate its storage. In particular, the store has
|
|
//! the same capacity but a possibly reduced lifetime.
|
|
//!
|
|
//! A mutable operation is _atomic_ if, when power is lost during the operation, the
|
|
//! store is either updated (as if the operation succeeded) or left unchanged (as if
|
|
//! the operation did not occur). If the store is left unchanged, lifetime may still
|
|
//! be consumed.
|
|
//!
|
|
//! The store relies on the following _storage interface_:
|
|
//! - It is possible to read a byte slice. The slice won't span multiple pages.
|
|
//! - It is possible to write a word slice. The slice won't span multiple pages.
|
|
//! - It is possible to erase a page.
|
|
//! - The pages are sequentially indexed from 0. If the actual underlying storage
|
|
//! is segmented, then the storage layer should translate those indices to
|
|
//! actual page addresses.
|
|
//!
|
|
//! The store has a _total capacity_ of `C = (N - 1) * (P - 4) - M - 1` words, where
|
|
//! `P` is the number of words per page, `N` is the number of pages, and `M` is the
|
|
//! maximum length in words of a value (256 for large enough pages). The capacity
|
|
//! used by each mutable operation is given below (a transient word only uses
|
|
//! capacity during the operation):
|
|
//! - `Insert` uses `1 + ceil(len / 4)` words where `len` is the length of the
|
|
//! value in bytes. If an entry was replaced, the words used by its insertion
|
|
//! are freed.
|
|
//! - `Remove` doesn't use capacity if alone in the transaction and 1 transient
|
|
//! word otherwise. If an entry was deleted, the words used by its insertion are
|
|
//! freed.
|
|
//! - `Transaction` uses 1 transient word. In addition, the updates of the
|
|
//! transaction use and free words as described above.
|
|
//! - `Clear` doesn't use capacity and frees the words used by the insertion of
|
|
//! the deleted entries.
|
|
//! - `Prepare` doesn't use capacity.
|
|
//!
|
|
//! The _total lifetime_ of the store is below `L = ((E + 1) * N - 1) * (P - 2)` and
|
|
//! above `L - M` words, where `E` is the maximum number of erase cycles. The
|
|
//! lifetime is used when capacity is used, including transiently, as well as when
|
|
//! compaction occurs. The more the store is loaded (few remaining words of
|
|
//! capacity), the more compactions are frequent, and the more lifetime is used.
|
|
//!
|
|
//! It is possible to approximate the cost of transient words in terms of capacity:
|
|
//! `L` transient words are equivalent to `C - x` words of capacity where `x` is the
|
|
//! average capacity (including transient) of operations.
|
|
//!
|
|
//! ## Preconditions
|
|
//!
|
|
//! The store may behave in unexpected ways if the following assumptions don't hold:
|
|
//! - A word can be written twice between erase cycles.
|
|
//! - A page can be erased `E` times after the first boot of the store.
|
|
//! - When power is lost while writing a slice or erasing a page, the next read
|
|
//! returns a slice where a subset (possibly none or all) of the bits that
|
|
//! should have been modified have been modified.
|
|
//! - Reading a slice is deterministic. When power is lost while writing a slice
|
|
//! or erasing a slice (erasing a page containing that slice), reading that
|
|
//! slice repeatedly returns the same result (until it is overwritten or its
|
|
//! page is erased).
|
|
//! - To decide whether a page has been erased, it is enough to test if all its
|
|
//! bits are equal to 1.
|
|
//! - When power is lost while writing a slice or erasing a page, that operation
|
|
//! does not count towards the limits. However, completing that write or erase
|
|
//! operation would count towards the limits, as if the number of writes per
|
|
//! word and number of erase cycles could be fractional.
|
|
//! - The storage is only modified by the store. Note that completely erasing the
|
|
//! storage is supported, essentially losing all content and lifetime tracking.
|
|
//! It is preferred to use `Clear` with a threshold of 0 to keep the lifetime
|
|
//! tracking.
|
|
//!
|
|
//! The store properties may still hold outside some of those assumptions but with
|
|
//! weaker probabilities as the usage diverges from them.
|
|
//!
|
|
//! # Implementation
|
|
//!
|
|
//! We define the following constants:
|
|
//! - `E < 65536` the number of times a page can be erased.
|
|
//! - `3 <= N < 64` the number of pages in the storage.
|
|
//! - `8 <= P <= 1024` the number of words in a page.
|
|
//! - `Q = P - 2` the number of words in a virtual page.
|
|
//! - `K = 4096` the maximum number of keys.
|
|
//! - `M = min(Q - 1, 256)` the maximum length in words of a value.
|
|
//! - `V = (N - 1) * (Q - 1) - M` the virtual capacity.
|
|
//! - `C = V - N` the user capacity.
|
|
//!
|
|
//! We build a virtual storage from the physical storage using the first 2 words of
|
|
//! each page:
|
|
//! - The first word contains the number of times the page has been erased.
|
|
//! - The second word contains the starting word to which this page is being moved
|
|
//! during compaction.
|
|
//!
|
|
//! The virtual storage has a length of `(E + 1) * N * Q` words and represents the
|
|
//! lifetime of the store. (We reserve the last `Q + M` words to support adding
|
|
//! emergency lifetime.) This virtual storage has a linear address space.
|
|
//!
|
|
//! We define a set of overlapping windows of `N * Q` words at each `Q`-aligned
|
|
//! boundary. We call `i` the window spanning from `i * Q` to `(i + N) * Q`. Only
|
|
//! those windows actually exist in the underlying storage. We use compaction to
|
|
//! shift the current window from `i` to `i + 1`, preserving the content of the
|
|
//! store.
|
|
//!
|
|
//! For a given state of the virtual storage, we define `h_i` as the position of the
|
|
//! first entry of the window `i`. We call it the head of the window `i`. Because
|
|
//! entries are at most `M + 1` words, they can overlap on the next page only by `M`
|
|
//! words. So we have `i * Q <= h_i <= i * Q + M` . Since there are no entries
|
|
//! before the first page, we have `h_0 = 0`.
|
|
//!
|
|
//! We define `t_i` as one past the last entry of the window `i`. If there are no
|
|
//! entries in that window, we have `t_i = h_i`. We call `t_i` the tail of the
|
|
//! window `i`. We define the compaction invariant as `t_i - h_i <= V`.
|
|
//!
|
|
//! We define `|x|` as the capacity used before position `x`. We have `|x| <= x`. We
|
|
//! define the capacity invariant as `|t_i| - |h_i| <= C`.
|
|
//!
|
|
//! Using this virtual storage, entries are appended to the tail as long as there is
|
|
//! both virtual capacity to preserve the compaction invariant and capacity to
|
|
//! preserve the capacity invariant. When virtual capacity runs out, the first page
|
|
//! of the window is compacted and the window is shifted.
|
|
//!
|
|
//! Entries are identified by a prefix of bits. The prefix has to contain at least
|
|
//! one bit set to zero to differentiate from the tail. Entries can be one of:
|
|
//! - Padding: A word whose first bit is set to zero. The rest is arbitrary. This
|
|
//! entry is used to mark words partially written after an interrupted operation
|
|
//! as padding such that they are ignored by future operations.
|
|
//! - Header: A word whose second bit is set to zero. It contains the following fields:
|
|
//! - A bit indicating whether the entry is deleted.
|
|
//! - A bit indicating whether the value is word-aligned and has all bits set
|
|
//! to 1 in its last word. The last word of an entry is used to detect that
|
|
//! an entry has been fully written. As such it must contain at least one
|
|
//! bit equal to zero.
|
|
//! - The key of the entry.
|
|
//! - The length in bytes of the value. The value follows the header. The
|
|
//! entry is word-aligned if the value is not.
|
|
//! - The checksum of the first and last word of the entry.
|
|
//! - Erase: A word used during compaction. It contains the page to be erased and
|
|
//! a checksum.
|
|
//! - Clear: A word used during the `Clear` operation. It contains the threshold
|
|
//! and a checksum.
|
|
//! - Marker: A word used during the `Transaction` operation. It contains the
|
|
//! number of updates following the marker and a checksum.
|
|
//! - Remove: A word used during the `Transaction` operation. It contains the key
|
|
//! of the entry to be removed and a checksum.
|
|
//!
|
|
//! Checksums are the number of bits equal to 0.
|
|
//!
|
|
//! # Proofs
|
|
//!
|
|
//! ## Compaction
|
|
//!
|
|
//! It should always be possible to fully compact the store, after what the
|
|
//! remaining capacity should be available in the current window (restoring the
|
|
//! compaction invariant). We consider all notations on the virtual storage after
|
|
//! the full compaction. We will use the `|x|` notation although we update the state
|
|
//! of the virtual storage. This is fine because compaction doesn't change the
|
|
//! status of an existing word.
|
|
//!
|
|
//! We want to show that the next `N - 1` compactions won't move the tail past the
|
|
//! last page of their window, with `I` the initial window:
|
|
//!
|
|
//! ```
|
|
//! forall 1 <= i <= N - 1, t_{I + i} <= (I + i + N - 1) * Q
|
|
//! ```
|
|
//!
|
|
//! We assume `i` between `1` and `N - 1`.
|
|
//!
|
|
//! One step of compaction advances the tail by how many words were used in the
|
|
//! first page of the window with the last entry possibly overlapping on the next
|
|
//! page.
|
|
//!
|
|
//! ```
|
|
//! forall j, t_{j + 1} = t_j + |h_{j + 1}| - |h_j| + 1
|
|
//! ```
|
|
//!
|
|
//! By induction, we have:
|
|
//!
|
|
//! ```
|
|
//! t_{I + i} <= t_I + |h_{I + i}| - |h_I| + i
|
|
//! ```
|
|
//!
|
|
//! We have the following properties:
|
|
//!
|
|
//! ```
|
|
//! t_I <= h_I + V
|
|
//! |h_{I + i}| - |h_I| <= h_{I + i} - h_I
|
|
//! h_{I + i} <= (I + i) * Q + M
|
|
//! ```
|
|
//!
|
|
//! Replacing into our previous equality, we can conclude:
|
|
//!
|
|
//! ```
|
|
//! t_{I + i} = t_I + |h_{I + i}| - |h_I| + i
|
|
//! <= h_I + V + (I + i) * Q + M - h_I + i
|
|
//! = (N - 1) * (Q - 1) - M + (I + i) * Q + M + i
|
|
//! = (N - 1) * (Q - 1) + (I + i) * Q + i
|
|
//! = (I + i + N - 1) * Q + i - (N - 1)
|
|
//! <= (I + i + N - 1) * Q
|
|
//! ```
|
|
//!
|
|
//! We also want to show that after `N - 1` compactions, the remaining capacity is
|
|
//! available without compaction.
|
|
//!
|
|
//! ```
|
|
//! V - (t_{I + N - 1} - h_{I + N - 1}) >= // The available words in the window.
|
|
//! C - (|t_{I + N - 1}| - |h_{I + N - 1}|) // The remaining capacity.
|
|
//! + 1 // Reserved for Clear.
|
|
//! ```
|
|
//!
|
|
//! We can replace the definition of `C` and simplify:
|
|
//!
|
|
//! ```
|
|
//! V - (t_{I + N - 1} - h_{I + N - 1}) >= V - N - (|t_{I + N - 1}| - |h_{I + N - 1}|) + 1
|
|
//! iff t_{I + N - 1} - h_{I + N - 1} <= |t_{I + N - 1}| - |h_{I + N - 1}| + N - 1
|
|
//! ```
|
|
//!
|
|
//! We have the following properties:
|
|
//!
|
|
//! ```
|
|
//! t_{I + N - 1} = t_I + |h_{I + N - 1}| - |h_I| + N - 1
|
|
//! |t_{I + N - 1}| - |h_{I + N - 1}| = |t_I| - |h_I| // Compaction preserves capacity.
|
|
//! |h_{I + N - 1}| - |t_I| <= h_{I + N - 1} - t_I
|
|
//! ```
|
|
//!
|
|
//! From which we conclude:
|
|
//!
|
|
//! ```
|
|
//! t_{I + N - 1} - h_{I + N - 1} <= |t_{I + N - 1}| - |h_{I + N - 1}| + N - 1
|
|
//! iff t_I + |h_{I + N - 1}| - |h_I| + N - 1 - h_{I + N - 1} <= |t_I| - |h_I| + N - 1
|
|
//! iff t_I + |h_{I + N - 1}| - h_{I + N - 1} <= |t_I|
|
|
//! iff |h_{I + N - 1}| - |t_I| <= h_{I + N - 1} - t_I
|
|
//! ```
|
|
//!
|
|
//!
|
|
//! ## Checksum
|
|
//!
|
|
//! The main property we want is that all partially written/erased words are either
|
|
//! the initial word, the final word, or invalid.
|
|
//!
|
|
//! We say that a bit sequence `TARGET` is reachable from a bit sequence `SOURCE` if
|
|
//! both have the same length and `SOURCE & TARGET == TARGET` where `&` is the
|
|
//! bitwise AND operation on bit sequences of that length. In other words, when
|
|
//! `SOURCE` has a bit equal to 0 then `TARGET` also has that bit equal to 0.
|
|
//!
|
|
//! The only written entries start with `101` or `110` and are written from an
|
|
//! erased word. Marking an entry as padding or deleted is a single bit operation,
|
|
//! so the property trivially holds. For those cases, the proof relies on the fact
|
|
//! that there is exactly one bit equal to 0 in the 3 first bits. Either the 3 first
|
|
//! bits are still `111` in which case we expect the remaining bits to bit equal
|
|
//! to 1. Otherwise we can use the checksum of the given type of entry because those
|
|
//! 2 types of entries are not reachable from each other.
|
|
//!
|
|
//! To show that valid entries of a given type are not reachable from each other, we
|
|
//! show 3 lemmas:
|
|
//!
|
|
//! 1. A bit sequence is not reachable from another if its number of bits equal to
|
|
//! 0 is smaller.
|
|
//!
|
|
//! 2. A bit sequence is not reachable from another if they have the same number of
|
|
//! bits equals to 0 and are different.
|
|
//!
|
|
//! 3. A bit sequence is not reachable from another if it is bigger when they are
|
|
//! interpreted as numbers in binary representation.
|
|
//!
|
|
//! From those lemmas we consider the 2 cases. If both entries have the same number
|
|
//! of bits equal to 0, they are either equal or not reachable from each other
|
|
//! because of the second lemma. If they don't have the same number of bits equal to
|
|
//! 0, then the one with less bits equal to 0 is not reachable from the other
|
|
//! because of the first lemma and the one with more bits equal to 0 is not
|
|
//! reachable from the other because of the third lemma and the definition of the
|
|
//! checksum.
|
|
//!
|
|
//! # Fuzzing
|
|
//!
|
|
//! For any sequence of operations and interruptions starting from an erased
|
|
//! storage, the store is checked against its model and some internal invariant at
|
|
//! each step.
|
|
//!
|
|
//! For any sequence of operations and interruptions starting from an arbitrary
|
|
//! storage, the store is checked not to crash.
|
|
|
|
#![cfg_attr(not(feature = "std"), no_std)]
|
|
|
|
#[macro_use]
|
|
extern crate alloc;
|
|
|
|
#[macro_use]
|
|
mod bitfield;
|
|
mod format;
|
|
mod storage;
|
|
mod store;
|
|
|
|
pub use self::storage::{Storage, StorageError, StorageIndex, StorageResult};
|
|
pub use self::store::{StoreError, StoreResult};
|