7. Vere II: The Loom

The Loom

Vere's main memory model is called the loom. (Presumably this is from the roads shuttling back and forth, and perhaps mixing a metaphor.) A contiguous block of memory, formerly 2GB but now specified by the runtime flag, is allocated for the loom. This is the noun arena, and to work on it we need to use special u3-specific allocators (u3a).

One standard contemporary memory model afforded by an operating system has a heap↗ for manual dynamic memory allocation (C malloc()) and a stack↗ for local variable data (last-in-first-out, C alloca() but also implicit). The heap grows from the bottom up and the stack from the top down. (The alternative model today is to use SunOS-style mmap() to allow virtual memory paging anywhere in memory.)

    0           brk                                          ffff
    |   heap     |                                    stack    |
    |------------#################################+++++++++++++|
    |                                             |            |
    0                                             sp         ffff

brk is the brk() System V call, which marked the limit of the heap arena.
sp is the stack pointer.

u3 differs from this model in one particular: by permitting the heap and stack to point either way, we can efficiently nest pairs of stack and heap. We call such a pair (and their free memory arena) a road. The outermost road is the surface road, and inner roads are created in alternating directions when dependent calculations are embarked upon.

When a new inner road is created, it switches direction from the outer road. This puts its heap up against the outer road's stack, and its stack up against the outer road's heap. But when the road is terminated, its stack is freed while its heap becomes part of the outer road's stack.

A conventional heap-low-stack-high road is a north road:

    0           rut   hat                                    ffff
    |            |     |                                       |
    |~~~~~~~~~~~~-------##########################+++++++$~~~~~|
    |                                             |      |     |
    0                                            cap    mat  ffff

while a reversed heap-high-stack-low road is a south road:

    0           mat   cap                                    ffff
    |            |     |                                       |
    |~~~~~~~~~~~~$++++++##########################--------~~~~~|
    |                                             |      |     |
    0                                            hat    rut  ffff

cap is the top of the stack.
mat is the bottom of the stack (ffff in the example surface road).
rut is the bottom of the heap arena (not 0 because of immutable storage).
hat is the top of the heap arena.
~ is deep storage (immutable).
- is durable storage (heap).
+ is temporary storage (stack).
$ is the allocation frame.
# is free memory.

The motivation for the road model is that you need not update refcounts in senior memory. This diminishes the downsides of reference counting. (The upside of refcounting is deterministic finalization: automatic memory management comes from the properties of the computation itself not from external preemptive events. Refcounting gives eager determinism.)

When we need to process an event or perform any kind of complicated calculation, we process it using an inner road. Because the roads alternate direction, any data from an inner heap that needs to be preserved must be copied back out. The advantages, tho, are:

The surface road is left read-only by u3 and thus clean. Thus when snapshotting, clean pages are kept clean.
An inner calculation can be aborted without affecting the surface.
The surface is not fragmented because the inner results are copied in when necessary.

Vere ends up in operation with nested roads during a computation:

    0                                                        ffff
    |-------                                       ++++++++++++|   surface road
    |~~~~~~~$+++++++++                    ---------~~~~~~~~~~~~|   inner road 1
    |~~~~~~~~~~~~~~~~~---------     +++++$~~~~~~~~~~~~~~~~~~~~~|   inner road 2
    |                          ##&##                           |   free memory

The Vere interpreter runs in a road, and you can check if you're on the surface road or in an inner road. (Most of the time you should just assume you're on an inner road.) This distinction matters, such as c3_assert(), which produces an exception with stack trace on an inner road, but kills the process on an outer road.

The current road is u3R, a global. Within Arvo, a new road is currently begun in the following cases:

every event
every read from the namespace (Arvo scry gate into vane doesn't, userspace .^ dotket does)
every call to ++mink (and ++mock, etc.)

Any work that will be done where you only want to keep one portion of it is a good candidate for a road.

The loom is organized into pages, each 16 KB in size. There is a guard page & in the middle of free memory to make sure that the stack and heap do not overwrite each other. The guard page is adjusted in u3e_ward() when necessary:

When a fault is detected in the guard page, the guard page is recentered in the free space of the current road. if the guard page cannot be recentered, then memory exhaustion has occurred.

There is a hope to add both a raw hint to suggest to the system to use a new road and a bump-allocation mode to permit turning off refcounting on inner roads when practical. (Cf. #6805↗.)

`u3a` Allocator

To work with u3 memory, use the u3a memory allocation functions:

u3a_malloc()
u3a_free()
u3a_realloc()

You should never call malloc() in the loom (but can, of course, in the Vere layer above u3).

Of course, we don't always know how large our atom will be. Therefore, the standard way of building large atoms is to allocate a block of raw space with u3i_slab_init(), then chop off the end with u3i_slab_malt() (which does the measuring itself) or u3i_slab_mint() in case you've measured it yourself.

Keep in mind that atoms do not retain leading zeros.

The reference counters introduced last time, u3a_gain()=u3k() and u3a_lose()=u3z(), are also part of u3a. However, other than these you typically use u3a indirectly through u3i and u3r/u3x.

Some details of the allocator are in flux right now↗: “As an experiment, [~master-morzod has] rewritten the serf to a) stop allocating events, effects, and IPC messages on the home road, and b) keep the Arvo kernel on an inner road for as long as possible (i.e. until we need to save/pack/meld/&c.).”

The King (Urth)

The king process is in charge of:

IPC
Event log
Unix effects including I/O
Stateless Nock interpreter (the “ghost ship” ivory pill material from ca05)

(We discussed the serf/Mars in ca05.)

Event Log & Snapshotting

The event log is the ordered list of all of the Arvo events (completed moves) resulting in the present state of Arvo. Events are handled at two levels: the event log, which is written consistently at each event, and the snapshot of the loom, which allows rapid recovery of the current state.

In practice, event logs become large and unwieldy over time. Periodically a snapshot of the permanent state is taken, so the entire event log needn't be replayed on reboot. You're still able to rebuild your state down to the last keystroke. This is due to the practice of persistence. Persistence, in the context of storing data in a computer system, means that data is stored in a non-volatile manner and that input must be recorded before the output result is performed. Thus, every event must be written to disk - or must be persisted - before the event effects actually take place.

The snapshot of the loom allows the last few events from the event log to be replayed to recover the present state.

In fact, with current usage patterns (~2023.10.12), there's a problem:

Every few minutes, the runtime applies a patch to its on-disk snapshot. This pauses the process, so the size is important. In particular, for many large ships (~nibset-napwyn, ~wicdev-wisryt, ~natnex-ronret), this is around 600MB. If the maximum acceptable pause is 1 second, this requires a disk which can handle 600MB/s of throughput (or maybe twice that, because it writes the patchfile, then applies it?), which is extremely high. (#6805↗)

`noun/error.h`

Before we look at the particulars of the event log and snapshotting code, let's take a brief detour to see how assertion errors are handled.

# define u3_assert(x)                       \
    do {                                    \
      if (!(x)) {                           \
        fflush(stderr);                     \
        fprintf(stderr, "\rAssertion '%s' " \
                "failed in %s:%d\r\n",      \
                #x, __FILE__, __LINE__);    \
        u3m_bail(c3__oops);                 \
        abort();                            \
      }                                     \
    } while(0)

An error here then triggers into u3m_bail(), the primary crash handler.

Review u3m_bail in noun/manage.c.

Event Log

What is an event on disk? Urbit maintains an LMDB↗ transactional database for key–value pairs, with a META table for metadata and an EVENTS table for event number–event data pairs, sequentially ordered.

An event is a u3_fact, a struct including the timestamp and event ovum:

/* u3_fact: completed event
*/
  typedef struct _u3_fact {
    c3_d             eve_d;               //  event number
    c3_l             mug_l;               //  kernel mug after
    u3_noun            job;               //  (pair date ovum)
    struct _u3_fact* nex_u;               //  next in queue
  } u3_fact;

Event log replay thus refers to retrieving the sequence of events from the pier's database instance and playing back each sequential event.

Examine u3_mars_play() for details of how playback works.
- _mars_play_batch()
- _mars_poke_play()
- u3v_poke_raw() and we're back to a conventional Arvo poke as we saw in ca05.

The introduction of epochs will enable finer-grained system recovery when necessary:

Historically, Vere has stored a single event log and snapshot. To facilitate replay across different binary versions more convenient and less error-prone, an improved design is the "epoch" system. In the epoch system, Vere breaks up the event log into "epoch"s, where an epoch represents a snapshot and some events after that snapshot. An epoch lives in its own folder, named after the first event in that epoch. In addition to storing a snapshot and a log of events, each epoch folder also stores a version file indicating which version of Vere originally ran these events -- this makes replay across different binary versions much easier, especially in the case of a jet mismatch in an old binary.

Snapshots

Replay is how Vere computes the state of a ship's Arvo instance from the event log after a ship reboots. In order to avoid replaying the entire event log, Replay takes a snapshot of the current state of the ship approximately once every ten minutes. Then when a ship reboots, Replay loads the most recent snapshot and replays events from the event log up to the most recent event.

u3e_save() saves the loom (snapshots), often called via u3m_save().

Demand Paging

Demand paging refers to the ability to load only needed pages of memory into RAM, leaving other pages on disk, to reduce memory use.

Pages are marked as clean (PROT_READ) or dirty (PROT_READ|PROT_WRITE) or guard (PROT_NONE). The access pattern assumes all pages are accessed from the outside inwards (another advantage of the loom model).

urbit/vere #402 (merged)
urbit/vere #410 (merged)

Unix I/O

The king is responsible for the I/O operations of the communicating vanes: Ames, Behn, Clay, Dill, Eyre, Iris, Khan, Lick. (The other two vanes, Gall and Jael, are landlocked and only interact within Urbit.)

We will cover the I/O drivers in a later lesson ca11 after we have covered the major vanes which need to interface with the host OS.

IPC

The Urbit runtime has two categories of IPC:

King/serf interprocess communication
Vane-driven interprocess communication
%khan/conn.c-based sockets
%lick-based communications

In general, POSIX IPC↗ is “a mechanism that allows processes to communicate with each other and synchronize their actions.” This can be done by sharing memory directly between the processes or by passing messages. Vere does a little of both: the loom is the shared memory arena, and sometimes messages are used.

For instance, in Vere's pier management (mainly vere/pier.c), the lord↗ coordinates the king and the serf through messages. Like Arvo, the king and the serf thus need to have the right API shape to connect to each other. The lord coordinates using writ to pass a value from the king to the serf, and plea to pass a value from the serf to the king. (These are rather like Arvo passes and gifts, but can be initiated from either side rather than just the pass/give pattern.) Whimsically, this is defined in Hoon inside a C comment in vere/lord.c:

|%
::  +writ: from king to serf
::
+$  writ
  $%  $:  %live
          $%  [%cram eve=@]
              [%exit cod=@]
              [%save eve=@]
              [%meld ~]
              [%pack ~]
      ==  ==
      [%peek mil=@ sam=*]
      [%play eve=@ lit=(list ?((pair @da ovum) *))]
      [%work mil=@ job=(pair @da ovum)]
  ==
::  +plea: from serf to king
::
+$  plea
  $%  [%live ~]
      [%ripe [pro=%1 hon=@ nok=@] eve=@ mug=@]
      [%slog pri=@ tank]
      [%flog cord]
      $:  %peek
          $%  [%done dat=(unit (cask))]
              [%bail dud=goof]
      ==  ==
      $:  %play
          $%  [%done mug=@]
              [%bail eve=@ mug=@ dud=goof]
      ==  ==
      $:  %work
          $%  [%done eve=@ mug=@ fec=(list ovum)]
              [%swap eve=@ mug=@ job=(pair @da ovum) fec=(list ovum)]
              [%bail lud=(list goof)]
      ==  ==
  ==
--

The procedure for IPC needs to establish communications, which is a follow-on from the king starting the serf. (Vere is single-threaded but runs two processes.)

+$writ from king to serf:

%live is a request to start up the serf.
%peek is a request for data from Arvo in the serf.
%play is a request to the serf to play an event (in an event playback).
%work is a request to the serf to carry out a computation in Arvo.

+$plea from serf to king:

%live tells if the serf is alive.
%ripe tracks the serf startup state.
%slog is an output request.
%flog is a debug output request.
- %peek is a response to the king with a scry result.
- %play is a response to an event playback.
- %work is a response to an injected ovum event.

Other parts of IPC depend on the atom-framing implementations in vere/newt.c, another key part of king–serf IPC. newt.c produces noun blobs that have a five-byte header and a variable-length payload. The header has a one-byte version tag, typically 0x0, followed by a four-byte little-endian message byte count. The payload is the ++jammed noun. This is used by urbit eval for stdin computed against the ivory pill ghost ship, for instance:

$ echo "(add 1 41)" | urbit eval
loom: mapped 2048MB  
lite: arvo formula 2a2274c9  
lite: core 4bb376f0  
lite: final state 4bb376f0  
eval (run):  
42

eval supports several options for processing Hoon nouns as input to or output from conn.c:

-j, --jam: output result as a jammed noun
-c, --cue: read input as a jammed noun
-n, --newt: write output / read input as a newt-encoded jammed noun, when paired with -j or -c respectively
-k: treat the input as the jammed noun input of a %fyrd request to conn.c; if the result is a goof, pretty-print it to stderr instead of returning it

In vere/newt.c, see particularly:

u3_newt_send() transmits a jammed noun (using u3s_jam_xeno(), for instance) to a task buffer for libuv. (Recall that libuv is the main event loop driver for the king process.)
u3_newt_read() pulls out the jammed noun from the buffer.

Khan and Lick both use newt.c.

Demonstrate invoking a statement at the CLI with the urbit executable.

Example: `|meld` Trace

Let's walk through the lifecycle of a command-line initiated ./zod/.run meld.
- vere/main.c
- noun/urth.c
Compare the lifecycle of |meld.
- /gen/hood/meld
- /lib/helm/kiln
- /lib/hood/kiln
- /sys/clay
- vere/lord.c
- vere/serf.c

We will discuss Khan and Lick in detail during the next lesson, but here's a quick recap of their functionality.

Khan

Khan is the "control plane" and thread-runner vane. Its main purpose is to allow external applications to run threads via a Unix Socket and receive the result.

A socket↗ is an endpoint for data communications. In the runtime, it is implemented by conn.c, the runtime counterpart to %khan.

%khan Overview

Lick

Lick manages IPC ports, and the communication between Urbit applications and POSIX applications via these ports. Other vanes and applications ask Lick to open an IPC port, notify it when something is connected or disconnected, and transfer data between itself and the Unix application.

%lick Overview

Debugging the Runtime

To conclude today's material, we would like to briefly demonstrate several debugging principles with VereAres.

`printf`

fprintf-based output should be done using fprintf() to stderr. Use both \n and \r to achieve line feed (move cursor down one line) and carriage return (move it to the left). You can also use u3l_log which does not require \r\n, but should not be used in cases where the I/O drivers have not yet been initialized or can no longer be relied upon, e.g. crashing or shutdown.

`gdb`

For C, make heavy use of gdb. lldb is far worse than gdb for debugging Urbit, so it's worth developing on a Linux box even if that means sshing into a server. (~wicdev-wisryt)

gdb works best when you build with debugging symbols.

bazel build :urbit --compilation_mode=dbg

- OR -

bazel build :urbit --copt=-DU3_CPU_DEBUG

- OR -

bazel build :urbit --copt=-DU3_MEM_DEBUG

When using GDB, before attaching to the process you should set the following:

set follow-fork-mode child
handle SIGSEGV nostop noprint

If you are debugging jets or the serf, then you want to attach to the serf at urbit-worker.

gdb --args ./bazel-bin/pkg/vere/urbit --args zod

- OR -

gdb attach <PID>

break jet_file:jet_name

Valgrind

Valgrind↗ is a memory profiling tool for diagnosing memory usage and memory leaks. As a memory tracking tool, Valgrind uses several times more memory than the native application requires.

The possible expedient to make it viable is:

Decrease the loom size with --loom (e.g. --loom 29 or something similarly constrained).

urbit/urbit #5161 has some discussion about using Valgrind. (The memory leak in question was ultimately resolved by #5614↗, which presents an instructive insight.)

Changing the Serf

The process call to invoke the serf is hard-coded in the king. To run Sword (née Ares) as a serf, for instance, you need to change the protocol entry in vere/lord.c:u3_lord_init() to the Sword executable as a full-path string literal.

Exercises

Turn on LORD_TRACE_CUE and boot a new fake ship.
Examine the new /ted/runtime-version thread which is in the latest release (412 K). How does it work? Trace the types and values back through to see how the values are recovered from the runtime.