20 August 2023

What I would like to see in a modern Virtual Machine

As I was gathering inspiration (or doing research, if you want to be fancy about it) for the fully-introspectable computing platform introduced in the previous post, I figured it might be worth taking the idea of abstracting the hardware, to make user-facing software easier to program, to its limit.

One of primary roles of an operating system kernel is to provide a unified interface over hardware, so applications and developers can concentrate on their task without having to care about intricacies and particularities of the hardware the entire stack sits upon. (The other role of the kernel is to multiplex access over hardware.) If you stop to think about it, that's the same goal of a Virtual Machine, which instead of abstracting over your peripherals, it provides an abstraction over the CPU itself.

Effectively, there is no conceptual difference between a kernel and a virtual machine, if not for historical reasons. One abstracts over one thing, the other over everything else. What happens if we unify them?

Let's imagine what characteristics a potential kernel/VM hybrid might have, and what I wish it would have to truly shine over the status quo.

Language independence

Language independence is by itself a novel idea, as most VMs are tailor-made for a specific language, often little more than an afterthought added when a naive AST interpreter is deemed to slow.

x86, ARM, etc. are a common target for thousands of languages, but in the VM world you have the Java virtual machine, the CLR virtual machine for .NET languages, the Python bytecode interpreter and so on. Why is that?

WASM might be the first serious step towards a language-independent, generic VM implementation whose sole reason is to exist as a target for native, interpreted, high- and low-level languages alike.

A kernel/VM hybrid should allow multiple languages and environments to flourish, focusing on being as generic as possible.

Memory allocation primitives

The most annoying part of writing any sort of VM, yet a step required by any language a little more complex than Brainfuck, is implementing a garbage collector. In fact, the speed of a language is to a large extent determined by the quality of its garbage collection algorithm. Is it using a naive mark-and-sweep collector hacked together in an afternoon, or is it a concurrent, generational GC like the state-of-the-art one found in the Java world? I wonder how many languages have died because of their memory allocation subsystem.

Garbage collection also dictates how the applications interact with the outside world, especially when sitting on an existing operating system, which might mandate a particular memory layout and ABI. For example, the Go runtime has to jump through hoops (CGo) in order to interact with the rest of the C and Linux environment because of its different memory allocation semantics.

There is no reason why every single language running on this hypothetical VM should have to reimplement yet another terrible garbage collector. WASM still hasn't standardised its GC primitives that would make writing higher-level-than-Rust languages a great deal easier.

A kernel/VM hybrid should take care of allocating memory and freeing when it is not in use anymore. Then solve the "how do we make garbage collection really, really fast?" once and for all.

First class pointers

We all know by now of Tony Hoare's billion-dollar (I'd even say trillion-dollar) mistake, the invention of the null pointer.

A serious kernel/VM hybrid worth its code should make invalid memory references just impossible to construct in the first place. The VM should offer no way of turning a number into a pointer, basically.

So how is pointer formed, and how the heck do you interact with memory then? This is an area of active research for me, but here's what I've got so far:

VM words have a 1-bit tag, to distinguish between pointer to an array of memory and not-a-pointer. Pointers have a length (stored in the immediately previous word in memory), are allocated by the primitives I mentioned in the previous section, thus cannot be constructed from a raw number. Pointer arithmetic, the type commonly found in C, is completely disallowed. When you want to interact with memory you can only do it by using this pointer and an index, which is bounds-checked to disallow writing outside your allotted space. In other words, you're only allowed to do p[index++] = ... but not *p++ = ...

This design avoids all the nastiness associated with invalid memory access patterns, which are often silent on regular bare-metal systems. Here, if you try to access an array outside its bounds, that's a bug, just kill the process. If you try to allocate memory and there's no memory available, you have bigger issues at hand, and you might as well turn the machine off, so effectively, memory allocation is guaranteed to succeed.

Of course, you wouldn't be able to implement C on top of this platform — but honestly, so what?

If you allow me a small rant, I do believe there is something wrong in our assessment of computer performance if we have to resort to manual memory management as seen in C, or complex memory semantics as seen in Rust, to be able to squeeze out decent performance and to avoid wasting memory. I'm writing this post on a machine with 64,000,000,000 bytes of memory, that runs more than 4,000,000,000 instructions every single second. If one wants to avoid wasting resources, it might be worth to review the entire stack so an editor on such an absurdly fast machine doesn't open slower than EDIT.COM did on DOS. Our computers are not slow because of VM-level array-bounds checking and garbage collection, but in fact, our quest for performance at all costs is what has allowed us to build this unnecessary tower of complexity in the first place, and it's the reason software today feels slower than it did yesterday.

So, good riddance to manual memory allocation. We don't need it where we're going.

First class concurrency

Even the most established VM designs today completely paper over the fact that we live in a multicore world, and so hide any potential to leverage this power behind runtime functions and ceremony the end-user itself has to officiate, just to be able to split some of that workload on the remaining cores. I bought a Raspberry Pico W for less than $10, it runs at 133 MHz, has 256 kilobyte of RAM, yet ships with two whole cores.

It makes no sense in this day and age to spend a lot of time figuring out the ideal number of registers or optimal opcode encoding system, without providing an instruction to run a function on a separate thread/core/coroutine. With the same ease I can do CALL function, I want to be able to do SPAWN function, and have it run elsewhere.

Of course, to remain memory-safe, nothing should be shared across cores, and communication should be handled at the low level. The good thing of having first-class pointers is that if you send a pointer to another core (which require a copy on NUMA systems), reusing that pointer is easily detected, the program summarily killed, which you'd catch early in development. Hopefully no more data races.

This share-nothing design would open compatibility with asymmetric architectures, where only some cores have access to a specific piece of hardware or subsystem. A forward-facing platform cannot expect we'll keep using ginormous CPUs with a modest amount of cores and uniform memory in the near future. What's certain is that compute is increasingly connected, and on the single core cooperative multitasking is fast enough for many applications. Concurrent programming should be as accessible as integer multiplication.

Optimised for simplicity

There are a lot of fancy features I've discussed here, but as with many things, the work is not done until the design is made as simple as possible, and then more.

We don't need registers. A modern stack machine is good enough. Yes, there are multiple benchmarks that show that register machines can be 30% faster (very debatable) than equivalent stack machines. Nonsense, and if it's true, it doesn't matter anyway.

Register machines are too tied to the processors they're designed to run on. How many registers is good enough? 4? Then you need more work to make use of the 16 registers on x86-64, or the 32 registers on RISC-V. 16 you say? So you need to do more work if you want to port this to an embedded system. Registers are a hardware implementation detail, the exact thing we want to abstract over in this platform.

On the other hand, stack machines are simple, and that's their most important quality. They can be made very fast, and research in the matter has died down since the 1980s if not for Chuck Moore himself and his work on Forth processors. A decent kernel/VM hybrid would translate its opcodes to machine code anyway, so its relative simplicity is irrelevant.

If I were to write a Virtual Machine today, to explore a novel avenue of computation, these are the things I wish it would natively support. The only negative shared by most of these features is how they might not be as performant as the real hardware they're running on. But my entire hypothesis is that computers are already more than fast enough, they're just terribly mismanaged, because we're still in the Stone Age of computing.

To make computers fast again, I suggest rebuilding from something slow and naive. Maybe a little less powerful. As long as it's simple to understand, this time around.

See also / Prior art

View all posts