Good Module System Design — or — What Makes Module Systems Good, Anyway?

Jul 2 2016

Module systems provide three things to a programming environment:

Safe code reuse
Encapsulation and Data Hiding
Namespacing and collision avoidance

In this essay, I expound on these topics, and elucidate my own thoughts (as both a professional programmer and aspiring language designer) on what constitutes good module system design.

Safe Code Reuse

Consider the following function, which implements a reservoir sampling algorithm:

(fn reservoir (ll i value)
    (let (r (rand 0 1)
          l (len ll))
      (if (< i l)
          (set (nth ll i) value)
          (if (< r (/ l i))
              (set (nth ll (rand 0 l)) value)))))

Ideally, we'd like to not have to copy that code from program to program (especially since it has a bug that we eventually need to fix). Instead, we'd rather do something like this:

(use stats/sample)

(fn main (args)
    (let (ll '())
      (each (i v args)
            (reservoir ll i v))
      (printf "R = [%v]\n" ll)))

A good module system lets you do that. Later, when we find that bug I alluded to earlier, we can fix the module and distribute a new version. Calling programs benefit when they update to that version, without having to patch the bug directly.

Encapsulation is very important. The module system must let module authors "hold back" some functionality as pertinent only to the implementation of the module, not its interface. Simply put, the interface is what the outside world see, the implementation is how the job gets done.

Encapsulation & Data Hiding

Consider if we refactor our function a little, by introducing a small utility to govern whether or not a new sample replaces and older one:

(module stats/sample)

(fn replace? (l i)
    (< (rand 0 1) (/ l i)))

(fn reservoir (ll i value)
    (let (l (len ll))
      (if (< i l)
          (set (nth ll i) value)
          (if (replace? l i)
              (set (nth ll (rand 0 l)) value)))))

Now we have a problem. Our new (replace?) function is an implementation detail of our (reservoir) function, and we don't want callers to use it directly.

For Snook, the completely 100% hypothetical Lisp-like language I made up for this essay, we can introduce a decorator form that signals to the compiler that a given function ought not be seen by callers:

(module stats/sample)

(private
  (fn replace? (l i)
      (< (rand 0 1) (/ l i))))

;; ... etc. ...

With (replace?) marked private, any attempt to call it directly from programs or other modules will result in an error that the method is not defined.

Namespacing & Collision Avoidance

Assume for the sake of illustration that we are writing a sensor reading collection program for a water treatment facility, in Snook. Because it's 2016, we're using an HTTP-enabled SCADA system for managing our treatment facility data. So we write up the following (inside of its own module, of course):

(module scada)
(use net/http)

(fn reservoir (ip username password)
    (connect
      (format "https://%s:81234" ip)   ;; address of HTTP server
      (basic-auth username password))) ;; HTTP BasicAuth header

(fn metric (endpoint name)
    (let (r (get endpoint
                 (format "/v1/metric/%s" name)))
      (if (= 200 (status-code r))
          (body r)
          (panic "request failed"))))

That neatly encapsulates how we connect to one of the reservoirs via its SCADA endpoint, and how we pull metrics out of it. However, we have set ourselves up for almost certain failure.

What happens when we try to use these two modules together, à la:

(use scada)        ;;          defines (reservoir)
(use stats/sample) ;; ... also defines (reservoir)

Chaos. Havoc. Uncertainty.

A good module system mandates namespacing to sidestep this issue entirely. Namespaces disambiguate which function, in which module you are interested in calling. By mandating it, module authors don't have to be as vigilant about name reuse.

Snook does it by prefixing the imported symbols (the functions) with the full name of the module.

(use scada)
(use stats/sample)

(fn main (args)
    (let (metrics (listof 10 nil)
          ip      (nth args 0)
          user    (nth args 1)
          pass    (nth args 2)
          resv    (scada/reservoir ip user pass))

      ;; collect lots of metrics (keep 10 samples)
      (repeat (i 2000)
        (let (v (scada/metric resv "pressure"))
          (stats/sample/reservoir metrics i v)))

      ;; print the 10 samples we kept
      (printf "[%s] temp: %v\n" ip metrics)))

A slight improvement on this scheme is to allow the programmer to specify their own namespace, to be used for the scope of the file:

(use scada)
(use stats/sample st)

(fn main (args)
    (let (metrics (listof 10 nil)
          ip      (nth args 0)
          user    (nth args 1)
          pass    (nth args 2)
          resv    (scada/reservoir ip user pass))

      ;; collect lots of metrics (keep 10 samples)
      (repeat (i 2000)
        (let (v (scada/metric resv "pressure"))
          (st/reservoir metrics i v)))       ;; simpler!

      ;; print the 10 samples we kept
      (printf "[%s] temp: %v\n" ip metrics)))

(of course, this kind of namespace feature means we have to rewrite our scada module so that it uses the net/http/ prefix appropriately.)

Some Further Thoughts

Here are elaborations on some notes I took while thinking about module systems.

Symbols, Calling Environments & Rebinding

The (use ...) construct augments the calling environment by defining new symbols (using the prefix notation) for all exported functions — that is, those that have not been explicitly marked as private. Similar provisions can be made for variable bindings and constants.

To safeguard the integrity of modules, their bindings are fixed at compilation, and cannot be monkey-patched at runtime. To safeguard programmer sanity, Snook forbids rebinding of symbols imported via a (use ...) construct. If it were to allow it, it would only be a shadowing rebinding; it would have absolutely no bearing on the original module.

A side-effect of this decision is that exported variables are effectively read-only. This is good, since module-level variables are usually abused as a form of "acceptable" global variables. Module-level constants are unaffected by the no-rebinding rule.

Dependencies

Dependency tracking and resolution, while not explicitly part of the module system proper, is important to the utility and viability of the module system — indeed the language itself. If no one can find, or reliably source a module, what good is the module system?

I have several thoughts on this, that I will be committing to prose before long.

Load- & Link-time Optimization

One of the design goals of Snook (keeping in mind that it is entirely fictional) is to facilitate small, self-contained, static executable binaries for a variety of target processors. The module system must support this endeavor by intelligently allowing unused functions to be skipped during compilation and assembly. This leads to smaller, more trim binaries.

My original thought on this was to introduce an additional level of segmentation below the module level: the unit. A module is composed of one or more units, each of which is a self-contained aspect of the module. Tightly-integrated modules would have fewer units, looser modules, more.

The more I thought about, the more I realized that explicit (not to mention manual) segmentation of a module into units would be awkward and unwieldy as a module author. Are units allowed to call functions in other units of the same module? Doesn't that make them part of the same unit?

Maybe we can shift the burden of segmentation to the compiler...

Given the static call graph of a module, unconnected networks are the units! Compile each segment separately, caching them to speed up future compilation, and then at link time, just link in what you need, based on the program's call graph.

It works in theory, at least.

Conclusion

Module systems are important. As a programmer working in a language, more than half of my time is spent finding, learning and using modules (versus taking advantage of specific language features). To design a good language in the modern era, you have to design a good module system to go with it.

Happy Hacking!

newer: A Few Notes On Configuring Ergo Dox Keyboards
older: LLVM - A Gentle Introduction

James works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring just how much data you can shove though DuckDB before it explodes.