Dependencies

What are dependencies, when you’re developing and running software? Depends :).

A super simple Python script might not really have dependencies. You can pull a .py file from the Internet, slap python3 script.py into your terminal, and run the script. If the application is a bit more complex, it might have multiple files, requiring that you download a few Python files and co-locate them in a folder before you can run them. So, okay, maybe instead of downloading a single file from a website, now you’re downloading a zip file or pulling a git repository to fetch that entire folder of files.

Other scripts will require that you install packages to your system from some dependency repository (which you might do with pip, for example). You might just figure out what packages to install from a README, or maybe there’s a file in the repository that says which packages need to be installed (hopefully with fixed version numbers, so you can make sure you’re getting the same version of each library that the original author was using). What this looks like probably depends on your language’s ecosystem; NodeJS/npm uses a package.json, Python/pip uses a requirements.txt, Rust/Cargo uses a Cargo.toml, and so on. You may also have an option to install a package locally or globally. NodeJS’s npm, for example, will install packages to a local folder by default, so that projects won’t affect each other. Python’s package installations tend to be global, deferring to additional ecosystem tools like virtualenvs or Conda to isolate and choose between environments.

Already, this amount of dependency management can be a pain. Sometimes it’s hard to tell which Python packages you’re using – you might have one dependency version listed in your requirements.txt but actually have another version installed (say, from another project you were working on yesterday). Running the project on another computer you own, or even worse, trying to guide some other researcher across the planet through installing your janky code, can be hell when you realize you don’t actually know what dependencies you’ve been using this whole time.

Dependency hell doesn’t stop there, though. Some libraries will depend on additional software installed on your computer. For example, you’ll need libjpeg installed to use JPEG support in Python’s Pillow image processing library. If you’re lucky, there’ll be a guide in a project’s README explaining how to install the required libraries for your operating system (or maybe the project maintainer will just have built statically-linked releases that include all of the dependencies in a single binary – but don’t hold your breath). Otherwise, you’re stuck tracking down whichever package is available for your operating system that will bind correctly to the software you’re trying to run. Let’s hope it’s the right version.

The actual language you’re running is probably versioned, too, of course. Not only will Python 3.8 behave a little differently in your code than Python 3.9, your library dependencies might themselves have an opinion on which language/interpreter version you should be running.

Okay, I get it. what's the point?

One thing you can do to solve this – to make sure your code runs instead of breaking someday – is to arrange everything just right, make sure you’ve installed exactly the dependencies that are necessary to run the code, and freeze the computer in a block of ice. Maybe not ice. Something freeze-y.

We totally do this in the BiD lab. That’s how we print posters. We’ve got some old-ass Mac Mini that’s got just the right driver configuration installed, a USB port in the front, and a login account Jeremy made years ago. And we do not touch it, because we may never figure out how to connect to that damn printer again.

It doesn’t have to be a literal single physical machine, of course. You could make a bit-for-bit copy of the machine’s hard drive, and if the printer computer ever dies, buy another identical Mac Mini and splat the disk image onto the hard drive, and it’d probably work.

You can take a step towards reproducibility, though, if you’re able to simulate that exact computer on other hardware. That’s what a virtual machine would do. Splat the hard drive into a simulated Mac Mini, and as long as it’s able to poke through the simulation to talk to the printer over USB, you’d still have a probably-not-fragile way to print posters. If you ever did screw up the VM, you could just grab the original disk image, make a new VM, and print from there.

Virtual machines really do try to completely simulate what it’d be like to be running software on the specified hardware. You can almost imagine them as circuit simulators, working out the exact logic that the real computer processor would do to run the software you’re building. That’s not actually what they’re doing – they’re approximations, not truly simulating every 1 and 0 that flows through real wires in a real processor – but they do try to behave the same way. If you had some research code that you wanted me to be able to run, you could hand me a disk image of your computer’s drive and some details about your hardware configuration, and with the right VM software, I’d probably be able to run the software by simulating your computer.

Two problems there.

First, I’d also have access to all the personal crap you keep on your research computer. Remember, you handed me a copy of your entire hard drive. One way to fix this is to build up a VM from scratch, starting with a fresh install of your operating system and then installing nothing but your research software (alongside all the relevant dependencies, of course). Of course, to do this well, you have to start early in your project’s development, before your dependencies get so complicated that you’ve lost track of how you got everything working.

This can be tricky to keep up to date as your project gets more complex, because if you’re manually creating your VM image, you can either keep updating the same image over time (which might get bloated after a while, if you never uninstall anything) or you can build a new VM from scratch every time your configuration changes, which keeps your VM lean but is time-consuming.

A better way might be to write a script that, starting from a fresh install of your operating system, installs everything you need and populates your project code. Now, if you want to remove an older dependency or add a new one, you can just update the script and rerun it to recreate your VM. If your script is deterministic, you can even just check the script into version control instead of trying to juggle these multi-gigabyte VM images.

But the second problem with this approach is that VMs are just really expensive to run. They’ve gotten better than they used to be, but at the end of the day, you’re simulating an entire sub-computer and operating system inside your own. Sometimes that’s overkill and causes too much overhead to really be worth it. Especially expensive are operations that “break out” of the VM, like writing to a file on the host computer, and they normally require special configuration on the inside of the VM to coordinate with the host’s VM software.

Docker addresses both of these issues by offering configurable OS-level virtualization. When you run software with Docker, the software isn’t running in a VM, but on your computer like other software would. You’ll see it in your process manager, for example. Docker isolates applications not by simulating an entire world for them like a VM does, but by intercepting interactions between the application and the outside world.

I’ll explain what I mean by that, but first, let’s clarify the limitations of this approach:

Unlike virtual machines, Docker won’t translate CPU instructions for other architectures. For example, software compiled for an ARM CPU won’t run on an x86-64 CPU through Docker.
Software running in Docker will share your host machine’s Linux kernel, which prevents Docker from needing to run nested OS operations within the virtualized software, but also means you’re stuck with the kernel you have on the host machine (meaning, for example, you’ll have only the device drivers installed onto your host machine’s kernel).

Of course, you could consider these two factors – platform architecture and Linux kernel version – to be “dependencies” of your software, things that you’d prefer to be able to control precisely in deployment. In practice, though, Docker’s OS-level virtualization seems to offer us enough control to wrangle most dependency issues we face in deploying software, even though we’re stuck using the host’s platform and kernel. The big speedups we get over using full virtualization mean we often say that Docker offers “lightweight” containerization.

Okay, I get it. what's the point?​

Okay, I get it. what's the point?