How should I properly document my homelab?

enchantedgoldapple@sopuli.xyz · edit-2 6 hours ago

How should I properly document my homelab?

cecilkorik@lemmy.ca · 5 hours ago

You’re on the right track. Like everything else in self-hosting you will learn and develop new strategies and scale things up to an appropriate level as you go and as your homelab grows. I think the key is to start with something immediately achievable, and iterate fast, aiming for continuous improvement.

My first idea was much like yours, very traditional documentation, with words, in a document. I quickly found the same thing you did, it’s half-baked and insufficient. There’s simply no way to make make it match the actual state of the system perfectly and it is simply inadequate to use English alone to explain what I did because that ends up being too vague to be useful in a technical sense.

My next realization was that in most cases what I really wanted was to be able to know every single command I had ever run, basically without exception. So I started documenting that instead of focusing on the wording and the explanations. Then I started to feel like I wasn’t capturing every command reliably because I would get distracted trying to figure out a problem and forget to, and it was duplication of effort to copy and paste commands from the console to the document or vice versa. That turned into the idea of collecting bunches of commands together into a script, that I could potentially just run, which would at least reduce the risk of gaps and missing steps. Then I could put the commands I wanted to run right into the script, run the script, and then save it for posterity, knowing I’d accurately captured both the commands I ran and the changes I made to get it working by keeping it in version control.

But upon attempting to do so, I found that just a bunch of long lists of commands on their own isn’t terribly useful so I started to group all the lists up, attempting to find commonalities by things like server or service, and then starting organize them better into scripts for different roles and intents that I could apply to any server or service, and over time this started to develop into quite a library of scripts. As I was doing this organizing I realized that as long as I made sure the script was functionally idempotent (doesn’t change behaviors or duplicate work when run repeatedly, it’s an important concept) I can guarantee that all my commands are properly documented and also that they have all been run – and if they haven’t, or I’m not sure, I can just run the script again as it’s supposed to always be safe to re-run no matter what state the system is in. So I started moving more and more to this strategy, until I realized that if I just organized this well enough, and made the scripts run automatically when they are changed or updated, I could not only improve my guarantees of having all these commands reliably run, but also quickly run them on many different servers and services all at once without even having to think about it.

There are some downsides of course, this leaves the potential of bugs in the scripts that make it not idempotent or not safe to re-run, and the only thing I can do is try to make sure they don’t happen, and if they do, identify and fix these bugs when they happen. The next step is probably to have some kind of testing process and environment (preferably automated) but now I’m really getting into the weeds. But at least I don’t really have any concerns that my system is undocumented anymore. I can quickly reference almost anything it’s doing or how it’s set up. That said, one other risk is that the system of scripts and automation becomes so complex that they start being too complex to quickly untangle, and at that point I’ll need better documentation for them. And ultimately you get into a circle of how do you validate the things your scripts are doing are actually working and doing what you expect them to do and that nothing is being missed, and usually you run back into the same ideas that doomed your documentation from the start, consistency and accuracy.

It also opens an attack vector, where somebody gaining access to these scripts not only gains all the most detailed knowledge of how your system is configured but also the potential to inject commands into those scripts and run them anywhere, so you have to make sure to treat these scripts and systems like the crown jewels they are. If they are compromised, you are in serious trouble.

By now I have of course realized (and you all probably have too) that I have independently re-invented infrastructure-as-code. There are tools and systems (ansible and terraform come to mind) to help you do this, and at some point I may decide to take advantage of them but personally I’m not there yet. Maybe soon. If you want to skip the intermediate steps I did, you might even be able to skip directly to that approach. But personally I think there is value in the process, it helps defining your needs and building your understanding that there really isn’t anything magical going on behind the scenes and that may help prevent these tools from turning into a black box which isn’t actually going to help you understand your system.

Do I have a perfect system? Of course not. In a lot of ways it’s probably horrific and I’m sure there are more experienced professionals out there cringing or perhaps already furiously warming up their keyboards. But I learned a lot, understand a lot more than I did when I started, and you can too. Maybe you’ll follow the same path I did, maybe you won’t. But you’ll get there.