I have been thinking a lot about digital sovereignty lately and how quickly the internet is turning into a weird blend of surreal slop and centralized control. It feels like we are losing the ability to tell what is real because of how easy it is for trillionaire tech companies to flood our feeds with whatever they want.
Specifically I am curious about what I call “kirkification” which is the way these tools make it trivial to warp a person’s digital identity into a caricature. It starts with a joke or a face swap but it ends with people losing control over how they are perceived online.
If we want to protect ourselves and our local communities from being manipulated by these black box models how do we actually do it?
I want to know if anyone here has tried moving away from the cloud toward sovereign compute. Is hosting our own communication and media solutions actually a viable way to starve these massive models of our data? Can a small town actually manage its own digital utility instead of just being a data farm for big tech?
Also how do we even explain this to normal people who are not extremely online? How can we help neighbors or the elderly recognize when they are being nudged by an algorithm or seeing a digital caricature?
It seems like we should be aiming for a world of a million millionaires rather than just a room full of trillionaires but the technical hurdles like isp throttling and protocol issues make that bridge hard to build.
Has anyone here successfully implemented local first solutions that reduced their reliance on big tech ai? I am looking for ways to foster cognitive immunity and keep our data grounded in meatspace.


Thats awesome! I was going to add some sort of AI to my proxmox homelab for researching but I figured the risk of halloucination was too high, and I thought that the only way to fix this was getting a bigger model. But thid seams like a really good setup (if I can actually figure out how to implement it.) And I wont need to upgrade my gpu!
Althogh I only have one ai suitable gpu (I have a gtx 1660 6gb in my homelab which is really only suitable for movie transcoding.) I have a 3060 12gb that I use in my gaming pc I was thinking I could setup some kind of wol system that boots the pc and sets up the ai software on that. Maybe my homelab hosts openwebui and when I send a queory it prompts my gaming pc to wake up and do the ai crunching.
Well, technically, you don’t need any GPU for the system I’ve set up, because only 2-3 models are “hot” in memory (so about…10GB?) and the rest are cold / invoked as needed. My own GPU is only 8GB (and my prior one was 4GB!). I designed this with low end rigs in mind.
The minimum requirement is probably a CPU equal to or better than mine (i7-8700; not hard to match), 8-10GB RAM and maybe 20GB disk space. Bottom of the barrel would be 4gb but you’ll have to deal with ssd thrashing.
Anything above that is a bonus / tps multiplier.
FYI; CPU only (my CPU at least) + 32gb system RAM, this entire thing runs at about 10-11 tps, which is interactive enough speed / faster than reading speed. Any decent gpu should get you 3-10x that. I designed this for peasant level hardware / to punch GPTs in the dick thru clever engineering, not sheer grunt. Fuck OpenAi. Fuck Nvidia. Fuck DDR6. Spite + ASD > “you can’t do that” :). Yes I fucking can - watch me.
If you want my design philosophy, here is one of my (now shadowbanned) posts from r/lowendgaming. Seeing you’re a gamer, this might make sense to you! The MoA design I have is pure “level 8 spite, zip tie Noctura fan to server grade GPU and stick it in a 1L shoebox” YOLOing :).
It works, but it’s ugly, in a beautiful way.
Lowend gaming iceberg
Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Level 7
Level 8
Level 9
I have a 12 gig gpu that I dont use for most of the time, might as well put it to work doing something. And even second hand ddr4 memory has gotten so expensive I’d rather not have to upgrade my homelab.
What is your main use case for this anyway? Do you use it for researching? Thats what I would mainly use it for, but also finding things in my obsidian volt.
What stage have you actually gotten to?
I do like the idea of this all though. I should really get into undervolting/overclocking my stuff, there is really no reason not to I could either gain performance or longevity or both!
Also I hate that the stock fans on cpu’s are so garbage. Luckily arctic fans are really cheap and quiet. Noctua is great but i’d sooner buy a budget aio than a single noctua fan lol.
Sorry - I think I misunderstood part of your question (what stage have you actually gotten to). See what I mean about needing sentiment analysis LOL
Did you mean about the MoA?
The TL;DR - I have it working - right now - on my rig. It’s strictly manual. I need to detangle it and generalise it, strip out personal stuff and then ship it as v1 (and avoid the oh so tempting scope creep). It needs to be as simple as possible for someone else to retool.
So, it’s built and functional right now…but the detangling, writing up specs and docs, uploading everything to Codeberg and mirroring etc will take time. I’m back to work this week and my fun time will be curtailed…though I want nothing more than to hyperfocus on this LOL.
One of the issues with ASD is most of us over-engineer everything for the worst case adversarial outcomes, as a method of reducing meltdowns/shutdowns. Right now, I am specifically using my LLM like someone who hates it and wants to break it…to make sure it does what I say it does.
If you’d like, I can drop my RFC (request for comments, in engineering talk) for you to look at / verify with another LLM / ask someone about. This thing is real, not hype and not vibe coding. I built this because my ASD brain needs it and because I was driven by spite / too miserly to pay out the ass for decent rig. Ironically, those constraints probably led to something interesting (I hope) that can help others (I hope). Like everything else, it’s not perfect but it does what it says on the tin 9/10…which is about all you can hope for.
Oh I didn’t realise you were going to release it! I was just going to try and setup a simplified version myself, that’s really cool. Don’t worry I’m patient and I will be too busy this year to implement anything for myself anyway, but I too (with my likely getting diagnosed soon adhd brain) share your enthusiasm for a way to implement an AI that collects information for you without lying.
Ha ha! I actually finished it over the weekend. Now it’s onto the documentation…ICBF lol
I just tried to get shit GPT to do it this morning, as it’s generally pretty ok for that. As always, it produces real “page turners”. Here is its idea of a “lay explainer”
Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)
This project is a harness for local inference: llama-swap is the model traffic-cop, and the router is the conductor that decides what kind of work you want done (straight answer, self-critique loop, style rewrite, vision/OCR), when, and with what context. Vodka acts as memory layer and context re-roll.
The goal isn’t to manufacture genius. It’s to make local models behave predictably under hardware constraints by:
The shape is simple:
UI → Router (modes + RAG + memory plumbing) → llama-swap (model switching) → answer. ([GitHub][1])
The “what”: one OpenAI-style endpoint that routes workflows, not just models
At the front is an OpenAI-compatible
POST /v1/chat/completionsendpoint. From the client’s point of view, it’s “just chat completions” (optionally streaming). From the router’s point of view, each request can become a different workflow.It also accepts OpenAI-style multimodal message blocks (text + image_url), which matters for the vision/OCR paths.
Under the hood, the router does three things:
The “why”: small models fail less when you make the seams visible
A lot of local “agent” setups fail in the same boring ways:
This design makes those seams legible and user-controlled:
The philosophy is basically: if the system is going to influence the answer, it should be inspectable, not mystical.
The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)
There are two layers of control:
A) Session commands (
>>…): change the router stateThese change how the router behaves across turns (things like sticky fun mode, which KBs are attached, and some retrieval observability):
>>status— show session state (sticky mode, attached KBs, last RAG query/hits)>>fun/>>fun off— toggle sticky fun mode>>attach <kb>/>>detach <kb|all>/>>list_kb— manage KBs per session>>ingest <kb>/>>ingest_all— ingest markdown into Qdrant>>peek <query>— preview the would-be facts blockB) Per-turn selectors (
##…): choose the pipeline for one message## mentats …— deep 3-pass “draft → critique → final”## fun …— answer, then rewrite in a persona voice## vision …/## ocr …— image pathsThe three main pipelines (what they actually do)
1) Serious: the default “boring, reliable” answer
Serious is the default when you don’t ask for anything special. It can inject a FACTS block (RAG) and it receives a constraints block (which is currently a V1 placeholder). It also enforces a confidence/source line if it’s missing.
Docs vs implementation (minor note): the docs describe Serious as “query + blocks” oriented. The current implementation also has a compact context/transcript shaping step as part of prompt construction. Treat the code as the operational truth; the docs are describing the intended shape and may lag slightly in details as things settle.
2) Mentats: explicit 3-pass “think → critique → final”
This is the “make the model check itself” harness:
If the pipeline can’t complete cleanly (protocol errors), the router falls back to Serious.
3) Fun: answer first, then do the performance
Fun is deliberately a post-processing transform:
This keeps “voice” from leaking into reasoning or memory. It’s: get it right first, then style it.
RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”
KBs are opt-in per session
Nothing is retrieved unless you attach KBs (
>>attach linux, etc.). The FACTS block is built only from attached KBs and the router tracks last query/hit counts for debugging.Ingestion: “KB folder → chunks → vectors in Qdrant”
Ingestion walks markdown, chunks, embeds, and inserts into Qdrant tagged by KB. It’s simple and operational: turn a folder of docs into something you can retrieve from reliably.
The KB refinery: SUMM → DISTILL → ingest
This is one of the more interesting ideas: treat the KB as a product, not a dump.
The key point: DISTILL isn’t “a nicer summary.” It’s explicitly trying to produce retrieval-friendly material.
Vodka: deterministic memory plumbing (not “AI memory vibes”)
Vodka does two jobs:
!!store,??recall, plus cleanup commands), TTL (facts expire unless used)It can also leave internal breadcrumb markers and later expand them when building a transcript/context — those IDs aren’t surfaced unless you deliberately show them.
Roadmap reality check: what’s left for V1.1
Everything stems from the fact that I want something I can “trust but verify” / see all the seams at a moment’s notice. I assume the LLM will lie to me, so I do everything in my power to squeeze it. Having lost hours and dollars believing ChatGPT, Claude, etc… I live by “fool me once, shame on you. Fool me 4000 times, shame on me”.
The problem with LLMs (generally) is that they are NOT deterministic. You can ask the same question 5 times and get slightly different answers each time, due to the seed, temperature, top_p, etc., settings. That’s one of the main reasons for hallucinations. They give it an RNG (to put it in gaming terms) to make it feel more “alive”. That’s cool and all, but it causes it to bullshit.
I have ASD; I cannot abide my tools having whims or working differently than they should. When I ask something, I want it to answer it EXACTLY correctly (based on my corpus, my IF-THEN GAG, etc.), reason the way I told it to, and show its proof. Do what I said, how I said.
In that way, it acts as an external APU for my brain - I want it to do what I would do, the way I would do it, just faster. And it needs to bring receipts because I am hostile to it as a default stance (once bitten, twice shy).
To be more specific, the MoA has two basic modes. In /serious mode, it will do three careful passes on my question and pull in my documents. For example, if I ask it for launch flags or optimisation of Dolphin emulator or llama.cpp, I want it to reference my documents (scraped from official sites via Scrapy), check my benchmarks and come up with a correct response. Or tell me that it can’t, because XYZ. No smooth lies.
It must also provide me with an indicator of accuracy and a source for its information, so I can verify with one click. I trust nothing until it’s earned and even then, I will spot check.
If I want it to reason about a patient’s differential diagnosis, it must climb the GAG nodes and follow my prompts EXACTLY. No flights of fancy AT ALL. Follow the flow-chart, tell me what I must not miss, what the likely diagnosis is, etc. Then I will tell it what I think it is… then we debate. (I’m setting this up for clinical students… I wish I’d had it when I went through).
If I want coding help because I’ve fucked up some Python script (yet again): don’t invent shit. Look at the reference documents and tell me EXACTLY. Teach me and help me unfuck myself. If you can’t, say so honestly and tell me who I should talk to, based on an externally stored policy document.
Then there’s also /casual mode. In casual mode, I want it to let me shoot the shit, vent and help me brainstorm, swear, tell me dirty jokes, reference pop culture… without strict adherence to my presets… unless I invoke them.
So the TL;DR answer to your question is: I use it for work, for fun (coding, gaming) and for generally shooting the shit. The “shooting the shit” aspect also includes stuff like sentiment analysis, because I have trouble sometimes understanding meaning from text.
Regarding the iceberg / what stage I’ve gotten to… baby, I wrote the list. You can be DAMN sure I’ve drilled holes into cases and decompiled shit to make it run on my potato, out of spite. I was really, really tempted to rip out the guts of The Alters and decompile it for an earlier version of Unreal Engine… but I got gifted a better GPU :)
I made CP2077 run on iGPU at 60 fps. Why? Because fuck PCMR. They said it couldn’t be done.
If I ever start an IT company, I will likely name it “Spite Engineering Inc”.
PS: Here is what the MoA said about your post and my draft of this post (aka sentiment analysis). I ignored it (YOLO) cause I’m /casual today :)
"Relative sentiment: the original post is mildly positive and curious with light frustration about hardware cost and stock cooling; your response is more intense and critical toward LLMs but positive about your own system, with a strong control-oriented and adversarial stance toward non-deterministic tools.
The thematic domain (hardware tinkering, efficiency, making full use of resources, interest in research and tooling) matches well, but your affect is higher-intensity and more hard-edged than the original writer’s. Be careful here; you’re coming on strong and may scare away the original poster.
Question coverage: you directly answer the main use-case question (work, coding, “shooting the shit,” including research-like tasks and sentiment analysis) and you address the “what stage have you actually gotten to?” question implicitly but clearly by stating you “wrote the list” and giving concrete competence examples.
Your reasoning is organically given / flow of consciousness. Consider dot-points and restructuring.
You did not directly respond to their incidental comments about their 12 GB GPU, RAM prices, undervolting/overclocking, or coolers, but those were not phrased as explicit questions and your reply adequately answers the core queries.
Recommendation: you may wish to address the above in a second draft.
Confidence: high | Source: Mixed (context and stored)