How do we actually confront or evade "kirkification" and the flood of ai slop?

h333d@lemmy.world · 9 days ago

How do we actually confront or evade "kirkification" and the flood of ai slop?

SuspciousCarrot78@lemmy.world · edit-2 4 hours ago

Ha ha! I actually finished it over the weekend. Now it’s onto the documentation…ICBF lol

I just tried to get shit GPT to do it this morning, as it’s generally pretty ok for that. As always, it produces real “page turners”. Here is its idea of a “lay explainer”

Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)

This project is a harness for local inference: llama-swap is the model traffic-cop, and the router is the conductor that decides what kind of work you want done (straight answer, self-critique loop, style rewrite, vision/OCR), when, and with what context. Vodka acts as memory layer and context re-roll.

The goal isn’t to manufacture genius. It’s to make local models behave predictably under hardware constraints by:

making retrieval explicit (no “mystery memory”),
keeping “fancy modes” opt-in,
and making the seams inspectable when something goes wrong.

The shape is simple:

UI → Router (modes + RAG + memory plumbing) → llama-swap (model switching) → answer. ([GitHub][1])

The “what”: one OpenAI-style endpoint that routes workflows, not just models

At the front is an OpenAI-compatible POST /v1/chat/completions endpoint. From the client’s point of view, it’s “just chat completions” (optionally streaming). From the router’s point of view, each request can become a different workflow.

It also accepts OpenAI-style multimodal message blocks (text + image_url), which matters for the vision/OCR paths.

Under the hood, the router does three things:

Decides the pipeline (Serious / Mentats / Fun / Vision / OCR)
Builds an explicit FACTS block (RAG) if you’ve attached any KBs
Calls llama-swap, which routes the request to the chosen local model backend behind an OpenAI-like interface ([GitHub][1])

The “why”: small models fail less when you make the seams visible

A lot of local “agent” setups fail in the same boring ways:

they silently change behaviour,
they smuggle half-remembered context,
they hallucinate continuity.

This design makes those seams legible and user-controlled:

You pick the mode explicitly (no silent “auto-escalation”).
Retrieval is explicit and inspectable.
There’s a “peek” path that can show what the RAG facts block would look like without answering — which is unbelievably useful for debugging.

The philosophy is basically: if the system is going to influence the answer, it should be inspectable, not mystical.

The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)

There are two layers of control:

A) Session commands (`>>…`): change the router state

These change how the router behaves across turns (things like sticky fun mode, which KBs are attached, and some retrieval observability):

>>status — show session state (sticky mode, attached KBs, last RAG query/hits)
>>fun / >>fun off — toggle sticky fun mode
>>attach <kb> / >>detach <kb|all> / >>list_kb — manage KBs per session
>>ingest <kb> / >>ingest_all — ingest markdown into Qdrant
>>peek <query> — preview the would-be facts block

B) Per-turn selectors (`##…`): choose the pipeline for one message

## mentats … — deep 3-pass “draft → critique → final”
## fun … — answer, then rewrite in a persona voice
## vision … / ## ocr … — image paths

The three main pipelines (what they actually do)

1) Serious: the default “boring, reliable” answer

Serious is the default when you don’t ask for anything special. It can inject a FACTS block (RAG) and it receives a constraints block (which is currently a V1 placeholder). It also enforces a confidence/source line if it’s missing.

Docs vs implementation (minor note): the docs describe Serious as “query + blocks” oriented. The current implementation also has a compact context/transcript shaping step as part of prompt construction. Treat the code as the operational truth; the docs are describing the intended shape and may lag slightly in details as things settle.

2) Mentats: explicit 3-pass “think → critique → final”

This is the “make the model check itself” harness:

Thinker drafts using QUERY + FACTS + constraints
Critic checks for overreach / violations
Thinker produces the final, carrying forward a “FACTS_USED / CONSTRAINTS_USED” discipline

If the pipeline can’t complete cleanly (protocol errors), the router falls back to Serious.

3) Fun: answer first, then do the performance

Fun is deliberately a post-processing transform:

pass 1: generate the correct content (lower temperature)
pass 2: rewrite in a persona voice (higher temperature), explicitly instructed not to change the technical meaning

This keeps “voice” from leaking into reasoning or memory. It’s: get it right first, then style it.

RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”

KBs are opt-in per session

Nothing is retrieved unless you attach KBs (>>attach linux, etc.). The FACTS block is built only from attached KBs and the router tracks last query/hit counts for debugging.

Ingestion: “KB folder → chunks → vectors in Qdrant”

Ingestion walks markdown, chunks, embeds, and inserts into Qdrant tagged by KB. It’s simple and operational: turn a folder of docs into something you can retrieve from reliably.

The KB refinery: SUMM → DISTILL → ingest

This is one of the more interesting ideas: treat the KB as a product, not a dump.

SUMM produces a human-readable summary (strict: no fabrication, no silent renaming) from base text
DISTILL produces dense, retrieval-shaped atoms (embedding-friendly headings/bullets, minimal noise)
then ingest the distilled output

The key point: DISTILL isn’t “a nicer summary.” It’s explicitly trying to produce retrieval-friendly material.

Vodka: deterministic memory plumbing (not “AI memory vibes”)

Vodka does two jobs:

context reduction / stability: keep the effective context small and consistent
explicit notes: store/retrieve nuggets on demand (!! store, ?? recall, plus cleanup commands), TTL (facts expire unless used)

It can also leave internal breadcrumb markers and later expand them when building a transcript/context — those IDs aren’t surfaced unless you deliberately show them.

Roadmap reality check: what’s left for V1.1

Constraints/GAG: placeholder in V1 (constraints block currently empty)
Coder role: present in config but not wired yet

How do we actually confront or evade "kirkification" and the flood of ai slop?

How do we actually confront or evade "kirkification" and the flood of ai slop?

Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)

The “what”: one OpenAI-style endpoint that routes workflows, not just models

The “why”: small models fail less when you make the seams visible

The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)

A) Session commands (>>…): change the router state

B) Per-turn selectors (##…): choose the pipeline for one message

The three main pipelines (what they actually do)

1) Serious: the default “boring, reliable” answer

2) Mentats: explicit 3-pass “think → critique → final”

3) Fun: answer first, then do the performance

RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”

KBs are opt-in per session

Ingestion: “KB folder → chunks → vectors in Qdrant”

The KB refinery: SUMM → DISTILL → ingest

Vodka: deterministic memory plumbing (not “AI memory vibes”)

Roadmap reality check: what’s left for V1.1

A) Session commands (`>>…`): change the router state

B) Per-turn selectors (`##…`): choose the pipeline for one message