I have a db with a lot of data that all need precise summarisation, I would do it myself if it wasn’t 20 thousand fields long

It is about 300k tokens, and Gemini 2.5 struggles missing points and making up facts

Separating them into smaller sections is not an option, because even when seperated they can take up 30k tokens, and the info that needs summarisation may span 100k token ranges

I learnt that fine tuning may have better results than general purpose models, and now I’m wondering if there is anything high token count for summarisation.

Any help would be appreciated, even if its to suggest another general purpose model that has better coherency

  • Omega@discuss.onlineOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    9 days ago

    I have attempted those solutions, R1 was best, even then I would have to chunk it, it may be possible to feed it extensive summary of previous information for better summaries (maybe)

    Gemini is good until 200k. Scout is good until 100k. R1 was always good, till context limit.

    • pepperfree@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 days ago

      So something like

      Previously the text talk about [last summary]
      [The instruction prompt]...
      [Current chunk/paragraphs]
      
    • SmokeyDope@lemmy.worldM
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      9 days ago

      You can try to use VSCode + roo to intelligently chunk it autonomously. Get a API key from your llm provider of choice, put your data into a text file, Edit the roo agent personalites thats set to coding by default. Instead add and select a custom summarizer persona, for roo to use then tell it to summarize the text file.