• 9 Posts
  • 209 Comments
Joined 1 month ago
cake
Cake day: May 14th, 2026

help-circle


  • Probably that plus a higher quant solves it. Thing is most of us default to Q4_K_M as “precise enough”… and that seems to be kryptonite for the new Qwen’s.

    That’s another thing with hosting AI that’s not often discussed. Sure, you can maybe run that 27B model…but if it’s at Q3_XS it’s going to be … “mentally challenged”.

    I’ve heard the Gemma models with QAT are meant to be near full precision at Q4 size. Haven’t tried em yet.

    Actually, on that topic - I’ve heard there’s a different architecture (RWKV), that’s supposed to be much more efficient for long context because it uses an entirely different KV system.

    Sadly, there are few RWKV native models and retraining a standard transformer to RWKV seems like a pain in the ass. I’d need to hire a cloud GPU, distill into a different architecture, mess with datasets … honestly ICBF.


  • Yeah, I’ve heard the B70 is good bang for buck. My kids love using chat GPT to generate images and I’m aware that there are some really capable local models that can do that as well now - B70 should make short work of it.

    That may be something for me to look at later on if I decide to keep self hosting.

    OTOH, I’m also aware that I may end up building something that they don’t actually use. Been there, done that, and I don’t want to do it again.

    Actually, on that topic, one interesting use case for me is my youngest one wants to have a YouTube channel.

    So obviously, I’m not going to let her become a YouTuber, but what I’m thinking of doing is providing her my old phone (properly locked down) so that she can video record clips of what she wants.

    Then - have those clips sent automatically to our jellyfin server so it appears like a channel. Code a fake YT plugin so that AI can do likes, positive comments etc.

    It’s… work. I dunno…maybe a good enough AI can vibe code the entire project for me.



  • Pretty simple. People keep going on about how useful these local models are for coding. So what I wanted to do was to create a standardized test for myself to see if that was true before committing to anything.

    ( I think the various benchmarks out there are a bit fluffy, so I wanted to try it against a real workload.)

    What I did was throw a bunch of money up at OpenRouter and then used Roo to call in diff models, one at a time.

    I gave each the same task - that is, here is a piece of code, here is my ticket, do what my ticket says.

    I already knew what was wrong with the code, but I wanted to see how obedient the models are at sticking to a scoped ticket and what they would find.

    By far the best bang for buck was GPT 5.4 mini. It is exceptionally obedient at doing exactly what you tell it as long as you tell it exactly what to do.

    It won’t go off piste if properly constrained.

    I think for light - med workloads, $20 on ChatGPT is a crimal steal. Chat and Codex have a separate usage pool.

    I’m also aware that this is open AI’s lock in phase where they provide the samples of crack for free to get you hooked. And, yes, they are crack dealers in every sense of the word.

    Anyway, it’s good to know that with a little bit of elbow grease and some smarts, the smaller models, which could reasonably be self-hosted, could do a decent enough job if they are narrowly scoped.

    You’re probably not going to be able to yeet an entire code base at them and go “figure out what’s wrong and fix it” while you snooze tho, but I think that’s probably a good thing from a human in the middle perspective.







  • SuspiciousCarrot78@aussie.zoneOPtoSelfhosted@lemmy.worldDo you host your own AI?
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    2
    ·
    edit-2
    8 hours ago

    I actually ran a series of A|B split tests (using GPT, Claude, Qwen 27B, Qwen 35B, GLM) on some code I’d written.

    The Qwen models managed to find issues the others missed and offer useful suggestions.

    Coding wise, they’re a little too eager to take the next step / be a helpful assistant, and context collapse is a real thing with them. I would say yes, they are capable, and probably even more so in the Qwen specific coding harness.

    The thing is, small models can only hold so much in their latent space. If you give them a big job or free range task, they will find a way to monkey paw it. They need short leash and test gates.




  • SuspiciousCarrot78@aussie.zoneOPtoSelfhosted@lemmy.worldDo you host your own AI?
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    8 hours ago

    Agreed. It will be ironic if 1.58B models (Microsoft) turns out to be the great white hope.

    I looked at the recent Steam stats (which is a GPU sample of convenience); the most common GPU size was 6GB. Meanwhile you probably need what…64GB unified memory or a 5090 to drive a decent model at a decent speed/context?

    There’s a real gap between the haves and the have nots and it’s widening.





  • SuspiciousCarrot78@aussie.zoneOPtoSelfhosted@lemmy.worldDo you host your own AI?
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    3
    ·
    edit-2
    10 hours ago

    Myself - I’ve self hosted LLMs before, but with only 4-8GB vram (depending which card is in place), I can’t run the good stuff at acceptable enough speeds.

    (Don’t @ me - I know all the tricks with turbo quants, spec decoding, MoE etc. 192GB/s is 192GB/s)

    I do use Handy (STT) which is amazing (my fingers are arthritic and typing hurts after a while).

    My personal use case for LLM is quite simple - a trumped up super google and / or self reflection / journalling / sound board. Despite being glib about it, that’s actually very useful to me.

    Work wise, I use the big winking orange asshole (Claude) when I have to. I have moral tension with with it, so am seriously looking at other options. I hear good things about GLM 5.2, but if I can’t run Qwen 35B at any kind of decent speed, well…self hosted GLM is a pipe dream.