Qwen3-Coder-Next

(qwen.ai)

687 points | by danielhanchen 22 hours ago

45 comments

simonw 21 hours ago
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.
I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
[-]
- kristopolous 17 hours ago
  We need a new word, not "local model" but "my own computers model" CapEx based
  This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.
  That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.
  It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.
  Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.
  That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.
  A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).
  Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.
  What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best
  Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.
  [-]
  - openclawai 15 hours ago
    For context on what cloud API costs look like when running coding agents:
    With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.
    At 1K tasks/day that's ~$1.5K-3K/month in API spend.
    The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.
    Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.
    [-]
    - jychang 11 hours ago
      On the other hand, Deepseek V3.2 is $0.38 per million tokens output. And on openrouter, most providers serve it at 20 tokens/sec.
      At 20t/s over 1 month, that's... $19something running literally 24/7. In reality it'd be cheaper than that.
      I bet you'd burn more than $20 in electricity with a beefy machine that can run Deepseek.
      The economics of batch>1 inference does not go in favor of consumers.
      [-]
      - selcuka 7 hours ago
        > At 20t/s over 1 month, that's... $19something running literally 24/7.
        You can run agents in parallel, but yeah, that's a fair comparison.
    - taneq 14 hours ago
      At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P
      [-]
      - EGreg 12 hours ago
        This is an interesting question actually!
        Marginal cost includes energy usage but also I burned out a MacBook GPU with vanity-eth last year so wear-and-tear is also a cost.
    - pstuart 12 hours ago
      Might there be a way to leverage local models just to help minimize the retries -- doing the tool calling handling and giving the agent "perfect execution"?
      I'm a noob and am asking as wishful thinking.
      [-]
      - jermaustin1 0 minutes ago
        > I'm a noob and am asking as wishful thinking.
        Don't minimize your thoughts! Outside voices and naive questions sometimes provide novel insights that might be dismissed, but someone might listen.
        I've not done this exactly, but I have setup "chains" that create a fresh context for tool calls so their call chains don't fill the main context. There is no reason why the Tool Calls couldn't be redirected to another LLM endpoint (local for instance). Especially with something like gpt-oss-20b, where I've found executing tools happens at a higher success than claude sonnet via openrouter.
  - zozbot234 16 hours ago
    You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.
    [-]
    - kristopolous 15 hours ago
      Yeah there's a lot of people that advocate for really slow inference on cheap infra. That's something else that should be expressed in this fidelity
      Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.
      At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.
      Essentially a task failure is 1 in 10, I want to monitor and retry.
      If it's 1 in 1000, then I can walk away.
      The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless
      But sometimes you do...
      [-]
      - zozbot234 15 hours ago
        If you launch enough tasks in parallel you aren't going to care that 1 in 10 failed, as long as the other 9 are good. Just rerun the failed job whenever you get around to it, the infra will still be getting plenty of utilization on the rest.
  - estimator7292 1 hour ago
    Local as in localhost
  - echelon 16 hours ago
    I don't even need "open weights" to run on hardware I own.
    I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.
    I do not want my career to become dependent upon Anthropic.
    Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.
    I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.
    I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.
    We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.
    [-]
    - Aurornis 15 hours ago
      > I do not want my career to become dependent upon Anthropic
      As someone who switches between Anthropic and ChatGPT depending on the month and has dabbled with other providers and some local LLMs, I think this fear is unfounded.
      It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.
      [-]
      - mrklol 7 hours ago
        Because they make it easy. Imagine they limit their models to their tooling and suddenly it’s introducing work.
      - echelon 15 hours ago
        > It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.
        We have two cell phone providers. Google is removing the ability to install binaries, and the other one has never allowed freedom. All computing is taxed, defaults are set to the incumbent monopolies. Searching, even for trademarks, is a forced bidding war. Businesses have to shed customer relationships, get poached on brand relationships, and jump through hoops week after week. The FTC/DOJ do nothing, and the EU hasn't done much either.
        I can't even imagine what this will be like for engineering once this becomes necessary to do our jobs. We've been spoiled by not needing many tools - other industries, like medical or industrial research, tie their employment to a physical location and set of expensive industrial tools. You lose your job, you have to physically move - possibly to another state.
        What happens when Anthropic and OpenAI ban you? Or decide to only sell to industry?
        This is just the start - we're going to become more dependent upon these tools to the point we're serfs. We might have two choices, and that's demonstrably (with the current incumbency) not a good world.
        Computing is quickly becoming a non-local phenomenon. Google and the platforms broke the dream of the open web. We're about to witness the death of the personal computer if we don't do anything about it.
        [-]
        pseudony 4 hours ago
        I just don’t see it.
        I mean, the long arch of computing history has had us wobble back and forth in regards to how closed down it all was, but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.
        On the software front, we definitely swung back from the age of Microsoft. Sure, Linux is a lot more corporate than people admit, but it’s a lot more open than Microsoft’s offerings and it’s capable of running on practically everything except the smallest IOT device.
        As for LLMs. I know people have hyped themselves up to think that if you aren’t chasing the latest LLM release and running swarms of agents, you are next in the queues for the soup kitchens, but again, I don’t see why it HAS to play out that way, partly because of history (as referenced), partly because open models are already so impressive and I don’t see any reason why they wouldn’t continue to do well.
        In fact, I do my day-to-day work using an open weight model. Beyond that, can only say I know employers who will probably never countenance using commercially hosted LLMs, but who are already setting up self-hosted ones based on open weight releases.
        [-]
        Orygin 1 hour ago
        > but it seems we are almost at a golden age again with respect to good enough (if not popular) hardware.
        I don't think we're in any golden age since the GPU shortages started, and now memory and disks are becoming super expensive too.
        Hardware vendors have shown they don't have an interest in serving consumers and will sell out to hyperscalers the moment they show some green bills. I fear a day where you won't be able to purchase powerful (enough) machines and will be forced to subscribe to a commercial provider to get some compute to do your job.
      - airstrike 15 hours ago
        right, but ChatGPT might not exist at some point, and if we don't force feed the open inference ecosystem and infrastructure back into the mouths of the AI devourer that is this hype cycle, we'll simply be accepting our inevitable, painful death
        [-]
        Aurornis 12 hours ago
        > right, but ChatGPT might not exist at some point
        There are multiple frontier models to choose from.
        They’re not all going to disappear.
        [-]
        Bukhmanizer 9 hours ago
        This seems absurdly naive to me with the path big tech has taken in the last 5 years. There’s literally infinite upside and almost no downside to constraining the ecosystem for the big players.
        You don’t think that eventually Google/OpenAI are going to go to the government and say, “it’s really dangerous to have all these foreign/unreglated models being used everywhere could you please get rid of them?”. Suddenly they have an oligopoly on the market.
        airstrike 11 hours ago
        right, and the less we rely on ChatGPT and Claude, the more we give power to "all other frontier models", which right now have very, very little market share
        hahajk 10 hours ago
        the companies could merge or buy each other
        christkv 14 hours ago
        If they die there will be so much hardware released to do other tasks.
        [-]
        echelon 14 hours ago
        Perhaps not tasks you get the opportunity to do.
        Your job might be assigned to some other legal entity renting some other compute.
        If this goes as according to some of their plans, we might all be out of the picture one day.
        If these systems are closed, you might not get the opportunity to hire them yourself to build something you have ownership in. You might be cut out.
  - christkv 16 hours ago
    I won't need a heater with that running in my room.
    [-]
    - wincy 8 hours ago
      Haha running OSS-120B on my 5090 with most of the layers in video memory, some in RAM with LM Studio, I was hard pressed to get it to actually use anywhere near the full 600W. Gaming in 4K playing a modern game generates substantially more sustained heat.
    - hedora 13 hours ago
      This looks like it’ll run easily on a Strix Halo (180W TDP), and be a little sluggish on previous gen AMDs (80W TDP).
      I can’t be bothered to check TDPs on 64GB macbooks, but none of these devices really count as space heaters.
  - mrklol 7 hours ago
    I mean if it’s running in your lan, isn’t it local? :D
  - bigyabai 16 hours ago
    OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.
    There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.
    [-]
    - FrenchTouch42 16 hours ago
      > time-to-first-token/token-per-second/memory-used/total-time-of-test
      Would it not help with the DDR4 example though if we had more "real world" tests?
      [-]
      - bigyabai 16 hours ago
        Maybe, but even that fourth-order metric is missing key performance details like context length and model size/sparsity.
        The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.
- 1dom 21 hours ago
  I run Qwen3-Coder-30B-A3B-Instruct gguf on a VM with 13gb RAM and a 6gb RTX 2060 mobile GPU passed through to it with ik_llama, and I would describe it as usable, at least. It's running on an old (5 years, maybe more) Razer Blade laptop that has a broken display and 16gb RAM.
  I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.
  It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.
  I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?
  [-]
  - codedokode 1 hour ago
    30-A3B model gives 13 t/s without GPU (I noticed that token/sec * # of params matches memory bandwidth).
  - regularfry 20 hours ago
    I've had usable results with qwen3:30b, for what I was doing. There's definitely a knack to breaking the problem down enough for it.
    What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.
  - simonw 19 hours ago
    Honestly I've been completely spoiled by Claude Code and Codex CLI against hosted models.
    I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.
    Setting a task and then coming back to see if it worked an hour later is too much friction for me!
    [-]
    - dingnuts 17 hours ago
      [dead]
- embedding-shape 21 hours ago
  > I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful
  I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.
  I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.
  [-]
  - andai 18 hours ago
    Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.
    I wonder if it has to do with the message format, since it should be able to do tool use afaict.
    [-]
    - nekitamo 9 hours ago
      This is a common problem for people trying to run the GPT-oss models themselves. Reposting my comment here:
      GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:
      https://openrouter.ai/docs/guides/best-practices/reasoning-t...
      Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.
      Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.
  - pocksuppet 16 hours ago
    You are describing distillation, there are better ways to do it, and it was done in the past, Deepseek distilled onto Qwen.
  - gigatexal 21 hours ago
    I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.
    [-]
    - pixelpoet 20 hours ago
      Laptops are fundamentally a poor form factor for high performance computing.
    - embedding-shape 20 hours ago
      Yeah, Apple hardware don't seem ideal for LLMs that are large, give it a go with a dedicated GPU if you're inclined and you'll see a big difference :)
      [-]
      - marci 5 hours ago
        Their issue with the mac was the sound of fans spinning. I doubt a dedicated gpu will resolved that.
      - politelemon 17 hours ago
        What are some good GPUs to look for if you're getting started?
        [-]
        wincy 8 hours ago
        If you want to actually run models on a computer at home? The RTX 6000 Blackwell Pro Workstation, hands down. 96GB of VRAM, fits into a standard case (I mean, it’s big, as it’s essentially the same form factor as an RTX 5090 just with a lot denser VRAM).
        My RTX 5090 can fit OSS-20B but it’s a bit underwhelming, and for $3000 if I didn’t also use it for gaming I’d have been pretty disappointed.
- mark_l_watson 15 hours ago
  I configured Claude Code to use a local model (ollama run glm-4.7-flash) that runs really well on a 32G M2Pro macmini. Maybe my standards are too low, but I was using that combination to clean up the code, make improvements, and add docs and tests to a bunch of old git repo experiment projects.
  [-]
  - redundantly 12 hours ago
    Did you have to do anything special to get it to work? I tried and it would just bug out, things like respond with JSON strings summarizing what I asked of it or just outright getting things wrong entirely. For example, I asked it to summarize what a specific .js file did and it provided me with new code it made up based on the file name...
    [-]
    - mark_l_watson 12 hours ago
      Yes, I had to set the Ollama context size to 32K
      [-]
      - redundantly 10 hours ago
        Thank you, it's working as expected now!
- dehrmann 20 hours ago
  I wonder if the future in ~5 years is almost all local models? High-end computers and GPUs can already do it for decent models, but not sota models. 5 years is enough time to ramp up memory production, consumers to level-up their hardware, and models to optimize down to lower-end hardware while still being really good.
  [-]
  - johnsmith1840 18 hours ago
    Opensource or local models will always heavily lag frontier.
    Who pays for a free model? GPU training isn't free!
    I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.
    People always will want the fastest, best, easiest setup method.
    "Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.
    [-]
    - margalabargala 17 hours ago
      I don't think this is as true as you think.
      People do not care about the fastest and best past a point.
      Let's use transportation as an analogy. If all you have is a horse, a car is a massive improvement. And when cars were just invented, a car with a 40mph top speed was a massive improvement over one with a 20mph top speed and everyone swapped.
      While cars with 200mph top speeds exist, most people don't buy them. We all collectively decided that for most of us, most of the time, a top speed of 110-120 was plenty, and that envelope stopped being pushed for consumer vehicles.
      If what currently takes Claude Opus 10 minutes to do can be done is 30ms, then making something that can do it in 20ms isn't going to be enough to get everyone to pay a bunch of extra money for.
      Companies will buy the cheapest thing that meets their needs. SOTA models right now are much better than the previous generation but we have been seeing diminishing returns in the jump sizes with each of the last couple generations. If the gap between current and last gen shrinks enough, then people won't pay extra for current gen if they don't need it. Just like right now you might use Sonnet or Haiku if you don't think you need Opus.
      [-]
      - johnsmith1840 14 hours ago
        This is the assumption of a hard plateu we can effectively optimize forever towards while possible we havn't seen it.
        Again my point is "good enough" changes as possibilities open. Marketing teams running entire infra stacks is an insane idea today but may not be in the future.
        You could easily code with a local model similar to gpt 4 or 3 now but I will 10-100x your performance with a frontier model and that will fundamentally not change.
        Hmmm but maybe there's an argument of a static task. Once a model hits that ability of that specific task you can optimize it into a smaller model. So I guess I buy the argument for people working on statically capped conplexity tasks?
        PII detection for example, a <500M model will outperform a 1-8B param model on that narrow task. But at the same time just a pii detection bot is not a product anymore. So yes a opensource one does it but as a result its fundamentally less valuable and I need to build higher and larger products for the value?
    - kybernetikos 17 hours ago
      Gpt3.5 as used in the first commercially available chat gpt is believed to be hundreds of billions of parameters. There are now models I can run on my phone that feel like they have similar levels of capability.
      Phones are never going to run the largest models locally because they just don't have the size, but we're seeing improvements in capability at small sizes over time that mean that you can run a model on your phone now that would have required hundreds of billions of parameters less than 6 years ago.
      [-]
      - johnsmith1840 14 hours ago
        Sure but the moment you can use that small model locally its capabilities are no longer differntiated or valuable no?
        I supose the future will look exacrly like now. Some mixture of local and non local.
        I guess my argument is that market dominated by local doesn't seem right and I think the balance will look similar to what it is right now
      - onion2k 16 hours ago
        The G in GPT stands for Generalized. You don't need that for specialist models, so the size can be much smaller. Even coding models are quite general as they don't focus on a language or a domain. I imagine a model specifically for something like React could be very effective with a couple of billion parameters, especially if it was a distill of a more general model.
        [-]
        MzxgckZtNqX5i 16 hours ago
        I'll be that guy: the "G" in GPT stands for "Generative".
        christkv 14 hours ago
        Thats what i want and orchestrator model that operates with a small context and then very specialized small models for react etc
    - __MatrixMan__ 16 hours ago
      I think we'll eventually find a way to make the cycle smaller, so instead of writing a stackoverflow post in 2024 and using a model trained on it in 2025 I'll be contributing to the expertise of a distributed-model-ish-thing on Monday and benefitting from that contribution on Tuesday.
      When that happens, the most powerful AI will be whichever has the most virtuous cycles going with as wide a set of active users as possible. Free will be hard to compete with because raising the price will exclude the users that make it work.
      Until then though, I think you're right that open will lag.
    - torginus 16 hours ago
      I don't know about frontier, I code nowadays a lot using Opus 4.5, in a way that I instruct it to do something (like complex refactor etc) - I like that it's really good at actually doing what its told and only occasionally do I have to fight it when it goes off the rails. It also does not hallucinate all that much in my experience (Im writing Js, YMMV with other languages), and is good at spotting dumb mistakes.
      That said, I'm not sure if this capability is only achievable in huge frontier models, I would be perfectly content using a model that can do this (acting as a force multiplier), and not much else.
    - Vinnl 16 hours ago
      > People always will want the fastest, best, easiest setup method
      When there are no other downsides, sure. But when the frontier companies start tightening the thumbscrews, price will influence what people consider good enough.
    - bee_rider 15 hours ago
      The calculation will probably get better for locally hosted models once investor generosity runs out for the remotely hosted models.
  - enlyth 14 hours ago
    I'm hoping so. What's amazing is that with local models you don't suffer from what I call "usage anxiety" where I find myself saving my Claude usage for hypothetical more important things that may come up, or constantly adjusting prompts and doing some manual work myself to spare token usage.
    Having this power locally means you can play around and experiment more without worries, it sounds like a wonderful future.
  - manbitesdog 20 hours ago
    Plus a long queue of yet-undiscovered architectural improvements
    [-]
    - vercaemert 18 hours ago
      I'm suprised there isn't more "hope" in this area. Even things like the GPT Pro models; surely that sort of reasoning/synthesis will eventually make its way into local models. And that's something that's already been discovered.
      Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).
      [-]
      - bigfudge 16 hours ago
        It seems like a lot of the benefits of SOTA models are from data though, not architecture? Won't the moat of the big 3/4 players in getting data only grow as they are integrated deeper into businesses workflows?
        [-]
        vercaemert 16 hours ago
        That's a good point. I'm not familiar enough with the various moats to comment.
        I was just talking at a high level. If transformers are HDD technology, maybe there's SSD right around the corner that's a paradigm shift for the whole industry (but for the average user just looks like better/smarter models). It's a very new field, and it's not unrealistic that major discoveries shake things up in the next decade or less.
  - infinitezest 20 hours ago
    A lot of manufacturers are bailing on consumer lines to focus on enterprise from what I've read. Not great.
  - regularfry 20 hours ago
    Even without leveling up hardware, 5 years is a loooong time to squeeze the juice out of lower-end model capability. Although in this specific niche we do seem to be leaning on Qwen a lot.
- kristianp 13 hours ago
  Why don't you try it out in Opencode? It's possible to hook up the openrouter api, and some providers have started to host it there [1]. It's not yet available in opencode's model list [2].
  Opencode's /connect command has a big list of providers, openrouter is on there.
  [1] https://openrouter.ai/qwen/qwen3-coder-next
  [2] https://opencode.ai/docs/zen/#endpoints
  [-]
  - simonw 10 hours ago
    Oh good! OpenRouter didn't have it this morning when I first checked.
- dcastm 19 hours ago
  I have the same experience with local models. I really want to use them, but right now, they're not on par with propietary models on capabilities nor speed (at least if you're using a Mac).
  [-]
  - bityard 19 hours ago
    Local models on your laptop will never be as powerful as the ones that take up a rack of datacenter equipment. But there is still a surprising amount of overlap if you are willing to understand and accept the limitations.
- vessenes 21 hours ago
  I'm thinking the next step would be to include this as a 'junior dev' and let Opus farm simple stuff out to it. It could be local, but also if it's on cerebras, it could be realllly fast.
  [-]
  - ttoinou 21 hours ago
    Cerebras already has GLM 4.7 in the code plans
    [-]
    - vessenes 21 hours ago
      Yep. But this is like 10x faster; 3B active parameters.
      [-]
      - ttoinou 21 hours ago
        Cerebras is already 200-800 tps, do you need even faster ?
        [-]
        overfeed 20 hours ago
        Yes! I don't try to read agent tokens as they are generated, so if code generation decreases from 1 minute to 6 seconds, I'll be delighted. I'll even accept 10s -> 1s speedups. Considering how often I've seen agents spin wheels with different approaches, faster is always better, until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops
        [-]
        pqtyw 16 hours ago
        > until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops
        That would imply they'd have to be actually smarter than humans, not just faster and be able to scale infinitely. IMHO that's still very far away..
- dust42 20 hours ago
  Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.
  On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
  So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
  But who knows, maybe Qwen gives them a hand? (hint,hint)
  [-]
  - ttoinou 20 hours ago
    I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?
    [-]
    - dust42 20 hours ago
      KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.
      Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
  - cgearhart 16 hours ago
    Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?
    [-]
    - anon373839 14 hours ago
      There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...
    - dust42 15 hours ago
      It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.
      I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.
- codazoda 16 hours ago
  I can't get Codex CLI or Claude Code to use small local models and to use tools. This is because those tools use XML and the small local models have JSON tool use baked into them. No amount of prompting can fix it.
  In a day or two I'll release my answer to this problem. But, I'm curious, have you had a different experience where tool use works in one of these CLIs with a small local model?
  [-]
  - zackify 15 hours ago
    I'm using this model right now in claude code with LM Studio perfectly, on a macbook pro
    [-]
    - codazoda 15 hours ago
      You mean Qwen3-Coder-Next? I haven't tried that model itself, yet, because I assume it's too big for me. I have a modest 16GB MacBook Air so I'm restricted to really small stuff. I'm thinking about buying a machine with a GPU to run some of these.
      Anywayz, maybe I should try some other models. The ones that haven't worked for tool calling, for me are:
      Llama3.1
      Llama3.2
      Qwen2.5-coder
      Qwen3-coder
      All these in 7b, 8b, or sometimes 30b (painfully) models.
      I should also note that I'm typically using Ollama. Maybe LM Studio or llama.cpp somehow improve on this?
      [-]
      - vessenes 7 hours ago
        I’m mostly out of the local model game, but I can say confidently that Llama will be a waste of time for agentic workflows - it was trained before agentic fine tuning was a thing, as far as I know. It’s going to be tough for tool calling, probably regardless of format you send the request in. Also 8b models are tiny. You could significantly upgrade your inference quality and keep your privacy with say a machine at lambda labs, or some cheaper provider, though. Probably for $1/hr - where an hour is a many times more inference than an hour on your MBA.
  - regularfry 15 hours ago
    Surely the answer is a very small proxy server between the two?
    [-]
    - codazoda 15 hours ago
      That might work, but I keep seeing people talk about this, so there must be a simple solution that I'm over-looking. My solution is to write my own minimal and experimental CLI that talks JSON tools.
- organsnyder 20 hours ago
  They run fairly well for me on my 128GB Framework Desktop.
  [-]
  - mittermayr 18 hours ago
    what do you run this on if I may ask? lmstudio, ollama, lama? which cli?
    [-]
    - MrDrMcCoy 14 hours ago
      Can't speak for parent, but I've had decent luck with llama.cpp on my triple Ryzen AI Pro 9700 XTs.
    - redwood_ 12 hours ago
      I run Qwen3-Coder-Next (Qwen3-Coder-Next-UD-Q4_K_XL) on the Framework ITX board (Max+ 395 - 128GB) custom build. Avg. eval at 200-300 t/s and output at 35-40 t/s running with llama.cpp using rocm. Prefer Claude Code for cli.
- brianjking 9 hours ago
  TFW 48gb M4 Pro isn't going to run it.
- danielhanchen 21 hours ago
  It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.
- segmondy 20 hours ago
  you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...
  [-]
  - jrop 19 hours ago
    I don't buy this. I've long wondered if the larger models, while exhibiting more useful knowledge, are not more wasteful as we greedily explore the frontier of "bigger is getting us better results, make it bigger". Qwen3-Coder-Next seems to be a point for that thought: we need to spend some time exploring what smaller models are capable of.
    Perhaps I'm grossly wrong -- I guess time will tell.
    [-]
    - bityard 19 hours ago
      You are not wrong, small models can be trained for niche use cases and there are lots of people and companies doing that. The problem is that you need one of those for each use case whereas the bigger models can cover a bigger problem space.
      There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
      [-]
      - dagss 15 hours ago
        Not an AI researcher and I don't really know, but intuitively it makes a lot of sense to me.
        To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".
        So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.
        If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.
        And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.
      - abraae 16 hours ago
        Is that counterintuitive? If I had a model trained on 10 different programming languages, including my target language, I would expect it to do better than a model trained only on my target language, simply because it has access to so much more code/algorithms/examples then my language alone.
        i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.
        [-]
        dagss 15 hours ago
        > simply because it has access to so much more code/algorithms/examples then my language alone
        I assumed that is what was catered for with "even when controlling for the size of the training set".
        I.e. assuming I am reading it right: That it is better to get the same data as 25% in 4 languages, than 100% in one language.
    - segmondy 18 hours ago
      eventually we will have smarter smaller models, but as of now, larger models are smarter by far. time and experience has already answered that.
      [-]
      - adastra22 16 hours ago
        Eventually we might have smaller but just as smart models. There is no guarantee. There are information limits to smaller models of course.
  - epolanski 14 hours ago
    Aren't both latest opus and sonnet smaller than the previous versions?
  - regularfry 15 hours ago
    There is (must be - information theory) a size/capacity efficiency frontier. There is no particular reason to think we're anywhere near it right now.
danielhanchen 21 hours ago
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next
[-]
- genpfault 19 hours ago
  Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).
  System info:
```
    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64
```
  llama.cpp command-line:
```
    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768
```
  [-]
  - halcyonblue 19 hours ago
    What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!
    [-]
    - coder543 19 hours ago
      MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.
      Not as good as running the entire thing on the GPU, of course.
  - danielhanchen 13 hours ago
    Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!
    [-]
    - genpfault 10 hours ago
      Nifty, thanks for the heads-up!
- bityard 18 hours ago
  Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.
  Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?
  [-]
  - danielhanchen 13 hours ago
    Thanks! Oh Qwen3's own GGUFs also works, but ours are dynamically quantized and calibrated with a reasonably large diverse dataset, whilst Qwen's ones are not - see https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
    [-]
    - bityard 12 hours ago
      I've read that page before and although it all certainly sounds very impressive, I'm not an AI researcher. What's the actual goal of dynamic quantization? Does it make the model more accurate? Faster? Smaller?
      [-]
      - itake 10 hours ago
        More accurate and smaller.
        quantization = process to make the model smaller (lossy)
        dynamic = being smarter about the information loss, so less information is lost
- ranger_danger 21 hours ago
  What is the difference between the UD and non-UD files?
  [-]
  - danielhanchen 21 hours ago
    UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.
    [-]
    - CamperBob2 20 hours ago
      Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.
      The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
      [-]
      - danielhanchen 20 hours ago
        Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok
        [-]
        Keats 18 hours ago
        Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s.
        [-]
        mirekrusin 17 hours ago
        It takes download time + 1 minute to test speed yourself, you can try different quants, it's hard to write down a table because it depends on your system ie. ram clock etc. if you go out of gpu.
        I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.
        [-]
        Keats 16 hours ago
        Testing speed is easy yes, I'm mostly wondering about the quality difference between Q6 vs Q8_K_XL for example.
        [-]
        danielhanchen 13 hours ago
        I haven't done benchmarking yet (plan to do them), but it should be similar to our post on DeepSeek-V3.1 Dynamic GGUFs: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
      - segmondy 20 hours ago
        The green/yellow/red indicators are based on what you set for your hardware on huggingface.
    - ranger_danger 15 hours ago
      What is your definition of "important" in this context?
      [-]
      - danielhanchen 13 hours ago
        Oh we wrote about it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
- MrDrMcCoy 15 hours ago
  Still hoping IQuest-Coder gets the same treatment :)
- binsquare 21 hours ago
  How did you do it so fast?
  Great work as always btw!
  [-]
  - danielhanchen 20 hours ago
    Thanks! :) We're early access partners with them!
- bytesandbits 9 hours ago
  how are you so fast man
- CamperBob2 16 hours ago
  Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks!
  Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?
  [-]
  - danielhanchen 13 hours ago
    Nice! Yes Q8_0 is similar - the others are different since they use a calibration dataset.
simonw 19 hours ago
I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:
```
  brew upgrade llama.cpp # or brew install if you don't have it yet
```
Then:
```
  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja
```
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:
```
  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja
```
It's using about 28GB of RAM.
[-]
- technotony 17 hours ago
  what are your impressions?
  [-]
  - simonw 14 hours ago
    I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.
    [-]
    - danielhanchen 13 hours ago
      Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)
      I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off
- nubg 16 hours ago
  what's the token per seconds speed?
skhameneh 21 hours ago
It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.
[-]
- Aurornis 17 hours ago
  I experimented with the Q2 and Q4 quants. First impression is that it's amazing we can run this locally, but it's definitely not at Sonnet 4.5 level at all.
  Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.
  A few times it got stuck in thinking loops and I had to cancel prompts.
  This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.
  [-]
  - Kostic 17 hours ago
    I would not go below q8 if comparing to sonnet.
    [-]
    - anon373839 1 hour ago
      Yeah. Q2 in any model is just severely damaged, unfortunately. Wish it weren’t so.
  - margalabargala 17 hours ago
    Wonder where it falls on the Sonnet 3.7/4.0/4.5 continuum.
    3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.
    If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.
    [-]
    - Computer0 12 hours ago
      I still have yet to find a "Small" model that can use function calls consistently enough to not be frustrating. That is the most noticeable difference I consistently see between even older "SOTA" models and the best performing "SMALL" models (<70b).
  - cubefox 17 hours ago
    > I experimented with the Q2 and Q4 quants.
    Of course you get degraded performance with this.
    [-]
    - Aurornis 15 hours ago
      Obviously. That's why I led with that statement.
      Those are the quant thresholds where people with mid-high end hardware can run this locally at reasonable speed, though.
      In my experience Q2 is flakey, but Q4 isn't dramatically worse.
      [-]
      - cubefox 4 hours ago
        > Obviously. That's why I led with that statement.
        Then why did you write this?
        > It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.
- cmrdporcupine 16 hours ago
  It feels more like Haiku level than Sonnet 4.5 from my playing with it.
- cirrusfan 21 hours ago
  If it sounds too good to be true…
  [-]
  - theshrike79 21 hours ago
    Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.
    There's no reason for a coding model to contain all of ao3 and wikipedia =)
    [-]
    - jstummbillig 19 hours ago
      There is: It works (even if we can't explain why right now).
      If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.
    - noveltyaccount 20 hours ago
      I think I like coding models that know a lot about the world. They can disambiguate my requirements and build better products.
      [-]
      - regularfry 20 hours ago
        I generally prefer a coding model that can google for the docs, but separate models for /plan and /build is also a thing.
        [-]
        noveltyaccount 19 hours ago
        > separate models for /plan and /build
        I had not considered that, seems like a great solution for local models that may be more resource-constrained.
        [-]
        regularfry 19 hours ago
        You can configure aider that way. You get three, in fact: an architect model, a code editor model, and a quick model for things like commit messages. Although I'm not sure if it's got doc searching capabilities.
    - moffkalast 19 hours ago
      That's what Meta thought initially too, training codellama and chat llama separately, and then they realized they're idiots and that adding the other half of data vastly improves both models. As long as it's quality data, more of it doesn't do harm.
      Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.
      [-]
      - theshrike79 16 hours ago
        But as far as I know, that was way before tool calling was a thing.
        I'm more bullish about small and medium sized models + efficient tool calling than I'm about LLMs too large to be run at home without $20k of hardware.
        The model doesn't need to have the full knowledge of everything built into it when it has the toolset to fetch, cache and read any information available.
    - MarsIronPI 20 hours ago
      But... but... I need my coding model to be able to write fanfiction in the comments...
    - wongarsu 15 hours ago
      Now I wonder how strong the correlation between coding performance and ao3 knowledge is in human programmers. Maybe we are on to something here /s
  - FuckButtons 17 hours ago
    There have been advances recently (last year) in scaling deep rl by a significant amount, their announcement is in line with a timeline of running enough experiments to figure out how to leverage that in post training.
    Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.
  - Der_Einzige 17 hours ago
    It literally always is. HN Thought DeepSeek and every version of Kimi would finally dethrone the bigger models from Anthropic, OpenAI, and Google. They're literally always wrong and average knowledge of LLMs here is shockingly low.
    [-]
    - cmrdporcupine 16 hours ago
      Nobody has been saying they'd be dethroned. We're saying they're often "good enough" for many use cases, and that they're doing a good job of stopping the Big Guys from creating a giant expensive moat around their businesses.
      Chinese labs are acting as a disruption against Altman etcs attempt to create big tech monopolies, and that's why some of us cheer for them.
      [-]
      - Der_Einzige 8 hours ago
        "Nobody says X" is as presumptuous and wrong (both metaphorically and literally) as "LLMs can't do X". It is one of the worst thought terminating cliches.
        Thousands have been saying this, you aren't paying attention.
        [-]
        cmrdporcupine 8 hours ago
        As thought terminating as "HN Thought [insert strawman here]"
        C'mon.
tommyjepsen 19 hours ago
I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTU
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
vessenes 21 hours ago
3B active parameters, and slightly worse than GLM 4.7. On benchmarks. That's pretty amazing! With better orchestration tools being deployed, I've been wondering if faster, dumber coding agents paired with wise orchestrators might be overall faster than using the say opus 4.5 on the bottom for coding. At least we might want to deploy to these guys for simple tasks.
[-]
- markab21 21 hours ago
  It's getting a lot easier to do this using sub-agents with tools in Claude. I have a fleet of Mastra agents (TypeScript). I use those agents inside my project as CLI tools to do repetitive tasks that gobble tokens such as scanning code, web search, library search, and even SourceGraph traversal.
  Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.
  [-]
  - solumunus 20 hours ago
    Are you just exposing mastra cli commands to Claude Code in md context? I’d love you to elaborate on this if you have time.
    [-]
    - adriand 20 hours ago
      Seconded!
  - IhateAI 20 hours ago
    [flagged]
    [-]
    - mrandish 19 hours ago
      > just (expensive) magic trick
      Related: as an actual magician, although no longer performing professionally, I was telling another magician friend the other day that IMHO, LLMs are the single greatest magic trick ever invented judging by pure deceptive power. Two reasons:
      1. Great magic tricks exploit flaws in human perception and reasoning by seeming to be something they aren't. The best leverage more than one. By their nature, LLMs perfectly exploit the ways humans assess intelligence in themselves and others - knowledge recall, verbal agility, pattern recognition, confident articulation, etc. No other magic trick stacks so many parallel exploits at once.
      2. But even the greatest magic tricks don't fool their inventors. David Copperfield doesn't suspect the lady may be floating by magic. Yet, some AI researchers believe the largest, most complex LLMs actually demonstrate emergent thinking and even consciousness. It's so deceptive it even fools people who know how it works. To me, that's a great fucking trick.
      [-]
      - a_wild_dandan 16 hours ago
        Speaking of tricks, does anyone here know how many angels can dance on the head of a pin?
      - IhateAI 18 hours ago
        Also, just like how in centuries past, rulers/governments bet their entire Empires on the predictions of magicians / seers they consulted. Machine learning Engineers are the new seers and their models are their magic tricks. It seems like history really is a circle.
- doctorpangloss 21 hours ago
  Time will tell. All this stuff will get more adoption when Anthropic, Google and OpenAI raise prices.
  [-]
  - Alifatisk 20 hours ago
    They can only raise prices as long as people buy their subscriptions / pay for their api. The Chinese labs are closing in on the SOTA models (I would say they are already there) and offer insane cheap prices for their subscriptions. Vote with your wallet.
0cf8612b2e1e 16 hours ago
What is the best place to see local model rankings? The benchmarks seem so heavily gamed that I am willing to believe the “objective” rankings are a lie and personal reviews are more meaningful.
Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.

17t/s on a laptop with 6GB VRAM and DDR5 system memory. Maximum of 100k context window (then it saturates VRAM). Quite amazing, but tbh I'll still use inference providers, because it's too slow and it's my only machine with "good" specs :)

    cat docker-compose.yml
    services:
      llamacpp:
        volumes:
          - llamacpp:/root
        container_name: llamacpp
        restart: unless-stopped
        image: ghcr.io/ggml-org/llama.cpp:server-cuda
        network_mode: host
        command: |
          -hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_XL --jinja --cpu-moe --n-gpu-layers 999 --ctx-size 102400 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on
    # unsloth/gpt-oss-120b-GGUF:Q2_K
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]

    volumes:
       llamacpp:

Tepix 18 hours ago
Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage).
From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
[-]
- dimgl 18 hours ago
  How's the Strix Halo? I'd really like to get a local inference machine so that I don't have to use quantized versions of local models.
  [-]
  - evilduck 16 hours ago
    Works great for these type of MOE models. The ability to have large amounts of VRAM let you run different models in parallel easily, or to have actually useful context sizes. Dense models can get sluggish though. AMD's ROCm support has been a little rough for Stable Diffusion stuff (memory issues leading to application stability problems) but it's worked well with LLMs, as does Vulkan.
    I wish AMD would get around to adding NPU support in Linux for it though, it has more potential that could be unlocked.
  - Tepix 16 hours ago
    Prompt preprocessing is slow, the rest is pretty great.
- cmrdporcupine 18 hours ago
  I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context.
  I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.
  I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.
  I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.
  I suspect the API providers will offer this model for nice and cheap, too.
  [-]
  - aseipp 18 hours ago
    llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.
    I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...
    [-]
    - cmrdporcupine 17 hours ago
      Yeah I got 35-39tok/sec for one shot prompts, but for real-world longer context interactions through opencode it seems to be averaging out to 20-30tok/sec. I tried both MXFP4 and Q4_K_XL, no big difference, unfortunately.
      --no-mmap --fa on options seemed to help, but not dramatically.
      As with everything Spark, memory bandwidth is the limitation.
      I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
      However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.
Alifatisk 20 hours ago
As always, the Qwen team is pushing out fantastic content
Hope they update the model page soon https://chat.qwen.ai/settings/model
[-]
- getcrunk 15 hours ago
  That’s a perfectly fine usage of content (primary substance offered by a “website”)
- smallerfish 19 hours ago
  > "content"
  Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.
  [-]
  - Alifatisk 15 hours ago
    I wasn’t only referring to their new model, I meant their blogpost and the research behind their progress, its always a joyride to read.
    I didn’t know it was this serious with the vocabulary, I’ll be more cautious in the future.
  - Havoc 17 hours ago
    Not everyone on hn is a native english speaker...
adefa 15 hours ago
Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344
```
  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7
```
[-]
- cmrdporcupine 14 hours ago
  I tried the FP8 in vLLM on my Spark and although it fit in memory, I started swapping once I actually tried to run any queries, and, yeah, could not have a context larger than 8k.
  I figured out later this is because vLLM apparently de-quantizes to BF16 at runtime, so pointless to run the FP8?
  I get about 30-35 tok/second using llama.cpp and a 4-bit quant. And a 200+k context, using only 50GB of RAM.
  [-]
  - justaboutanyone 13 hours ago
    Running llama.cpp rather than vLLM, it's happy enough to run the FP8 variant with 200k+ context using about 90GB vram
    [-]
    - cmrdporcupine 11 hours ago
      yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.
codedokode 1 hour ago
It's sad they only have 80B version, given current RAM prices.
cedws 20 hours ago
I kind of lost interest in local models. Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools and it reminded me why we need to support open tools and models. I’ve cancelled my CC subscription, I’m not paying to support anticompetitive behaviour.
[-]
- Aurornis 18 hours ago
  > Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools
  To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.
  Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.
  They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.
  Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.
  The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.
  [-]
  - ericd 18 hours ago
    The anticompetitive part is setting a much lower price for typical usage of Claude Code vs. typical usage of another CLI dev tool.
    [-]
    - gehsty 18 hours ago
      Anticompetitive with themselves? It’s not like Claude / Anthropic have any kind of monopoly, and services companies are allowed to charge different rates for different kind of access to said service?
    - rhgraysonii 17 hours ago
      The anticompetitive move would be not running their software if ‘which codex’ evaluated to showing a binary and then not allow you to use it due to its presence. Companies are allowed to set pricing and not let you borrow the jet to fly to a not approved destination. This distortion is just wrong as a premise. They are being competitive by making a superior tool and their business model is “no one else sells Claude” and they are pretty right to do this IMO.
      [-]
      - ericd 17 hours ago
        Anticompetitive behavior has been normalized in our industry, doesn't make it not anticompetitive. It's a restriction that's meant to make it harder to compete with other parts of their offering. The non-anticompetitive approach would be to offer their subscription plans with a certain number of tokens every month, and then make Claude Code the most efficient with the tokens, to let it compete on its own merits.
  - falloutx 14 hours ago
    > Anthropic will let you use their API with any coding tools you want
    No, in 2026, even with their API plan the create key is disabled for most orgs, you basically have to ask your admin to give you a key to use something other than Claude Code. You can imagine how that would be a problem.
    [-]
    - CryptoBanker 13 hours ago
      That’s not an Anthropic problem, that’s a problem with whomever you work for.
      [-]
      - falloutx 4 hours ago
        Have talked to engineers in atleast 5 more companies and they have the same issue, apparently its part of the deal Anthropic is giving to companies, and they are happily taking it. I have never seen companies so complaint to a external vendor.
  - huevosabio 18 hours ago
    Yes, exactly. The discourse has been so far off the rails now.
  - cedws 17 hours ago
    The question I pose is this: if they're willing to start building walls this early in the game while they've still got plenty of viable competitors, and are at most 6 months ahead, how will they treat us if they achieve market dominance?
    Some people think LLMs are the final frontier. If we just give in and let Anthropic dictate the terms to us we're going to experience unprecedented enshittification. The software freedom fight is more important than ever. My machine is sovereign; Anthropic provides the API, everything I do on my machine is my concern.
  - 8note 17 hours ago
    from what i remember, i couldnt actually use claude code with the subscription when i subscribed. i could only use it with third party tools.
    eventually they added subscription support and that worked better than cline or kilo, but im still not clear what anthropic tools the subscription was actually useful for
  - Draiken 17 hours ago
    I don't get why so much mental gymnastics is done to avoid the fact that locking their lower prices to effectively subsidize their shitty product is the anti competitive behavior.
    They simply don't want to compete, they want to force the majority of people that can't spend a lot on tokens to use their inferior product.
    Why build a better product if you control the cost?
- aljgz 19 hours ago
  You gave up some convenience to avoid voting for a bad practice with your wallet. I admire this, try to consistently do this when reasonably feasible.
  Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.
  [-]
  - gloomyday 19 hours ago
    Society is lacking people that stand up for something. My efforts to consume less is seen as being cheap by my family, which I find so sad. I much prefer donating my money than exchanging superfluous gifts on Christmas.
  - pluralmonad 18 hours ago
    As I get older I more and more view convenience as the enemy of good. Luckily (or unluckily for some) a lot of the tradeoffs we are asked to make in the name of convenience are increasingly absurd. I have an easier and easier time going without these Faustian bargains.
    [-]
    - aljgz 17 hours ago
      IMHO The question is: who is in control? The user, or the profit-seeking company/control-seeking government? There is nothing we can do to prevent companies from seeking profit. What we can do is to prefer tools that we control, if that choice is not available, then tools that we can abandon when we want, over tools that remove our control AND abandoning them would be prohibitively difficult.
- skapadia 19 hours ago
  Claude Opus 4.5 by far is the most capable development model. I've been using it mainly via Claude Code, and with Cursor.
  I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.
  Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.
  [-]
  - vercaemert 19 hours ago
    I'd encourage you to try the -codex family with the highest reasoning.
    I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).
    I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.
    I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".
    [-]
    - skapadia 18 hours ago
      Thanks, I'll try those out. I've used Codex CLI itself on a few small projects as well, and fired it up on a feature branch where I had it implement the same feature that Claude Code did (they didn't see each other's implementations). For that specific case, the implementation Codex produced was simpler, and better for the immediate requirements. However, Claude's more abstracted solution may have held up better to changing requirements. Codex feels more reserved than Claude Code, which can be good or bad depending on the task.
    - eadwu 18 hours ago
      I've tried nearly all the models, they all work best if and only if you will never handle the code ever again. They suck if you have a solution and want them to implement that solution.
      I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.
      There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.
      [-]
      - vercaemert 18 hours ago
        Yes, I only plan/implement on fully AI projects where it's easy for me to tell whether or not they're doing the thing I want regardless of whether or not they've rewritten the codebase.
        For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.
        That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.
      - teaearlgraycold 18 hours ago
        There are domains of programming (web front end) where lots of requests can be done pretty well even when you want them done a certain way. Not all, but enough to make it a great tool.
  - Uehreka 17 hours ago
    > Claude Opus 4.5 by far is the most capable development model.
    At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.
    I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.
    [-]
    - Der_Einzige 17 hours ago
      I have literally the opposite experience and so does most of AI pilled twitter and the AI research community of top conferences (NeurIPS, ICLR, ICML, AAAI) Why does this FUD keep appearing on this site?
      Edit: It's very true that the big 4 labs silently mess with their models and any action of that nature is extremely user hostile.
      [-]
      - CamperBob2 17 hours ago
        Probably because all of the major providers are constantly screwing around with their models, regardless of what they say.
  - skippyboxedhero 18 hours ago
    It feels very close to a trade-off point.
    I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.
    What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.
    Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?
    I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.
    The day for open models will come but it still feels so close and so far.
- giancarlostoro 20 hours ago
  I do wonder if they locked things down due to people abusing their CC token.
  [-]
  - simonw 20 hours ago
    I buy the theory that Claude Code is engineered to use things like token caching efficiently, and their Claude Max plans were designed with those optimizations in mind.
    If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.
    (But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)
    [-]
    - mirekrusin 19 hours ago
      It should just burn quota faster then. Instead of blocking they should just mention that if you use other tools then your quota may reduce at 3x speed compared to cc. People would switch.
    - andai 18 hours ago
      When I last checked a few months ago, Anthropic was the only provider that didn't have automatic prompt caching. You had to do it manually (and you could only set checkpoints a few times per context?), and most 3rd party stuff does not.
      They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.
      By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?
    - volkercraig 20 hours ago
      Absolutely. I installed clawdbot for just long enough to send a single message, and it burned through almost a quarter of my session allowance. That was enough for me. Meanwhile I can use CC comfortably for a few hours and I've only hit my token limit a few times.
      I've had a similar experience with opencode, but I find that works better with my local models anyway.
      [-]
      - andai 18 hours ago
        I used it for a few mins and it burned 7M tokens. Wish there was a way to see where it's going!
        (There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)
        [-]
        giancarlostoro 13 hours ago
        I have a feeling the different harnesses create new context windows instead of using one. The more context windows you open up with Claude the quicker your usage goes poof.
      - giancarlostoro 19 hours ago
        Wow, that is very surprising and alarming. I wish Anthropic would have made a more public statement as to why they blocked other harnesses.
    - pluralmonad 18 hours ago
      I would be surprised if the primary reason for banning third party clients isn't because they are collecting training data via telemetry and analytics in CC. I know CC needlessly connects to google infrastructure, I assume for analytics.
    - ImprobableTruth 19 hours ago
      If that was the real reason, why wouldn't they just make it so that if you don't correctly use caching you use up more of your limit?
  - segmondy 19 hours ago
    Nah, their "moat" is CC, they are afraid that as other folks build effective coding agent, they are are going lose market share.
  - cedws 20 hours ago
    In what way would it be abused? The usage limits apply all the same, they aren't client side, and hitting that limit is within the terms of the agreement with Anthropic.
    [-]
    - bri3d 20 hours ago
      The subscription services have assumptions baked in about the usage patterns; they're oversubscribed and subsidized. If 100% of subscriber customers use 100% of their tokens 100% of the time, their business model breaks. That's what wholesale / API tokens are for.
      > hitting that limit is within the terms of the agreement with Anthropic
      It's not, because the agreement says you can only use CC.
      [-]
      - Nemi 20 hours ago
        > The subscription services have assumptions baked in about the usage patterns; they're oversubscribed and subsidized.
        Selling dollars for $.50 does that. It sounds like they have a business model issue to me.
        [-]
        bri3d 20 hours ago
        This is how every cloud service and every internet provider works. If you want to get really edgy you could also say it's how modern banking works.
        Without knowing the numbers it's hard to tell if the business model for these AI providers actually works, and I suspect it probably doesn't at the moment, but selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces (for varying definitions of functional, I suppose). I'm surprised this is so surprising to people.
        [-]
        Tossrock 19 hours ago
        Don't forget gyms and other physical-space subscriptions. It's right up there with razor-and-blades for bog standard business models. Imagine if you got a gym membership and then were surprised when they cancelled your account for reselling gym access to your friends.
        muyuu 19 hours ago
        If they rely on this to be competitive, I have serious doubts they will survive much longer.
        There are already many serious concerns about sharing code and information with 3rd parties, and those Chinese open models are dangerously close to destroying their entire value proposition.
        Nemi 18 hours ago
        > selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces
        Being a common business model and it being functional are two different things. I agree they are prevalent, but they are actively user hostile in nature. You are essentially saying that if people use your product at the advertised limit, then you will punish them. I get why the business does it, but it is an adversarial business model.
        djeastm 16 hours ago
        >Without knowing the numbers it's hard to tell if the business model for these AI providers actually works
        It'll be interesting to see what OpenAI and Anthropic will tell us about this when they go public (seems likely late this year--along with SpaceX, possibly)
        cyanydeez 17 hours ago
        The Business model is Uber. It doesn't work unless you corner the market and provide a distinct value replacement.
        The problem is, there's not a clear every-man value like Uber has. The stories I see of people finding value are sparse and seem from the POV of either technosexuals or already strong developer whales leveraging the bootstrapy power .
        If AI was seriously providing value, orgs like Microsoft wouldn't be pushing out versions of windows that can't restart.
        It clearly is a niche product unlike Uber, but it's definitely being invested in like it is universal product.
      - cedws 20 hours ago
        That's on Anthropic for selling a mirage of limits they don't want people to actually reach for.
        It's within their capability to provision for higher usage by alternative clients. They just don't want to.
      - behnamoh 20 hours ago
        > It's not, because the agreement says you can only use CC.
        it's like Apple: you can use macOS only on our Macs, iOS only on iPhones, etc. but at least in the case of Apple, you pay (mostly) for the hardware while the software it comes with is "free" (as in free beer).
  - whywhywhywhy 20 hours ago
    Taking umbrage as if it matters how I use the compute I'm paying for via the harness they want me to use it within as long as I'm just doing personal tasks I want to do for myself, not trying to power an apps API with it seems such a waste of their time to be focusing on and only causes brand perception damage with their customers.
    Could have just turned a blind eye.
  - echelon 20 hours ago
    The loss of access shows the kind of power they'll have in the future. It's just a taste of what's to come.
    If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.
    I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.
    I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.
    [-]
    - derac 9 hours ago
      Kimi k2.5 is a good choice.
  - CamperBob2 20 hours ago
    How do I "abuse" a token? I pass it to their API, the request executes, a response is returned, I get billed for it. That should be the end of the conversation.
    (Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)
    [-]
    - bri3d 20 hours ago
      You can buy this product, right here: https://platform.claude.com/docs/en/about-claude/pricing
      That's not the product you buy when you a Claude Code token, though.
      [-]
      - s5fs 19 hours ago
        Claude Code supports using API credits, and you can turn on Extra Usage and use API credits automatically once your session limit is reached.
        This confused me for a while, having two separate "products" which are sold differently, but can be used by the same tool.
- dirkc 19 hours ago
  Access is one of my concerns with coding agents - on the one hand I think they make coding much more accessible to people who aren't developers - on the other hand this access is managed by commercial entities and can be suspended for any reason.
  I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values
- tomashubelbauer 20 hours ago
  Anthropic banned my account when I whipped up a solution to control Claude Code running on my Mac from my phone when I'm out and about. No commercial angle, just a tool I made for myself since they wouldn't ship this feature (and still haven't). I wasn't their biggest fanboy to begin with, but it gave me the kick in the butt needed to go and explore alternatives until local models get good enough that I don't need to use hosted models altogether.
  [-]
  - darkwater 20 hours ago
    I control it with ssh and sometimes tmux (but termux+wireguard lead to a surprisingly generally stable connection). Why did you need more than that?
    [-]
    - tomashubelbauer 20 hours ago
      I didn't like the existing SSH applications for iOS and I already have a local app that I made that I have open 24/7, so I added a screen that used xterm.js and Bun.spawn with Bun.Terminal to mirror the process running on my Mac to my phone. This let me add a few bells and whistles that a generic SSH client wouldn't have, like notifications when Claude Code was done working etc.
      [-]
      - pluralmonad 18 hours ago
        How did they even know you did this? I cannot imagine what cause they could have for the ban. They actively want folks building tooling around and integrating with Claude Code.
        [-]
        tomashubelbauer 18 hours ago
        I have no idea. The alternative is that my account just happened to be on the wrong side of their probably slop-coded abuse detection algorithm. Not really any better.
  - redblacktree 20 hours ago
    How did this work? The ban, I mean. Did you just wake up to find out an email and that your creds no longer worked? Were you doing things to sub-process out to the Claude Code CLI or something else?
    [-]
    - tomashubelbauer 20 hours ago
      I left a sibling comment detailing the technical side of things. I used the `Bun.spawn` API with the `terminal` key to give CC a PTY and mirrored it to my phone with xterm.js. I used SSE to stream CC data to xterm.js and a regular request to send commands out from my phone. In my mind, this is no different than using CC via SSH from my phone - I was still bound by the same limits and wasn't trying to bypass them, Anthropic is entitled to their different opinion of course.
      And yeah, I got three (for some reason) emails titled "Your account has been suspended" whose content said "An internal investigation of suspicious signals associated with your account indicates a violation of our Usage Policy. As a result, we have revoked your access to Claude.". There is a link to a Google Form which I filled out, but I don't expect to hear back.
      I did nothing even remotely suspicious with my Anthropic subscription so I am reasonably sure this mirroring is what got me banned.
      Edit: BTW I have since iterated on doing the same mirroring using OpenCode with Codex, then Codex with Codex and now Pi with GPT-5.2 (non-Codex) and OpenAI hasn't banned me yet and I don't think they will as they decided to explicitly support using your subscription with third party coding agents following Anthropic's crackdown on OpenCode.
      [-]
      - fc417fc802 19 hours ago
        > Anthropic is entitled to their different opinion of course.
        I'm not so sure. It doesn't sound like you were circumventing any technical measures meant to enforce the ToS which I think places them in the wrong.
        Unless I'm missing some obvious context (I don't use Mac and am unfamiliar with the Bun.spawn API) I don't understand how hooking a TUI up to a PTY and piping text around is remotely suspicious or even unusual. Would they ban you for using a custom terminal emulator? What about a custom fork of tmux? The entire thing sounds absurd to me. (I mean the entire OpenCode thing also seems absurd and wrong to me but at least that one is unambiguously against the ToS.)
      - eptcyka 18 hours ago
        > Anthropic is entitled to their different opinion of course.
        It’d be cool if Anthropic were bound by their terms of use that you had to sign. Of course, they may well be broad enough to fire customers at will. Not that I suggest you expend any more time fighting this behemoth of a company though. Just sad that this is the state of the art.
        [-]
        tomashubelbauer 18 hours ago
        It sucks and I wish it were different, but it is not so different from trying to get support at Meta or Google. If I was an AI grifter I could probably just DM a person on Twitter and get this sorted, but as a paying customer, it's wisest to go where they actually want my money.
  - RationPhantoms 20 hours ago
    There is weaponized malaise employed by these frontier model providers and I feel like that dark-pattern, what you pointed out, and others are employed to rate-limit certain subscriptions.
    [-]
    - bri3d 20 hours ago
      They have two products:
      * Subscription plans, which are (probably) subsidized and definitely oversubscribed (ie, 100% of subscribers could not use 100% of their tokens 100% of the time).
      * Wholesale tokens, which are (probably) profitable.
      If you try to use one product as the other product, it breaks their assumptions and business model.
      I don't really see how this is weaponized malaise; capacity planning and some form of over-subscription is a widely accepted thing in every industry and product in the universe?
      [-]
      - tomashubelbauer 20 hours ago
        I am curious to see how this will pan out long-term. Is the quality gap of Opus-4.5 over GPT-5.2 large enough to overcome the fact that OpenAI has merged these two bullet points into one? I think Anthropic might have bet on no other frontier lab daring to disconnect their subscription from their in-house coding agent and OpenAI called their bluff to get some free marketing following Anthropic's crackdown on OpenCode.
        [-]
        bri3d 20 hours ago
        It will also be interesting to see which model is more sustainable once the money fire subsidy musical chairs start to shake out; it all depends on how many whales there are in both directions I think (subscription customers using more than expected vs large buys of profitable API tokens).
      - Propelloni 20 hours ago
        So, if I rent out my bike to you for an hour a day for really cheap money and I do so a 50 more times to 50 others, so that my bike is oversubscribed and you and others don't get your hours, that's OK because it is just capacity planning on my side and widely accepted? Good to know.
        [-]
        bri3d 19 hours ago
        Let me introduce you to Citibike?
        Also, this is more like "I sell a service called take a bike to the grocery store" with a clause in the contract saying "only ride the bike to the grocery store." I do this because I am assuming that most users will ride the bike to the grocery store 1 mile away a few times a week, so they will remain available, even though there is an off chance that some customers will ride laps to the store 24/7. However, I also sell a separate, more expensive service called Bikes By the Hour.
        My customers suddenly start using the grocery store plan to ride to a pub 15 miles away, so I kick them off of the grocery store plan and make them buy Bikes By the Hour.
        elzbardico 18 hours ago
        As others pointed out, every business that sells capacity does this, including your ISP provider.
        They could, of course, price your 10GB plan under the assumption that you would max out your connection 24 hours a day.
        I fail to see how this would be advantageous to the vast majority of the customers.
        [-]
        pluralmonad 18 hours ago
        Well, if the service price were in any way tied to the cost of transmitting bytes, then even the 24hr scenarios would likely see a reduction in cost to customers. Instead we have overage fees and data caps to help with "network congestion", which tells us all how little they think of their customers.
        dehugger 19 hours ago
        Yes, correct. Essentially every single industry and tool which rents out capacity of any system or service does this. Your ISP does this. The airline does this. Cruise lines. Cloud computing environments. Restaurants. Rental cars. The list is endless.
        pyvpx 19 hours ago
        I have some bad news for you about your home internet connection.
  - Tossrock 19 hours ago
    They did ship that feature, it's called "&" / teleport from web. They also have an iOS app.
    [-]
    - tomashubelbauer 19 hours ago
      That's non-local. I am not interested in coding assistants that work on cloud based work-spaces. That's what motivated me to developed this feature for myself.
      [-]
      - Tossrock 18 hours ago
        But... Claude Code is already cloud-based. It relies on the Anthropic API. Your data is all already being ingested by them. Seems like a weird boundary to draw, trusting the company's model with your data but not their convenience web ui. Being local-only (ie OpenCode & open weights model running on your own hw) is consistent, at least.
        [-]
        tomashubelbauer 18 hours ago
        It is not a moral stance. I just prefer to have my files of my personal projects in one place. Sure I sync them to GitHub for backup, but I don't use GitHub for anything else in my personal projects. I am not going to use a workflow which relies on checking out my code to some VM where I have to set everything up in a way where it has access to all the tools and dependencies that are already there on my machine. It's slower, clunkier. IMO you can't beat the convenience of working on your local files. When I used my CC mirror for the brief period where it worked, when I came back to my laptop, all my changes were just already there, no commits, no pulls, no sync, nothing.
        [-]
        Tossrock 18 hours ago
        Ah okay, that makes sense. Sorry they pulled the plug on you!
- disiplus 20 hours ago
  im downloading it as we speek to try to run it on a 32gb 5090 + 128gb ddr5 i will compare it to glm 4.7-flash that was my local model of choice
  [-]
  - gitpusher 19 hours ago
    Likewise curious to hear how it goes! 80B seems too big for a 5090, I'd be surprised if it runs well un-quantized.
  - wilkystyle 20 hours ago
    Interested to hear how this goes!
- rschachte 19 hours ago
  Easy to use a local proxy to use other models with CC. Wrote a basic working one using Claude. LiteLLM is also good. But I agree, fuck their mindset
- _ink_ 19 hours ago
  What setup comes close to Claude Code? I am willing to rent cloude GPUs.
- wahnfrieden 20 hours ago
  OpenAI committed to allowing it btw. I don't know why Anthropic gets so much love here
  [-]
  - rustyhancock 20 hours ago
    Cause they make the best coding model.
    It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.
    They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy
    I've cancelled a clause sub, but still have one.
    [-]
    - bheadmaster 20 hours ago
      Agreed.
      I've tried all of the models available right now, and Claude Opus is by far the most capable.
      I had an assertion failure triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-contained reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.
      I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.
  - jmathai 20 hours ago
    Probably because the alternatives are OpenAI, Google, Meta. Not throwing shade at those companies but it's not hard to win the hearts of developers when that's your competition.
  - cedws 20 hours ago
    Thanks, I’ll try out Codex to bridge until local models get to the level I need.
  - falloutx 14 hours ago
    Anthropic is astroturfing most of the programming forums including this one.
  - teratron27 20 hours ago
    Because OpenAI is on the back foot at the moment, they need the retention
  - varispeed 20 hours ago
    On the other hand I feel like 5.2 gets progressively dumbed down. It used to work well, but now initial few prompts go in right direction and then it goes off the rails reminding me more of a GPT-3.5.
    I wonder what they are up to.
- thedangler 19 hours ago
  How are you using the huge models locally?
- Alxc1 20 hours ago
  I must have missed it, but what did Claude disable access for? Last I checked Cline and Claude Max still worked.
  [-]
  - hnrodey 20 hours ago
    OpenCode
    [-]
    - tshaddox 19 hours ago
      Yes, although OpenCode works great with official Claude API keys that are on normal API pricing.
      What Anthropic blocked is using OpenCode with the Claude "individual plans" (like the $20/month Pro or $100/month Max plan), which Anthropic intends to be used only with the Claude Code client.
      OpenCode had implemented some basic client spoofing so that this was working, but Anthropic updated to a more sophisticated client fingerprinting scheme which blocked OpenCode from using this individual plans.
    - nullbyte 19 hours ago
      Protip for Mac people: If OpenCode looks weird in your terminal, you need to use a terminal app with truecolor support. It looks very janky on ANSI terminals but it's beautiful on truecolor.
      I recommend Ghostty for Mac users. Alacritty probably works too.
      [-]
      - mayhemducks 19 hours ago
        Thank you for this comment! I knew it was something like this. I've been using it in the VSCode terminal, but you're right, the ANSI terminal just doesn't work. I wasn't quite sure why!
    - stevejb 19 hours ago
      Is this still the case? Is Anthropic still not allowing access to OpenCode?
      [-]
      - cedws 19 hours ago
        Officially, it's against TOS. I'm told you can still make it work by adding this to ~/.config/opencode/opencode.json but it risks a ban and you definitely shouldn't do it.
        { "plugin": [ "opencode-anthropic-auth@latest" ] }
        [-]
        stevejb 19 hours ago
        Ah interesting. I have been using OpenCode more and more and I prefer it to Claude Code. I use OpenCode with Sonnet and/or Opus (among other models) with Bedrock, but paying metered rates for Opus is a way to go bankrupt fast!
        fc417fc802 19 hours ago
        Just like I shouldn't use an unofficial play store client, right? No one would ever do that.
  - illusive4080 19 hours ago
    They had a public spat with Opencode
- throwup238 19 hours ago
  Did they actually say that? I thought they rolled it back.
  OpenCode et al continue to work with my Max subscription.
- logicallee 18 hours ago
  What do you require local models to do? The State of Utopia[1] is currently busy porting a small model to run in a zero-trust environment - your web browser. It's finished the port in javascript and is going to wasm now for the CPU path. you can see it being livecoded by Claude right now[2] (this is day 2, day 1 it ported the C++ code to javascript successfully). We are curious to know what permissions you would like to grant such a model and how you would like it served to you. (For example, we consider that you wouldn't trust a Go build - especially if it's built by a nation state, regardless of our branding, practices, members or contributors.)
  Please list what capabilities you would like our local model to have and how you would like to have it served to you.
  [1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)
  [2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi
- ad 20 hours ago
  which tools?
- jstummbillig 20 hours ago
  > I’m not paying to support anticompetitive behaviour
  You are doing that all the time. You just draw the line, arbitrarily.
  [-]
  - tclancy 20 hours ago
    The enemy of done is perfect, etc. what is the point of comments like this?
    [-]
    - jstummbillig 19 hours ago
      What is the point of any of this? To exchange how we think about things. I think virtue signaling is boring and uncandid.
      [-]
      - InsideOutSanta 19 hours ago
        But you are virtue-signalling, too, based on your own definition of virtuous behavior. In fact, you're doing nothing else. You're not contributing anything of value to the discussion.
      - tclancy 19 hours ago
        Unclench and stop seeing everything as virtual signaling. What about al those White Knight, SJWs in the 70s who were against leaded gas? Still virtue signaling?
        [-]
  - mannanj 20 hours ago
    That's great, yes. We all draw the line somewhere, subjectively. We all pretend we follow logic and reason and lets all be more honest and truthfully share how we as humans are emotionally driven not logically driven.
    It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).
    The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.
featherless 15 hours ago
I got openclaw to compete Qwen3-Coder-Next vs Minimax M2.1 simultaneously on my Mac Studio 512GB: https://clutch-assistant.github.io/model-comparison-report/
Robdel12 20 hours ago
I really really want local or self hosted models to work. But my experience is they’re not really even close to the closed paid models.
Does anyone any experience with these and is this release actually workable in practice?
[-]
- littlestymaar 17 hours ago
  > But my experience is they’re not really even close to the closed paid models.
  They are usually as good as the flagship model for 12-18 months ago. Which may sound like a massive difference, because somehow it is, but it's also fairly reasonable, you don't need to live to the bleeding edge.
  [-]
  - cmrdporcupine 16 hours ago
    And it's worth pointing out that Claude Code now dispatches "subagents" from Opus->Sonnet and Opus->Haiku ... all the time, depending on the problem.
    Running this thing locally on my Spark with 4-bit quant I'm getting 30-35 tokens/sec in opencode but it doesn't feel any "stupider" than Haiku, that's for sure. Haiku can be dumb as a post. This thing is smarter than that.
    It feels somewhere around Sonnet 4 level, and I am finding it genuinely useful at 4-bit even. Though I have paid subscriptions elsewhere, so I doubt I'll actually use it much.
    I could see configuration OpenCode somehow to use paid Kimi 2.5 or Gemini for the planning/analysis & compaction, and this for the task execution. It seems entirely competent.
mmaunder 15 hours ago
These guys are setting up to absolutely own the global south market for AI. Which is in line with the belt and road initiative.
SamDc73 11 hours ago
This is model 12188, which claims to rival SOTA models while not even being in the same league.
In terms of intelligence per compute, it’s probably the best model I can realistically run locally on my laptop for coding. It’s solid for scripting and small projects.
I tried it on a mid-size codebase (~50k LOC), and the context window filled up almost immediately, making it basically unusable unless you’re extremely explicit about which files to touch. I tested it with a 8k context window but will try again with 32k and see if it becomes more practical.
I think the main blocker for using local coding models more is the context window. A lot of work is going into making small models “smarter,” but for agentic coding that only gets you so far. No matter how smart the model is, an agent will blow through the context as soon as it reads a handful of files.
[-]
- halJordan 11 hours ago
  The small context window has been a recognized problem for a while now. Really only Google has the ability to use a good long context window
- mirekrusin 2 hours ago
  What are you talking about? Qwen3-Coder-Next supports 256k context. Did you wanted to say that you don't have enough memory to run it locally yourself?
- anhner 4 hours ago
  you should look into using subagents, which each have their own context window and don't pollute the main one
gitpusher 19 hours ago
Pretty cool that they are advertising OpenClaw compatibility. I've tried a few locally-hosted models with OpenClaw and did not get good results – (that tool is a context-monster... the models would get completely overwhelmed them with erroneous / old instructions.)
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
macmac_mac 15 hours ago
I just tried qwen 3 tts and it was mind blowingly good, you can even provide directions for the overall tone etc. Which wasn't the case when I used commercial super expensive products like the (now closed after being bought by meta) play.ht .
Does anyone see a reason to still use elevenlabs etc. ?
zokier 18 hours ago
For someone who is very out of the loop with these AI models, can someone explain what I can actually run on my 3080ti (12G)? Is this something like that or is this still too big; is there anything remotely useful runnable with my GPU? I have 64G RAM if that helps (?).
[-]
- AlbinoDrought 18 hours ago
  This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit.
  I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
  The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
  [-]
  - zokier 16 hours ago
    Thanks for the pointers!
    one more thing, that guide says:
    > You can choose UD-Q4_K_XL or other quantized versions.
    I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?
```
    IQ4_XS
    Q4_K_S
    Q4_1
    IQ4_NL
    MXFP4_MOE
    Q4_0
    Q4_K_M
    Q4_K_XL
```
    [-]
    - MrDrMcCoy 15 hours ago
      The I-prefix stands for Imatrix smoothing in the quantization. It trades a little more accuracy for speed than other quant styles. The _0 and _1 quants are older, simpler quants that are very accurate but kinda slow. The K quants, in my limited understanding, primarily quantize at the specified bit depth, but will bump certain important areas higher, and less used parts lower. It generally performs better while providing similar accuracy to the _1 quants. MXFP4 is specific to Nvidia, so I can't use it on my AMD hardware. It's supposed to be very efficient. The UD part includes more of Unsloth's speed optimizations.
      Also, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.
- cirrusfan 18 hours ago
  This model is exactly what you’d want for your resources. GPU for prompt processing, ram for model weights and context length, and it being MoE makes it fairly zippy. Q4 is decent; Q5-6 is even better, assuming you can spare the resources. Going past q6 goes into heavily diminishing resources.
zamadatix 21 hours ago
Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?
[-]
- yorwba 20 hours ago
  SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.
  [-]
  - regularfry 20 hours ago
    If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.
  - zamadatix 19 hours ago
    Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).
- edude03 21 hours ago
  Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.
  [-]
  - zamadatix 21 hours ago
    Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.
    [-]
    - esafak 21 hours ago
      For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot
    - jsnell 21 hours ago
      That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).
      The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
alexellisuk 21 hours ago
Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
[-]
- redrove 18 hours ago
  I have a 3090 and a 4090 and it all fits in in VRAM with Q4_0 and quantized KV, 96k ctx. 1400 pp, 80 tps.
- segmondy 20 hours ago
  1 6000 should be fine, Q6_K_XL gguf will be almost on par with the raw weights and should let you have 128k-256k context.
ionwake 20 hours ago
will this run on an apple m4 air with 32gb ram?
Im currently using qwen 2.5 16b , and it works really well
[-]
- segmondy 20 hours ago
  No, at Q2 you are looking at a size of about 26gb-30gb. Q3 exceeds it, you might run it, but the result might vary. Best to run a smaller model like qwen3-32b/30b at Q6
  [-]
  - ionwake 19 hours ago
    Thank you for your advice have a good evening
StevenNunez 15 hours ago
Not crazy about it. It keeps getting stuck in a loop and filling up the context window (131k, run locally). Kimi's been nice, even if a bit slow.
[-]
- lostmsu 11 hours ago
  Did you apply RoPE?
orliesaurus 21 hours ago
how can anyone keep up with all these releases... what's next? Sonnet 5?
[-]
- gessha 20 hours ago
  Tune it out, come back in 6 months, the world is not going to end. In 6 months, you’re going to change your API endpoint and/or your subscription and then spend a day or two adjusting. Off to the races you go.
- Squarex 20 hours ago
  Well there are rumors sonnet 5 is coming today, so...
- Havoc 17 hours ago
  Pretty much every lab you can think of has something scheduled for february. Gonna be a wild one
- cmrdporcupine 16 hours ago
  This is going to be a crazy month because the Chinese labs are all trying to get their releases out prior to their holidays (Lunar New Year / Spring Festival).
  So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.
  And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.
- bigyabai 19 hours ago
  Relatively, it's not that hard. There's like 4-5 "real" AI labs, who altogether manage to announce maybe 3 products max, per-month.
  Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.
storus 20 hours ago
Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?
[-]
- segmondy 20 hours ago
  Not applicable... the models just process whatever context you provide to them, context management happens outside of the model and depends on your inference tool/coding agent.
  [-]
  - cyanydeez 17 hours ago
    It's interesting how people can be so into LLMs but dont, at the end of the day, understand they're just passing "well formatted" text to a text processor and everything else is build around encoding/decoding it into familiar or novel interfaces & the rest.
    The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.
    Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.
    [-]
    - storus 16 hours ago
      That's why you can use latest open coding models locally that reportedly reached the performance of Sonet-4.5 so almost SOTA. And then you can think of tricks like I mentioned above to directly manipulate GPU RAM for context cleanup when needed which is not possible with cloud models unless their provider enables that.
dk8996 12 hours ago
Is there a good way to enable this model within VSCode, looking for something like Copilot?
valcron1000 20 hours ago
Still nothing to compete with GPT-OSS-20B for local image with 16 VRAM.
blurbleblurble 16 hours ago
So dang exciting! There are a bunch of new interesting small models out lately, by the way, this is just one of them...
endymion-light 21 hours ago
Looks great - i'll try to check it out on my gaming PC.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
[-]
- kevinsync 12 hours ago
  It might be Screen Studio [0] -- I was gonna write "99% sure" but now I'm not sure at all!!
  [0] https://screen.studio
ossicones 20 hours ago
What browser use agent are they using here?
[-]
- novaray 18 hours ago
  Yes, the general purpose version is already supported and should have the same identical architecture
StevenNunez 19 hours ago
Going to try this over Kimi k2.5 locally. It was nice but just a bit too slow and a resource hog.
throwaw12 21 hours ago
We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks
[-]
- gordonhart 21 hours ago
  By the time that happens, Opus 5 and GPT-5.5 will be out. At that point will a GPT-5.2 tier open-weights model feel "good enough"? Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.
  [-]
  - cirrusfan 21 hours ago
    I think it depends on what you use it for. Coding, where time is money? You probably want the Good Shit, but also want decent open weights models to keep prices sane rather than sama’s 20k/month nonsense. Something like a basic sentiment analysis? You can get good results out of a 30b MoE that runs at good pace on a midrange laptop. Researching things online with many sources and decent results I’d expect to be doable locally by the end of 2026 if you have 128GB ram, although it’ll take a while to resolve.
    [-]
    - bwestergard 20 hours ago
      What does it mean for U.S. AI firms if the new equilibrium is devs running open models on local hardware?
      [-]
      - selectodude 20 hours ago
        OpenAI isn’t cornering the market on DRAM for kicks…
  - yorwba 20 hours ago
    When Alibaba succeeds at producing a GPT-5.2-equivalent model, they won't be releasing the weights. They'll only offer API access, like for the previous models in the Qwen Max series.
    Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.
  - tosh 21 hours ago
    It feels like the gap between open weight and closed weight models is closing though.
    [-]
    - theshrike79 21 hours ago
      Mode like open local models are becoming "good enough".
      I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
      When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
      [-]
      - cyanydeez 17 hours ago
        See, the market is investing like _that will never happen_.
        [-]
        theshrike79 16 hours ago
        I'm just riding the VC powered wave of way-too-cheap online AI services and building tools and scaffolding to prepare for the eventual switch to local models =)
  - thepasch 20 hours ago
    If an open weights model is released that’s as capable at coding as Opus 4.5, then there’s very little reason not to offload the actual writing of code to open weight subagents running locally and stick strictly to planning with Opus 5. Could get you masses more usage out of your plan (or cut down on API costs).
  - rglullis 20 hours ago
    I'm going in the opposite direction: with each new model, the more I try to optimize my existing workflows by breaking the tasks down so that I can delegate tasks to the less powerful models and only rely on the newer ones if the results are not acceptable.
  - rubslopes 18 hours ago
    I used to say that Sonnet 4.5 was all I would ever need, but now I exclusively use Opus...
  - littlestymaar 17 hours ago
    > Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.
    That's the tyranny of comfort. Same for high end car, living in a big place, etc.
    There's a good work around though: just don't try the luxury in the first place so you can stay happy with the 9 months delay.
- Keyframe 20 hours ago
  I'd be happy with something that's close or same as opus 4.5 that I can run locally, at reasonable (same) speed as claude cli, and at reasonable budget (within $10-30k).
- segmondy 20 hours ago
  Try KimiK2.5 and DeepSeekv3.2-Speciale
- IhateAI 20 hours ago
  Just code it yourself, you might surprise yourself :)
fudged71 19 hours ago
I'm thrilled. Picked up a used M4 Pro 64GB this morning. Excited to test this out
syntaxing 21 hours ago
Is Qwen next architecture ironed out in llama cpp?
ltbarcly3 13 hours ago
Here's a tip: Never name anything new, next, neo, etc. You will have a problem when you try to name the thing after that!
dzonga 17 hours ago
the qwen website doesn't work for me in safari :(. had to read the announcement in chrome
jtbaker 18 hours ago
any way to run these via ollama yet?
kylehotchkiss 16 hours ago
Is there any online resource tracking local model capability on say... a $2000 64gb memory Mac Mini? I'm getting increasingly excited about the local model space because it offers us a future where we can benefit from LLMs without having to listen to tech CEOs saber rattle about removing America of its jobs so they can get the next fundraising round sorted
moron4hire 19 hours ago
My IT department is convinced these "ChInEsE cCcP mOdElS" are going to exfiltrate our entire corporate network of its essential fluids and vita.. erh, I mean data. I've tried explaining to them that it's physically impossible for model weights to make network requests on their own. Also, what happened to their MitM-style, extremely intrusive network monitoring that they insisted we absolutely needed?
cpill 16 hours ago
I wonder if we could have much smaller models if they train on less languages? i.e. python + yaml + json only or even an single languages with an cluster of models loaded into memory dynamically...?
yowang 5 hours ago
[dead]
raphaelmolly8 21 hours ago
[dead]
lysace 15 hours ago
Is it censored according to the wishes of the CCP?
[-]
- mirekrusin 15 hours ago
  Who cares? If you don't like it, you can fine tune.
  [-]
  - asdfss674564 3 hours ago
    is the censor after post training or is it applied at the data set
  - lysace 15 hours ago
    I think a lot of people care. Most decidedely not you.
    [-]
    - mirekrusin 1 hour ago
      I think people care about open weights, so they can use it locally, including fine tuning like unalignment.
      There are of course people that when you give them something that did cost millions of dollars to build for free will complain and share with the world what exactly they're entitled to.
Soerensen 21 hours ago
The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
[-]
- cirrusfan 21 hours ago
  I find it really surprising that you’re fine with low end models for coding - I went through a lot of open-weights models, local and "local", and I consistently found the results underwhelming. The glm-4.7 was the smallest model I found to be somewhat reliable, but that’s a sizable 350b and stretches the definition of local-as-in-at-home.
  [-]
  - NitpickLawyer 21 hours ago
    You're replying to a bot, fyi :)
    [-]
    - CamperBob2 20 hours ago
      If it weren't for the single em-dash (really an en-dash, used as if it were an em-dash), how am I supposed to know that?
      And at the end of the day, does it matter?
      [-]
      - axus 17 hours ago
        Some people reply for their own happiness, some reply to communicate with another person. The AI won't remember or care about the reply.
    - IhateAI 20 hours ago
      "Is they key unlock here"
      [-]
      - mrandish 19 hours ago
        Yeah, that hits different.