Inferencing our custom "Apply" LLM at over 3000 tokens per second
Our custom fine-tuned apply model ranks better than Qwen 2.5 Coder.
Integrating code changes from ChatGPT/Claude is one of the most important steps in a great AI IDE experience.
If you’re unfamiliar, most AI-powered IDEs allow you to one-click “apply” the changes from the LLM into your codebase.
Here’s what that looks like in Sweep:
Doing this reliably is key to making sure our users can seamlessly review and integrate the AI’s changes.
It’s a big step up from copying and pasting from ChatGPT!
Finetuning for a 50% error reduction
For our first version of Apply, we used the Qwen 2.5 Coder family of models. We want models that are small (1B-32B parameters) so that we can inference them quickly on our own hardware.
We also do not want reasoning, as the task is fairly simple, and any extra tokens just make it slower.
The Qwen family is perfect, as their open-source models in this range are smarter and better than other families like Mistral, Llama, or DeepSeek.
Qwen Coder 2.5 7B does quite well out of the box, with a 90%+ success rate!
Finetuning models without consequences is difficult. Typically, you’ll find that decreasing the chance of one error increases the chance of another.
For example, early in our training runs we focused on decreasing the rate of extra changes. But soon, we noticed these models would generate invalid syntax!
This is called “catastrophic forgetting”, where the model can learn a new task, but forgets what it previously learned.
We needed to carefully tune our evaluations and training data in order to fix this.
We’re excited to share that our finetuned apply model reduces the rate of extra changes by 52.9% over the default Qwen model that we used as our base model.
This is a meaningful difference, and our later models have been continuously improving!
Inferencing at over 3000 tokens per second
Integrating speculative decoding, we’ve been able to boost performance way past hosted LLM providers. This means a 1000-line file gets rewritten in <2 seconds.
For reference, the fastest hosted model (Gemini 2.0 Flash) runs at 250 tokens per second. This is still too slow for a great experience.
In addition, because some of our customers require on-prem, we can’t expect them to have access to all of the models we’d like. We support BYOGPU, which lets them get the best of both worlds!
Looking forward
We’ve learned a lot from building our near-instantaneous “apply”. We’re excited to unlock more exciting use cases, like better and more intelligent autocomplete!
To try out Sweep, check out our installation guide here. If you’d like to work with us, reach out!