vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: could not broadcast input array from shape (944,) into shape (512,)
- [Prototype][WIP] Prefix Cache Aware Scheduling for V0
- [Bug]: Running on a single machine with multiple GPUs error
- [Bug]: You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
- Concurrency meta-llama/Llama-3.1-8B doesnt change with access to more GPUs
- [Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty)
- [Misc]: Eagle reformat checkpoint compatible with Vllm
- [Bug]: Sampling parameter fixed issue while doing speculative sampling verification step
- [Bug] params Type is not right?
- [Bug]: Engine iteration timed out. This should never happen!
- Docs
- Python not yet supported