vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported7 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: Support guided decoding with multistep decoding
- [Bug]: Interference of Tokens in Concurrent Requests Causing Result Confusion in Version 0.6.3
- [Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms
- [Installation]: pynvml.NVMLError_InvalidArgument: Invalid Argument
- [New Model]: NV-Embed-v2
- [Bug]: Function calling with Qwen & Streaming ('NoneType' object has no attribute 'get')
- Concurrency meta-llama/Llama-3.1-8B doesnt change with access to more GPUs
- [Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty)
- [Bug]: Engine iteration timed out. This should never happen!
- [Bug]: Attempting to profile VLLM with TPU errors
- Docs
- Python not yet supported