vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: Support for NVIDIA Unified memory
- [Bug]: LLM initialization time increases significantly with larger tensor parallel size and Ray
- [Bug]: 因vllm的版本不同,启动的qwen2.5服务,对于相同的输入;0.6.1.post2 sse输出是正确的,但 0.6.3.post1是错误的?
- [Feature]: The CPU backend supports mixture of experts (MoE)
- [Bug]: Internal Server Error when echo'ing logprobs with sampling
- [Feature]: Is it possible for VLLM to support inference with dynamic activation sparsity?
- [Core] Reduce TTFT with concurrent partial prefills
- [V1] TPU Prototype
- [Bug]: Deepseek V2 coder 236B awq error!
- [Installation]: Install Gpu vllm got no module named triton
- Docs
- Python not yet supported