vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported2 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [CI/Build] Add e2e correctness in oai
- [Bug]: Extra body don't work when response_format is also sent for serving.
- [Feature]: Beam Search also requires diversity
- [Bug]: vllm is hang after upgrade to v0.5.4
- [Usage]: Acceptance rate for Speculative Decoding
- [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method
- Provided example for loading GGUF model is not working [Bug]:
- [CI/Build] Allow building for CUDA compute capability 8.7
- [Usage]: add mulitple lora in docker
- [Bug]: Empty prompt kills vllm server (AsyncEngineDeadError: Background loop is stopped.)
- Docs
- Python not yet supported