vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported2 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.
- [Usage]: Is there any way to hook features inside vision-language model?
- Request support for the deepseek-gptq version
- [Usage]: When debugging with vLLM, a CUDA error occurs.
- [TPU] Enable neural-magic pre-quantized W8A8/16 checkpoint for TPU backend
- [Bug]: Running mistral-large results in an error related to NCCL
- [Usage]: Wait for the response for each prediction
- [Feature]: Integrate with `Formatron`
- [Bug]: for mistral-7B, local batch inference mode causes OOM error, while serving mode does not cause error
- [Bug]: gpu-memory-utilization does not pickup enough GPU memory
- Docs
- Python not yet supported