vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported2 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Llama 3 answers starting with <|start_header_id|>assistant<|end_header_id|>
- [Bug]: Docker build for ROCm fails for latest release and main branch
- [Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.
- [Usage]: Is there any way to hook features inside vision-language model?
- Request support for the deepseek-gptq version
- [Usage]: When debugging with vLLM, a CUDA error occurs.
- [TPU] Enable neural-magic pre-quantized W8A8/16 checkpoint for TPU backend
- [Bug]: Running mistral-large results in an error related to NCCL
- [Usage]: Wait for the response for each prediction
- [Feature]: Integrate with `Formatron`
- Docs
- Python not yet supported