vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported4 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support
- [Bug]: ValueError: There is no module or parameter named 'lm_head.qweight_type' in Qwen2ForCausalLM.When use GGUF and draft model
- [Usage]: Running OpenAI Swarm with vLLM-hosted models
- [Bug]: prompt logprobs are different with batch_size > 1 compared to batch_size=1
- [Bug]: Multiple tool calls for llama3.2-11b-vision-instruct
- [Bug]: Run Pixtral-Large-Instruct-2411 raised a error Attempted to assign 1 x 2074 = 2074 multimodal tokens to 2040 placeholders
- [Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
- [Bug]: preemptmode recompute
- [Bug]: data_parallel_size=4 or 2 not working for lighteval with vllm backend.
- [AMD][Build] Porting dockerfiles from the ROCm/vllm fork
- Docs
- Python not yet supported