vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported7 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Multiple tool calls for llama3.2-11b-vision-instruct
- [Bug]: Run Pixtral-Large-Instruct-2411 raised a error Attempted to assign 1 x 2074 = 2074 multimodal tokens to 2040 placeholders
- [Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
- [Bug]: preemptmode recompute
- [Bug]: data_parallel_size=4 or 2 not working for lighteval with vllm backend.
- [Frontend] Disaggregate prefill decode with zmq
- Update run_cluster.sh
- [Feature]: prototype a support for non divisible attention heads
- [Bug]: AttributeError: 'Int8Params' object has no attribute 'bnb_shard_offsets', It seems that vllm's bnb prequantification support for cls models is not yet complete.
- [Bug]: Some weights are not initialized from checkpoints For Gemma2ForSequenceClassification
- Docs
- Python not yet supported