vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported5 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Disaggregate prefill decode with zmq
- Update run_cluster.sh
- [Feature]: prototype a support for non divisible attention heads
- [Bug]: AttributeError: 'Int8Params' object has no attribute 'bnb_shard_offsets', It seems that vllm's bnb prequantification support for cls models is not yet complete.
- [Bug]: Some weights are not initialized from checkpoints For Gemma2ForSequenceClassification
- [Doc]: Why NGramWorker does not support cache operations
- [Bug]: Cutlass 2:4 Sparsity + FP8/Int8 Quant RuntimeError: Error Internal
- [Misc]: Finetuned llama3.2 vision instruct model is failing during VLLM weight_loader
- [Bug]: After successfully loading the LoRA module with load_lora_adapter, the result returned by v1/models does not include this LoRA module.
- [Misc]: Very High GPU RX/TX using vllm
- Docs
- Python not yet supported