vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported4 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Hardware][NV] Fix Modelopt model loading for k-v-scales
- [Frontend] Disaggregate prefill decode with zmq
- Update run_cluster.sh
- [Feature]: prototype a support for non divisible attention heads
- [Model] Added Google T5 model support to vLLM
- [Bug]: AttributeError: 'Int8Params' object has no attribute 'bnb_shard_offsets', It seems that vllm's bnb prequantification support for cls models is not yet complete.
- [Bug]: Some weights are not initialized from checkpoints For Gemma2ForSequenceClassification
- [Doc]: Why NGramWorker does not support cache operations
- [Bug]: Cutlass 2:4 Sparsity + FP8/Int8 Quant RuntimeError: Error Internal
- [Misc]: Finetuned llama3.2 vision instruct model is failing during VLLM weight_loader
- Docs
- Python not yet supported