vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported7 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Fix the torch version parsing logic
- [Bug]: CPU offload not working for DeepSeek-V2-Lite-Chat
- [Doc]: What version of vllm and lmcache does that example use https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py
- [V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill
- [Bugfix] fix client socket timeout when serve multi-node model in Ray
- [Bug]: CI flake - v1/engine/test_llm_engine.py::test_parallel_sampling[True]
- [Feature]: Fused MoE config for Nvidia RTX 3090
- [Bug]: Disagreement and misalignment between supported models in documentation and actual testing
- [MODEL ADDITION] Ovis2 Model Addition
- [Bug]: Gemma-3 (27B) can't load save_pretrained() checkpoint: AssertionError: expected size 5376==2560, stride 1==1 at dim=0
- Docs
- Python not yet supported