vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported2 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug] [BlockManagerV2]: Prefill for sliding window models can allocate more blocks than sliding window size
- [New Model]: LLaVA-OneVision
- Simplify Jamba state management
- [Core] More-efficient cross-attention parallel QKV computation
- [Bug]: (VllmWorkerProcess pid=3253) WARNING 08-13 11:31:37 shm_broadcast.py:386] No available block found in 60 second
- [Misc]: I want to run Llama 3.1 405B using speculative. Can you give me a guide?
- [Usage]: release notes,best practice, abort-actively
- [Bug]: FP8 Quantization support for AMD GPUs
- [Bug]: Does not work on MacOS
- [Core][Model][Frontend] Model architecture plugins
- Docs
- Python not yet supported