vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: KV Cache Quantization with GGUF turns out quite poorly.
- Support Cross encoder models
- [Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model
- [Misc]: Snowflake Arctic out of memory error with TP-8
- [Bug]: Torch profiling does not stop and cannot get traces for all workers
- [Bug]: Guided Decoding Broken in Streaming mode
- [Misc]: Ask for the roadmap of async output processing support for speculative decoding
- [Bug]: v0.6.4.post1 crashed:Error in model execution: CUDA error: an illegal memory access was encountered
- [Kernel] Add CUTLASS sparse support, heuristics, and torch operators
- [Feature]: Allow head_size smaller than 128 on TPU with Pallas backend
- Docs
- Python not yet supported