vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: KeyError: 'model.layers.45.block_sparse_moe.gate.g_idx'
- [Kernel][Core][WIP] Tree attention and parallel decoding
- [Bug]: mistralai/Mixtral-8x22B-Instruct-v0.1 fails to load 2/3 times on aae08249acca69060d0a8220cab920e00520932c
- [Model] Qwen 1.5 moe support lora
- [Hardware][Nvidia] Enable support for Pascal GPUs
- [Bug]: WSL2 nccl issue with 2 GPUs?
- [Bug]: OpenAI API request doesn't go through with 'guided_json'
- [Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits.
- [Usage]: It seems that vllm doesn't perform well under high concurrency
- [Bug]: Failing to find LoRA adapter for MultiLoRA Inference
- Docs
- Python not yet supported