vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Performance]: VLLM 请求数量过多时太慢
- [Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime
- [Usage]: Question Regarding VLLM Rate Limit
- 【Frontend】Add sampler_priority and repetition_penalty_range
- [Usage]: When to use flashinfer as the default backend
- [Model] Update MPT model with GLU and rope and add low precision layer norm
- [Bug]: [Performance] 100% performance drop using multiple lora vs no lora(qwen-chat model)
- [Installation]: Release Assets Wheels (.whl) are missing since v0.6.2
- [Bug]: I want to integrate vllm into LLaMA-Factory, a transformers-based LLM training framework. However, I encountered two bugs: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method & RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
- [Performance]: InternVL multi image speed is not improved compare to original
- Docs
- Python not yet supported