vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported4 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: [Performance] 100% performance drop using multiple lora vs no lora(qwen-chat model)
- [Installation]: Release Assets Wheels (.whl) are missing since v0.6.2
- [Bug]: I want to integrate vllm into LLaMA-Factory, a transformers-based LLM training framework. However, I encountered two bugs: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method & RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
- [Performance]: InternVL multi image speed is not improved compare to original
- [Performance]: speed regression 0.6.2 => 0.6.3?
- [Feature]: LoRA support for InternVLChatModel
- [Feature]: Allow setting tool_choice="none" in LLM calls if the OpenAI comaptible vllm server is started with --enable-auto-tool-choice
- [Bug]: aborting streaming and non-streaming request does not abort vllm request
- [Bug]: guided_grammer fails only on Mixtral models
- [Feature]: Support for quantized models for tensorized model weights
- Docs
- Python not yet supported