vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Usage]: Error Segmentation fault(core dumped) while testing asynchronous high concurrency
- [Usage]: Model Qwen2ForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github
- [Usage]: vllm can host offline? with internet connection?
- [Installation]: Could you please upload the newest docker version to support the lora load of Qwen?
- [Bug]: CUDA error: invalid argument
- [Bug]: docker 启动vllm,配置了host_IP ,还是 [W socket.cpp:663] [c10d] The client socket has failed to connect to [::ffff:172.16.8.232]:39623 (errno: 110 - Connection timed out)
- [Bug]: n_inner divisible to number of GPUs
- [Usage]: Can vllm be used together with tensorRT? Has anyone ever done an example
- [Bug]: distributed model example with num_gpus does not use all gpus provided by the ray actor
- [Feature]: is vllm support sequence-parallel?
- Docs
- Python not yet supported