vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA
- [Doc]: Migrate to Markdown
- [Bug]: RuntimeError: CUDA error: operation not permitted when stream is capturing when serving llama 3.2 90b
- [New Model]: fishaudio/fish-speech-1.4
- [Bug]: 使用vllm和transformer部署Qwen2vl,同一张图片输出结果不一致
- [Bug]: NCCL error with 2-way pipeline parallelism.
- [Doc]: Compare LMDeploy vs vLLM AWQ Triton kernels
- [core] Bump ray to use _overlap_gpu_communication in compiled graph t…
- [Feature]: Add Support for Specifying Local CUTLASS Source Directory via Environment Variable
- [Bug]: (Program crashes after increasing --tensor-parallel-size) with error pynvml.NVMLError_InvalidArgument: Invalid Argument
- Docs
- Python not yet supported