vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported4 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Misc]: Enable dependabot to help managing known vulnerabilities in dependencies
- [Kernel][Hardware][AMD][ROCm] Fix rocm/attention.cu compilation on ROCm 6.0.3
- [Core] Disaggregated prefilling supports valkey
- [Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
- [Usage]: speculative OutOfMemoryError:
- [Misc] Add conftest plugin for applying forking decorator
- [Usage]: Is there any difference between max_tokens and max_model_len?
- Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher
- [Bug]: 4208 CPU vllm 0.6.0 启动qwen-vl-7b ,报下面图片中的异常,模型开始可以正常输出,调用多次后,无返回结果
- [Bug]: Slow Inference Speed with llama 3.1 70B GGUF Q4 on A100 80G (8.7 tokens/s)
- Docs
- Python not yet supported