vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: gfx1100安装的flash_attn分支不支持,切了howiejay/navi_support分支后,flash_attn_varlen_func() got an unexpected keyword argument 'window_size' 如何解决
- [Bug]: [Bug]: vllm 启动,openai的swarm 函数调用不正常
- [Misc]: How to organize a large number of requests for invocation?
- [Bug]: CRITICAL 11-05 12:03:03 launcher.py:99] MQLLMEngine is already dead, terminating server process
- [Performance]: latency of medusa is longer than naive inferece even the concurreny =2
- [Installation]: Missing v0.6.3.post1-cu118-cp310.whl. Can share it? Thanks so much
- [Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback
- [Bugfix] Free cross attention block table for preempted-for-recompute sequence group.
- [Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k)
- Adding cascade inference to vLLM
- Docs
- Python not yet supported