vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011
- [Not for review] Test multimodal with adag
- [Bug]: Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled cuda error
- [Hardware] [Intel GPU] Add multistep scheduler for xpu device
- [Usage]: --cpu-offload-gb no use
- [help wanted]: write tests for python-only development
- [Bug]: KeyError during loading of Mixtral 8x22B in FP8
- [Bug]: Installed vllm successfully for AMD MI60 but inference is failing
- [Bug]: vLLM was installed and used without issues, but recently, during more frequent usage, it suddenly throws an error on a particular request and stops working entirely. Even nvidia-smi cannot return any output. The log is as follows:
- [Misc]: remove dropout related stuff from triton flash attention kernel
- Docs
- Python not yet supported