vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported4 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Hardware] [Intel GPU] Add multistep scheduler for xpu device
- [Usage]: --cpu-offload-gb no use
- [help wanted]: write tests for python-only development
- [Bug]: KeyError during loading of Mixtral 8x22B in FP8
- [Bug]: vLLM was installed and used without issues, but recently, during more frequent usage, it suddenly throws an error on a particular request and stops working entirely. Even nvidia-smi cannot return any output. The log is as follows:
- [Misc]: remove dropout related stuff from triton flash attention kernel
- [Feature]: Quantization support for LLaVA OneVision
- [Bug]: 当vLLM 部署实现 OpenAI API,并且生成模型使用llama 3 8b instruct做RAG任务时,模型生成不停
- [Bug]: Quantization example outdated (Ammo -> ModelOpt)
- [Bug]: Hermes 2 Pro Tool parser could not locate tool call start/end tokens in the tokenizer!
- Docs
- Python not yet supported