vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: awq marlin error for deepseek v2 lite
- [New Model]: SparseLLM/prosparse-llama-2-7b
- [Bug]: 我在使用factory_llama工具以qlora的方式训练Qwen/Qwen2.5-1.5B-Instruct模型,然后以vllm加载lora的方式启动,结果报错:AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet.,有大佬知道是哪儿的问题吗
- [Feature]: automatically release graphics card memory
- [Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms (WIP not for review)
- [Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms
- [Misc]: Will the kv-cache be computed and stored if max_tokens=1?
- [Misc]: Remove max_tokens field for chat completion requests when not supported anymore by the OpenAI client
- [Installation]: pynvml.NVMLError_InvalidArgument: Invalid Argument
- [New Model]: NV-Embed-v2
- Docs
- Python not yet supported