vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server
- [Bug]:The parameter gpu_memory_utilization does not take effect
- [Misc] Allow LoRA to adaptively increase rank and remove possible_max_ranks
- [Bug]: Memory allocation with echo=True
- [Usage]: While loading model get 'layers.0.mlp.down_proj.weight' after merge_and_unload()
- [Bug]: GGUF Model Output Repeats Nonsensically
- [Usage]: How to make model response information appear in the vllm backend logs
- [Feature]: When apply prompt_logprobs for OpenAI server, the prompt_logprobs field in respnose does not show which token is chosen
- [doc] update the code to add models
- [fix] Correct num_accepted_tokens counting
- Docs
- Python not yet supported