vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server
- [V1] Fix Compilation config & Enable CUDA graph by default
- [Bug]: Authorization ignored when root_path is set
- [CI][Installation] Avoid uploading CUDA 11.8 wheel
- [Bugfix] Allow token ID-only inputs in Qwen2-Audio
- [Feature]: Support for Registering Model-Specific Default Sampling Parameters
- [Frontend] Add Command-R and Llama-3 chat template
- [V1] VLM prefix caching: Add hashing of images
- Setting default for EmbeddingChatRequest.add_generation_prompt to False
- [core] overhaul memory profiling and fix backward compatibility
- Docs
- Python not yet supported