vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported2 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Core][Model] Add support for recursively loading weights by model ID
- [Performance]: vLLM version issue.
- Remove request.max_tokens assertion in serving_completion.py
- [Performance]: 5x slower throught with openAI client/server than native one
- [Installation]: Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback): libcudart.so.12: cannot open shared object file: No such file or directory
- [Usage]: how do I pass in the JSON content-type for ASYNC Mistral 7B offline inference
- [Feature]: Lora for MiniCPM_2_6
- [Usage]: Confirm tool calling is not supported and this is the closest thing can be done
- [Bug]: Requests larger than 75k input tokens cause `Input prompt (512 tokens) is too long and exceeds the capacity of block_manager` error
- [Bug]: Intermitted model load failure with error `Got async event : local catastrophic error` on A100
- Docs
- Python not yet supported