vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Option to change the temporary storage location for Ray logs
- add policy
- Error when trying to run the latest docker container
- Stopping an indefinitely running request
- Awq quantized qwen-72b-chat input long text returns empty string
- Improve Cuda compatibility of vllm-openai image
- Yi-34B-200K have empty output under default config (max_position_embedding=20000)
- Refactor openai completion api w.r.t Prefix Cache
- avoid decode_sequence when only input prompt_token_ids
- When using tp =2 to inference model in greedy mode and vllm 0.3 , results are random
- Docs
- Python not yet supported