vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 t transpose_mat2 n m 9216 n 3398 k 7168 mat1_ld 7168 mat2_ld 7168 result_ld 9216 computeType 68 scaleType 0
- [Usage]: has vllm supported encoder-only model such as bge-m3?
- [Bug]: VLLM usage on AWS Inferentia instances
- [Bug]: "Triton Error [CUDA]: device kernel image is invalid" when loading Mixtral-8x7B-Instruct-v0.1 in fused_moe.py
- [Misc]: 我在使用vllm启动的openai api在进行对话时出现这样的情况
- [RFC]: Add runtime weight update API
- [Bug]: asyncio.exceptions.CancelledError asyncio.exceptions.TimeoutError
- [RFC]: proper resource cleanup for LLM class with file-like usage
- [Installation]: poetry add vllm not working on my Mac -- xformers (0.0.26.post1) not supporting PEP 517 builds.
- [Misc] optimize sampler with top_p=1 and top_k>0
- Docs
- Python not yet supported