vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported3 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- inference with AWQ quantization
- When starting the second vllm.entrypoints.api_server using tensor parallel in a single node, the second vllm api_server Stuck in " Started a local Ray instance." OR "Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory"
- ImportError: libcudart.so.11.0: cannot open shared object file: No such file or director
- [Bug]: DynamicNTKScalingRotaryEmbedding implementation is different from Transformers
- [Usage]: /tmp/ray PermisionDenied
- [Bug]: vllm slows down after a long run
- [Bug]: (raylet) file_system_monitor.cc:111: /tmp/ray/session_ is over 95% full, Object creation will fail if spilling is required.
- [Usage]: How to use vLLM with `Tensor` input (customized tokenizer).
- Enable mypy type checking
- [Usage]: Inquiry about minimal test code for VLLM paged attention in vllm\csrc\attention
- Docs
- Python not yet supported