vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported7 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: vLLM was installed and used without issues, but recently, during more frequent usage, it suddenly throws an error on a particular request and stops working entirely. Even nvidia-smi cannot return any output. The log is as follows:
- [Bug]: Quantization example outdated (Ammo -> ModelOpt)
- [Bug]: Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
- [Bug]: latest docker build (0.6.2) got error due to VLLM_MAX_SIZE_MB
- [Bug]: Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
- [Bugfix] Update grafana dashboard
- [Bug]: An error occurred while using H20 to perform multi machine inference 405B through the ray cluster, causing inference to crash.
- [RFC]: Make device agnostic for diverse hardware support
- [Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps`
- [Feature]: Simple Data Parallelism in vLLM
- Docs
- Python not yet supported