lorax
https://github.com/predibase/lorax
Python
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported1 Subscribers
Add a CodeTriage badge to lorax
Help out
- Issues
- Efficient implementation of all_reduce and all_gather for collect_lora_a
- Add support for fp8 (H100)
- feat: support loading eetq quantized model
- Upgrade to AWQ kernels v0.0.6
- LoRAX server with 2 GPUs and multiple adapters becomes permanently faster in swapping ONLY after parallel execution of requests.
- Support loading `.pt` weights
- Retrieve all lora models from Huggingface hub by base model setting.
- Extend adapters to support MLP head for embeddings, classification
- How to use --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>?
- an illegal memory access was encountered for Mixtral with 1700 tokens
- Docs
- Python not yet supported