sentencepiece
https://github.com/google/sentencepiece
C++
Unsupervised text tokenizer for Neural Network-based text generation.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
C++ not yet supported0 Subscribers
Add a CodeTriage badge to sentencepiece
Help out
- Issues
- Python from source on armv7l raises ' undefined symbol: __atomic_fetch_add_8 '
- Only support 64bit?
- tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type
- split_by_number doesn't match documentation?
- bazel support for C++ API
- user defined char set
- Would plan to support BBPE
- Sentencepiece with pre-defined vocabulary
- How to create new model file with restricted vocabulary?
- can we train by Parallel Computing or Multithreading or multi-Progress
- Docs
- C++ not yet supported