sentencepiece
https://github.com/google/sentencepiece
C++
Unsupervised text tokenizer for Neural Network-based text generation.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
C++ not yet supported0 Subscribers
Add a CodeTriage badge to sentencepiece
Help out
- Issues
- With unigram algorithm, constant piece at end of each sentences does not become a token
- Error Attribute Error: type object 'SentencePieceTrainer' has no attribute 'train'. Did you mean: 'Train'?
- Zero Width Joiner issue for Sinhala Language
- No typings in Python package
- When I set SPM_PROTOBUF_PROVIDER to "package" in CMakeLists.txt, the compilation fails.
- trainer_interface.cc: Integer value -1 is outside the valid range of values [0, 255] for the enumeration type 'ScriptType'
- Wrong calculation of max_score in unigram_model.cc
- How to deal with id
- debloat the cmakelists.txt and add a bunch of customization for building
- coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
- Docs
- C++ not yet supported