Build multilingual tokenizers for CJK
sentencepieceskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Tokenize multilingual text with language-agnostic subword segmentation
Best for
Multilingual models or custom vocabularies where language-agnostic approach beats language-specific tokenizers.
Inputs
- · raw text
- · vocabulary size
- · language (optional)
Outputs
- · BPE/Unigram subword tokens
- · .model file
- · .vocab file
Requires
- · sentencepiece
- · protobuf
Preconditions
- · sentencepiece installed
- · text data available
Failure modes
- · OOM if vocab size too large
- · tokenization artifacts if corpus not representative
Trust signals
- · Google production use (BERT, T5, mT5)
- · handles 100+ languages
- · no language-specific rules