Build multilingual tokenizers for CJK

sentencepieceskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Tokenize multilingual text with language-agnostic subword segmentation

Best for

Multilingual models or custom vocabularies where language-agnostic approach beats language-specific tokenizers.

Inputs
  • · raw text
  • · vocabulary size
  • · language (optional)
Outputs
  • · BPE/Unigram subword tokens
  • · .model file
  • · .vocab file
Requires
  • · sentencepiece
  • · protobuf
Preconditions
  • · sentencepiece installed
  • · text data available
Failure modes
  • · OOM if vocab size too large
  • · tokenization artifacts if corpus not representative
Trust signals
  • · Google production use (BERT, T5, mT5)
  • · handles 100+ languages
  • · no language-specific rules