Tokenize text 1GB in under 20 seconds
huggingface-tokenizersskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Tokenize text efficiently using HuggingFace tokenizers library (Rust-backed)
Best for
Fast, parallel tokenization when model-specific vocabularies and special tokens matter.
Inputs
- · tokenizer name or model
- · text input
- · optional special tokens
Outputs
- · token IDs
- · attention mask
- · token decode
Requires
- · transformers
- · tokenizers (Rust)
Preconditions
- · tokenizer available (HF hub or local)
Failure modes
- · OOM if batch too large
- · vocabulary mismatch if wrong tokenizer loaded
Trust signals
- · Rust backend → 100× faster than Python
- · compatible with all HF models