Fast State-of-the-art tokenizers, optimized for both research and production. Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in Transformers. Train new vocabularies and tokenize, using today’s most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU. Easy to use, but also extremely versatile. Designed for both research and production. Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token. Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Features

  • Train new vocabularies and tokenize, using today’s most used tokenizers
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU
  • Easy to use, but also extremely versatile
  • Designed for both research and production
  • Full alignment tracking
  • Truncation, Padding, add the special tokens your model needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Tokenizers

Tokenizers Web Site

Other Useful Business Software
Build with generative AI, deploy apps fast, and analyze data in seconds—all with Google-grade security. Icon
Build with generative AI, deploy apps fast, and analyze data in seconds—all with Google-grade security.

Access over 150 cutting-edge products, plus industry-defining AI

Google Cloud is a cloud-based service that allows you to create anything from simple websites to complex applications for businesses of all sizes.
Try it free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenizers!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Artificial Intelligence Software, Rust Machine Learning Software

Registered

2023-03-23