About A tokenizer divides text into a sequence of tokens, which roughly correspond to "words". We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely mimic Penn Treebank 3 (PTB) tokenization, hence its name, though over time the tokenizer has added quite a few options and a fair amount of Unicode compatibility, so in general it will work
![The Stanford Natural Language Processing Group](https://cdn-ak-scissors.b.st-hatena.com/image/square/e889770165ceda99787e57a72c4d5d5d40c069aa/height=288;version=1;width=512/https%3A%2F%2Fnlp.stanford.edu%2Fimg%2FStanford-NLP-stack-small-465-400px.jpg)