Charformer Turkish base model (character-level T5)

Most Turkish-language models you’ll see are subword-tokenizer-based. This one takes a different route: it’s an encoder-decoder (T5-style) transformer that works directly on characters, with a fixed char→id mapping (vocab.json) instead of a learned BPE/unigram tokenizer. That can be a good fit for messy inputs (typos, mixed casing, slang, OCR artifacts) where subword segmentations get brittle. The author also links the architecture back to the Charformer paper, which is essentially about making character-level transformers practical with lightweight downsampling.

The author notes it’s still in active development, with weights updating regularly. So treat it as a “try it and see” research artifact rather than a stable dependency. A reasonable first experiment is to use it for a simple text-to-text task like normalization (cleaning up noisy Turkish text) or short summarization, and then compare outputs against a tokenizer-based baseline on the same prompts to see where character-level modeling helps or hurts. If you decide to push further, keep a small evaluation set of deliberately misspelled or code-mixed inputs — that’s where character-level approaches tend to have the most obvious upside.

View on Hugging Face

Source listing: https://huggingface.co/models?sort=modified