Aknazar Janibek home about me projects

nlp text normalization

08 Jan 2025

Text representation and classification

corpus: a computer-readable collection texts
examples: tweets, wikipedia articles, facebook posts
token: any word in the corpus, number of tokens is an estimate of the length of corpus
type: unique representatives of the tokens, number of types is an estimate of the vocabulary size