Steps to normalize and clean textual data.

  1. Cleaning:
    • Cleaning noise, emojis, urls, hashtags(#), user mentions(@) etc.
    • Fixing elongated words, spell checking, expanding contractions, etc.
    • Preparing text as model input: removing punctuation, lower-casing all words, etc.
  2. Segmentation(Tokenization)
    • Word Tokenization
    • Sentence Segmentation
  3. Root words:
    • Stemming: extracting base or stem word by removing prefix and suffixes.
    • Lemmatizing: extracting lemma using lexical knowledge.
  4. Part-of-speech Tagging (POS)
  5. Dependency Parsing
  6. Constituency Parsing
  7. Coreference Resolution
  8. Lexical Normalization: Lexical normalization is the task of translating/transforming a non standard text to a standard register.
  9. Missing Elements: Missing elements are a collection of phenomenon that deals with things that are meant, but not explicitly mentioned in the text.
  10. Entity Linking (EL)
  11. Word Sense Disambiguation: The task of Word Sense Disambiguation (WSD) consists of associating words in context with their most suitable entry in a pre-defined sense inventory
  12. Semantic Parsing: Semantic parsing is the task of translating natural language into a formal meaning representation on which a machine can act.