TF-IDF

Term-Frequency/Inverted-Document-Frequency(TF-IDF) is a Natural Language Processing (NLP) technique for Feature Extraction and for "term" weighting in information retrieval. It weights terms in a corpus for each document based on frequency of it's appearance in that document. I.e. terms appearing with a higher frequency in a document but not other documents in the corpus will have higher weight.


in this formulas:

  • stands for term
  • stands for frequency
  • stands for document
  • stands for set of documents

Concerns:

  • TF-IDF can be computationally expensive if the vocabulary is large.
  • TF-IDF doesn't capture Semantics of terms.

Term Frequency (TF)

Term Frequency (TF) in TF-IDF is a measure of how frequently a term(), appears in a document(), given is the number of times the term appears in the document.:

Inverse Document Frequency (IDF)

IDF in TF-IDF is a measure of how important(common or uncommon) a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words:

IDF gives the words that are uncommon a higher value in comparison to the words that are much more common(such as ‘the’, ‘a’, ‘is’, …).