what is the difference between tfidf vectorizer and tfidf transformer

Question

I know that the formula for tfidf vectorizer is

Count of word/Total count * log(Number of documents / no.of documents where word is present)

I saw there's tfidf transformer in the scikit learn and I just wanted to difference between them. I could't find anything that's helpful.

Refer the doc [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). It might help you — Sociopath, Feb 18 '19 at 10:51
@AkshayNevrekar It was confusing a bit. I couldn't understand the formula used. I am hoping someone here might be able to help. — user_6396, Feb 18 '19 at 10:57

score 10 · Accepted Answer · answered Feb 18 '19 at 13:12

10

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

answered Feb 18 '19 at 13:12

Artem Trunov

1,112
6
14

1

So it basically converts the sparse count matrix returned by countvectorizer to tfidf matrix. – user_6396 Feb 18 '19 at 22:28

score 4 · Answer 2 · answered Dec 07 '19 at 19:19

With Tfidftransformer you will compute word counts using CountVectorizer and then compute the IDF values and only then compute the Tf-idf scores. With Tfidfvectorizer you will do all three steps at once.

I think you should read this article which sums it up with an example.

score 2 · Answer 3 · answered Oct 29 '19 at 16:45

Artem's answer pretty much sums up the difference. To make things clearer here is an example as referenced from here.

TfidfTransformer can be used as follows:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


train_set = ["The sky is blue.", "The sun is bright."] 

vectorizer = CountVectorizer(stop_words='english')
trainVectorizerArray =   vectorizer.fit_transform(article_master['stemmed_content'])

transformer = TfidfTransformer()
res = transformer.fit_transform(trainVectorizerArray)

print ((res.todense()))


## RESULT:  

Fit Vectorizer to train set
[[1 0 1 0]
 [0 1 0 1]]

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Extraction of count features, TF-IDF normalization and row-wise euclidean normalization can be done in one operation with TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
res1 = tfidf.fit_transform(train_set)
print ((res1.todense()))


## RESULT:  

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Both processes produce a sparse matrix comprising of the same values.
Other useful references would be tfidfTransformer.fit_transform, countVectoriser_fit_transform and tfidfVectoriser .

what is the difference between tfidf vectorizer and tfidf transformer

3 Answers3

Linked