Use relative term frequency instead of absolute #317

BlackbitDevs · 2024-08-22T16:18:52Z

In

Lines 113 to 122 in a763e66

    
           $df            = $this->totalMatchingDocuments($term, $isLastKeyword); 
        
           $idf           = log($count / max(1, $df)); 
        
           foreach ($this->getAllDocumentsForKeyword($term, $noLimit, $isLastKeyword) as $document) { 
        
               $docID = $document['doc_id']; 
        
               $tf    = $document['hit_count']; 
        
               $num   = ($tfWeight + 1) * $tf; 
        
               $denom = $tfWeight 
        
                    * ((1 - $dlWeight) + $dlWeight) 
        
                    + $tf; 
        
               $score             = $idf * ($num / $denom);

the score gets calculated. The problem here is that $document['hit_count'] returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.

Example:
Let's take the example from https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Example_of_tf%E2%80%93idf but with a little changed hit counts
We have 4 documents:
Document 1

Term	Hit count
this	1
is	1
a	2
sample	1

Document 2

Term	Hit count
this	1
is	1
another	2
example	3

Spam document

Term	Hit count
this	10
is	10
a	10
sample	10
another	10
example	10

We search for "example":
idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.

tf how it is currently calculated in TNT search:

// document 1 does not get found because it does not contain term "example"

// document 2
$tf    = $document['hit_count'];  // = 3
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 3 = 6
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight) // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 3 = 4
$score             = $idf * ($num / $denom); // = 0.1761 * (6/4) = 0.26415

// spam document
$tf    = $document['hit_count'];  // = 10
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 10 = 20
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)  // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 10 = 11
$score             = $idf * ($num / $denom); // = 0.1761 * (20/11) = 0.32

So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.

According to Wikipedia's calculation term frequency is

Term frequency, tf(t,d), is the relative frequency of term t within document d,
where ft,d [the numerator] is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document d (counting each occurrence of the same term separately).

(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)

So with Wikipedia's definition of term frequency, there would be the following results:

tf(Document 2, 'example') = hit count of term / sum(hit_count) = 3 / 7 = 0.4286
tf(Spam document, 'example') = 10 / 60 = 0.1666

score(Document 2) = 0.1761 * 0.4286 = 0.0755
score(Spam document) = 0.1761 * 0.1666 = 0.0293

nticaric · 2024-08-22T17:25:30Z

The algorithm we are using here is BM25. You can learn more about it here

The $dlWeight variable represents the b coefficient which serves for ranking. In practice, the optimal range for this coefficient is typically between 0.5 and 0.8

In your extreme case, the spam document would rank better, but not to much. In practice a document where a word occurs more times is more relevant. By doing a lot of search against documents, we decided to keep the coefficient 0.5

Therefore, changing hit_count to float does not make sense, since this tells us the number how many times a token occurs in a document which can only be an integer

BlackbitDevs · 2024-08-25T19:43:31Z

Ok, will read about BM25 later.

But how can ((1 - $dlWeight) + $dlWeight) in

$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)
    + $tf;

make any difference as this is always 1? In BM25 it calculates (1 - b + b * something) - but this something is missing in TNT Search's formula or the brackets are wrong.

nticaric · 2024-09-03T07:08:53Z

Sorry for the late reply! You are correct; the expression always equals 1. The formula has been modified over the years, and this was overlooked. I remember that we changed the BM25 formula to ignore the document length, which is why a parameter is missing.

BlackbitDevs added 3 commits August 22, 2024 17:36

relative term frequency

b429014

relative term frequency

9597e72

relative term frequency

555b15e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use relative term frequency instead of absolute #317

Use relative term frequency instead of absolute #317

BlackbitDevs commented Aug 22, 2024 •

edited

Loading

nticaric commented Aug 22, 2024

BlackbitDevs commented Aug 25, 2024

nticaric commented Sep 3, 2024

	$df = $this->totalMatchingDocuments($term, $isLastKeyword);
	$idf = log($count / max(1, $df));
	foreach ($this->getAllDocumentsForKeyword($term, $noLimit, $isLastKeyword) as $document) {
	$docID = $document['doc_id'];
	$tf = $document['hit_count'];
	$num = ($tfWeight + 1) * $tf;
	$denom = $tfWeight
	* ((1 - $dlWeight) + $dlWeight)
	+ $tf;
	$score = $idf * ($num / $denom);

Use relative term frequency instead of absolute #317

Are you sure you want to change the base?

Use relative term frequency instead of absolute #317

Conversation

BlackbitDevs commented Aug 22, 2024 • edited Loading

nticaric commented Aug 22, 2024

BlackbitDevs commented Aug 25, 2024

nticaric commented Sep 3, 2024

BlackbitDevs commented Aug 22, 2024 •

edited

Loading