Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use relative term frequency instead of absolute #317

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

BlackbitDevs
Copy link
Contributor

@BlackbitDevs BlackbitDevs commented Aug 22, 2024

In

tntsearch/src/TNTSearch.php

Lines 113 to 122 in a763e66

$df = $this->totalMatchingDocuments($term, $isLastKeyword);
$idf = log($count / max(1, $df));
foreach ($this->getAllDocumentsForKeyword($term, $noLimit, $isLastKeyword) as $document) {
$docID = $document['doc_id'];
$tf = $document['hit_count'];
$num = ($tfWeight + 1) * $tf;
$denom = $tfWeight
* ((1 - $dlWeight) + $dlWeight)
+ $tf;
$score = $idf * ($num / $denom);

the score gets calculated. The problem here is that $document['hit_count'] returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.

Example:
Let's take the example from https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Example_of_tf%E2%80%93idf but with a little changed hit counts
We have 4 documents:
Document 1

Term Hit count
this 1
is 1
a 2
sample 1

Document 2

Term Hit count
this 1
is 1
another 2
example 3

Spam document

Term Hit count
this 10
is 10
a 10
sample 10
another 10
example 10

We search for "example":
idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.

tf how it is currently calculated in TNT search:

// document 1 does not get found because it does not contain term "example"

// document 2
$tf    = $document['hit_count'];  // = 3
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 3 = 6
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight) // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 3 = 4
$score             = $idf * ($num / $denom); // = 0.1761 * (6/4) = 0.26415

// spam document
$tf    = $document['hit_count'];  // = 10
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 10 = 20
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)  // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 10 = 11
$score             = $idf * ($num / $denom); // = 0.1761 * (20/11) = 0.32

So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.

According to Wikipedia's calculation term frequency is

Term frequency, tf(t,d), is the relative frequency of term t within document d,
where ft,d [the numerator] is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document d (counting each occurrence of the same term separately).

(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)

So with Wikipedia's definition of term frequency, there would be the following results:

tf(Document 2, 'example') = hit count of term / sum(hit_count) = 3 / 7 = 0.4286
tf(Spam document, 'example') = 10 / 60 = 0.1666

score(Document 2) = 0.1761 * 0.4286 = 0.0755
score(Spam document) = 0.1761 * 0.1666 = 0.0293

@nticaric
Copy link
Contributor

The algorithm we are using here is BM25. You can learn more about it here

The $dlWeight variable represents the b coefficient which serves for ranking. In practice, the optimal range for this coefficient is typically between 0.5 and 0.8

In your extreme case, the spam document would rank better, but not to much. In practice a document where a word occurs more times is more relevant. By doing a lot of search against documents, we decided to keep the coefficient 0.5

Therefore, changing hit_count to float does not make sense, since this tells us the number how many times a token occurs in a document which can only be an integer

@BlackbitDevs
Copy link
Contributor Author

Ok, will read about BM25 later.

But how can ((1 - $dlWeight) + $dlWeight) in

$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)
    + $tf;

make any difference as this is always 1? In BM25 it calculates (1 - b + b * something) - but this something is missing in TNT Search's formula or the brackets are wrong.

@nticaric
Copy link
Contributor

nticaric commented Sep 3, 2024

Sorry for the late reply! You are correct; the expression always equals 1. The formula has been modified over the years, and this was overlooked. I remember that we changed the BM25 formula to ignore the document length, which is why a parameter is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants