Is it possible to get fulltext search score as float in 0-1?

lingvisa · March 19, 2021, 5:00am

https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-full-text-search/

The score is good, however, the score can be well above or below 1. Is it possible to get it as a float number between 0-1? So that it can be easier to set a threshold to retain or discard the results, based on the relevance.

Joel · March 19, 2021, 5:35pm

You could create a normalized score, just divide all the scores by the maximum score from the result set. Note, this is a "search relative" normalization. From my fairly limited experience with fulltext indexes (but very recent..) I believe that the maximum score possible depends on the specific fulltext index design and is also data dependent.

I quick googled it, and this page appears to confirm this.

That page has a link to the lucene page about how score is calculated as well, I'll include it here for convenience

https://lucene.apache.org/core/3_0_3/scoring.html

lingvisa · March 20, 2021, 6:21am

Maybe you mean divide all the scores by the sum of all scores to normalize? My purpose is to set a threshold value to decide whether the results should be kept or discarded. If I divide by the max score, the top score's normalization value will always be 1 and this doesn't serve my purpose. I need a way to compress all the scores to the range of 0<x<1. So for example, if the normalized score >0.8, I want to keep it, and discard all others.

Joel · March 20, 2021, 9:03pm

Normalized range is 0.0 to 1.0, but if you really really want it to be 0.0 to <1.0 I guess there are a variety of ways to fudge that.

I'll give an example, (similar to the scores I see for my index), if I return only the top 4 scores and they are
7.0, 4.0, 1.0, 1.0

The max score is 7.0, so then the normalized scores would be

1.0, 0.5714, 0 .1429, 0 .1429 (rounded to four digits...)

for a fudge (though I don't understand why you want to do this), you could simply multiply those scores by 0.99, yielding these scores, now forced into the range 0 to <1.0

0.99, 0.5657, 0.1414, 0.1414 (rounded to four digits...)

lingvisa · March 21, 2021, 4:57am

That works for transformation, but my real point is, how to decide whether a result's relevance is strong enough to keep it. In your example, if the max score of 7.0 is represented by the node and query of:

aaa abacad

I may not want to keep it, because the similarity is not good enough. Instead, if the 7.0 is:

aaa aaab

Then this result is a lot better in terms of similarity. The scores are only top ranked results, but don't say how relevant (similar) to the query. Even if the very top result is little similar to the query, it is still a top rank result, which is right, but isn't indicative of how relevant to the query. In your example, if the 4 scores can be transformed to:

0.45, 0.31, 0.11, 0.09
0.87, 0.47, 0.3, 0.12

I would certainly want to discard the first set of results and keep the first result of the 2nd set, since it is more than 0.8. That's the 'threshold' I was talking about.

How do you use the scores you modified down the road?

Topic		Replies	Views
Fuzzy match Scores range for FTS search Cypher	0	66	December 3, 2024
Fulltext search with wildcard doesn't seem to preserve "reasonable" scoring Cypher search	0	224	February 21, 2022
Full Text search result scores are different with 3.5.14 Community and 4.2.6 Community Neo4j Graph Platform	2	257	May 27, 2021
Fulltext search boosting search results that start with 'search' Cypher	3	394	July 7, 2020
OMITNORMS field in Lucene fulltext analyzer Procedures & APOC	0	262	April 30, 2020

Is it possible to get fulltext search score as float in 0-1?

Related topics