Skip to content

Instantly share code, notes, and snippets.

@mschoch
Created July 4, 2016 20:17
Show Gist options
  • Save mschoch/062ba9708298ff1daede73f4d5d5fac3 to your computer and use it in GitHub Desktop.
Save mschoch/062ba9708298ff1daede73f4d5d5fac3 to your computer and use it in GitHub Desktop.
Introduction to Bleve Scoring

Scoring

Scoring of complex queries involves a few different pieces. To understand how they fit together, it's helpful to first look a simple query, then add in complexities one at a time until we have the complete picture.

Term Query

The simplest query we can run is a term query for a single term. Consider a term search for light in the field description of the beer-search dataset. The top scoring hit is iron_city_brewing_co-ic_light and when we ask Bleve to explain the scoring we get:

"explanation": {
  "value": 2.024174152548743,
  "message": "fieldWeight(description:light in iron_city_brewing_co-ic_light), product of:",
  "children": [
    {
      "value": 1.4142135623730951,
      "message": "tf(termFreq(description:light)=2"
    },
    {
      "value": 0.3333333432674408,
      "message": "fieldNorm(field=description, doc=iron_city_brewing_co-ic_light)"
    },
    {
      "value": 4.293921680740409,
      "message": "idf(docFreq=270, maxDocs=7303)"
    }
  ]
}

Reproduce: curl -XPOST http://localhost:8094/api/search -d '{"size":10,"explain":true,"highlight":{},"query":{"term":"light","field":"description"}}'

As you can see there are 3 components to the score of this document:

  • TF (term frequency) - how often does the term appear in this field of the document? The more frequent the term the higher the score. In this implementation, the actual value used is the square root of the term frequency. In this example, the term frequency was 2, thus the value used in the computation is 1.414.

  • Norm (field normalization) - how many tokens are in the field? If the field has many tokens, we want to lower the score. Consider 2 documents that both use the term "light" twice, one has 10 tokens, the other has 1000. We want the document using twice out of 10 tokens to score higher than the docuemnt that used it twice in 1000 tokens. In this implementation, the actual value used is: 1/sqrt(token_length) In this example, the token length was 9, thus the value used is 0.333. (NOTE: in the future, index time field boosting would be included in this computation, but bleve currently only supports query time boosting)

  • IDF (inverse document frequency) - out of the whole population of documents, how many of them used this term? The idea is that a term appearing in only a few documents is more significant than a term which appears in many/all documents. In this implementation, the actual value used is: 1 + ln(total_num_docs/(doc_freq + 1)) In this example, the total number of docs was 7303 and the document frequency was 270, thus the computed value is 4.294 (NOTE: the +1's just avoid all divide by 0 cases and don't significantly alter the computation)

So, with these 3 pieces calculated, we compute the final score by multiplying them: 1.414*0.333*4.294=2.024

Searching Multiple Terms

Now that we've seen how to calculate the score of a search for a single term, we can look at the more common case where a user searches for multiple terms of the form A or B or C...

Consider a match search for the terms light and water in the field description of the beer-search dataset. Also we want to boost the importance of water by a factor of 3. The top scoring hit is iron_city_brewing_co-ic_light and when we ask Bleve to explain the scoring we get:

"explanation": {
  "value": 2.2412700681905235,
  "message": "product of:",
  "children": [
    {
      "value": 2.2412700681905235,
      "message": "sum of:",
      "children": [
        {
          "value": 0.5248131710762932,
          "message": "weight(description:light^1.000000 in iron_city_brewing_co-ic_light), product of:",
          "children": [
            {
              "value": 0.2592727361998342,
              "message": "queryWeight(description:light^1.000000), product of:",
              "children": [
                {
                  "value": 1,
                  "message": "boost"
                },
                {
                  "value": 4.293921680740409,
                  "message": "idf(docFreq=270, maxDocs=7303)"
                },
                {
                  "value": 0.060381337964955,
                  "message": "queryNorm"
                }
              ]
            },
            {
              "value": 2.024174152548743,
              "message": "fieldWeight(description:light in iron_city_brewing_co-ic_light), product of:",
              "children": [
                {
                  "value": 1.4142135623730951,
                  "message": "tf(termFreq(description:light)=2"
                },
                {
                  "value": 0.3333333432674408,
                  "message": "fieldNorm(field=description, doc=iron_city_brewing_co-ic_light)"
                },
                {
                  "value": 4.293921680740409,
                  "message": "idf(docFreq=270, maxDocs=7303)"
                }
              ]
            }
          ]
        },
        {
          "value": 1.7164568971142304,
          "message": "weight(description:water^3.000000 in iron_city_brewing_co-ic_light), product of:",
          "children": [
            {
              "value": 0.9658041459133684,
              "message": "queryWeight(description:water^3.000000), product of:",
              "children": [
                {
                  "value": 3,
                  "message": "boost"
                },
                {
                  "value": 5.331692310152274,
                  "message": "idf(docFreq=95, maxDocs=7303)"
                },
                {
                  "value": 0.060381337964955,
                  "message": "queryNorm"
                }
              ]
            },
            {
              "value": 1.7772308230163623,
              "message": "fieldWeight(description:water in iron_city_brewing_co-ic_light), product of:",
              "children": [
                {
                  "value": 1,
                  "message": "tf(termFreq(description:water)=1"
                },
                {
                  "value": 0.3333333432674408,
                  "message": "fieldNorm(field=description, doc=iron_city_brewing_co-ic_light)"
                },
                {
                  "value": 5.331692310152274,
                  "message": "idf(docFreq=95, maxDocs=7303)"
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 1,
      "message": "coord(2/2)"
    }
  ]
}

Reproduce: curl -XPOST http://localhost:8094/api/search -d '{"size":10,"explain":true,"highlight":{},"query":{"disjuncts":[{"term":"light","field":"description"},{"term":"water","field":"description","boost":3.0}]}}'

At first this looks very complicated, but the first thing to recognize is that two of the sub-sections (the ones with message starting "fieldWeight") are ones we just looked at (see the 2.024 value matches the one we just computed above). This value, and the corresponding 1.777 for the other term are raw scores for the two individual terms.

But, when search multiple terms, there are 2 other factors that come into play.

  • Boosting - the idea of boosting is that when searching for more than one term, sometimes you want to increase the relative importance of (or boost) one term over another. You could just add a step where you multiply by the boost factor, but that would make scores involving boost wildly different from those that didn't. To account for this a normalization is done across all the boost factors. This allows boost factors to alter the relative importance of documents without drastically altering the final scores.

So, first as an intermediate step we will compute a sum square of the weights of all the queries at this level. In this case, there are 2 term queries, one for light and one for water. Each of them computes their boost * IDF, squares that value and returns it. Then these values are summed for all queries at this level.

  • light - has IDF of 4.293 and boost of 1.0 - multiplied and squared this is 18.437
  • water - has IDF of 5.331 and boost of 3.0 - multiplied and squared this is 255.842

Now, we sum these, take the square root, then take the inverse, and get 0.060. This value is known as the queryNorm.

Now, when each of the underlying term queries are computing their score, they also calculate a value based on the boost, the IDF, and the queryNorm that we just calculated.

  • light - boost of 1.0, IDF of 4.293, and queryNorm 0.60 => 0.259
  • water - boost of 3.0, IDF of 5.331, and queryNorm 0.60 => 0.965

So, the final score for each term is then the product of these new weights, multipied by the simple term score that we learned how to compute above.

  • light - weight 0.259, term score 2.024 => 0.524
  • water - weight 0.965, term score 1.777 => 1.716

At this stage, these adjusted term scores are summed yielding the value 2.241.

  • Coordination Factor - the other major consideration when scoring searches of this form is a factor to account for how many of the terms you searched for matched. The idea is that no matter how well any single term matched, you want to reward documents that match a larger number of the terms you searched. To do this we multiply the score by what we call the coordination factor. It is simply the number of search terms that were satisfied out of the total number of search terms (at this level). In this example, the document satisfied both of the terms we searched for, thus 2/2 or a coordination factor of 1.0. Had only one of the terms matched, the score would have been cut in half due to a coordination factor of 0.5.

Background Material

Lucene TFIDF Similarity

Introduction To Information Retrieval, Chapter 6

@mschoch
Copy link
Author

mschoch commented Jul 31, 2020

@vijaygopal-couchbase the wording of this documentation is wrong/confusing. It is not the length of the token, but the number of tokens in the field. Clearing using the phrase "token length" seems wrong, but the rest of the surrounding text seems to clearly be talking about the number of tokens.

@vijaygopal-couchbase
Copy link

vijaygopal-couchbase commented Jul 31, 2020

Then it would mean the number of times the word "light" appears in the document "iron_city_brewing_co-ic_light" is 9 times. Hence 1/sqrt(9) results in 0.33. Is my understanding correct?

@mschoch
Copy link
Author

mschoch commented Jul 31, 2020

No, the number of times the word occurs in the document is the frequency. The norm is calculated from the length of the field, as measured by the number of tokens. Let's look at an example. I believe this is the original document corresponding to the example above:

{
  "name": "IC Light",
  "abv": 4.15,
  "ibu": 0,
  "srm": 0,
  "upc": 0,
  "type": "beer",
  "brewery_id": "iron_city_brewing_co",
  "updated": "2010-07-22 20:00:20",
  "description": "IC Light is an original, not a watered down copy, and brewed to be light from start to finish.",
  "style": "American-Style Light Lager",
  "category": "North American Lager"
}

After the analyzer removes stop words, the tokens in the description field are:

Pos Term Start End
1 ic 1 3
2 light 4 9
5 original 16 24
8 watered 32 39
10 copy 45 49
12 brewed 55 61
15 light 68 73
17 start 79 84
19 finish 88 94

You can run this yourself here: http://analysis.blevesearch.com/analysis

As you can see light has frequency 2, and the field has length 9.

I hope this helps.

@vijaygopal-couchbase
Copy link

Thanks a lot Marty for solving the mystery :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment