Skip to content

Add diminishing returns for notes in Solr scoring #412

Description

@duckduckgrayduck

For logged in premium users, they have access to note search.
An example query:
https://www.documentcloud.org/documents/?q=Higganum+complaint

The first few documents are rated so highly not because they are super relevant in content, but because they have a LOT of notes with the word complaint in them. Did testing in the Django shell with my user (which has note search enabled) and those without and the resulting scores for the top 5 documents Solr returns are as follows:

WITHOUT notes:
  id: 2704406 score: 51.575443
  id: 3535180 score: 45.707706
  id: 336686 score: 44.942238
  id: 26948698 score: 41.152092
  id: 6953986 score: 41.038723

WITH notes:
  id: 321087 score: 397.05875
  id: 3034115 score: 186.30943
  id: 28167531 score: 147.98016
  id: 23789873 score: 147.8687
  id: 1369433 score: 132.84674

When looking deeper into the explanation:

doc 321087 has 106 notes matching 'complaint' in title:
note: 47030 title: Telephone-related complaint score: 3.3627408
note: 47031 title: Telephone-related complaint score: 3.3627408
note: 47032 title: Telephone-related complaint score: 3.3627408
note: 47033 title: Telephone-related complaint score: 3.3627408
note: 47036 title: Telephone-related complaint score: 3.3627408

We should add some kind of diminishing return mechanism here so the first few note hits impact score, but the returns diminish with too many

Metadata

Metadata

Assignees

No one assigned

    Labels

    planningReview at project planning meetingsolr

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions