Skip to content

Instantly share code, notes, and snippets.

@Joe631
Last active December 22, 2024 01:56
Show Gist options
  • Save Joe631/992f2210a3e20f6ac7753103149e3558 to your computer and use it in GitHub Desktop.
Save Joe631/992f2210a3e20f6ac7753103149e3558 to your computer and use it in GitHub Desktop.

Joe C
if i have a distribution of numbers from 0 to 800, some less than 1, what is the best way to put them on a scale from 0 to 1

To normalize a distribution of numbers from 0 to 800 to a scale from 0 to 1, you can use the min-max normalization technique. This method scales the data based on the minimum and maximum values in the dataset. Here's the formula for min-max normalization:

normalized_value = (value - min) / (max - min)

In your case, the minimum value (min) is 0, and the maximum value (max) is 800. So the formula simplifies to:

normalized_value = value / 800

For each number in your dataset, divide it by 800 to get the corresponding normalized value between 0 and 1.

Here's an example in Python:

data = [0, 1, 800, 400, 0.5]
normalized_data = [x / 800 for x in data]

print(normalized_data)

Output:

csharp

[0.0, 0.00125, 1.0, 0.5, 0.000625]

In this example, the numbers in the data list are normalized to a scale between 0 and 1 in the normalized_data list using list comprehension. Joe C what trigram similarity score would skarsgard and 'bill skarsgard' have ?

Trigram similarity, also known as trigram similarity index, is a measure of the similarity between two strings based on the number of trigrams (three-character sequences) they share. To calculate the trigram similarity score, you can use the Jaccard index, which is the ratio of the number of common trigrams to the total unique trigrams in both strings.

Let's calculate the trigram similarity score between skarsgard and bill skarsgard:

Extract trigrams from both strings:
    skarsgard: ['ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    bill skarsgard: ['bil', 'ill', 'll ', 'l s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']

Find the number of common trigrams:
    Common trigrams: ['ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of common trigrams: 7

Find the total number of unique trigrams in both strings:
    Unique trigrams: ['bil', 'ill', 'll ', 'l s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of unique trigrams: 12

Calculate the Jaccard index:

Jaccard index = (Number of common trigrams) / (Number of unique trigrams) = 7 / 12 = 0.5833 (rounded to four decimal places)

The trigram similarity score between skarsgard and bill skarsgard is approximately 0.5833. Joe C is `alex skarsgard' closer in trigram similarity to 'bill skarsgard' or 'alexander skarsgard'

Let's calculate the trigram similarity scores between 'alex skarsgard' and 'bill skarsgard', and 'alex skarsgard' and 'alexander skarsgard':

Extract trigrams from all three strings:
    alex skarsgard: ['ale', 'lex', 'ex ', 'x s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    bill skarsgard: ['bil', 'ill', 'll ', 'l s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    alexander skarsgard: ['ale', 'lex', 'exa', 'xan', 'and', 'nde', 'der', 'er ', 'r s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']

Find the number of common trigrams between 'alex skarsgard' and 'bill skarsgard':
    Common trigrams: [' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of common trigrams: 8

Find the number of common trigrams between 'alex skarsgard' and 'alexander skarsgard':
    Common trigrams: ['ale', 'lex', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of common trigrams: 10

Find the total number of unique trigrams between 'alex skarsgard' and 'bill skarsgard':
    Unique trigrams: ['ale', 'lex', 'ex ', 'x s', 'bil', 'ill', 'll ', 'l s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of unique trigrams: 16

Find the total number of unique trigrams between 'alex skarsgard' and 'alexander skarsgard':
    Unique trigrams: ['ale', 'lex', 'ex ', 'x s', 'exa', 'xan', 'and', 'nde', 'der', 'er ', 'r s', ' sk', 'ska', 'kar', 'ars', 'rsg', 'sga', 'gar', 'ard']
    Number of unique trigrams: 19

Calculate the Jaccard indices:
    Jaccard index (alex skarsgard vs. bill skarsgard) = 8 / 16 = 0.5
    Jaccard index (alex skarsgard vs. alexander skarsgard) = 10 / 19 ≈ 0.5263 (rounded to four decimal places)

Based on the trigram similarity scores, alex skarsgard is closer in trigram similarity to alexander skarsgard (0.5263) than to bill skarsgard (0.5). Joe C it seems like alex skarsgard should be a much closer match. is there another way we could compare the closeness of 2 strings in postgres that might be better?

In PostgreSQL, you can use the pg_trgm extension to calculate trigram similarity between strings. However, as you mentioned, trigram similarity may not always provide the best results for comparing the closeness of two strings.

An alternative method is using the Levenshtein distance, which calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. The smaller the Levenshtein distance, the more similar the two strings are.

PostgreSQL provides the levenshtein function as part of the fuzzystrmatch module. To use it, you first need to enable the module by running:

sql

CREATE EXTENSION fuzzystrmatch;

After enabling the fuzzystrmatch module, you can use the levenshtein function to compare strings:

sql

SELECT levenshtein('alex skarsgard', 'bill skarsgard') AS distance1, levenshtein('alex skarsgard', 'alexander skarsgard') AS distance2;

This query will return the Levenshtein distances between 'alex skarsgard' and 'bill skarsgard', and between 'alex skarsgard' and 'alexander skarsgard'. The smaller distance indicates a closer match.

To find the similarity based on Levenshtein distance, you can normalize it by dividing it by the maximum possible distance (length of the longer string) and subtracting the result from 1:

sql

SELECT (1.0 - (levenshtein('alex skarsgard', 'bill skarsgard')::float / GREATEST(length('alex skarsgard'), length('bill skarsgard')))) AS similarity1, (1.0 - (levenshtein('alex skarsgard', 'alexander skarsgard')::float / GREATEST(length('alex skarsgard'), length('alexander skarsgard')))) AS similarity2;

This query will return the Levenshtein similarity scores between 'alex skarsgard' and 'bill skarsgard', and between 'alex skarsgard' and 'alexander skarsgard'. The higher score indicates a closer match.

Using the Levenshtein distance can provide a better comparison of string closeness when trigram similarity is not sufficient. Joe C can you show me how to edit this code to use levenshtein distance instead of trigramsimilarity to calculate similarity

class PersonFilter(django_filters.FilterSet): name = django_filters.CharFilter(method='filter_by_name')

class Meta:
    model = Person
    fields = (
        'searchable_name',
    )

def filter_by_name(self, queryset, name, value):
    query_len = len(value)
    query = unidecode(value.replace(' ', ''))

    if query_len <= 3:
        popularity_weight = 1.0
    elif query_len <= 6:
        popularity_weight = 0.5
    else:
        popularity_weight = 0.25

    similarity_weight = 1.0 - popularity_weight

    return queryset\
        .filter(
            searchable_name__trigram_similar=query)\
        .annotate(
            similarity=TrigramSimilarity('searchable_name', query),
            popularity_normalized=Cast(
                Case(
                    When(popularity__lt=1, then=Value(1)),
                    When(popularity__gt=100, then=Value(100)),
                    default=F('popularity')),
                output_field=FloatField(),
            ),
            search_score=(
                (F('similarity') * similarity_weight) +
                (F('popularity_normalized') / 100 * popularity_weight))
        )

To use the Levenshtein distance instead of trigram similarity in the provided code, you'll need to make the following changes:

Import the Func, Greatest, and Length classes from django.db.models.
Replace the TrigramSimilarity annotation with a custom Levenshtein function using the Func class.
Calculate the similarity score by normalizing the Levenshtein distance and subtracting the result from 1.

Here's the modified code:

python

from django.db.models import Case, FloatField, F, Func, Value, When, Cast, Length, Greatest from django_filters import rest_framework as filters import django_filters from unidecode import unidecode

class Levenshtein(Func): function = 'levenshtein' arity = 2

class PersonFilter(filters.FilterSet): name = django_filters.CharFilter(method='filter_by_name')

class Meta:
    model = Person
    fields = (
        'searchable_name',
    )

def filter_by_name(self, queryset, name, value):
    query_len = len(value)
    query = unidecode(value.replace(' ', ''))

    if query_len <= 3:
        popularity_weight = 1.0
    elif query_len <= 6:
        popularity_weight = 0.5
    else:
        popularity_weight = 0.25

    similarity_weight = 1.0 - popularity_weight

    return queryset\
        .annotate(
            levenshtein_distance=Levenshtein('searchable_name', query),
            max_distance=Greatest(Length('searchable_name'), Length(Value(query))),
            similarity=1 - (F('levenshtein_distance') / F('max_distance')),
            popularity_normalized=Cast(
                Case(
                    When(popularity__lt=1, then=Value(1)),
                    When(popularity__gt=100, then=Value(100)),
                    default=F('popularity')),
                output_field=FloatField(),
            ),
            search_score=(
                (F('similarity') * similarity_weight) +
                (F('popularity_normalized') / 100 * popularity_weight))
        )\
        .filter(similarity__gt=0.1)  # Adjust the threshold to filter out low similarity results

In this code, I've added a custom Levenshtein function using the Func class, which will call PostgreSQL's levenshtein function when executed. The similarity is calculated by normalizing the Levenshtein distance and subtracting the result from 1. The final queryset is filtered based on a similarity threshold (0.1 in this example) to remove low similarity results. You can adjust this threshold according to your requirements. Joe C what range will similarity be?

The similarity value in the modified code will be in the range of 0 to 1, inclusive.

The similarity is calculated as follows:

makefile

similarity = 1 - (levenshtein_distance / max_distance)

Since the Levenshtein distance is the minimum number of single-character edits required to transform one string into another, it will be at least 0. The maximum possible Levenshtein distance is the length of the longer string (max_distance).

When the two strings are identical, the Levenshtein distance will be 0, and the similarity will be 1 (perfect similarity). When there's no overlap between the two strings and the Levenshtein distance is equal to the length of the longer string, the similarity will be 0 (no similarity).

As the Levenshtein distance ranges from 0 to the length of the longer string, the similarity value will range from 0 to 1, where 0 means no similarity and 1 means a perfect match. Joe C I cant import greatest or length from django.db.models

My apologies for the confusion. I realized that I made a mistake in my previous response. In Django, you don't need to import Greatest or Length from django.db.models. Instead, you can use the Length function from django.db.models.functions and the Greatest function from django.db.models.functions.comparison.

Here's the corrected import statement:

python

from django.db.models.functions import Length from django.db.models.functions.comparison import Greatest

Now you can use Length and Greatest in your code as shown in the previous example:

python

max_distance=Greatest(Length('searchable_name'), Length(Value(query))), similarity=1 - (F('levenshtein_distance') / F('max_distance')),

With these updated imports, the code should work as intended. Joe C after running this code with the query rachel i get this error : Cannot resolve keyword 'rachel' into field. can you help me fix this?

My apologies for the oversight in my previous response. The issue occurs because the query variable is being used directly in the Length function, which is not allowed. Instead, you should wrap the query value with the Value function from django.db.models to pass it as a constant value.

Here's the corrected code snippet:

python

...

return queryset
.annotate( levenshtein_distance=Levenshtein('searchable_name', query), max_distance=Greatest(Length('searchable_name'), Length(Value(query))), similarity=1 - (F('levenshtein_distance') / F('max_distance')), popularity_normalized=Cast( Case( When(popularity__lt=1, then=Value(1)), When(popularity__gt=100, then=Value(100)), default=F('popularity')), output_field=FloatField(), ), search_score=( (F('similarity') * similarity_weight) + (F('popularity_normalized') / 100 * popularity_weight)) )
.filter(similarity__gt=0.1) # Adjust the threshold to filter out low similarity results

Now, the query variable is properly wrapped in the Length function with the Value function, and the code should work without raising an error. Joe C what does Length('searchable_name') do?

The Length function in Django is an SQL function that returns the length of the provided text field. In this case, Length('searchable_name') is used to get the length of the searchable_name field for each record in the queryset.

Length('searchable_name') is used in combination with the Greatest function to calculate the maximum possible Levenshtein distance between the query string and the searchable_name field, which is necessary for normalizing the Levenshtein distance to calculate the similarity score. The normalization is done by dividing the Levenshtein distance by the maximum possible distance (the length of the longer string) and subtracting the result from 1.

Here's the part of the code that uses Length('searchable_name'):

python

max_distance=Greatest(Length('searchable_name'), Length(Value(query))), similarity=1 - (F('levenshtein_distance') / F('max_distance')),

This code snippet calculates the max_distance by taking the greater value between the length of the searchable_name field and the length of the query string. Then it calculates the similarity score by normalizing the Levenshtein distance. Joe C it still doesnt seem right, let me post the code again:

def filter_by_name(self, queryset, name, value): query_len = len(value) query = unidecode(value.replace(' ', ''))

    if query_len <= 3:
        popularity_weight = 1.0
    elif query_len <= 6:
        popularity_weight = 0.5
    else:
        popularity_weight = 0.25

    similarity_weight = 1.0 - popularity_weight

    return queryset\
        .filter(
            searchable_name__trigram_similar=query)\
        .annotate(
            levenshtein_distance=Levenshtein('searchable_name', query),
            max_distance=Greatest(Length('searchable_name'), Length(Value(query))),
            similarity=1 - (F('levenshtein_distance') / F('max_distance')),
            popularity_normalized=Cast(
                Case(
                    When(popularity__lt=1, then=Value(1)),
                    When(popularity__gt=100, then=Value(100)),
                    default=F('popularity')),
                output_field=FloatField(),
            ),
            search_score=(
                (F('similarity') * similarity_weight) +
                (F('popularity_normalized') / 100 * popularity_weight))
        )

I apologize for the confusion in my previous responses. I realized that I misunderstood your question initially. The provided code still uses the trigram similarity for filtering the queryset. To update the code to use the Levenshtein distance for filtering, you should remove the .filter(searchable_name__trigram_similar=query) line and filter based on the calculated similarity score instead.

Here's the corrected code:

python

def filter_by_name(self, queryset, name, value): query_len = len(value) query = unidecode(value.replace(' ', ''))

if query_len <= 3:
    popularity_weight = 1.0
elif query_len <= 6:
    popularity_weight = 0.5
else:
    popularity_weight = 0.25

similarity_weight = 1.0 - popularity_weight

return queryset\
    .annotate(
        levenshtein_distance=Levenshtein('searchable_name', query),
        max_distance=Greatest(Length('searchable_name'), Length(Value(query))),
        similarity=1 - (F('levenshtein_distance') / F('max_distance')),
        popularity_normalized=Cast(
            Case(
                When(popularity__lt=1, then=Value(1)),
                When(popularity__gt=100, then=Value(100)),
                default=F('popularity')),
            output_field=FloatField(),
        ),
        search_score=(
            (F('similarity') * similarity_weight) +
            (F('popularity_normalized') / 100 * popularity_weight))
    )\
    .filter(similarity__gt=0.1)  # Adjust the threshold to filter out low similarity results

In this code, I removed the .filter(searchable_name__trigram_similar=query) line and added a filter based on the similarity score after the annotate method. Now, the queryset will be filtered based on the calculated Levenshtein similarity instead of trigram similarity. You can adjust the similarity threshold (0.1 in this example) according to your requirements. Joe C can i keep the trigram filter so i can use the gin index?

Yes, you can keep the trigram filter to take advantage of the GIN index for better performance. In that case, you can use the trigram filter for an initial rough filtering and then apply the Levenshtein distance calculation for more accurate similarity scoring.

Here's the code with the trigram filter and Levenshtein distance calculation combined:

python

def filter_by_name(self, queryset, name, value): query_len = len(value) query = unidecode(value.replace(' ', ''))

if query_len <= 3:
    popularity_weight = 1.0
elif query_len <= 6:
    popularity_weight = 0.5
else:
    popularity_weight = 0.25

similarity_weight = 1.0 - popularity_weight

return queryset\
    .filter(searchable_name__trigram_similar=query)\
    .annotate(
        levenshtein_distance=Levenshtein('searchable_name', query),
        max_distance=Greatest(Length('searchable_name'), Length(Value(query))),
        similarity=1 - (F('levenshtein_distance') / F('max_distance')),
        popularity_normalized=Cast(
            Case(
                When(popularity__lt=1, then=Value(1)),
                When(popularity__gt=100, then=Value(100)),
                default=F('popularity')),
            output_field=FloatField(),
        ),
        search_score=(
            (F('similarity') * similarity_weight) +
            (F('popularity_normalized') / 100 * popularity_weight))
    )\
    .filter(similarity__gt=0.1)  # Adjust the threshold to filter out low similarity results

In this code, the trigram filter is kept for an initial filtering based on the GIN index. The Levenshtein distance calculation and similarity scoring are applied after the trigram filtering. The queryset is then filtered based on the similarity score calculated using the Levenshtein distance. Joe C that code still produces this error: Exception Value: Cannot resolve keyword 'rachel' into field.

I apologize for the confusion. It seems the issue is related to using the query variable directly in the annotate() method. You should wrap the query variable using the Value function from django.db.models when calculating the Levenshtein distance.

Here's the corrected code snippet:

python

return queryset
.filter(searchable_name__trigram_similar=query)
.annotate( levenshtein_distance=Levenshtein('searchable_name', Value(query)), max_distance=Greatest(Length('searchable_name'), Length(Value(query))), similarity=1 - (F('levenshtein_distance') / F('max_distance')), popularity_normalized=Cast( Case( When(popularity__lt=1, then=Value(1)), When(popularity__gt=100, then=Value(100)), default=F('popularity')), output_field=FloatField(), ), search_score=( (F('similarity') * similarity_weight) + (F('popularity_normalized') / 100 * popularity_weight)) )
.filter(similarity__gt=0.1) # Adjust the threshold to filter out low similarity results

Now, the query variable is properly wrapped with the Value function when calculating the Levenshtein distance, and the code should work without raising an error. Joe C now i got this error: Exception Value: Expression contains mixed types. You must set output_field.

I apologize for the confusion. The error occurs because the similarity expression in the annotate() method involves mixed types. You need to set the output_field for the expression. In this case, you can use the ExpressionWrapper function from django.db.models to set the output field as FloatField.

Here's the corrected code snippet:

python

from django.db.models import ExpressionWrapper, FloatField

...

return queryset
.filter(searchable_name__trigram_similar=query)
.annotate( levenshtein_distance=Levenshtein('searchable_name', Value(query)), max_distance=Greatest(Length('searchable_name'), Length(Value(query))), similarity=ExpressionWrapper( 1 - (F('levenshtein_distance') / F('max_distance')), output_field=FloatField() ), popularity_normalized=Cast( Case( When(popularity__lt=1, then=Value(1)), When(popularity__gt=100, then=Value(100)), default=F('popularity')), output_field=FloatField(), ), search_score=( (F('similarity') * similarity_weight) + (F('popularity_normalized') / 100 * popularity_weight)) )
.filter(similarity__gt=0.1) # Adjust the threshold to filter out low similarity results

Now, the similarity expression is wrapped with ExpressionWrapper, and the output field is set as FloatField. The code should work without raising an error. Joe C whats the levenshtein distance of 'rachel winter' and 'rachel weiss'

The Levenshtein distance between two strings is the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. Let's calculate the Levenshtein distance between 'rachel winter' and 'rachel weiss':

Substitute 'w' in 'winter' with 'e': 'renter' → 'eenter'
Substitute 'n' in 'eenter' with 'i': 'eenter' → 'eieter'
Substitute 't' in 'eieter' with 's': 'eieter' → 'eieser'
Substitute 'e' in 'eieser' with 's': 'eieser' → 'eisser'
Insert 's' after the second 's' in 'eisser': 'eisser' → 'weisser'

It takes 5 operations to transform 'rachel winter' into 'rachel weiss'. So, the Levenshtein distance between 'rachel winter' and 'rachel weiss' is 5. Joe C in the above code why do we find the max length of searchable_name ?

My apologies for the confusion in my previous responses. After reviewing the code again, I realized that using the maximum length of searchable_name is not necessary here. Instead, we should find the maximum length between the searchable_name of each record and the query string. The purpose of finding the maximum length is to normalize the Levenshtein distance when calculating the similarity score.

In the code snippet, the max_distance is calculated as the greater value between the length of the searchable_name field and the length of the query string:

python

max_distance=Greatest(Length('searchable_name'), Length(Value(query))),

This is used to normalize the Levenshtein distance when calculating the similarity score:

python

similarity=1 - (F('levenshtein_distance') / F('max_distance')),

Normalizing the Levenshtein distance helps to ensure that the similarity score is on a scale between 0 and 1, where 0 means completely dissimilar and 1 means identical. By dividing the Levenshtein distance by the maximum possible distance (the length of the longer string) and subtracting the result from 1, we get a similarity score that takes the length of the strings into account. Joe C show me then how you would calculate similarity for 'rachel winter' and 'rachel weiss'

To calculate the similarity between 'rachel winter' and 'rachel weiss' using the normalized Levenshtein distance, follow these steps:

Calculate the Levenshtein distance.
Determine the maximum length between the two strings.
Normalize the Levenshtein distance by dividing it by the maximum length.
Subtract the normalized Levenshtein distance from 1 to get the similarity score.

We have already calculated the Levenshtein distance between 'rachel winter' and 'rachel weiss' as 5.

Now, let's determine the maximum length between the two strings:

Length of 'rachel winter': 13
Length of 'rachel weiss': 12
Maximum length: 13

Next, normalize the Levenshtein distance:

Normalized Levenshtein distance = Levenshtein distance / Maximum length = 5 / 13 ≈ 0.3846

Finally, calculate the similarity score:

Similarity score = 1 - Normalized Levenshtein distance = 1 - 0.3846 ≈ 0.6154

So, the similarity between 'rachel winter' and 'rachel weiss' is approximately 0.6154, where 0 means completely dissimilar and 1 means identical. Joe C show me then how you would calculate similarity for 'rachel winter' and 'rachelwinder'

To calculate the similarity between 'rachel winter' and 'rachelwinder' Joe C show me then how you would calculate similarity for 'rachel winter' and 'rachelwinter'

To calculate the similarity between 'rachel winter' and 'rachelwinter' using the normalized Levenshtein distance, follow these steps:

Calculate the Levenshtein distance.
Determine the maximum length between the two strings.
Normalize the Levenshtein distance by dividing it by the maximum length.
Subtract the normalized Levenshtein distance from 1 to get the similarity score.

First, let's calculate the Levenshtein distance between 'rachel winter' and 'rachelwinter':

Remove the space in 'rachel winter': 'rachelwinter'

It takes 1 operation (deletion) to transform 'rachel winter' into 'rachelwinter'. So, the Levenshtein distance between 'rachel winter' and 'rachelwinter' is 1.

Now, let's determine the maximum length between the two strings:

Length of 'rachel winter': 13
Length of 'rachelwinter': 12
Maximum length: 13

Next, normalize the Levenshtein distance:

Normalized Levenshtein distance = Levenshtein distance / Maximum length = 1 / 13 ≈ 0.0769

Finally, calculate the similarity score:

Similarity score = 1 - Normalized Levenshtein distance = 1 - 0.0769 ≈ 0.9231

So, the similarity between 'rachel winter' and 'rachelwinter' is approximately 0.9231, where 0 means completely dissimilar and 1 means identical. Joe C can you write some tests for the filter_by_name method?

To write tests for the filter_by_name method, you can create a test file for the PersonFilter class and use Django's built-in test framework. Here's a simple example:

Create a new file named test_filters.py in the same directory where your filters.py is located.
Add the following code to test_filters.py:

python

from django.test import TestCase from .models import Person from .filters import PersonFilter

class PersonFilterTestCase(TestCase): def setUp(self): self.person1 = Person.objects.create(searchable_name='rachel winter', popularity=10) self.person2 = Person.objects.create(searchable_name='rachel weiss', popularity=50) self.person3 = Person.objects.create(searchable_name='rachelwinter', popularity=30) self.person4 = Person.objects.create(searchable_name='alex winter', popularity=20)

def test_filter_by_name_exact_match(self):
    filter = PersonFilter({'name': 'rachel winter'})
    results = filter.qs
    self.assertEqual(len(results), 1)
    self.assertEqual(results.first(), self.person1)

def test_filter_by_name_partial_match(self):
    filter = PersonFilter({'name': 'rachel w'})
    results = filter.qs
    self.assertEqual(len(results), 2)
    self.assertIn(self.person1, results)
    self.assertIn(self.person2, results)

def test_filter_by_name_no_match(self):
    filter = PersonFilter({'name': 'does not exist'})
    results = filter.qs
    self.assertEqual(len(results), 0)

def test_filter_by_name_similar(self):
    filter = PersonFilter({'name': 'rachelwinter'})
    results = filter.qs
    self.assertEqual(len(results), 2)
    self.assertIn(self.person1, results)
    self.assertIn(self.person3, results)

# You can add more test cases as needed.

These test cases cover some basic scenarios:

test_filter_by_name_exact_match: Test if the filter returns the correct person when the exact name is given.
test_filter_by_name_partial_match: Test if the filter returns all persons with a partially matching name.
test_filter_by_name_no_match: Test if the filter returns no results when there is no matching name.
test_filter_by_name_similar: Test if the filter returns similar names based on the similarity score.

Make sure to adjust the test cases as necessary to fit your specific use case and requirements. Run the tests using the following command:

bash

python manage.py test yourapp.tests.test_filters

Replace yourapp with the name of your Django app containing the PersonFilter class. Joe C if rachel winter has a popularity of .6 and 'rachel weiss' has a popularity of 38 - when i seach 'rachel weiss' which record would have a higher search score

To answer this question, we need to calculate the search scores for both 'rachel winter' and 'rachel weiss' using the provided popularity values and the similarity scores based on the Levenshtein distance. In this case, I'll use the search score formula provided in the initial code snippet:

python

search_score = (similarity * similarity_weight) + (popularity_normalized / 100 * popularity_weight)

First, let's calculate the similarity scores based on the Levenshtein distance, using the query 'rachel weiss':

'rachel winter' -> Levenshtein distance: 5 (previously calculated)
'rachel weiss' -> Levenshtein distance: 0 (identical strings)

Now, let's determine the maximum length between the two strings and the query:

Length of 'rachel winter': 13
Length of 'rachel weiss': 12
Maximum length: 13

Next, normalize the Levenshtein distance and calculate the similarity scores:

'rachel winter' -> Normalized distance: 5 / 13 ≈ 0.3846 -> Similarity score: 1 - 0.3846 ≈ 0.6154
'rachel weiss' -> Normalized distance: 0 / 13 = 0 -> Similarity score: 1 - 0 = 1

Now, let's calculate the search scores using the given popularity values and the similarity scores. We need to determine the popularity_weight and similarity_weight using the query length:

Query length: 12
Since the query length is > 6, popularity_weight = 0.25 and similarity_weight = 0.75

Finally, let's calculate the search scores for both records:

'rachel winter':
    popularity_normalized: min(max(0.6, 1), 100) = 1
    search_score = (0.6154 * 0.75) + (1 / 100 * 0.25) ≈ 0.4616 + 0.0025 = 0.4641

'rachel weiss':
    popularity_normalized: min(max(38, 1), 100) = 38
    search_score = (1 * 0.75) + (38 / 100 * 0.25) = 0.75 + 0.095 = 0.845

In this case, 'rachel weiss' has a higher search score (0.845) compared to 'rachel winter' (0.4641) when searching for 'rachel weiss'. Joe C f rachel winter has a popularity of .6 and 'rachel weiss' has a popularity of 38 - when i seach 'rachel winter' which record would have a higher search score

To answer this question, we'll calculate the search scores for both 'rachel winter' and 'rachel weiss' using the provided popularity values and the similarity scores based on the Levenshtein distance, using the query 'rachel winter':

First, let's calculate the similarity scores based on the Levenshtein distance:

'rachel winter' -> Levenshtein distance: 0 (identical strings)
'rachel weiss' -> Levenshtein distance: 5 (previously calculated)

Now, let's determine the maximum length between the two strings and the query:

Length of 'rachel winter': 13
Length of 'rachel weiss': 12
Maximum length: 13

Next, normalize the Levenshtein distance and calculate the similarity scores:

'rachel winter' -> Normalized distance: 0 / 13 = 0 -> Similarity score: 1 - 0 = 1
'rachel weiss' -> Normalized distance: 5 / 13 ≈ 0.3846 -> Similarity score: 1 - 0.3846 ≈ 0.6154

Now, let's calculate the search scores using the given popularity values and the similarity scores. We need to determine the popularity_weight and similarity_weight using the query length:

Query length: 13
Since the query length is > 6, popularity_weight = 0.25 and similarity_weight = 0.75

Finally, let's calculate the search scores for both records:

'rachel winter':
    popularity_normalized: min(max(0.6, 1), 100) = 1
    search_score = (1 * 0.75) + (1 / 100 * 0.25) = 0.75 + 0.0025 = 0.7525

'rachel weiss':
    popularity_normalized: min(max(38, 1), 100) = 38
    search_score = (0.6154 * 0.75) + (38 / 100 * 0.25) ≈ 0.4616 + 0.095 = 0.5566

In this case, when searching for 'rachel winter', 'rachel winter' has a higher search score (0.7525) compared to 'rachel weiss' (0.5566). Joe C can you show me the postgres execution plan for this django code

def filter_by_name(self, queryset, name, value): query_len = len(value)

    if query_len <= 3:
        popularity_weight = 1.0
    elif query_len <= 6 and query_len > 3:
        popularity_weight = 0.5
    else:
        popularity_weight = 0.25

    similarity_weight = 1.0 - popularity_weight

    return queryset\
        .filter(searchable_name__trigram_similar=unidecode(value.replace(' ', '')))\
        .annotate(
            levenshtein_distance=Levenshtein('searchable_name', Value(value)),
            max_distance=Greatest(Length('searchable_name'), Length(Value(value))),
            similarity=ExpressionWrapper(
                1 - (F('levenshtein_distance') / F('max_distance')),
                output_field=FloatField()
            ),
            popularity_normalized=Cast(
                Case(
                    When(popularity__lt=1, then=Value(1)),
                    When(popularity__gt=100, then=Value(100)),
                    default=F('popularity')),
                output_field=FloatField(),
            ),
            search_score=(
                (F('similarity') * similarity_weight)
                + ((F('popularity_normalized') / 100) * popularity_weight))
        )

To show the PostgreSQL execution plan for the Django code, you'll need to execute the resulting SQL query with the EXPLAIN or EXPLAIN ANALYZE command in PostgreSQL. You can use the queryset.query attribute to see the generated SQL, but you might need to adjust it to make it compatible with PostgreSQL's EXPLAIN command.

Here's what you can do:

In your Django view or a separate script, execute the filter_by_name method with some example parameters, and print the resulting SQL query:

python

Replace the parameters with appropriate values

queryset = Person.objects.all() filtered_queryset = filter_by_name(queryset, "name", "rachel winter")

Print the SQL query

print(str(filtered_queryset.query))

Copy the printed SQL query, remove any extra quotes, and wrap it with EXPLAIN or EXPLAIN ANALYZE:

sql

EXPLAIN SELECT ... FROM ... WHERE ...

Run the wrapped SQL query in your PostgreSQL database, using a SQL client or the psql command-line tool:

sh

psql your_database_name -c 'EXPLAIN SELECT ... FROM ... WHERE ...'

Replace your_database_name with the name of your PostgreSQL database.

This will give you the execution plan for the Django code you provided. Keep in mind that the plan may vary depending on your PostgreSQL version, database schema, indexes, and data distribution.

If you have difficulty with any of the steps, please let me know so I can provide further assistance. Joe C it seems like the results are coming back sorted by popularity instead of search_score. any idea why ?

It's possible that the results are not sorted by search_score because the order_by() method was not used in the queryset. To ensure that the results are sorted by search_score, you need to add the order_by() method with the appropriate field name.

In the provided filter_by_name method, update the return statement to include the order_by() method:

python

return queryset
.filter(searchable_name__trigram_similar=unidecode(value.replace(' ', '')))
.annotate( levenshtein_distance=Levenshtein('searchable_name', Value(value)), max_distance=Greatest(Length('searchable_name'), Length(Value(value))), similarity=ExpressionWrapper( 1 - (F('levenshtein_distance') / F('max_distance')), output_field=FloatField() ), popularity_normalized=Cast( Case( When(popularity__lt=1, then=Value(1)), When(popularity__gt=100, then=Value(100)), default=F('popularity')), output_field=FloatField(), ), search_score=( (F('similarity') * similarity_weight) + ((F('popularity_normalized') / 100) * popularity_weight)) )
.order_by('-search_score') # Add this line

The -search_score parameter in the order_by() method ensures that the results are sorted in descending order by the search_score field. This should return the results sorted by search_score instead of popularity. Joe C i created a new column 'search_score' to the queryset. is it possible to add this to the Person Model even though search_score does not exist in the db ?

Yes, you can add a dynamically calculated search_score field to the Person model without storing it in the database. You can achieve this using Python's @property decorator for a model method. This method will not be stored in the database but will be accessible as a property of the model instance.

First, modify your Person model to include the search_score property:

python

from django.db import models

class Person(models.Model): # Your existing fields go here, like searchable_name, popularity, etc.

@property
def search_score(self):
    # You can calculate the search_score here based on the similarity and popularity values
    # However, you will need to pass the search query to this method or store it in the instance
    pass

Then, in your filter_by_name method, remove the search_score annotation and use the search_score property instead. However, keep in mind that you will need to pass the search query to the property or store it in the instance to calculate the search_score.

Please note that using a property like this may lead to performance issues since the search_score will be calculated in Python instead of the database. The performance hit will be more noticeable when working with large datasets. If possible, it's better to use the annotate method as you did earlier to calculate the search_score in the database.

Additionally, keep in mind that sorting by the search_score property will also need to be done in Python, rather than in the database. This can further affect performance when working with large datasets. Joe C can you show me what the new search_score property should look like based on the previous code?

Here's an example of how to implement the search_score property based on the previous code. Note that you will need to pass the search query as a parameter when initializing the Person model.

python

from django.db import models from django.db.models import F, FloatField, ExpressionWrapper from django.db.models.functions import Length, Greatest from .utils import Levenshtein

class Person(models.Model): # Your existing fields go here, like searchable_name, popularity, etc.

def __init__(self, *args, search_query=None, **kwargs):
    super().__init__(*args, **kwargs)
    self.search_query = search_query

@property
def search_score(self):
    if not self.search_query:
        return None

    query_len = len(self.search_query)
    query = self.search_query.replace(' ', '')

    if query_len <= 3:
        popularity_weight = 1.0
    elif query_len <= 6:
        popularity_weight = 0.5
    else:
        popularity_weight = 0.25

    similarity_weight = 1.0 - popularity_weight

    levenshtein_distance = Levenshtein(self.searchable_name, query)
    max_distance = max(len(self.searchable_name), len(query))
    similarity = 1 - (levenshtein_distance / max_distance)

    popularity_normalized = min(max(self.popularity, 1), 100)

    search_score = (similarity * similarity_weight) + ((popularity_normalized / 100) * popularity_weight)
    return search_score

Now you can access the search_score property of a Person instance like this:

python

person = Person(searchable_name='John Doe', popularity=50, search_query='John') print(person.search_score)

However, keep in mind that this implementation calculates the search_score in Python and not in the database. This can lead to performance issues when working with large datasets. It's recommended to use the annotate method to calculate the search_score in the database if possible. Joe C is it possible to sort by search_score without adding it to the model ?

Yes, it's possible to sort by search_score without adding it to the model. You can continue using the annotate() method to calculate the search_score in the database, and then use the order_by() method to sort the queryset by search_score.

Here's the modified filter_by_name method based on your previous code:

python

def filter_by_name(self, queryset, name, value): query_len = len(value)

if query_len <= 3:
    popularity_weight = 1.0
elif query_len <= 6 and query_len > 3:
    popularity_weight = 0.5
else:
    popularity_weight = 0.25

similarity_weight = 1.0 - popularity_weight

return queryset\
    .filter(searchable_name__trigram_similar=unidecode(value.replace(' ', '')))\
    .annotate(
        levenshtein_distance=Levenshtein('searchable_name', Value(value)),
        max_distance=Greatest(Length('searchable_name'), Length(Value(value))),
        similarity=ExpressionWrapper(
            1 - (F('levenshtein_distance') / F('max_distance')),
            output_field=FloatField()
        ),
        popularity_normalized=Cast(
            Case(
                When(popularity__lt=1, then=Value(1)),
                When(popularity__gt=100, then=Value(100)),
                default=F('popularity')),
            output_field=FloatField(),
        ),
        search_score=(
            (F('similarity') * similarity_weight)
            + ((F('popularity_normalized') / 100) * popularity_weight))
    )\
    .order_by('-search_score')  # Sorting by search_score

In this code, you're calculating the search_score using the annotate() method and then sorting the queryset by the search_score using the order_by() method. This approach will allow you to sort by search_score without adding it to the model while also maintaining performance as the calculations are done in the database. Joe C would calling that function like this also work :

def get_person_list(*, filters: None): qs = Person.objects.all()

if filters is not None:
    qs = PersonFilter(filters, qs).qs

return qs.order_by("-search_score")

Yes, your get_person_list function would work, but with a small adjustment. Since the search_score is calculated inside the filter_by_name method of the PersonFilter class, it's already part of the queryset returned by PersonFilter(filters, qs).qs. You don't need to call order_by again in the get_person_list function.

Here's the updated get_person_list function:

python

def get_person_list(*, filters: None): qs = Person.objects.all()

if filters is not None:
    qs = PersonFilter(filters, qs).qs

return qs

Now when you call get_person_list(filters=filters), it will return the queryset with the filters applied, including the search_score annotation and sorting by -search_score. Joe C is there anything in this code that would reorder the results

class PersonListApi(ServiceExceptionHandlerMixin, views.APIView): class Pagination(PageNumberPagination): pass

class FilterSerializer(serializers.Serializer):
    name = serializers.CharField(required=False)

def get(self, request, *args, **kwargs):
    filter_serializer = self.FilterSerializer(data=request.query_params)
    filter_serializer.is_valid(raise_exception=True)

    movies = get_person_list(
        filters=filter_serializer.validated_data
    )

    return paginate_queryset(
        pagination_class=self.Pagination,
        serializer_class=PersonBasicInfoSerializer,
        queryset=movies,
        request=request,
        view=self
    )

In the provided code, there's nothing that would reorder the results. The get_person_list function returns a queryset with the search_score annotation and the sorting by -search_score.

However, when you pass the queryset to the paginate_queryset function, it's possible that the pagination process may alter the order if the PageNumberPagination class or any other code inside the paginate_queryset function does any additional ordering.

To ensure that the order is preserved, you should check the implementation of the paginate_queryset function and the PageNumberPagination class to see if there's any code that might reorder the queryset. If you find any such code, you can adjust it to preserve the order by -search_score that you set in the get_person_list function.

In general, Django pagination should not change the order of the queryset, but it's always a good idea to check the specific implementation you're using to ensure the correct behavior. Joe C does this look ok: def paginate_queryset(*, pagination_class, serializer_class, queryset, request, view): paginator = pagination_class()

page = paginator.paginate_queryset(queryset, request, view=view)
context = {'request': request}

if page is not None:
    serializer = serializer_class(page, many=True, context=context)
    return paginator.get_paginated_response(serializer.data)

serializer = serializer_class(queryset, many=True, context=context)
return Response(data=serializer.data)

Yes, your paginate_queryset function looks okay. It doesn't seem to modify the order of the queryset. The function takes the queryset as a parameter, applies pagination using the provided pagination_class, and then serializes the paginated results using the provided serializer_class. The order of the queryset should be preserved throughout this process.

This implementation should work correctly with your PersonListApi view and return the paginated results maintaining the order set by the -search_score. Joe C the results are returned in popularity order. can you make sure the levenshtein function we are using looks correct

Based on the code you shared earlier, here's the filter_by_name method that calculates search_score using the Levenshtein distance:

python

def filter_by_name(self, queryset, name, value): query_len = len(value)

if query_len <= 3:
    popularity_weight = 1.0
elif query_len <= 6 and query_len > 3:
    popularity_weight = 0.5
else:
    popularity_weight = 0.25

similarity_weight = 1.0 - popularity_weight

return queryset\
    .filter(searchable_name__trigram_similar=unidecode(value.replace(' ', '')))\
    .annotate(
        levenshtein_distance=Levenshtein('searchable_name', Value(value)),
        max_distance=Greatest(Length('searchable_name'), Length(Value(value))),
        similarity=ExpressionWrapper(
            1 - (F('levenshtein_distance') / F('max_distance')),
            output_field=FloatField()
        ),
        popularity_normalized=Cast(
            Case(
                When(popularity__lt=1, then=Value(1)),
                When(popularity__gt=100, then=Value(100)),
                default=F('popularity')),
            output_field=FloatField(),
        ),
        search_score=(
            (F('similarity') * similarity_weight)
            + ((F('popularity_normalized') / 100) * popularity_weight))
    )\
    .order_by('-search_score')  # Sorting by search_score

The Levenshtein distance function used in this method is correct. The line levenshtein_distance=Levenshtein('searchable_name', Value(value)) calculates the Levenshtein distance between the searchable_name field and the search query. Then, the similarity is calculated using 1 - (F('levenshtein_distance') / F('max_distance')).

If your results are returned in popularity order, it's possible that the search_score calculation may not be working as intended. You can try modifying the weights or the calculation formula to see if that changes the order of the results. Alternatively, you can also inspect the queryset's annotations to see if the search_score values are being calculated correctly. Joe C what type will the Levenshtein function return ?

The Levenshtein function returns an integer. The returned value is the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. This value will always be a non-negative integer, where a value of 0 indicates that the strings are identical, and larger values indicate greater dissimilarity between the strings. Joe C what type will max_distance return?

max_distance will return an integer. The Greatest function in the code you provided takes the maximum of two Length functions, both of which return the length of a string as an integer. Therefore, max_distance will represent the maximum length between the searchable_name field and the search query, both of which are strings, so the result will be an integer. Joe C since these two functions are retuning integers is it possible similarity is returning 0 and not a float?

Yes, you're correct. Since both levenshtein_distance and max_distance return integers, when you calculate similarity using this expression:

python

similarity=ExpressionWrapper( 1 - (F('levenshtein_distance') / F('max_distance')), output_field=FloatField() )

The division operation between two integers in Python 3 results in a float. However, the subtraction operation involving integers and floats may result in an integer (0) if the division result is 1.0.

To ensure that the similarity value is always a float, you can cast either levenshtein_distance or max_distance to a float before performing the division. Here's how to do that using Cast:

python

from django.db.models import FloatField from django.db.models.functions import Cast

similarity=ExpressionWrapper( 1 - (Cast(F('levenshtein_distance'), output_field=FloatField()) / F('max_distance')), output_field=FloatField() )

Now, the division will always result in a float, and the subtraction should also return a float value for similarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment