How DeepSeek explained the SimpleSim algorithm and found an oddity in it

2025-01-29
I want to tell you about a remarkable experience of today. We asked this new AI engine to explain an algorithm I designed and identified space for improvement.

Six months ago, I needed a function in Excel that compares fields based on similarity. The built in functions like XLOOKUP worked very nice, but only for exact matches. My use case was different: I had accounting lists from two different sources that described the same fact, but written differently. There was no common key to match them, only amount, year to match exactly and a label to match approximately.

So I wrote LLOOKUP and SLOOKUP. LLOOKUP implements the Levenshtein algorithm which is well studied, but very slow. It's complexity is O(n*m) in all cases. There must be a faster way, so I created the SimpleSim algorithm. It starts with a matching character and then looks forward, jumping always to the next matching character. The longer it has to look forward, the bigger is the penalty. The worst case is O(n*m), but for every approximate match, the algorithm stops before and has O(n) in best and random cases. I did not find a benchmark test set, but the algorithm behaved well and gave useable results.

Some days ago, my son who studies computer science told me he did not find anything on Internet which described the SimpleSim algorithm. This is not a surprise, because my description in SLOOKUP and LLOOKUP for Excel was rather superficial. I told him I will take the time to write a proper documentation or a white paper.

Some days later, my son told me about a chinese algorithm taking down the stock value of Nvidia. We discussed about DeepSeek and how it worked. He described it in the following words: While ChatGPT starts talking immediately, DeepSeek thinks before it speaks and it tells also about the reasoning.

We then thought, how about ask DeepSeek to document the algorithm before I do it manually. In another project, I ported SimpleSim to Javascript but never published it. So we asked DeepSeek to explain the following code.

function simpleSim(needle, haystack, simple = false) {
if (simple) {
needle = cleanString(needle);
haystack = cleanString(haystack);
}
var score = 0.0;
var offset = 0;
const needleLength = needle.length;
const haystackLength = haystack.length;
if (!needleLength) return 0.0; // always true
for(var i = 0; i < needleLength; i++) {
const n = needle.charAt(i);
for (var j = offset; j < offset + haystackLength; j++) {
const test = j % haystackLength;
const h = haystack.charAt(test);
if (n == h) {
if (test >= offset) score += 1 / (test - offset + 1);
offset = test + 1;
break;
}
}
}
return score / needleLength;
}

This is what DeepSeek answered

The code simpleSim function calculates a similarity score between two strings, needle and haystack, based on how closely the needle appears in order within the haystack, with rewards for proximity. Here's a breakdown:

Key Steps:

1. Preprocessing (simple = true):

  • Cleans both strings using cleanString (assumed to normalize case/special characters).

2. Initialization:

  • score starts at 0.
  • offset tracks where the next character search begins in haystack.

3. Matching Logic:

  • For each character in needle, search the haystack starting from scratch offset.
  • If a match is found:
    • Score: Add 1 / (distance + 1), where distance is how far the match is from offset.
    • Offset Update: Move the search start to just after the found position.

4. Normalization:
  • Final score is the average of all character scores (score / needleLength).

How It Works:

  • Order Matters: Characters in needle must appear in order in haystack.
  • Proximity Boost: Closer matches contribute more to the score.
  • No Wrapping: The search doesn't wrap around haystack; it stops at the end.

Example:

  • Needle: but "abc", Haystack: "aXbYc"
    • a at 0 → score += 1.
    • b at 2 → score += 1/3 (distance 1).
    • c at 4 → score += 1/2 (distance 1).
  • Final Score: (1 + 0.333 + 0.5) / 3 ≈ 0.611.

Use Case:

Useful for fuzzy string matching where order and proximity matter, like search suggestions or typo-tolerant lookups.

Note:

The modulo operation in the code seems to intend wrapping, but the test >= offset condition prevents it, making the search linear. This might be a bug or design choice.

First impression

DeepSeek follows closely the algorithm by paraphrasing it. What is interesting it that it identifies the purpose of the algorithm. It was not documented in the prompt. Deep Seek may have looked at the name, but this is not common. Most likely it must have interpreted the names of the parameters (everybody searches for a needle in a haystack) and probably has tried out the algorithm with some random values.

There was this strange note at the end but we first ignored it. We asked with a second prompt to compare SimpleSim with the Levenshtein algorithm.

How DeepSeek compared SimpleSim with Levenshtein

The SimpleSim method and Levenshtein distance solve different problems in string similarity, and their approaches are fundamentally distinct. Here’s a detailed comparison:

1. Core Purpose

  • SimpleSim: Measures how well the characters of needle appear in order within haystack, with a focus on proximity (closer matches get higher scores). Example: Searching for "abc" in "aXbYc" gives a high score because the characters appear in order, even with gaps.
  • Levenshtein Distance: Measures the minimum number of edits (insertions, deletions, substitutions) required to turn one string into another. Example: Turning "kitten" into "sitting" requires 3 edits (substitute "k" → "s", substitute "e" → "i", insert "g").

2. Key Differences

Aspect SimpleSim Levenshtein Distance |
Goal Order and proximity of characters in needle within haystack. Structural similarity based on edit operations.
Output Normalized score (higher = better match). Integer distance (lower = better match).
Handling Order Requires characters to appear in order (like a subsequence). Does not require order; focuses on overall string alignment.
Proximity Sensitivity Rewards closer matches (e.g., "aXb" scores higher than "aXXXb"). Ignores proximity; only counts edits.
Partial Matches Works well for partial matches (e.g., "abc" in "aXbYc"). Requires exact alignment; partial matches may still have high edit costs.
Performance O(n*m) worst-case (but optimizes via early exits). O(n*m) time and space (due to dynamic programming matrix).

3. Example Comparison

Case: needle = "abc", haystack = "aXbYc"

  • SimpleSim:
    • Score ≈ 0.611 (rewarded for order and proximity).
    • Logic: a at position 0, b at 2 (distance 1), c at 4 (distance 1).
  • Levenshtein:
    • Distance = 2 (insert "X" and "Y" between a, b, and c).
    • Logic: aa (no edit), insert X, bb (no edit), insert Y, cc (no edit).

4. When to Use Which

  • Use SimpleSim if:
    • You care about subsequence matching (e.g., search suggestions where characters appear in order but not necessarily consecutively).
    • Proximity matters (e.g., "apple" vs. "a_p_p_l_e" scores higher than "aXXXe").
  • Use Levenshtein if:
    • You need typo tolerance (e.g., "levenshtein" vs. "levnstien").
    • You want a standardized metric for general string similarity.
    • Substitutions, insertions, or deletions are critical (e.g., password correction).

5. Strengths and Weaknesses

Metric Strengths Weaknesses
SimpleSim Fast for fuzzy subsequence searches.
Proximity-aware.
Ignores substitutions (e.g., "abcd" vs. "abzd").
No penalty for extra characters.
Levenshtein Handles substitutions/insertions/deletions.
Standardized and precise.
Computationally heavy for long strings.
Insensitive to order/proximity.

6. Hybrid Approaches

In practice, tools like fuzzy search engines often combine ideas from both:

  1. Use a SimpleSim-like method to filter candidate matches quickly.
  2. Apply Levenshtein or Damerau-Levenshtein for final ranking.

For example, libraries like Fuse.js or RapidFuzz blend proximity, order, and edit-based metrics.

This is quite impressive but...

DeepSeek quite explained why I wrote SimpleSim. So we could have left it here, but I read again the note at the end of the first answer. Why did DeepSeek criticise my code?

The modulo operation in the code seems to intend wrapping, but the "test >= offset" condition prevents it, making the search linear. This might be a bug or design choice.

What happens here in the code? The algorithm is looking for the next matching character starting at the last matching character. The longer the lower the score. There is a moment we are at the end of the string and start over and stop if we find the character or stop at the complete loop if we do not find it.

At this moment, test is lower than offset and the scoring value 1 / (test - offset + 1) would have been negative. A negative value did not make sense, so I set it to zero.

This can be a design choice, but DeepSeek is right: We can do better than setting it to zero.

Why? Let's look at a use case when the characters are in the wrong order.

  • needle: "Lausanne"
  • haystack: "The Geneva lake is near Luasanne."

We could agree that the score should be high because there is just a swap "au" → "ua" which could be a simple typo.
The score is (1/25 + 1/2 + 0 + 1 + 1 + 1 + 1 + 1)/ 8 = 0.692
The low score for the first character is understandable but not the zero score for "u".

But what would be a good score? We count in the other direction. The search starts at s, and the nearer the character is from the starting point, in both directions, we will give a positive score.


if (test >= offset) {
score += 1 / (test - offset + 1);
} else {
score += 1 / (offset - test + 1);
}

Then we asked DeepSeek to explain the corrected algorithm

The second explanation of SimpleSim by DeepSeek

This JavaScript function, simpleSim, calculates a similarity score between two strings: needle and haystack. The score represents how well the needle string matches within the haystack string. Here's a detailed explanation of how it works:

Purpose

The function compares the needle string against the haystack string and calculates a score based on how closely the characters of the needle appear in the haystack in order. The score is normalized by the length of the needle.

Parameters

1. needle: The string to search for within the haystack.
2. haystack: The string in which to search for the needle.
3. simple (optional, default = false): If true, both needle and haystack are preprocessed using a cleanString function (not shown in the code). This might involve removing special characters, converting to lowercase, etc.

Steps in the Algorithm

1. Preprocessing (Optional):

  • If simple is true, the cleanString function is applied to both needle and haystack to normalize them.

2. Initialization:

  • score: A variable to accumulate the similarity score.
  • offset: A variable to track the position in the haystack where the next character of the needle should be searched.
  • needleLength and haystackLength: Store the lengths of the needle and haystack strings, respectively.

3. Edge Case:

  • If the needle is an empty string (needleLength == 0), the function immediately returns 0.0 because an empty string cannot match anything.

4. Nested Loops:

  • The outer loop iterates over each character of the needle.
  • The inner loop searches for the current needle character (n) in the haystack, starting from the current offset.
  • The search wraps around the haystack using modulo (%) to handle cases where the haystack is shorter than the needle.
  • If a match is found:
    • The score is updated based on the distance between the current offset and the position of the match (test). The closer the match, the higher the contribution to the score.
    • The offset is updated to the position after the match (test + 1) to ensure the next character of the needle is searched in the correct part of the haystack.

5. Score Calculation:

  • The score is calculated as the sum of inverse distances for each matched character.
  • The final score is normalized by dividing by the length of the needle.

6. Return Value:

  • The function returns the normalized score, which represents how well the needle matches the haystack.

Key Features

  • Wrapping Search: The use of modulo (%) allows the search to wrap around the haystack, making the algorithm flexible for cases where the haystack is shorter than the needle.
  • Distance-Based Scoring: The score is weighted by the distance between matches, giving higher importance to closer matches.
  • Normalization: The final score is divided by the length of the needle, ensuring the result is consistent regardless of the length of the input.

Example

Let’s say:

  • needle = "abc"
  • haystack = "abxabc"

1. Preprocessing (if simple = true): Both strings are cleaned.
2. The outer loop iterates over each character of needle (a, b, c).
3. The inner loop searches for each character in haystack:

  • a is found at position 0. Score += 1 / (0 - 0 + 1) = 1.0. Offset = 1.
  • b is found at position 1. Score += 1 / (1 - 1 + 1) = 1.0. Offset = 2.
  • c is found at position 5. Score += 1 / (5 - 2 + 1) = 0.25. Offset = 6.

4. Final score = (1.0 + 1.0 + 0.25) / 3 = 0.75.

Potential Improvements

1. Efficiency: The nested loops make this algorithm O(n*m), where n is the length of the needle and m is the length of the haystack. This could be optimized for large inputs.
2. Edge Cases: The function assumes the haystack is non-empty. If the haystack is empty, the function might behave unexpectedly.
3. cleanString Function: Since cleanString is not defined, its behavior is unclear. It should be documented or included in the code.

Let me know if you'd like further clarification or help with optimizing this algorithm!

Conclusion

SpeedSeek helped to identify an improvement of the code. We are trying this out and are still searching for a dataset to benchmark SimpleSim. Btw, SpeedSeek, do you know a public data set to benchmark algorithms that score similarity of strings? DeepSeek proposed many datasets we will explore soon. The most interesting are: