1 year ago
#249881
Joel
How to calculate the distance of one string across a defined range of strings?
Given an interval defined by two strings, [x, y], and third string s between them, is there a way to calculate the percentage of the whole interval from x to s. Preferably which honors collation (case matters vs not, for instance). An approximate answer is reasonable.
For example, given the strings 'a' and 'c', 'b' is halfway across, in the normal Latin-1 collation, so we'd expect an answer of 50%.
The obvious, and wrong, way is just to trust the encoding to carry the day. Unfortunately that ignores the fact the in a case insensitive collation, 'B' is in the interval ['a', 'c'], and is equivalent to 'b', even though 'B' is encoded as a higher number than 'c'. So the encoding doesn't have this information unless we go through some normalization, which might be expensive.
I'm hoping someone has thought of a better way. It seems like something that should come up in database implementation quite a bit, but I haven't seen anything in the literature, or online, alluding to this. To be fair, it's entirely possible I'm looking in the wrong places and under the wrong names. String distance questions seem to be dominated by edit distance, not this sort of collation related distance.
It's also possible that the question depends on the encoding, in addition to the collation. In that case, I'm most interested in the various UTF encodings.
string
algorithm
utf
database-engine
0 Answers
Your Answer