Hacker News new | past | comments | ask | show | jobs | submit login
How do you spell boscodictiasaur? how we correct spelling and complete queries (0x65.dev)
41 points by jpschm on Dec 8, 2019 | hide | past | favorite | 12 comments



SymSpell is great for tiny datasets(based on edit distance/operations, and dictionary size). It’s fast, but the generated index will likely be very large; furthermore, you can’t practically encode rules and transition costs (e.g cost from “f” to “ph” being zero for getting better results for [ifone]).

An FST is the better choice. We (https://www.bestprice.gr/) used SymSpell in the past, but have since switched to an FST based design, where the FST is used to match prefixes. We also store all queries encoded in the FST, together with their frequency, sorted by the query.

For each distinct corrected prefix, we determine the range in that (query, frequency) list using binary search, and then just iterate in that list from [first, last] in that range, and compute a score based on a language model and the edit cost(prefix to corrected prefix) (i.e noisy channel).

We encode many different rules in the FST including language domain ones, and building the index takes a few seconds for millions of queries. The language model is based on 5-grams and Stupid Backoff.


[Disclaimer: I'm the author of SymSpell] >>"SymSpell is great for tiny datasets". Dictionaries with one million word entries are no problem, even for a maximum edit distance of 4. For comparison, the Oxford English Dictionary contains 301,100 main entries and 616,500 word-forms in total. https://towardsdatascience.com/symspell-vs-bk-tree-100x-fast...

SymSpell can be augmented with a Weighted Edit distance giving a higher priority to pairs that are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). There are two SymSpell implementations with a Weighted Edit distance available: https://github.com/MighTguY/customized-symspell https://github.com/searchhub/preDict


For a few million queries we needed to index, many of them close to 30 characters in length(some even longer than that), the generated index size for max edit distance 3, was really large.

So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) -- and during query processing we would segment the input query and for each segment we ‘d consult any of those 3 indices that made sense and would keep the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider the n-gram sequence of the “suffix” and “prefix” matches between the “carried” top-K from the previous segment and the current/next segment, and so on, until we have exhausted the query. Segmentation was particularly hard to get right.

It was a fairly involved process but it was what worked for us -- again, SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of such SymSpell queries when processing an input query and it adds up quickly. (We will probably open-source our SymSpell implementation soon).


[Disclaimer: work at Cliqz] Both techniques have their share of upsides and downsides, infact we also use an FST based model to perform splits i.e donaldtrump ---> donald trump. The problem with the FST approach is when the prefix has a typo. I tried the spell correct on the website you linked to, it seems to fail when the first character is incorrect. Thanks for the information though, it is always interesting to read and learn from other approaches :)


Could you please tell me what you entered?

As for prefixes and typos, in our case, the FST is used for both matching a prefix and spelling correction of the prefix.

The first character is treated specially; the chance of getting the first character wrong is very low(according to our findings) so only completions/corrections that score very high get to be matched.


Sure. cmputer gets corrected, not omputer.

Just to provide some additional information, our library Keyvi(https://github.com/KeyviDev/Keyvi) has a very fast implementation of an FST based spell correct.


Got it. This is by design - though I suspect we need to dial down the cost for misspelled first character in prefixes.

We built our own FST, which is somewhat similar to the design of the Rust fst crate ( https://blog.burntsushi.net/transducers/#fst-construction ).


I spell it B13R, you're welcome: https://ghost.sk/dinosaurs/


Have you considered neural network based techniques for both problems?


Hi! Yes, we have played around with character and trigram level neural network language models. Also, we experimented with training a supervised neural network based on a misspellings dataset for the corrector.

Unfortunately, we had trouble getting the performance to a point where they could replace this system. It is definitely something we will revisit soon though!


Mboscodictiasaur obviously.


Interesting tidbit from Cliqz (https://cliqz.com/en/):

Europe has failed to build its own digital infrastructure. US companies such as Google have thus been able to secure supremacy. They ruthlessly exploit our data. They skim off all profits. They impose their rules on us. In order not to become completely dependent and end up as a digital colony, we Europeans must now build our own independent infrastructure as the foundation for a sovereign future.

Why not post this article to the European "HN"? Would hate for the US Hacker News to be ruthlessly providing traffic to a company whose stated mission is anti-US.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: