Inflection based de-duplication

26/03/2018

Say you have a list of words ordered by priorities (scores), in which more important words are located at the beginning, whereas less important words are located at the end, and this list could look like this:

[pets, dog, ball, cat, pet, food]

 

Problem

As you can see, this list contains two different forms of the same word: “pet” and “pets”. Having these both words, doesn’t provide more unique information than having only one of them. So how can we decide which one should be kept and which one should be removed from the list?

Another interesting detail, according to the list's order, because word “pets” is closer to the beginning of the list than the word “pet”, this means that “pets” is more important than “pet”. Now we can determine that “pets” should be kept and “pet” should be removed.

Given this information we can state two problems that we need to solve:

  1. Find duplicates regardless of their form plural or singular.
  2. Keep only more important form of the word.

 

Solution

As title states, today I’ll be using inflection in order to solve first problem, here is an algorithm:

  1. Start list iteration.
  2. Build an index of words on each step to check whether current word’s singular form has been seen already or not.
  3. Check if current word in the step:
    1. if word is of singular form and index has same position for this word - keep word;
    2. if word is of plural form and index doesn’t have its singular forms yet - keep word;
    3. if word is of plural form and index contains its singular forms but their positions are closer to the end than the position of current word - keep word;
    4. else - remove word.

This approach requires O(N) runtime and O(N) space. 

As a result of this algorithm, input list

[pets, dog, ball, cat, pet, food]

will be transformed into:

[pets, dog, ball, cat, food]