Support for more languages in Tagify

20/03/2020

From the day 0, Tagify had only supported English language. And the situation continued to be this way, until one day there was an issue opened about the “Languages” on the GitHub. The feedback in the issue was about adding support for more languages apart from the English.

The first new language that was added was Russian, it is one of the languages which author of the Tagify can understand, so that was quite an obvious choice. The whole story only required introduction of the Russian letters (from а to я or just а-я) and the Russian stop-words vocabulary. This has been released with the version v0.32.0 of the Tagify.

Generalization

Since then there were few releases, but the idea of adding more languages were still in the air. The question was whether adding support for more can be generalized so that introduction of new letter sets (as with Russian case а-я) for each new language can be avoided.

Thanks to the power of the Regular Expressions syntax, there is a way to match on all the letters in a generic way without listing separate language letter-sets, it is - \p{L}.

The rest was just a matter of adding corresponding stop-words for the languages.

9 more languages

With the release of the version v0.42.1 Tagify supports nine (9) more languages apart from English and Russian:

  • Chinese
  • Hindi
  • Hebrew
  • Spanish
  • Arabic
  • Japanese
  • German
  • French
  • Korean

So for example, running following:

% tagify -s https://he.wikipedia.org/wiki/%D7%9E%D7%97%D7%A9%D7%91_%D7%A7%D7%95%D7%95%D7%A0%D7%98%D7%99

which is by the way a wiki page about “Quantun Computing” in Hebrew, will produce something like this:

מחשב קלאסי מספר המפתח קוונטי

the resulting tags above are translated into (if I’m not mistaken):

classic computer quantum key number

And this is how the very first issue raised by non-author got resolved.

Cheers!