Towards a (true) Dhivehi search engine

As much as I would like the Dhivehi language to die and rot away, it seems it won't happen, atleast for a while. The (relatively) newly minted freedom to publish newspapers and the growth of web-based news sites may have poised Dhivehi for a serious revival of the language. The revival probably isn't so much in terms of improvements in the vocabulary or other more linguistics related changes but rather a revival in terms of the amount of information now being pumped out in Dhivehi - and in my opinion, that's a great start.

A (if not THE) point worth noting here is that much of this new information is being produced - and published - by digital means. Most government authorities now have web portals and an increasing number of them maintain them diligently. Most, if not all, newspapers and magazines also seem to maintain web portals with their content being made available online on the web. This modern revival thus presents a very interesting and a very much modern set of problems (to geeks like me atleast :-P) :- accessing it. It is probably the first time in Maldivian history that a "dhivehi search engine" makes practical sense.

Now, I am aware that Google and other search engines can be used to search for Dhivehi and I'm also aware that there are a few local operations that purport/aspire to be Maldivian search engines but they all share important shortcomings. These shortcomings are mostly inherent to the various methods of writing Thaana as used on the World Wide Web.

Say you want to search for the word "rayyithunge". Typing that into a search engine would bring an entirely different set of results from typing in "rwacyituncge" or "ރައްޔިތުންގެ" - both of which are alternative forms of representing the same thing in Dhivehi. The different set of results arise because of the differences in the representation schemes used on the different sites. A search with the phrase "rayyithunge" would bring in results with pages that seem to mostly contain English and that's because "rayyithunge" is Dhivehi "Latin"ised into English so that we could use standard English characters to write Dhivehi words. People commonly use such Latinised Dhivehi when writing emails or chatting - say "haalu kihineththa" etc. Meanwhile, a search with the phrase "rwacyituncge" results in a listing of content from sites like Haveeru and Miadhu who use standard ASCII coupled with custom Dhivehi fonts with the characters mapped. If you try copy-pasting something written on the Haveeru page you'd see that it comes out as a seemingly meaningless jumble of letters. Lastly, a search with the phrase "ރައްޔިތުންގެ would bring in results from sites like Minivan Daily and Sangu Daily who use Unicode to display Dhivehi. Anyway, the technical explanations aside, the point is that Dhivehi search is (currently) a messy enterprise.

The solution to this problem can (seem to) be pretty simple. A custom search interface could be made to simply take the search query from a user and convert it into the three different representation schemes and then spawn search a search for each representation phrase on any of the existing search engines. This would work just fine... until you run into peculiar problems related to Latinised and ASCII Dhivehi schemes. Take for example the word "ފަލަ" Latinised into "fala" - a search on the word would result in almost entirely non-Dhivehi results totally unrelated to what we really want. Similarly, a search on the ASCII'ed phrase "Oled" (which is the word "ދެލޯ") would result in a large number of non-Dhivehi results with no bearing on what we wanted. These problems occur because Latinised and ASCII Dhivehi representations can result in text that have meaning in English as well - such as the case of "Oled" as above which happens to be a popular technical term in English.

A more sophisticated approach to the search problem probably could successfully iron out (most of) these quirks. An ideal solution would be to do away with the existing search engines such as Google, despite their awesomeness, and develop a custom search engine. A custom engine would allow for the recognition of the various representation schemes used and the subtle differences between them. A search phrase entered on such an engine would perhaps standardize the phrase and search through a standardized index to return results that are a better mirror of the Dhivehi content that is out there. Such a custom search engine could bundle in extra Dhivehi-related facilities such as conversions to allow for lack of (particular) fonts as used on sites and spelling correction among others.

So, perhaps the question now is, is there a real need for a Dhivehi search engine yet? When should a Maldivian "Google" be born?

Back from pause

It's been quite a while since I last made a post here on my blog - more than a month to be more precise - but the lack of posts wasn't all due to forgetful negligence or busy schedules. Rather, it was mostly due to deliberate inaction while I contemplated some things regarding blogging.

Simply said, I came to the decision to halt blogging because I was quite intimidated by the effects that my blog was having on my personal life lately. Never had I thought, when I started blogging over two years ago, that the things I publish in the virtual world would lead to alarming consequences for myself in the real world.

One important such consequence is one that is (and should be) reasonably expected: exposure. This is especially relevant for bloggers like me who are upfront about who they are and choose not to hide behind pseudonyms and veils of secrecy. What I say on the blog then becomes directly attributable to me as a person and I am held accountable for what I say rather than all of it being chalked up to some anonymous pseudonym. Further, blog posts - explicitly or implicitly - reveal a lot more about the intellectual dispositions of the author including ideological stances and beliefs. It is important to note that these tend have direct detrimental effects on the blogger's individual privacy and anonymity. I find it startling when random people whom I've never met or heard of strike conversation with a seeming air of supposed familiarity with myself or when people talk of me with a sort of conviction that I'm this or that - all based on my blog posts. Exposure is good for aspiring politicians and performance artists neither of which I have the slightest inclination towards becoming. The balance between anonymity and exposure is a tricky one, especially for someone who wants to stay pretty much under the radar.

Another important consequence of blogging relates to the response it evokes in readers. Personal blogs like mine often contain the ideas, thoughts and ramblings of the author - all of which have the potential of being controversial and disagreeable to some. Sadly, this sometimes goes to the extent that some choose to take extreme offence and retaliate with the vilest of language and threats. The comment system on my blog has been very open and mostly goes unmoderated, yet I had to switch to active comment moderation in the latter part of last year due to growing use of vile language, mindless insults and threats to myself and family. While I appreciate all types of comments and welcome disagreement and discussion, I don't think death threats and chants wishing ill things to me are really warranted. Freedom of expression is a great thing but not all people seem to appreciate it with civility and restraint...

These aren't things that would normally bother me and it hadn't until I met a couple of eager fanatics whose unreserved and unashamed drive to show their disagreement through violence - something that I had the (dis)pleasure of experiencing in the months I was in Male' last year - gave me cause for serious concern. Anyway, I hope to resume blogging regularly again - in spite of the above mentioned consequences...

Meteor shower December 13-14

Just thought I'd share this:
Best Meteor Shower of 2007 Peaks Dec. 13
"What could be the best meteor display of the year will reach its peak on the night of Dec.13-14.

Here is what astronomers David Levy and Stephen Edberg have written of the annual Geminid Meteor Shower: "If you have not seen a mighty Geminid fireball arcing gracefully across an expanse of sky, then you have not seen a meteor."

...

There is a fair chance of perhaps catching sight of some "Earth-grazing" meteors. Earth grazers are long, bright shooting stars that streak overhead from a point near to even just below the horizon. Such meteors are so distinctive because they follow long paths nearly parallel to our atmosphere.

The Geminids begin to appear noticeably more numerous in the hours after 10 p.m. local time, because the shower's radiant is already fairly high in the eastern sky by then. The best views, however, come around 2 a.m., when their radiant point will be passing very nearly overhead."

The full article is at http://www.space.com/spacewatch/071207-ns-geminids.html

As I've said before I find meteoroids very exciting! I surely will be out watching for this meteor shower. Will you? :-P

Akunbakun update: 1 week expiry date

I made a slight change to Akunbakun, the Facebook application, over the weekend that makes Akunbakun challenges expire one week after they have been set. A message about the expiry is published to the mini feed and news feed along with the names of all those who managed to win that particular challenge.

I plan to add features to make it easier to find which of your friends has an active Akunbakun challenge running as an easier alternative to having to run through profiles to see if they have one.

Meanwhile, the number of users of the application keeps on increasing! :-)


Expiry notice as published in the mini feed


Thaana Unicode<->Ascii conversions PHP class

Here is something that would probably be very handy to Maldivian web developers dabbling with Dhivehi sites. This PHP class addresses the need for converting text to and from Thaana in Ascii and Thaana in Unicode.

The class makes it easy to standardize text into one format irrespective of how it was/is written. This means that you can take text written in Accent, MS Word 97 (and prior) or written using Unicode as featured on recent MS Word editions and use the class to present output in the format of your choice without the need for imposing restrictions on the people who write the text. The class comes in even more handy when you have a form submission that takes input in Unicode but needs to be stored in the database or presented later as Ascii, or vice versa.

The class was something I originally wrote around 2001 and was used in the free Online Document Converter that featured on maldivianunderground.net. I rewrote it for PHP 5 recently for use in a project I am working on. The original class had support for Letin dhivehi -> Unicode/Ascii conversions as well which I haven't included in this release but will add it a future update.

Usage should be pretty straightforward but here is an example just to illustrate:

Example:
<?php
require 'thaana_conversions.obj.php';
$thaana = new Thaana_Conversions();

echo $thaana->convertUnicodeToAscii('&#1931;&#1960;&#1928;&#1964;&#1920;&#1960;');
echo $thaana->convertAsciiToUnicode('rWacje');
?>

Download:
- Thaana_Conversions.zip (v0.1, 2KB)

Enjoy :-)

Update (7-May-2008): This version is now superseded by v0.2.

Updated: PHP script for Hulhule flight info

I'm releasing an update to the Hulhule FlightInfo PHP class I released late last year as a response to a feature request I received recently. The update adds support for flight information for all the days as available on the Flight Information Services site rather than the current-day-only mode that the original script supported. A small demo script is packaged in the download to highlight the usage details.

The Hulhule Flight Info class simplifies access to the flight information by doing the dirty work of grabbing the HTML page from the FIS site, parsing it, extracting the required information and making it accessible as a multidimensional array. It supports grabbing both the arrival and departure information and is easy to use. It has since been integrated into a few sites by different people.

- Download the flightinfo class v0.2

Enjoy :-)

Akunbakun for Facebook - 100th user and Rankings

Akunbakun for Facebook, which I released last month, racked up it's 100th user today and so I'm celebrating the occasion by adding a new feature: Ranking.

Ranking is basically the success rate of the user on the Akunbakun challenges they take up. Numerically, it is simply a ratio between the number of wins and the number of tries taken. I added the feature to allow users to know how they are doing overall and to entice people to play more and get better scores. ;-)

Ranking, along with a few minor fixes and enhancements, has been rolled out a while ago and the Akunbakun box on the profile pages of all those who have added the application would now reflect the change.

Enjoy.


The new Ranking feature on Akunbakun