Data driven language learning

A post that outlines my experiment in learning Bahasa Indonesia by identifying and memorizing 80% of the most frequently used word in a body of text.

Some research here discusses various word frequency and vocabulary size required to be able to learn a language. The page mentions that “By knowing the 2000 most frequent word families of English, readers can understand approximately 80% of the words in any text. Therefore, the goal of an English learner should be to acquire these 2000 word families first, since this relatively small number of words is recycled in any piece of writing and ensures the basis for reading comprehension.

I figured I would do the same for Bahasa Indonesia. The strategy is to load up the top 80% of frequently appearing words as well as their English equivalent into Anki and memorize them, allowing me to comprehend a significant part of the language relatively quickly.

The first attempt was to try to extract words from the Bahasa Indonesia Wikipedia corpus by getting the “All pages, current versions only.” dumps from here. The words were extracted from the document, lower cased with all non-alphabet and spaces removed and ranked in order of frequency. This attempt failed due to the sheer number of tags and other specialized symbols and meta data in the wikipedia document.

My second attempt used the corpus from a “500,000 Word Bahasa Indonesia Parallel Corpus with Penn Treebank” (Much thanks to Prasetya Dwicahya for his help verification) which worked better since it produced purely words. However the issue with this method is that often, individual words don’t make any sense on their own and words are actually word pairs or even a sequence of 3 words together. Take for example, one could memorize “thank” and “you” separately but miss out on “thank you” which is an important part of the vocabulary.

This issue was resolved by scraping kamus.net an Indonesian <-> English dictionary. A table was created with the Indonesian word in one column, the equivalent English in another and this was matched against the corpus loaded from the Treebank that resulted in a third column containing the frequency.

For example:

yang which
whom
whose
which
16103
dan and
and is
8851

The results were saved as a CSV file and loaded in Anki, a spaced repetitive learning system that works like a flash cards with Indonesian words in the front, English at the back but rather than show the words randomly will show you words you have difficulty remembering frequently and easy words less so.

I’ve been on it for about a week now and have so far picked up a vocabulary of about a hundred words. I hope somebody will find this post useful and be able to build up on it. In case anyone is wondering, 1327 words forms 80% of all the words in the corpus.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *