Pulau Bahasa Word Lists

Pulau Bahasa Word Lists (“PBWL“) are lists of Indonesian common words sorted by language proficiency (the CEFR scale) and grouped by root word. Learn more about the basic features and advantages of PBWL on the summary page first.

Image created by geralt and distributed via Pixabay (for Free for use & download)

As an addendum, this page gives you in-depth answers on the following questions.

FAQs

  1. PBWL as a series offers several files. Which PBWL file(s) should I download?
  2. Why is PBWL better than other “word frequency lists” such as the ones by Wiktionary?
  3. How many words do I need to reach a basic fluent level? Indonesian has many derivative forms with prefixes and suffixes. How do you count these words?
  4. I heard that “top 3,000 words cover 95% of context.” Is that true? What does 95% mean in practice?
  5. To which CEFR level can I possibly reach upon graduation from the Duolingo/Clozemaster Indonesian courses?
  6. What do the abbreviations and jargon in the files mean?
  7. Is PBCT a legally safe material? Can I use/remix PBCT without getting permission from the original copyright holders?
  8. How can I file an error report?

Click the ➡ icon to jump to the answer.


Q1. PBWL Series and Version Info

Last Update: Mar. 18, 2024

💡 To first-time visitors: Before downloading, read this section first to learn more about the difference between “word tokens”, “lemmas” and “root words”.

So far, PBWL as a series consists of six Microsoft EXCEL files. Click the file name to view the content on Google Sheets and download it (except the Master file distributed only on a request basis. Please contact the author.)

Do not share the file(s) with others via e-mail, etc. Rather, inform them of the URL in order to directly access to the latest version, and encourage them to read this FAQ page first.

https://pulaubahasa.wordpress.com/vocab-builders/pbwl/#series
File Name (.xlsx)Latest VersionSize
(1) Master
(on a request basis only)
v1.0a
(Mar. 18, 2024)
440,000+
word tokens
(2) Root ⭐ ⭐
(highly recommended for learners)
v1.0a
(Mar. 18, 2024)
8,244
root words
(1.32MB)
(3) Acronym
(appendix of (2) Root)
v1.0a
(Mar. 18, 2024)
1,017
word tokens
(135KB)
(4) Country
(appendix of (2) Root)
v1.0a
(Mar. 18, 2024)
267
word tokens
(38.0KB)
(5) Lemma
(long list of lemmas not sorted by root word)
v1.0a
(Mar. 18, 2024)
27,852
lemmas
(3.65MB)
(6) Unchecked
(word tokens in (1) Master that are not checked)
v1.0a
(Mar. 18, 2024)
203,095
word tokens
(10.4MB)

Click the link. On the menu bar, go to “File” >> “Download” >> “Microsoft Excel” to save the file on your computer.

  • With 440K+ word tokens, (1) the “Master” file serves as a raw data set for application developers and linguistic data scientists (i.e. tech geeks).
  • On the other hand, the other five files are processed from the Master and ready to use for your daily practice on a flashcard app and for self-assessment.

Older versions are also available here.

The author of PBWL (MsFixer) is manually checking each word token in the Master. Here is the progress report.

Total tokens
(X+Y)
Checked
(X = X1+X2)
Unchecked
(Y)
Meaningful
(X1)
Garbage
(X2)
449,059
(100.0%)
245,964
(54.8%)
203,095
(45.2%)
39,267
(8.7%)
206,697
(46.0%)
How many word tokens are checked? (as of Mar. 18, 2024)

Wait! Where are these 440K+ word tokens from? Why does the Master file contain so much “garbage”?


File 1. Master

440,000+ word tokens in the Master file are extracted from the following three external corpora and merged into a single file.

  • OpenSubtitle (OS) Y2018 contains 357,441 word tokens and all imported to the PBWL Master.
  • Y2013 mixed version from the University of Leipzig Corpora Collection (LCC) contains 3,000,000+ word tokens and only top 300,000 word tokens are additionally imported to the Master.
  • Apa Kata Dunia (AKD) Y2016 contains 846,630 word tokens and only top 50,000 word tokens are additionally imported to the Master.

Garbage” in the PBWL term refers to typos, foreign words and unique nouns such as names of people, companies, products, local cities/towns, local ethnic groups or their languages (e.g. Java and Javanese). “Agglutinative” forms such as “apel-apel” (apples; plural form) and “apelku” (my apple; with possessive suffix) overlap the root word “apel“, so these agglutinative forms are also “garbage”.

Duplicated tokens across the three corpora are also trimmed from the Master. Also note that LCC is case-sensitive while the other two and the PBWL Master are case-insensitive.

Let’s skim the top 100K frequent words ranked by the three corpora.

TopChecked
(OS)
Meaningful
(OS)
Checked
(LCC)
Meaningful
(LCC)
Checked
(AKD)
Meaningful
(AKD)
1K1,000
(100.0%)
846
(84.6%)
1,000
(100.0%)
843
(84.3%)
1,000
(100.0%)
930
(93.0%)
3K3,000
(100.0%)
2,241
(74.7%)
3,000
(100.0%)
2,331
(77.7%)
3,000
(100.0%)
2,620
(87.3%)
5K5,000
(100.0%)
3,490
(69.8%)
5,000
(100.0%)
3,655
(73.1%)
5,000
(100.0%)
4,061
(81.2%)
10K10,000
(100.0%)
6,115
(61.2%)
10,000
(100.0%)
6,470
(64.7%)
10,000
(100.0%)
7,263
(72.6%)
20K19,138
(95.7%)
10,109
(50.5%)
20,000
(100.0%)
11,080
(55.4%)
18,116
(90.6%)
12,030
(60.2%)
30K27,251
(90.2%)
12,890
(43.0%)
29,048
(96.8%)
14,287
(47.6%)
24,652
(82.2%)
15,310
(51.0%)
50K41,551
(83.1%)
16,654
(33.3%)
44,750
(89.5%)
18,574
(37.1%)
35,207
(70.4%)
19,470
(38.9%)
100K71,561
(71.6%)
22,111
(22.1%)
79,698
(79.7%)
24,553
(24.6%)
64,984
(65.0%)
24,404
(24.4%)
Progress report (top 10K only; as of Mar. 18, 2024)

As explained in another Q/A section, the rankings by the three corpora are not so reliable. As a rough reference, however, you may want to regard LCC top 30K and LCC top 50-60K as Lower B2 and Upper B2 of the CEFR scale, respectively.

For example from LCC top 50K (highlighted), approx. 5.3K (10.5%) haven’t been checked yet. Once completely checked, the number of meaningful lemmas in the top 50K can be between 18.6K (37.1%) and 23.8K (47.6%). You may find additional meaningful lemmas in the future updates though the vast majority of unchecked tokens are presumably garbage or loan words from English.


File 2. Root

The PBWL “Root” file excludes garbage tokens from the “Master” file, picks up 8,100+ meaningful root words (equivalent to 27K+ lemmas on the KBBI calculation method), and then categorizes them by CEFR level. Each root word has up to top 3 common lemmas.

This file is prepared for learners who would like:

  • to memorize important lemmas sharing the same root word at once (i.e. bulk memorizing method)
  • to properly gauge their vocab size and assess their current proficiency level (CEFR level),
  • to identify which root words are not taught by Duolingo and Clozemaster (i.e. additionally learn on your own) and
  • not to be bothered by duplicated word tokens and garbage.

💡 Check out the average number of root words required for each CEFR level here.


File 3. Acronym

Acronyms are separated from the “Root” EXCEL file because most of them, compounded from multiple words, are hard to be grouped by a single root word, and because acronyms do not affect so much on your vocab size assessment.

Examples of acronym:

  • SD (Sekolah Dasar): Elementary School
  • Kemenkeu (= Kementerian Keuangan): Ministry of Finance
  • Menkeu (= Menteri Keuangan): Minister of Finance

Some of the acronyms are less frequently used. You can filter out such acronyms by using rankings by LCC and AKD.

Some acronyms have more than one meanings like this:

  • PBB = 1) “perserikatan bangsa-bangsa” (the United Nations), 2) “pajak bumi dan bangunan” (real estate tax), or 3) “peraturan baris-berbaris” (regulation on marching/parades)

File 4. Country

You can find names of countries and their currencies as well as of regions, ethnic groups and languages across multi-counties in the “Country” file. These words are also separated from the “Root” EXCEL file because they should not be prioritized by CEFR level, and most of them are loaned from English names, and thereby they do not affect so much on your vocab size assessment.

Please note that names of cities and local ethnic groups within a single country are regarded as “unique nouns”. For example,

  • Selandia Baru (New Zealand) ==> in scope of the Country file
  • Yen (Japanese Yen) ==> in scope of the Country file
  • Karibia (the Caribbean including the Caribbean Sea and its islands) ==> in scope of the Country file because higher than a country level
  • Jawa (Java as an island or Javanese as a local ethnic group and language) ==> out of scope because lower than a country level
  • Yerusalem (Jerusalem as the holy capital city for Palestinians and Israelis) ==> out of scope because lower than a country level

Although “Jawa” and “Yerusalem” are excluded from the “Country” file, you can still find them marked as “unique nouns” in the “Master” file.


File 5. Lemma

Although Pulau Bahasa highly recommends all learners to memorize words sharing the same root word at once, some of you may hate the “bulk” method and want to stick to the rankings by LCC without grouping by root word. If so, download this “Lemma” file instead of the “Root” file.

Here is an illustrative example of lemmatization (the root word = “jual”, meaning: “to sell”).

LemmasWord TokensLCC Rank
& Frequency
menjual
(to sell)
dijual (lower case)853rd (189,746 times)
Dijual (upper case)1,660th (99,575 times)
menjual1,256th (130,584 times)
Menjual20,706th (3,876 times)
… and other forms
penjualan
(seller)
penjual3,077th (49,301 times)
Penjual22,147th (3,500 times)
penjualnya32,810th (1,948 times)
penjual-penjual197,066th (95 times)
… and other forms
jual
(root)
jual934th (175,808 times)
Jual2,917th (52,758 times)
juallah146,586th (155 times)
… and other forms

As the table shows, the PBWL “Lemma” file uses the same lemmatization method as KBBI, which merges passive forms (di- verbs) etc. into the active forms (me- verbs). On the IndoDic lemmatization basis, passive/active forms are registered as different lemmas. ➡ Learn more about the difference of four lemmatization methods.

The (5) Lemma file contains acronyms and country names, etc. Once you download the (5) Lemma file, you do not need to additionally download the (3) Acronym and the (4) Country files.


File 6. Unchecked

As abovementioned, 20.3% of LCC top 100K word tokens in the “Master” file have not been checked yet. The “Unchecked” file simply lists them for advanced learners who wish to additionally pick up meaningful words on their own.

Please bear in mind, however, that the vast majority of tokens in the Unchecked file are presumably garbage or loan words from English.


Q2. Frequency Rankings of Other Lists

You might assume that word frequency rankings by other famous “traditional” lists (OS, LCC, and AKD) are in line with the CEFR levels — e.g., Top 1,000 words on those lists are A1 words. That is absolutely untrue.

According to OS, “public prosecutor” is a more important word than “October” and “carrot”. LCC and AKD suggest that you should give a higher priority to “social welfare” over “fork” and “dictionary”. They don’t make any sense.

Indonesian Words and RankingsOSLCCAKDPBWL
kau
(casual form of “you”; sounds rude in some areas in Indonesia)
21,3917,762Lower A2
Nona
(Miss)
49126,21224,769Lower A1
jaksa
(public prosecutor; district attorney)
2,1223,822429Upper B1
berlian
(diamond)
2,38613,9165,934Upper A2
Oktober
(October)
4,468780334Lower A1
koboi
(cowboy)
4,53242,27414,567Upper B1
garpu
(fork)
7,03929,07130,629Lower A1
wortel
(carrot)
8,43413,96516,081Lower A1
kancing
(button of clothes)
8,46921,08126,908Upper A1
kesejahteraan
(social welfare; prosperity)
9,2491,4092,074Upper B1
kamus
(dictionary)
11,94310,12016,375Upper A1
dahi
(forehead)
13,98915,58415,755Upper A1
rupiah
(Indonesian currency)
49,0701,069627Upper A1
Lebaran
(Islamic holiday at the end of the fasting month)
53,9813,372777Upper A1

As a matter of fact, rankings by OS/LCC/AKD are not ordered by language proficiency level. They just automatically crawled the Internet, split word tokens with space and counted the number of occurrence of each token. People on the Internet talk more about criminal cases and public prosecutors than about carrots. That is the reason why their rankings are so skewed.

  • OS = OpenSubtitles: The earliest (and probably the most famous) list introduced by Wiktionary. Language learning app Clozemaster refers to the OS list. OS is a social digital platform where users can post subtitles of famous films and TV dramas (mostly from US/UK, Japanese anime and Bollywood). The OS list crawled the entire subtitles. As a result, words related to Indonesian local or Islamic culture are often undervalued. Slang, colloquial expressions and terms in criminal cases are overvalued.
  • LCC = the University of Leipzig Corpora Collection: Secondary list introduced by Wiktionary in March 2023. Among the three frequency lists, LCC is probably the most balanced one. However, tokens in LCC are “case sensitive” and thereby the LCC ranks are diluted. For example, “saya” (meaning: “I/me”) in lower case ranks 31st, “Saya” in upper case ranks 108th and “SAYA” in all upper case ranks 42,182nd. The other two lists summed up the three tokens.
  • AKD = Apa Kata Dunia: The author of AKD list crawled online Indonesian news articles only. Formal written expressions as well as soccer terms are overvalued. However, typical typos are not so overvalued in AKD as the other two did because the crawled articles by AKD are more likely to be professionally proofread.

By loosely referring to these three lists, PBWL manually adjusted such errors and reorganized by CEFR level. The CEFR is on a six-grade scale — A1, A2 (beginner), B1, B2 (intermediate), C1 and C2 (fluent). On the other hand, PBWL split A1 – B2 into “Lower” and “Upper” (e.g., Lower A1, Upper A1, Lower A2…) while C1 and C2 are integrated into one group.


Q3. Vocab Size and Proficiency Level

Linguists use “root words” (or “word-family basis“) as a common practice to access one’s language skills for an apple-to-apple comparison across languages. Here are the average numbers of root words required for each of the CEFR levels.

CEFRIndonesian
(upper: avg,
lower: range)
EnglishSpanishFrenchGermanRussian
A1~800
(100 – 1,100)
7851,146975650n.a.
A2~2,000
(1,101 – 2,300)
2,3822,7301,6451,300n.a.
B1~4,000
(2,301 – 4,400)
5,3276,0663,3883,5002,300
B2~7,000
(4,401 – 7,700)
9,50211,8306,4076,05310,000
C1~9,000
(7,701 -)
11,90814,910n.a.n.a.12,000
C2n.a.15,71523,343n.a.n.a.20,000
Numbers of root words required for each of the CEFR levels
Source (except Indonesian): "Reading proficiency and vocabulary size: An empirical investigation" - Table 4.1. on Page 62, Tschirner et al. (2018)
Source (Indonesian): estimated by MsFixer (the author of Pulau Bahasa Word Lists (PBWL))

Your fellow learners may say “I learned 2,000 words from language course X” or “the dictionary Y has more than 50,000 lemmas.” Don’t forget, however, that Indonesian as an agglutinative language has a variety of derivative forms. How do you count words sharing the same root word? — e.g., “baik” (good/kind), “kebaikan” (kindness), “kebaikanmu” (your kindness), “sebaik” (as good as), “terbaik” (best), “memperbaiki” (to make it better) and “diperbaiki” (made it better). That’s why the root word counting method is important.

Learn more about the lemmatization and counting methods.

Don’t be overwhelmed

According to the abovementioned academic paper by Tschirner et al. (2018) (see Page 58 – 59), the average vocab size of native English speakers entering English-speaking collages is 10K word families. This figure is equivalent to approx. 30K lexical items (or lemmas) in English. 8K word families are required for the 98% coverage of nonacademic texts in English.

As a standard criterion, international (non-native) applicants are required to be at upper C1 by submitting their TOEFL or IELTS score to prestigious English-speaking collages. Therefore, non-native English speakers with over 8K word families may claim themselves as “fluent in English” and those who with over 11K may be “competitive”.

Good news! After reviewing over 35K lemmas in Indonesian, MsFixer (the author or PBWL) now assumes that Indonesian vocabulary has fewer synonyms than in English. Another good news! Approx. 25-30% of Indonesian root words at B2 level or below are loaned from or similar to English. So, don’t be so overwhelmed by the numbers.


Q4. Myth of Top 3K & 95% Coverage

You might already have heard from somewhere (example) that “top 3,000 words covers 95% of context.” As explained on this page, the term “3,000 words” is vague: it could be on the word-family basis or the IndoDic method — it’s more likely to be the latter. Moreover, what exactly does “95% coverage” mean?

According to the academic paper by Tschirner et al. (2018) (see Page 60), text coverage of 95% and of 98% are adequate for only 60% and 70% comprehension, respectively.

I give you another intuitive example. If your vocab size is the one with text coverage of 95%, you will get stuck with one unfamiliar word in every two lines of a typical hardcover book, and need to look up such unfamiliar words in a dictionary 10+ times per page. That’s not enjoyable at all. In summary,

3K words = text coverage of 95% = 60% comprehension without a dictionary


Q5. Vocab Coverage by Duolingo and Clozemaster

Duolingo self-claims that

Our new courses now cover all the A1- and A2-level content, with about 800 words introduced at each level by following the CEFR framework.

Blog title: “How are Duolingo courses evolving?” (published on Apr. 3, 2019, written by Bozena Pajak)

Clozemaster says that

It aims to answer the question, “What should I do after Duolingo?” and provide a more sentence based and contextual learning experience to complement other language learning apps like Memrise and Anki.

“What is Clozemaster?” on the official FAQ page

Do Duolingo and Clozemaster really cover essential words? Or, do they just overstate? Here is the fact.

Vocab covered byLower
A1
Upper
A1
Lower
A2
Upper
A2
Lower
B1
Upper
B1
(1) Duolingo only96.0%69.4%36.0%15.1%5.7%2.0%
(2) Clozemaster only97.2%82.6%65.0%51.6%35.2%18.5%
(3) Duo + CM99.8%92.8%74.3%57.6%37.6%19.8%
(4) Duo + CM + Pulau Bahasa99.8%95.6%83.2%74.6%55.4%31.1%
Coverage ratios of root words taught by the Duolingo and Clozemaster Indonesian courses as well as Pulau Bahasa Cloze Test (PBCT) as of Feb. 23, 2024

Note: “Covered” here means Duolingo, Clozemaster or Pulau Bahasa Cloze Test (PBCT) teaches a root word or its derivative forms. For example, Duolingo teaches only “perawat” (a nurse) and “merawat” (to care for; look after). PBWL categorizes the root word “rawat” as Upper A1, and marks “covered” by Duolingo.

URL for citation: https://pulaubahasa.wordpress.com/vocab-builders/pbwl/#table-of-duo-cm-coverage

Pulau Bahasa concludes that:

The Duolingo Indonesian course alone may enable learners to reach upper A1.

The Clozemaster course combined with Duo may do so upper A2.


In-depth analysis

There are two critical flaws in the official statement by Duolingo.

First, the Duolingo Indonesian course teaches you 1,832 words (lemmas), which are equivalent to 1,309 root words (grouped by word family and excluding unique nouns). As the abovementioned table shows, you need approx. 2,000 root words on average to reach A2. So, the content is simply too short.

Second, some of the words taught by Duolingo are at B1+ level.

Here are examples of words at A1 or A2 that Duolingo does NOT teach you but Clozemaster covers. The data shows that Clozemaster’s Indonesian course is a good “post-Duolingo” app to further expand your vocabulary.

Adjectives, adverbs and verbs

  • memang (really; indeed)
  • khawatir (worried; afraid)
  • bermanfaat (useful)
  • sopan (polite)
  • pandai (clever)
  • mabuk (drunk)
  • kira (think; guess)
  • lari (run)
  • tekan (push (a button, etc.))
  • curi (steal)
  • buang (dispose (rubbish, etc.))
  • baring (lie down)
  • rencana (plan)

Nouns

  • bibi (aunt)
  • rokok (tobacco)
  • tali (rope)
  • boneka (doll)
  • bangku (bench; seat)
  • prangko (post stamp)
  • acara (program (of TV, etc.))

PBWL shows which words (lemmas) or their root words you can learn from Duolingo or Clozemaster.


Q6. Abbreviations and Jargon

See this page.


See this page.


Q8. Report Errors

See the FAQ page of the sister product “PBCT”.


Leave a comment

Leave a comment

Design a site like this with WordPress.com
Get started