SIL Three-letter Codes for Identifying Languages

Overview

In the interest of fostering the uniform identification of all the world's language in information systems, SIL International releases for public use the system of three-letter codes used in the Ethnologue. This set of codes offers a unique identifier for each of the more than 7,000 languages that are listed in the Ethnologue. Examples of efforts that are already using these codes as a standard for language identification are the Open Language Archives Community, the Linguist List, and the Rosetta Project.

Any application that makes use of these language identifiers is just one click away from access to the full language descriptions that are available in the Ethnologue. That is, for any language identifier XXX that may be stored in a database, an application may present a link to the following URL in order to give the user access to the Ethnologue's description of that language:

http://www.ethnologue.com/show_language.asp?code=XXX

The most widely used standard for identifying languages in Internet documents (such as in HTTP headers or HTML metadata or in the XML lang attribute) is RFC 3066. In that standard, a three-letter identifier is interpreted as a code from the ISO 639 standard. That code set has codes for only 381 individual languages. (See elsewhere on this site for the mapping between SIL and ISO language codes.) RFC 3066 offers an extension mechanism of tags beginning with x- to handle custom codes for languages not covered in the standard. We recommend the following form for constructing an RFC 3066 compliant language tag from the SIL language code XXX:

x-sil-XXX

The remainder of this document, after describing the terms of use for the code tables, explains the structure of the code tables, gives some hints on how they might be used, and offers links for downloading them.

Terms of use

You are welcome to download the complete code set as provided below and incorporate the supplied tables into your own database application on condition that you do so in accordance with our Terms of Use statement.

Structure of the code tables

Four files make up the package of data tables that SIL International releases in support of its standard for language identifiers. They are tab-delimited files in which each line represents one row of a database table. The characters are encoded in the 8-bit standard known as ISO 8859-1 (which is a subset of the default Windows code page 1252).

LanguageCodes.tab  The complete list of three-letter language identifiers from the 14th edition of the Ethnologue (along with name, primary country, and language status).
CountryCodes.tab The list of two-letter country codes that are used in the main language code table.
LanguageIndex.tab An index for finding languages by country and by all known names (including primary name, alternate names, and dialect names).
ChangeHistory.tab A record of the history of changes to the language identifiers.

The following declarations provide the formal definitions for SQL data tables into which the tab-delimited files can be loaded:

CREATE TABLE LanguageCodes (
   LangID      char(3) NOT NULL,        -- Three-letter code
   CountryID   char(2) NOT NULL,        -- Main country where used 
   LangStatus  char(1) NOT NULL,        -- L(iving), N(early extinct),
                                        -- E(xtinct)
   Name        varchar(75) NOT NULL)    -- Primary name in that country 
 
CREATE TABLE CountryCodes ( CountryID char(2) NOT NULL, -- Two-letter code from ISO3166 Name varchar(75) NOT NULL ) -- Country name
CREATE TABLE LanguageIndex ( LangID char(3) NOT NULL, -- Three-letter code for language CountryID char(2) NOT NULL, -- Country where this name is used NameType char(2) NOT NULL, -- L(anguage), LA(lternate), -- D(ialect), DA(lternate) Name varchar(75) NOT NULL ) -- The name
CREATE TABLE ChangeHistory ( LangID char(3) NOT NULL, -- The code that has changed Date smalldatetime NOT NULL, -- Date change was released Action char(1) NOT NULL, -- C(reated), E(xtended), -- U(pdated), R(etired) Description varchar(200) NOT NULL ) -- Description of change

Using the code tables

LanguageCodes.tab lists the 7,148 distinct language identifiers used in the 14th edition of the Ethnologue. Of these, 308 represent extinct languages, 406 are nearly extinct, and the remainder are listed with "living" status. The following shows the entries for the first six languages identifiers:

LangID CountryID LangStatus Name                                                                        
------ --------- ---------- ------------- 
AAA    NG        L          Ghotuo
AAB    NG        L          Arum-tesu
AAC    PG        L          Ari
AAD    PG        L          Amal
AAE    IT        L          Albanian, Arbëreshë
AAF    IN        L          Aranadan

We see that AAA and AAB denote living languages spoken in Nigeria, AAC and AAD denote living languages spoken in Papua New Guinea, and so on. When a language is actually spoken in more than one country, the CountryId gives the country that is considered primary; usually the country of origin or country where most of the speakers are located.

CountryCodes.tab lists the two-letter identifier and name for 220 countries of the world. The codes are from the international standard known as ISO 3166-1 (1997. Codes for the representation of names of countries and their subdivisions--Part 1: Country codes. Geneva: International Organization on Standardization. http://www.din.de/gremien/nas/nabd/iso3166ma/. ). The following shows the entries for the first five codes in the list:

CountryID Name                                               
--------- --------------------- 
AD        Andorra
AE        United Arab Emirates
AF        Afghanistan
AG        Antigua and Barbuda
AI        Anguilla

The CountryCodes.tab table would be used to narrow the search for an identifier to a particular country. The user would choose a country from the country list in order to select the appropriate country code. That code would then be used in a SQL query to restrict the language identifier list to just entries for that country. For instance, if the user were interested only in Afghanistan, the following SQL query would return just the table rows for that country:

SELECT * FROM LanguageCodes WHERE CountryID='AF'

Alternatively, the following link to the Ethnologue web site could be used to generate a report listing all the languages for Afghanistan:

http://www.ethnologue.com/show_country.asp?code=AF

LanguageIndex.tab documents 37,420 distinct names used for the 7,148 languages. The entries in this index of names indicate in which country each name is used. The table thus contains 46,416 records since many of the names are used in more than one country and some are used with more than one language or dialect. The following shows the entries in the name index for the first three language identifiers:

LangID CountryID NameType Name                                                                        
------ --------- -------- ------------- 
AAA    NG        L        Ghotuo
AAA    NG        LA       Otuo
AAA    NG        LA       Otwa
AAB    NG        LA       Alumu
AAB    NG        D        Arum
AAB    NG        LA       Arum-cesu
AAB    NG        LA       Arum-chessu
AAB    NG        L        Arum-tesu
AAB    NG        D        Tesu
AAC    PG        L        Ari

We see that AAA has two alternate names in addition to the primary name of Ghotuo. AAB has three alternate names and two dialect names in addition to its primary name. AAC has just one name.

The LanguageIndex.tab table would be used to implement a search by name. For instance, the following query would return the three-letter codes for all the languages that use the name xyz:

SELECT DISTINCT LangID FROM LangaugeIndex
WHERE Name='xyz'

Note that DISTINCT is used since the same language could be known by the same name in multiple countries. To allow the user to verify that a proposed identifier is indeed the right one, the software would offer the following link to the Ethnologue web site to see a report giving detailed information about the selected language (where XXX is the proposed three-letter identifier):

http://www.ethnologue.com/show_language.asp?code=XXX

Another application of the LanguageIndex.tab table is to find all the countries in which a given language is spoken. For instance, the following query returns the names of all the countries in which language XXX is spoken:

SELECT DISTINCT C.Name FROM CountryCodes AS C
JOIN LanguageIndex AS L ON C.CountryID=L.CountryID
WHERE L.LangID='XXX'

In this case DISTINCT must be used since a language could have multiple names in a given country.

Finally, the LanguageIndex.tab table can be used to learn all the languages spoken in a particular country. Whereas the query illustrated previously retrieves all languages whose primary country is Afghanistan, the following query retrieves all languages spoken in Afghanistan:

SELECT DISTINCT LangID FROM LanguageIndex
WHERE CountryID='AF'

Change management

A new edition of the Ethnologue (both in print and on the Web) is published approximately every four years. (The 14th edition was published in 2000; the 15th edition is scheduled for publication in 2004.) Between editions, editorial work is on-going and changes affecting the code tables can be made at any time. The Web edition of the Ethnologue will not reflect these changes until the next edition appears. However, beginning in 2002, there will be a twice yearly release of an updated language code table that incorporates all the changes that have been decided on for the next edition. Users who want to keep up-to-date with the latest version of the code set may download these updated code tables from this page. (Note, however, that LanguageIndex.tab will not be updated until the 15th edition is produced.)

The fourth table in the set, ChangeHistory.tab, documents all the changes that have been made in successive versions of LanguageCodes.tab. The first column of ChangeHistory.tab specifies the three-letter code that has undergone a change. The second column gives the date the change was released for download in a revised LanguageCodes.tab. The third column identifies the kind of change that was made; four kinds of changes are tracked by means of a one-letter code:

Code Action Meaning
C Created The code is a new one that has been added.
E Extended The range of meaning of the code has been extended by merger with a now retired code. The description field tells what it was merged with.
R Retired The code has been retired from use. The description field tells what code or codes replace it.
U Updated The name, primary country, or status of the language has been changed.

Here are some sample rows from ChangeHistory.tab:

Code Type Date        Description
---- ---- ----------  ----------------------------------------
AOX  C    2002-01-31  Add ATORADA, Guyana, living
APR  E    2002-01-31  Includes [LOA] which was retired
LOA  R    2002-01-31  Merge with [APR]; change all [LOA] to [APR]
AWG  R    2002-01-31  Same as [WMI]; change all [AWG] to [WMI]
CKN  R    2002-01-31  Unable to verify existence; delete from database
AAS  U    2002-01-31  Change from extinct to living
BCJ  U    2002-01-31  Change name from BAADI to BARDI

Note that there is not a change type for the case of narrowing the meaning of a code, such as when the language denoted by one code is split into two languages. In such a case, the original code is retired, and two new codes are added. In this way, the user of the code set is assured that once a code has been used to tag an item of data, it will continue to be the right code to use for as long as the code remains an active member of the code set.

ChangeHistory.tab is cumulative. Thus it can be queried to discover what changes have been made to LanguageCodes.tab since a given date. For instance, the following SQL query would be used to find out what changes have occurred since the beginning of 2002:

SELECT * FROM ChangeHistory WHERE Date >=2002-01-01

For sites that have used the SIL three-letter language codes to identify languages in their own database applications, an important use of ChangeHistory.tab is to discover codes used in their data that are now obsolete and thus need to be changed. These will be only the codes that have been retired. Thus a full list of all data records needing to be changed can be found by doing a JOIN on the change history table. For instance, if the column named code in MyTable holds an SIL language code, then the following SQL statement will select all records that need to be changed due to changes to the code set since the beginning of 2002:

SELECT * FROM MyTable as M
JOIN ChangeHistory as C ON M.code=C.LangID
WHERE C.Action='R' AND C.Date >=2002-01-01

Note that the Description field of the joined result set will describe what needs to be done to bring the language code up-to-date.

Giving feedback

The Ethnologue is a work in progress; our knowledge of the world’s languages is always incomplete and subject to improvement. Many people who use the Ethnologue can give feedback that will make it better and SIL International has always valued this kind of input. Users may have more accurate information on details like locations or names or population figures or language development status. Or they may be able to provide information that would lead to a change to the set of language identifiers. For instance, they may be able to show that what is treated as one language is really two, or vice versa, or that a listed language does not exist or that an existing language is not listed.

If you believe any of the information in the Ethnologue is in error, send your proposed change by e-mail to the Ethnologue editor. Be sure to report the source of your information. When you want to request that a language be added to the code list because you believe it to be missing altogether, use the questionnaires supplied on this web site to submit a basic profile of the language.

Before a proposed change is accepted, it must meet two requirements: it needs to be in keeping with the criteria for language identification used in the Ethnologue, and the facts that lie behind the proposed change need to be verified. The verification process may take months as it generally involves making enquiries of individuals who are resident in the country where the language is spoken. These persons may in turn make enquiries of others in order to perform the verification. The submitter can expect to receive an acknowledgment from the Ethnologue editor.

The three-letter codes in the range QVA to QZZ are reserved for local use. That is, they will never be assigned by SIL International as language identifiers. Thus, when users feel that a needed code is missing from the code set, they may freely use one of these local use codes as a temporary measure until the outcome of a change request is known.

Downloading the code tables

The code tables (as tab-delimited, ISO 8859-1 encoded plain text files) may be downloaded individually by clicking the following links. These are the tables that define the code set as it was in the 14th edition of the Ethnologue (2000):

Here are all of the updated versions of the language code table that have been released since the 14th edition (with the most recent listed first), plus the latest version of the code history table (which is a cumulative list of all changes since the 14th edition):

Or download a complete set of code tables with this documentation file and the terms of use in a single zip file:


  Ethnologue: 14 Edition  |  Site map, 14th edition  |  Current Ethnologue edition  |  Who we are  |  Site search  
 
Copyright © 2002 SIL International