This paper is a progress report on a project to create a balanced and representative electronic corpus of colloquial Jakartan Indonesian to ensure coverage and testbeds for a project on the syntax of modern Indonesian in the Lexical Functional Grammar framework (LFG).
*Source materials* consist of
• Indonesian corpora, including 72 hours of recording of Jakartan Indonesian from the University of Wellington.
• natural language use of colloquial (Jakartan) Indonesian in a range of settings.
• publicly available internet materials on Indonesian, to create a corpus with a subcorpus consisting of balanced samples representing different text types of written Indonesian.
*Data and corpus composition* This covers language use in a range of contexts with relevant parameters including (sub)genres/registers, sex and age group of speakers, modalities (spoken vs. written), social contexts, event structures (monologue/dialogue), channel (radio broadcasting,) etc.
*Metadata and annotation*
This follows the OLAC (Open Language Archives Community) metadata set standard and includes description of the content (topic, genre, communicative contexts, modalities, etc.), the people involved and their roles (collector of the data, the language speakers/consultants, the annotator, tools used, etc.), the associated media files.
•
*Linguistic annotation*
This consists of grammatical (morphological and syntactic), lexical tagging and pragmatic tagging relevant for our LFG research, e.g. POS tagging and treebanking.
*Annotation depth* We plan three sub-corpora with different degrees of annotation depth: i) sub-corpus 1, minimally annotated with some basic metadata annotation, ii) sub-corpus 2 with basic interlinear annotation in addition to the metadata, and iii) sub-corpus 3 with rich grammatical, lexical and pragmatic annotation
Authors: Avery Andrews, Wayan Arka, Meladel Mistica, Jane Simpson
Event: SF08: Designing the Australian National Corpus Workshop