Signed language corpora and annotation

Signed language (SL) corpora are only just being created. The Auslan (Australian SL) corpus was the first major SL corpus. SL corpora are needed to empirically ground generalizations on SLs, and to facilitate peer review of descriptions and the theories which draw on SL data. I discuss on-going work on the creation of corpora for SL research. In particular, I address the issue of machine-readability and the use of lemmatized glosses. Naturally, SL corpora—as with all modern linguistic corpora—should be representative, well-documented (i.e., with relevant metadata) and machine-readable (i.e., able to be annotated and tagged consistently and systematically). This require dedicated technology (e.g., ELAN), standards and protocols (e.g., IMDI metadata descriptors), and transparent and agreed linguistic tags (e.g., grammatical class labels). However, it also requires the identification of lemmata. Lemmatization—the classification or identification of related forms under a single label or lemma (the equivalent of headwords or headsigns in a dictionary)—is fundamental to the process of corpus creation. To achieve this a reference dictionary or lexical database is needed to enable consistency in lemma identification. A robust understanding of the processes of lexicalisation in SLs is thus essential, and, reflecting this, annotation conventions that discriminate between, and treat consistently, different types of signs found within any SL text need to be articulated and adhered to. This paper describes these. Plans for the creation of new SL corpora in Europe and North America will be seriously flawed if they do not take into account the issue of lemmatization.

Authors: Trevor Johnston

Event: SF08: Designing the Australian National Corpus Workshop

← View all submissions for this event.