Enhancing the Value of a Linguistic Text Collection

This paper describes the incorporation of a linguistic annotated text collection into a server-based relational database which is the ‘backend’ to a web-based query system. It constitutes a case study in adapting an existing corpus to typical open-source technologies. This project is part of the DoBeS Program funded by the Volkswagen Foundation, which aims to document endangered languages. The text collection documents Saliba-Logea, an Oceanic language of Papua New Guinea.

The original data is in a format commonly used for the documentation and analysis of little-described languages. It comprises transcripts with time-code mark-up to facilitate the playback of related audio/video segments. This format is optimised for specific tasks – e.g. texts can be morphologically analysed via a related lexicon – but, being specialised, it is likely to remain unavailable to a wider community of researchers. A benefit of incorporating these texts into a generalised system is that this risk is mitigated. A further benefit is gaining access to additional query methods which allow new kinds of questions to be answered. Standard SQL statements can operate on both texts and related attributes, e.g. regarding location, speaker and subject matter. Complex processes such as regular-expression filters can also be applied.

The resulting dataset can be used, and edited, simultaneously by different researchers, using different interfaces. Moreover the local database can communicate directly with external resources via web-service mechanisms. I will outline a general method for preparing source texts, show some ways to query and manipulate the resulting database and discuss human and machine interfaces.

Authors: Andrew Margetts

Event: SF08: Designing the Australian National Corpus Workshop

← View all submissions for this event.