Including Email in an Australian National Corpus

In corpus and computational linguistics, email is a distinctive and important text-type. Its usage spans many genres and borrows linguistic features from both written and spoken text. As a communication medium, email touches the lives of many Australians, making it an important data type to be considered and included in the design and construction of an Australian National Corpus.

Current email research is limited by the scarcity of available email corpora. Much email research is based on a single American corpus of email released during the legal proceedings surrounding the collapse of Enron. While the Enron corpus has much to offer, an Australian email corpus as part of an Australian National Corpus would provide the opportunity to expand beyond a single dominant corpus. It would also make available a collection of email that reflects Australian language and culture to linguists, language technologists, sociologists and other researchers, in Australia and at large.

While privacy concerns make gathering email data to include in an Australian National corpus a difficult task, we believe it is a challenge that should be addressed.

We will bring to the workshop ideas from our experience in building tools for managing and annotating email data in corpora.

Authors: Andrew Lampert and Robert Dale and Cecile Paris

Event: SF08: Designing the Australian National Corpus Workshop

← View all submissions for this event.