Author(s) |
Title |
Abstract |
Ana Krajinović,
Rosey Billington,
Lionel Emil,
Gray Kaltap̃au
and
Nick Thieberger |
Building capacity for community-led documentation in Erakor, Vanuatu |
Close collaboration between community members and visiting researchers offers mutual benefits, including opportunities for new research insights and an expanded scope for supporting language maintenance and developing practical materials. We discuss a collaboration in Erakor, Vanuatu aiming to build the capacity of community-based researchers to undertake and sustain language and cultural documentation projects. We focus on the technical and procedural skills required to collect, manage, and work with audio and video data, and give an overview of the outcomes of a community-led project after initial training. We discuss the benefits and challenges of this type of project from the perspective of the community researchers and the external linguists. We show that the community-led project in Erakor, in which data management and archiving are incorporated into the documentation process, has crucial benefits for both the community and the linguists. Two most salient benefits are: a) long-term documentation of linguistic and cultural practices calibrated towards community's needs, and b) collections of large quantities of data of good phonetic quality, which, besides being readily available for research, have a great potential for training and testing emerging language technologies based on machine learning. |
Cheikh Bamba Dione |
LSTM based Language Models for Wolof |
This paper reports on the creation of a neural language model for Wolof, a Niger-Congo less-resourced language. The language model is based on recurrent neural networks with long short-term memory (LSTM) units. Neural network language models have been shown to provide good solutions to the data sparsity and curse of dimensionality issues typically encountered with classical language modeling approaches. To investigate the performance of the LSTM based language model, different baseline algorithms are run in an experimental setting. These include standard recurrent neural networks and state-of-the-art n-gram models using Kneser-Ney and Good-Turing smoothing algorithms. The obtained results indicate that the LSTM based model consistently outperforms all other algorithms evaluated, including in cases where the amount of the training data is severely limited. |
Annika Tjuka,
Lena Weißmann
and
Kilu von Prince |
Investigating habitual aspect in corpora from language documentation |
The Oceanic languages of Melanesia are generally small, low-resource languages, of which very little primary data is available. For our study on TAM categories, we have access to richly annotated corpora from seven endangered Oceanic languages. In this paper, we describe the methodology we used to investigate the category of habitual aspect in these languages. We show that some information can be recovered from the English translations. For a more in-depth study, we also relied on metadata on genres, and on clause-based tags labeling tense, aspect, mood, polarity, and clause type. The process of tagging aspect, in particular, revealed the theoretically and practically important fact that habituality is sometimes a property of larger spans of texts, rather than just a property of clauses, and can combine with more specific clause-level aspect. |
Alice Millour |
Getting to know the speakers: a survey of a non-standardized language digital use |
This paper presents the results of an on-line survey regarding the use on the Internet of a less-resourced non-standardized language: Alsatian. The survey, entitled ``Alsatian, the Internet, and You'' received 1,224 answers in a two months period starting January 2019. The purpose of this survey is twofold. First, we collect generic information on the use of their language by Alsatian speaking Internet users. Second, based on our own experience of crowdsourcing linguistic resources for Alsatian, we use this survey to gather insights on the needs, abilities and expectations of the speakers in order to make the most of their participation. |