Studies in the past five decades have enriched our understanding of the lexicon, phonology and grammar of Cantonese; yet some deeper issues, such as pragmatics, semantics and discourse, remain to be explored. This kind of research requires a significant amount of authentic and natural language data. The research team thus proposed the construction of a Cantonese corpus to expand the scope of Cantonese linguistic research. One major advantage of using corpus in language studies is the provision of objective, unbiased quantitative and qualitative data for research and other applications, including the compilation of language materials and natural language processing, such as speech-to-text and text-to-speech algorithms.
The research project started in 2011 with the support of an EdUHK internal research grant and the Early Career Scheme of the Research Grants Council. Dr Chin constructed the corpus in two phases with a size of about one million Chinese characters. The corpus data was collected by transcribing the dialogues of 80 black-and-white movies produced between the 1950s and 1970s, and is now available online.
The corpus won the Gold Medal and Special Award at the Silicon Valley International Invention Festival in 2019. Dr Chin has also developed mobile apps containing the corpus data. The CanPro app, which enables learners to practise Cantonese pronunciation through commonly used expressions in the corpus, won a Silver Medal at the 2021 Inventions Geneva Evaluation Days. Another mobile app called ‘Learn Cantonese with Big Data’, supported by the Language Fund of the Standing Committee on Language Education and Research, was launched in March 2022. One major feature of this app is the provision of linguistic information that Cantonese learners might find relevant and useful, such as the collocation of verb-noun, classifier-noun structures, which cannot be obtained without corpus data.
To access the corpus, click here.
For the full article of the impact case study, please click here.