Automated Analysis of Pragmatic Language Development in Autism
This is an ongoing research project I worked on with Professor Prud'hommeaux of Boston College. Autism spectrum disorder (ASD) is a neurodevelopmental condition associated with life-long deficits in communication that can impact both personal and professional well-being. Although the linguistic features associated with these deficits are routinely observed in clinical settings, they are difficult to quantify. For our research, we're collecting a growing dataset of conversations between high-functioning adults with ASD and their neurotypical conversational partners as they complete several collaborative tasks. We compare the linguistic characteristics of the two groups using both manually annotated features and computationally predicted features extracted from the conversations.
Yang, C., Liu, D., Yang, Q., Liu, Z., Prud’hommeaux, E., “Predicting pragmatic discourse features in the language of adults with autism spectrum disorder.” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. 2021.
Yang, C., Liu, D., Canfield, A., Hoffkins, C., Aldrich, J., Farash, S., Silverman, L., and Prud’hommeaux, E. 2021. Distinctive Features of Pragmatic Expression in Adults with ASD. Annual Conference of the International Society for Autism Research (INSAR-2021), virtual.
Yang, C., Prud’hommeaux, E., Silverman, L.B., Canfield, A. 2020. Toward Characterizing the Language of Adults with Autism in Collaborative Discourse. In Proceedings of the Workshop on Resources and processing of linguistic, para-linguistic and extra-linguistic data from people with various forms of cognitive, psychiatric, developmental impairments (RaPID-3), 54–59.
Automated Language Analysis for Dementia Screening
This is an ongoing research project I worked on with Professor Prud'hommeaux of Boston College. A common method of screening for dementia is the administration of a verbal fluency test, in which the patient is presented with a semantic or phonetic category and is given a minute to list as many words belonging to that category as they can. This research aims at using word embedding models to automatically analyze the fluency tests, thus removing the need for specialists to manually analyze the data themselves. Using "animals" as the semantic topic, we collect verbal fluency test data from participants in several age groups, with and without dementia. The cosine similarity between each pair of adjacent animals in the data is then generated by vector space models to determine whether a new semantic thread has begun. We employ a number of pre-trained models to generate the similarity metrics, using algorithms such as fastText, word2vec, gensim, and GloVe, and we also collect domain-specific data from Wikipedia on which new models are trained from scratch.
Fraktur is a blackletter typeface commonly used in German texts up until the mid-20th century. The ability to automatically transcribe texts printed in Fraktur would enable scholars to more efficiently work with and analyze historical texts. Performing optical character recognition on historical texts in particular poses additional challenges resulting from broken characters, scan quality, and spelling variations. This project aims to apply a simple similarity-based approach to the optical character recognition of individual Fraktur letters. Using a zoning-based black pixel density feature we were able to achieve approximately 92% accuracy using a k-NN classifier.
Authorship attribution often relies upon the writing style, vocabulary, and topics of interest among different writers, however these factors are more difficult to ascertain when analyzing shorter documents. This project experiments with different methods of authorship attribution on shorter texts, using lines of dialogue from TV shows to try and predict the speaker of each line. We explore two sets of features, one consisting of manually selected linguistic features, and the other consisting of sentence embeddings generated from the Google News word2vec model. Of the three algorithms tested, we found that using sentence embeddings with a logistic regression algorithm generally yielded the highest accuracy for each show. We also observed some interesting differences in how the show's genre affected the accuracy of the predictions.