4_family_timeMy co-author, Dustin s. Stoltz, and I propose a method for measuring a text’s engagement with a focal concept using distributional representations of the meaning of words. More specifically, this measure relies on Word Mover’s Distance, which uses word embeddings to determine similarities between two documents. In our approach, which we call Concept Mover’s Distance (CMD), a document is measured by the minimum distance the words in the document need to travel to arrive at the position of a “pseudo document” consisting of only words denoting a focal concept. This approach captures the prototypical structure of concepts, and is fairly robust to pruning sparse terms as well as to variation in text lengths within a corpus. It can be used with pre-trained embeddings, and even when terms denoting concepts are absent from corpora. It can also be applied to bag-of-words datasets.

The paper detailing CMD, “Concept Mover’s Distance: Measuring Concept Engagement via Word Embeddings in Texts,” is published in the Journal of Computational Social Science. The reproduction repository for the paper can be found here.

The R-based package, created in conjunction with Dustin S. Stoltz, is available via GitHub.