Natural Language Processing

A multinomial logistic regression analysis to determine word use across groups with the intention of identifying hallmarks of experience and a better understanding of the strengths and concerns of individuals.

Skills used:

In the Clustering and Bayesian Comparison project, I demonstrated how thought patterns in everyday life could be used to differentiate groups of people. Importantly, these groups differed on traits that could be helpful to precision mental healthcare. To further understand how these groups differed, I wanted to explore word use within and across the identified groups. Words and language are ubiquitous in our everyday life and the words that we choose to use, along with those that we choose not to, can provide vital clues about the things that preoccupy our thoughts.

Data collected from Mind Window not only contains measures about the characteristics of thoughts that people have in their everyday lives, but also provides a ‘free-response’ option for users to enter information about what they were doing, feeling, or thinking. In this analysis, I used data from over 1,500 individuals who provided at least 100 words along with their thought characteristic data (the average number of words being 562 – practically a small essay about their experiences). Looking at these words, frequency analysis was used to provide some basic information about the language that was used between groups. These frequencies were then compared via Bayesian ANOVA to see which word groupings differed with at least a 95% probability.

The groups differed in almost all of the linguistic categories that were compared. This included not only word count, but also categories like: positive emotion, negative emotion, social, cognitive processing, work, leisure, and ‘I’ words. The figure to the right shows each group’s posterior distributions from the Bayesian inferential approach. In these plots, if dashed lines for a distribution are outside of the dashed lines of another, there is a 95% probability that the groups differ on that measure.

While word frequency analysis is a quick and easy way of characterizing language use, it is rather broad in its application and the word categories can seem somewhat arbitrary. As such, another approach was pursued where logistic regression was used to determine which specific words had lesser or greater odds of predicting a participant’s group membership. This analysis was performed on over 2,000 participants with 2,000 words being, potentially, predictive features. The results indicated that some words were almost six times more likely to predict a particular group while others were nearly nine times likely to favor other groups when compared to use within one’s group. The top-five ‘favoring’ and ‘not-favoring’ words and their respective odds ratios can be seen in the tables below. Additionally, word clouds with the top 10 predictive words are also shown below where words colored in red favor that group while words in blue favor other groups (when compared to the one being considered). Words in the clouds are sized based on the strength of their predictive/non-predictive quality.

The group, from the Clustering and Bayesian Comparison project, with the highest psychological well-being (PWB) was group two. As can be seen in the figures, this group was most associated with words that seemed to indicate completion and task-related action. Other predictive words for cluster two included “plan” and “prepare” while the least associated words included “think” and “want.” Cluster five seemed to be most favored by words that involved leisure – the words “read” and “watch” being the ones that most favored this cluster while lower associated words were “think” and “get.” Group one, the group with the lowest PWB score, seemed most predicted by drive-related words like “hungry” and “want” while least associated with leisure-related words (“read,” “watch,” “tv”).

While these analyses don’t tell us that the language use of everyone in a particular group is dominated by these words, it does help to determine the trend of the content in a group’s naturalistic and everyday thought. Additionally, it provides a connection between specific words and the traits that are characteristic of certain groups (e.g., anxiety, goal orientation, age, and PWB). Having an understanding of these indicative words allows us to better understand the inner world of these individuals and sets up an additional mechanism for providing more effective mental health care.

Word 1Word 2Word 3Word 4Word 5
Group 1hungrywantfoodfeelwake
odds2.001.821.631.581.51
Group 2finishplanfamilyeatday
odds2.452.371.921.911.90
Group 3thinkschoolchasehomeworknothing
odds5.733.172.172.001.95
Group 4thinkgohomeworkclassmany
odds5.443.612.841.971.95
Group 5readwatchnewsshowgame
odds3.532.491.861.741.72
The top-five words that favor each group and the odds ratio of the prediction. For example, the word “think” favors Group 3 over other groups at odds of 5.73 to 1.
Word 1Word 2Word 3Word 4Word 5
Group 1readwatchtvgamenews
odds0.310.470.480.520.55
Group 2thinkwanttirereallyschool
odds0.270.590.630.640.65
Group 3gofinishdaytryfeel
odds0.510.700.700.710.73
Group 4watchlistenhusbandshowclean
odds0.470.570.620.650.71
Group 5thinkhomeworkgetworkgo
odds0.110.190.330.360.42
The top-five words that favor other groups and the odds ratio. For example, the word “read” favors other groups over Group 1 at odds of 3.23 to 1.