Topic Modeling Definition of Nature

Topic Modeling

https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf

These are “mixed-membership models” (see [1]). This assumes that definitions (in the case of this analysis) can belong to several topics and that the topic distribution can vary across definitions. This differs to other models (such as Semantic Analysis) which assumes that each word comes from the same distribution - that is, it is just as likely for a given word to appear in any one definition.

Two models can be used:

Latent Dirichlet Allocation (LDA) - this is a Bayesian mixture model when it is assumed that topics are uncorrelated (this would not be the case in this analysis)
Correlated Topics model (CTM) - extends upon LDA by allowing correlations between the topics

(see [2] and [3] for intros to these models)

Correlated Topics Model

This analysis uses the topicmodels package which uses the VEM (variation expectation-maximization) algorithm.

Preprocessing

Each row = participant
Each column = word
convert to lower case
remove punctuation
remove numbers
stemming (removing prefixes and suffixes)
removing stop words
removing words below a certain length minimum
Optional: select only words that occur in a minimum number of definitions (see [4])
Optional: select terms which highest term-frequency inverse document frequency (tf-idf) scores (see [3]) - this is only used for selecting the vocabulary in the corpus

Fitting the Model

In the CTM you fix the number of topics (k) a-priori. In this analysis I ran the model fixed at both 14 (the number of categories identified during manual coding) and 7 (the higher-order topics that each of these categories fit into).

“Additionally, estimation using Gibbs sampling requires specification of values for the parameters of the prior distributions. [4] suggest a value of 50/k for α and 0.1 for δ. Because the number of topics is in general not known, models with several different numbers of topics are fitted and the optimal number is determined in a data-driven way.”

Selecting the number of topics can be done by splitting the data into training and testing sets. ” The likelihood for the test data is then approximated using the lower bound for VEM estimation.”

“Another possibility for model selection is to use hierarchical Dirichlet processes as suggested [5]”.

Topic modelling uses a probabilistic algorithm to calculate the probability of a word appearing given a particular topic (category of definition)

Perplexity of models given varying amounts of topics
Num Topics	LogLik	Perplexity	Themes
2	-59381.9503174082	772.326437265421	natur, anim
3	-59421.5592366393	776.164516674393	plant, anim, natur
4	-59442.0295455392	778.518088140724	plant, anim, natur, earth
5	-59446.6197440542	779.4854042851	outdoor, anim, natur, earth, world
6	-59445.1615487207	779.468312805306	plant, anim, tree, earth, plants_anim, natur
7	-59448.3613681738	779.724657438038	outdoor, anim, tree, earth, plants_anim, environ, natur
8	-59439.8127945491	778.830509353136	outdoor, anim, everyth, earth, plants_anim, environ, forest, natur
9	-59440.8107295069	778.865172591979	outdoor, anim, around_us, thing, plants_anim, without, forest, outsid, earth
10	-59427.1546790232	778.156239975586	outdoor, anim, everyth, plant, plants_anim, without, forest, outsid, earth, tree
11	-59423.0705650545	776.153599507428	outdoor, anim, around_us, plant, plants_anim, without, forest, outsid, earth, live, natur
12	-59421.6136498809	778.111006218198	outdoor, anim, around_us, plant, plants_anim, without, forest, outsid, earth, live, environ, natur
13	-59422.6977120152	777.47886099978	outdoor, anim, around_us, thing, around, without, man, exist, earth, live, environ, world, natur
14	-59421.6979283443	778.077493546316	outdoor, live, around_us, thing, natur, without, man, exist, earth, life, environ, world, anim, plant
15	-59419.8811201895	778.343854648547	outdoor, live, around_us, thing, around, without, man, exist, earth, life, environ, world, anim, area, natur
16	-59420.2245459775	778.629073395814	outdoor, live, around_us, thing, around, without, man, exist, earth, life, environ, world, anim, tree, plants_anim, natur
17	-59407.2209355495	776.453640798642	outdoor, part, around_us, plant, natur, without, man, exist, earth, life, environ, world, anim, area, plants_anim, live, natur
18	-59403.9166452776	775.449449888357	outdoor, part, around_us, plant, natur, without, man, exist, earth, tree, environ, world, anim, forest, plants_anim, live, natur, natur
19	-59407.0185773046	777.424100007218	outdoor, part, around_us, plant, natur, without, man, exist, earth, life, wildlif, world, anim, forest, plants_anim, live, natur, environ, natur
20	-59395.371028869	777.45481758584	outdoor, part, around_us, plant, physic, without, man, exist, earth, life, wildlif, world, anim, tree, plants_anim, live, natur, environ, someth, tree

After going through all of this work, I do not think that this is a useful way of categorizing definitions. There is too much complexity in the way that people use synonyms in order for the algorithm to detect probabilistically what their definition should be categorized as.

Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008). “Mixed Membership Stochastic Block- models.” Journal of Machine Learning Research, 9, 1981–2014.
Steyvers M, Griffiths T (2007). “Probabilistic Topic Models.” In TK Landauer, DS McNamara,S Dennis, W Kintsch (eds.), Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates.
Blei DM, Lafferty JD (2009). “Topic Models.” In A Srivastava, M Sahami (eds.), Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Press.
Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.
Teh YW, Jordan MI, Beal MJ, Blei DM (2006). “Hierarchical Dirichlet Processes.” Journal of the American Statistical Association, 101(476), 1566–1581

https://heartbeat.comet.ml/text-classification-using-machine-learning-algorithm-in-r-ba763117c8aa