Topic Modeling Definition of Nature

Jun 27, 2025 · 4 min read

Topic Modeling

https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf

These are “mixed-membership models” (see [1]). This assumes that definitions (in the case of this analysis) can belong to several topics and that the topic distribution can vary across definitions. This differs to other models (such as Semantic Analysis) which assumes that each word comes from the same distribution - that is, it is just as likely for a given word to appear in any one definition.

Two models can be used:

  1. Latent Dirichlet Allocation (LDA) - this is a Bayesian mixture model when it is assumed that topics are uncorrelated (this would not be the case in this analysis)

  2. Correlated Topics model (CTM) - extends upon LDA by allowing correlations between the topics

(see [2] and [3] for intros to these models)

Correlated Topics Model

This analysis uses the topicmodels package which uses the VEM (variation expectation-maximization) algorithm.

Preprocessing

  • Each row = participant
  • Each column = word
  • convert to lower case
  • remove punctuation
  • remove numbers
  • stemming (removing prefixes and suffixes)
  • removing stop words
  • removing words below a certain length minimum
  • Optional: select only words that occur in a minimum number of definitions (see [4])
  • Optional: select terms which highest term-frequency inverse document frequency (tf-idf) scores (see [3]) - this is only used for selecting the vocabulary in the corpus

Fitting the Model

In the CTM you fix the number of topics (k) a-priori. In this analysis I ran the model fixed at both 14 (the number of categories identified during manual coding) and 7 (the higher-order topics that each of these categories fit into).

“Additionally, estimation using Gibbs sampling requires specification of values for the parameters of the prior distributions. [4] suggest a value of 50/k for α and 0.1 for δ. Because the number of topics is in general not known, models with several different numbers of topics are fitted and the optimal number is determined in a data-driven way.”

Selecting the number of topics can be done by splitting the data into training and testing sets. ” The likelihood for the test data is then approximated using the lower bound for VEM estimation.”

“Another possibility for model selection is to use hierarchical Dirichlet processes as suggested [5]”.

Topic modelling uses a probabilistic algorithm to calculate the probability of a word appearing given a particular topic (category of definition)

Perplexity of models given varying amounts of topics

Num Topics

LogLik

Perplexity

Themes

2

-59381.9503174082

772.326437265421

natur, anim

3

-59421.5592366393

776.164516674393

plant, anim, natur

4

-59442.0295455392

778.518088140724

plant, anim, natur, earth

5

-59446.6197440542

779.4854042851

outdoor, anim, natur, earth, world

6

-59445.1615487207

779.468312805306

plant, anim, tree, earth, plants_anim, natur

7

-59448.3613681738

779.724657438038

outdoor, anim, tree, earth, plants_anim, environ, natur

8

-59439.8127945491

778.830509353136

outdoor, anim, everyth, earth, plants_anim, environ, forest, natur

9

-59440.8107295069

778.865172591979

outdoor, anim, around_us, thing, plants_anim, without, forest, outsid, earth

10

-59427.1546790232

778.156239975586

outdoor, anim, everyth, plant, plants_anim, without, forest, outsid, earth, tree

11

-59423.0705650545

776.153599507428

outdoor, anim, around_us, plant, plants_anim, without, forest, outsid, earth, live, natur

12

-59421.6136498809

778.111006218198

outdoor, anim, around_us, plant, plants_anim, without, forest, outsid, earth, live, environ, natur

13

-59422.6977120152

777.47886099978

outdoor, anim, around_us, thing, around, without, man, exist, earth, live, environ, world, natur

14

-59421.6979283443

778.077493546316

outdoor, live, around_us, thing, natur, without, man, exist, earth, life, environ, world, anim, plant

15

-59419.8811201895

778.343854648547

outdoor, live, around_us, thing, around, without, man, exist, earth, life, environ, world, anim, area, natur

16

-59420.2245459775

778.629073395814

outdoor, live, around_us, thing, around, without, man, exist, earth, life, environ, world, anim, tree, plants_anim, natur

17

-59407.2209355495

776.453640798642

outdoor, part, around_us, plant, natur, without, man, exist, earth, life, environ, world, anim, area, plants_anim, live, natur

18

-59403.9166452776

775.449449888357

outdoor, part, around_us, plant, natur, without, man, exist, earth, tree, environ, world, anim, forest, plants_anim, live, natur, natur

19

-59407.0185773046

777.424100007218

outdoor, part, around_us, plant, natur, without, man, exist, earth, life, wildlif, world, anim, forest, plants_anim, live, natur, environ, natur

20

-59395.371028869

777.45481758584

outdoor, part, around_us, plant, physic, without, man, exist, earth, life, wildlif, world, anim, tree, plants_anim, live, natur, environ, someth, tree

After going through all of this work, I do not think that this is a useful way of categorizing definitions. There is too much complexity in the way that people use synonyms in order for the algorithm to detect probabilistically what their definition should be categorized as.

  1. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008). “Mixed Membership Stochastic Block- models.” Journal of Machine Learning Research, 9, 1981–2014.

  2. Steyvers M, Griffiths T (2007). “Probabilistic Topic Models.” In TK Landauer, DS McNamara,S Dennis, W Kintsch (eds.), Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates.

  3. Blei DM, Lafferty JD (2009). “Topic Models.” In A Srivastava, M Sahami (eds.), Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Press.

  4. Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

  5. Teh YW, Jordan MI, Beal MJ, Blei DM (2006). “Hierarchical Dirichlet Processes.” Journal of the American Statistical Association, 101(476), 1566–1581

https://heartbeat.comet.ml/text-classification-using-machine-learning-algorithm-in-r-ba763117c8aa