The Oxford Advanced Learner’s dictionary has an appendix called “Defining Vocabulary”. It says:
“In order to make the dictionary definitions easy to understand, we have written them using only the words in the following list.
[…]
Occasionally it has been necessary to use in a definition a word not in the list. When such a word occurs it is shown in SMALL CAPITAL LETTERS.”
I estimate that list has about 3,500 words.
⇒ If you base your network on that dictionary or one carefully constructed like that, the graph could have a central core of about 3,500 nodes with the other words circling around it.
Making a good visualization still would be a challenge, of course.
You can still browse it a bit online with some 3rd party sites: https://en-word.net/
(the link is broken though, it should be https://github.com/globalwordnet/english-wordnet)
which also uses WordNet:
https://en.wikipedia.org/wiki/WordNet
(which this is also using)
which was developed by Princeton w/ DARPA money as an early investigation into AI and so forth.
What am I missing?
My first thought was that the creator used a search library that filters common words by default, but the search code is all in the page and doesn't do that.
My second thought was that the 10k word corpus doesn't include those most common words. But it does.
Then I realized that the creator filtered them out. The page does say "7931 words", and the title here on HN says "10k* most common". The original corpus has exactly 10,000 words.
https://github.com/first20hours/google-10000-english/blob/d0...
The first 21 include all four we've mentioned:
the, of, and, to, a, in, for, is, on, that, by, this, with, i, you, it, not, or, be, are, from
From the old Princeton WordNet FAQ page (https://wordnet.princeton.edu/frequently-asked-questions):
> WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.
I suppose I could have included them as source nodes (only outgoing), but I think they would have ended up connecting to a whole bunch of definitions, while not providing much in the way of interest.