Smallest Complete Semantic Dictionary

Welcome, Guest. Please Login or Register.
2024-05-04 11:07:53 CoV Wiki
Learn more about the Church of Virus

News: Check out the IRC chat feature.

  Church of Virus BBS
  Mailing List
  Virus 2005
  Smallest Complete Semantic Dictionary

_{« previous} _{next »}

Pages: [1]

Author

Topic: Smallest Complete Semantic Dictionary (Read 849 times)

David Lucifer
Archon

Posts: 2642
Reputation: 8.94
Rate David Lucifer

Enlighten me.

Smallest Complete Semantic Dictionary
« on: 2005-07-30 15:15:02 »

Conjecture: For all real human languages there are one or more smallest complete semantic dictionaries (SCSDs).

Take any real dictionary and remove all pronunciations, etymologies, pictures, sidenotes and front matter so you are left with just a list of words and definitions. This is a semantic dictionary by definition. If all the words in the dictionary are defined in the dictionary then it is said to be complete.

Every word in the dictionary has an associated dictionary length or DLen. The following algorithm describes how to calculate the DLen for any word by generating a complete semantic dictionary from the word.

1. Starting with one word assign it the number 1.
2. Look up the definition of the word in the complete semantic dictionary and write it next to the word.
3. For each word in the definition look up its number and replace the word with its number.
4. If a word in the definition doesn't yet have a number assign it the next unused number and add it to the new dictionary.
5. Repeat steps 2-4 until all the words in the new dictionary are replaced with their assigned numbers.

The highest number assigned (corresponding to the number of words in the new dictionary) is the original word's DLen.

Which word(s) in the English language have the smallest DLen and what is its value? Which language has the smallest DLen? Do the same words in different languages tend to have the same or similar DLen? Are there any patterns in the distribution of DLens (histogram) in a given language?

More generally, does anyone else find these questions interesting? Has this kind of analysis been done before? What area of study would this fall in?

Report to moderator

Logged

Blunderov
Archon

Gender:

Posts: 3160
Reputation: 8.90
Rate Blunderov

"We think in generalities, we live in details"

RE: virus: Smallest Complete Semantic Dictionary
« Reply #1 on: 2005-07-31 04:24:42 »

[Blunderov] I asked the Politburo about this - her field is linguistics. Her
thoughts are:
a: such an enquiry would probably fall mostly into the fields of
lexicography or computational linguistics and
b: with, obviously, the exception of 'dead' languages like Latin , new
meanings (and contexts) arise continuously with the use of language. For a
lexicographer, the SCSD (in any language) would always be the very largest
and most complete dictionary available at that time.

(My own sixpence worth is that, as Chomsky has famously pointed out, the
universal grammar is capable of infinite expression. If I understand him
correctly, this is the very thing that must necessarily exist at the core of
language in order for it to be a 'language'.)

The Politburo goes on to point out that the internet has caused a tremendous
surge in new meanings, usages and indeed users of language; so much so that
lexicographers have no hope of being completely current. Language is
dynamic.

Take the word 'blog' - a contraction of 'weblog'. It has moved from slang to
colloquialism and must soon become an officially recognised 'dictionary'
word; but, in the famous (to me) words of the Rhino, 'We don't care!' It's
still language.

Best Regards.

-----Original Message-----
From: owner-virus@lucifer.com [mailto:owner-virus@lucifer.com] On Behalf Of
David Lucifer

Conjecture: For all real human languages there are one or more smallest
complete semantic dictionaries (SCSDs).

Take any real dictionary and remove all pronunciations, etymologies,
pictures, sidenotes and front matter so you are left with just a list of
words and definitions. This is a semantic dictionary by definition. If all
the words in the dictionary are defined in the dictionary then it is said to
be complete.

Every word in the dictionary has an associated dictionary length or DLen.
The following algorithm describes how to calculate the DLen for any word by
generating a complete semantic dictionary from the word.

1. Starting with one word assign it the number 1.
2. Look up the definition of the word in the complete semantic dictionary
and write it next to the word.
3. For each word in the definition look up its number and replace the word
with its number.
4. If a word in the definition doesn't yet have a number assign it the next
unused number and add it to the new dictionary.
5. Repeat steps 2-4 until all the words in the new dictionary are replaced
with their assigned numbers.

The highest number assigned (corresponding to the number of words in the new
dictionary) is the original word's DLen.

Which word(s) in the English language have the smallest DLen and what is its
value? Which language has the smallest DLen? Do the same words in different
languages tend to have the same or similar DLen? Are there any patterns in
the distribution of DLens (histogram) in a given language?

More generally, does anyone else find these questions interesting? Has this
kind of analysis been done before? What area of study would this fall in?

---
To unsubscribe from the Virus list go to <http://www.lucifer.com/cgi-bin/virus-l>

Report to moderator

Logged

David Lucifer
Archon

Posts: 2642
Reputation: 8.94
Rate David Lucifer

Enlighten me.

RE: virus: Smallest Complete Semantic Dictionary
« Reply #2 on: 2005-07-31 14:18:17 »

Quote from: Blunderov on 2005-07-31 04:24:42

b: with, obviously, the exception of 'dead' languages like Latin , new
meanings (and contexts) arise continuously with the use of language. For a
lexicographer, the SCSD (in any language) would always be the very largest
and most complete dictionary available at that time.

That is true for a different sense of "complete". Here I am defining "complete" to mean that every sense of every word used in the dictionary is defined in the same dictionary.

Rhino kindly pointed out a bug in my algorithm. In step 4 we were adding new words to the new dictionary and (implicitly) pulling in all the definitions associated with the word which is unneccessary. We should only add the definitions of words that are used in other definitions. So the numbers should be associated with definitions (concepts) not words. There is a many-to-many relationship between words and concepts. Obviously each word can (and usually does) have many definitions and one definition can have many words (synonyms).

In the revised algorithm the word "concept" refers to a single definition (sense) of a word.

1. Starting with one concept assign it the number 1.
2. Look up the definition of the concept in the complete semantic dictionary and write it next to the word.
3. For each word in the definition look up the number of its corresponding concept and replace the word with its concept number.
4. If a concept in the definition doesn't yet have a number assign it the next unused number and add it to the new dictionary.
5. Repeat steps 2-4 until all the words in the new dictionary are replaced with their assigned concept numbers.

The highest number assigned (the number of concepts in the dictionary) is the DLen of the dictionary. The DLen of the concept is the the DLen of the dictionary generated by the above algorithm.

Let DLen' be the DLen of the source dictionary (the dictionary used as the source of definitions).

Conjecture: For some concepts DLen = DLen'. In other words, it will be necessary to pull in all definitions from the source dictionary to define every word used in the definition of the original concept. It would be interesting to find out what proportion of the concepts fall into this category.

Conjecture: For some concepts DLen < DLen'. I think this is obvious because it was possible to generate complete semantic dictionaries 10 years ago before the word "blog" was invented. DLen 10 years ago was <= DLen' 10 years ago, and DLen' now is > DLen' 10 years ago with the introduction of new words.

If you understand the definitions and algorithms it should be obvious that DLen cannot be >DLen'. It should also be obvious that both DLen and DLen' depend on the source dictionary so depending on whether you start with Merriam-Webster or Concise OED the values will be different while the algorithm remains the same.

Report to moderator

Logged

David Lucifer
Archon

Posts: 2642
Reputation: 8.94
Rate David Lucifer

Enlighten me.

Re:Smallest Complete Semantic Dictionary
« Reply #3 on: 2005-07-31 14:25:42 »

Incidentally with this refinement of the algorithm it changes from easy to implement in software to AI-complete. Before it would have been quite straightforward to start with a source semantic dictionary (perhaps in XML format) generated from a real dictionary (e.g. OED), input a concept and automatically generate a CSD and calculate its DLen. With this change of using concepts instead of words it would be necessary for the program to understand the definitions enough to resolve which sense is meant by each word. This requires real understanding and real human-level intelligence, hence AI-Complete. I guess this will have to remain a thought experiment for a while.

« Last Edit: 2005-07-31 14:29:22 by David Lucifer »

Report to moderator

Logged

Pages: [1]

RSS feed