• "He who knows does not speak, he who speaks does not know."
-- Lao Tze (Quoted in the opening paragraph of C. K. Ogden & I. A. Richards (1923) The Meaning of Meaning)
  • "The cruelest lies are often told in silence."
-- Robert Louis Stevenson
  • "But I shall let the little I have known go forth into the day in order that someone better than I may guess the truth, and in his work may prove and rebute my error. At this I shall rejoice that I was yet a means whereby this truth has come to light."
-- ALBRECHT DÜRER (Opening quote in K. R. Popper (1963) Conjectures and Refutations)

A DIRECT APPROACH TO INFORMATION RETRIEVAL[1] 
 
Master's Thesis,[2] University of London,[3] 1975[4]
Submitted by Kyung-Youn Park[5]
Supervised by B. C. Brookes[6] 
University College London[7] 

WHAT

"What we should do, I suggest, is to give up the idea of ultimate sources of knowledge, and admit that all knowledge is human; that it is mixed with our errors, our prejudices, our dreams, and our hopes; that all we can do is to grope for truth even though it be beyond our reach. We may admit that our groping is often inspired, but we must be on our guard against the belief, however deeply felt, that our inspiration carries any authority, divine or otherwise. If we thus admit that there is no authority beyond the reach of criticism to be found within the whole province of our knowledge, however far it may have penetrated into the unknown, then we can retain, without danger, the idea that truth is beyond human authority. And we must retain it. For without this idea there can be no objective standards of inquiry; no criticism of our conjectures; no groping for the unknown; no quest for knowledge."

-- K. R. Popper, Conjectures and Refutations [4]

WHY

"In science men have learned consciously to subordinate themselves to a common purpose without losing the individuality of their achievements. Each one knows that his work depends on that of his predecessors and colleagues, and that it can only reach its fruition through the work of his successors."

-- J. D. Bernal, The Social Function of Science [5]

HOW

"The modern World Encyclopaedia should consist of relations, extracts, quotations, very carefully assembled with the approval of outstanding authorities in each subject, carefully collated and edited and critically presented. It would not be a miscellany, but a concentration, a clarification and a synthesis."

-- H. G. Wells, World Brain [6]

1. INTRODUCTION

edit

In this study I am concerned with file organization of scientific literature in view of discovering useful information efficiently; largely, the problem of information retrieval. It seems that information retrieval now implies something more than a mechanistic and technical problem, something that gradually resolves into complexity of human communication, understanding and knowledge. Similar views have recently been expressed by Mitroff, et al [1] and by Brookes [2] in a wider context. "As we may think" or look back, our initial hope for information retrieval has been faded in spite of tremendous development of computer techniques and others made for the past thirty years. This frustration was anticipated as early as 1948 by Wiener [3]. Still we are not sure if we could restore the hope in the near future, particularly along the same line of thought.

As to scientific information* in the wide sense, the following fundamental questions may be raised:

  • What is scientific information?
  • Why should scientific information be organized?
  • How can scientific information be organized?

Obviously, information retrieval is most closely related to the last question. But I feel that the other two questions should also be taken into consideration when we intend to discuss information retrieval carefully. I selected the prefatory statements by Popper [4], by Bernal [5], and by Wells [6] as the most thought-provoking with respect to these three fundamental questions. And the statements represent my standpoint that I have taken in approaching the problems of information retrieval.

In the following chapters, I discuss first some fundamental considerations for information retrieval. I shall understand the narrowed retrieval problems mainly owing to Fairthorne's insightful contention [7]. Further I shall attempt to understand the problems in the light of communication and information which appear to be almost undefined. For this purpose I attend to Cherry's critical view on human communication [8] and to Ogden and Richards' classic theory of interpretation [9]. In short I am seeking for a solution to the problems of information retrieval, by questioning what influences those who communicate and obtain information.

Eventually, I propose a way of file organization as most essential for information retrieval. The proposal is only crude at this stage. In fact, the discussion of fundamental considerations is thus intended to make clearer and justify to some extent the idea which might require further elaboration and application. The main feature of the proposal is to use in retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents. Such extracts seem to provide concise but significant clues for discriminating the cited documents. The most concise clues should be regarded as significant when they are coherent in their proper environments or contexts.

* Honestly I cannot quite clearly distinguish between scientific information and scientific knowledge, and again science, in the sense that these are sometimes interchangeable. Information and knowledge may represent the same thing in essence, which I shall understand as information particularly when it is oriented to the specific use or value.

2. THE LINE OF ATTACK

edit

The overall view of main retrieval events may be represented schematically as shown in Figure 1. It may be said here that:

S-d (substitution)
The system S substitutes d for a document D for the purpose of notification and prediction.
d-U (notification)
The user U is notified of a document D through d, and discern what the document D is about.
U-E (interaction)
The user U interacts with the system S, giving evidence E either on his information need, or on his satisfaction of the need.
E-S (inference)
The system S makes inferences from the evidence E, either making a search formulation, or evaluating its performance.
S-D (prediction)
The system S predicts a relevant document D, based on d and the search formulation.
D-U (discrimination)
The user U discriminates the document D in the light of his information need.

Figure 1. Schematic View of Information Retrieval Events.

Information retrieval is a complex type of communication between the system and the user. The schematic diagram in Figure 1 roughly shows the situation. Admittedly, the diagram is too simple and crude for explaining information retrieval meaningfully. It will be expanded in Chapter 5. Meanwhile, it may suffice to show how to approach retrieval problems.

What we want to know ultimately is the relationship between the system and the user, which is represented in Figure 1 by the solid arrows and characterized by prediction and discrimination of documents. Also, we can consider many other relationships in the diagram; for example, those represented by the dotted arrows and the broken arrows. Here we can reasonably assert that all knowledge of these relationships should concentrate on explicating the relationship of utmost importance between the system and the user.

On the other hand, information retrieval may be possible with little or no attention to knowledge of the relationship between the system and the user. That is to say, we can contain the system and the user in a black box*, perform information retrieval, and improve the performance successively by feedback control. Combination of the solid arrows and the dotted arrows makes a closed cycle for feedback control. The black box has two input terminals, Ein and Din, which are input to the system and the user, respectively. It also has two output terminals: one for the user to give Eout in search of, and then in response to, Din, and the other for the system to retrieve Dout in response to Ein. This principle is illustrated in Figure 2, where Po represents the given initial condition or a set of performance factors of the black box.

Figure 2. Feedback Control of Information Retrieval.

Whether or not it is possible and practicable, this principle almost certainly would not tell much about the relationship between the system and the user, meaningfully. In other words, it may not necessarily be suitable for explicating the relationship. Even if suitable, it can explain the relationship only indirectly, i.e., through inferences from a great deal of valid and consistent evidence.

The approach that has been overwhelmingly used in the field of information retrieval is very similar to this principle. The main difference is to change the initial condition Po in many ways in order to know which initial condition will give the optimum performance of the system. This approach is not quite intended to know the relationship between the system and the user. The other, direct approach will be attempted in this study.

* "I shall understand by a black box a piece of apparatus, such as four-terminal networks with two input and two output terminals, which performs a definite operation on the present and past of the input potential, but for which we do not necessarily have any information of the structure by which this operation is performed. On the other hand, a white box will be similar network in which we have built in the relation between input and output potentials in accordance with a definite structural plan for securing a previously determined input-output relation." -- Norbert Wiener, Cybernetics [3]

3. SYSTEMS VS. USERS

edit

3.1 Discrimination

edit

The user needs information. Even if he is seeking for it in a document, he may be little conscious of the physical form of document. For he may understand the notion of information in much the same way as a housewife in the marketplace does.

The user can easily and quite properly speak of pertinent, relevant, useful, or valuable information, expressing somewhat different shades of meaning. All these similar qualifiers, however, seem to be more or less redundant, for the user would appreciate information as such only when it is relevant to his specific purpose. Furthermore, it seems certain that these qualifiers are used only ambiguously.

Also, it is hard to say that the user, having found extremely valuable information in the library, must have in mind any economic sense at all. Of course, anyone would be perfectly right to understand information essentially as something "bought, sold, stored, treated, exchanged and consumed in economic terms," if there exists that sort of information. Another way of understanding [8] is such that:

...the information content of signals is not to be regarded as a commodity; it is more a property or potential of the signals, and as a concept it is closely related to the idea of selection, or discrimination.

Obviously, this quotation represents our common understanding. However, we know the fact that, by understanding or defining something in a particular way, one specifies in effect one's readiness or intention to communicate with other people as to the thing defined. And those people are expected or invited to share the same understanding. Unfortunately, some people would not or could not agree with the suggested definition, however authoritative, because it is absurd, unnecessary, out of their concern, or for some other reasons. In this case, communication is likely to become a conflicting argument apart from what is to be communicated; actually the breakdown of communication.

Communication between the retrieval system and the user as illustrated in Figure 1, is of secondary importance to the user. It is only intermediary or necessary for another communication of primary importance, that is, communication of information with the author of a document. Therefore misconception of information may damage both communications.

From the psychological point of view, Stevens [10] attempted to generalize communication by defining it as "the discriminatory response of an organism to a stimulus." This definition was criticized by Cherry [8] on the grounds that communication is essentially the relationship established between stimuli and responses. It does not follow however that the definition is wrong. It simply focuses on the communication event at the receiving end; for example, the discriminatory response of the user to a given document.

Information seeking presupposes satisfaction of information needs. The user will be really satisfied only when he finds information relevant to his need. Naturally, he discriminates what he has received in the light of his need and criteria. Namely, relevance judgment. Nothing can stop him from being subjective and tough in the judgment. He may even totalize his relevance criteria. Now it is well known that relevance judgments are so complicated depending on not only subject matter but also many other things [11]. It is noteworthy that the real judge is the user. If a panel of judges were to take his place, it may serve some practical purposes, but for a shift. We shall reserve the pure notion of relevance for those systems that aim to provide relevant information.[8]

Many studies have shown that informal communication is very popular among scientists, especially among those who are eminent. This phenomenon is quite convincing. However, it is mistaken that informal communication is superior to formal communication. At least, informal communication does not pay much attention to the "social responsibility" or morality, if you like, which was emphasized by Bernal [5]. It should be said that each represents a different machinery of communication, not degrading the other. Presumably, all the past experience of an eminent scientist in formal communication (e.g., through books, journals, lectures, libraries, and so on), must have enabled him to shortcircuit to only the essence, say, a few words of suggestion. This shortcircuiting may be applied to formal communication.

Among many others, Jahoda, et al [12] observed that 66% of faculty members interviewed in one university maintained personal indexes, that 42% of them regarded preparation of indexes as too time-consuming, and that 32% complained of inconsistency in indexing. No doubt, a large portion of scientists spend much time in preparing their own systems, i.e., personal indexes or the like, and they are themselves suffering from so-called retrieval problems. Therefore, anything that can help them solve the problems and improve their own systems would be duly appreciated.

On the other hand, naive and primitive as they may be, personal indexes must be worth careful study in order to learn how scientists extend their retrieval facilities toward systems. An important suggestion to all kinds of retrieval systems may be found there.

From the user's point of view, any retrieval system may be regarded as an extension of his information-seeking facilities. The user can satisfy his information need to some extent for himself, instead of delegating retrieval to what may be called outside systems as opposed to personal means. In this respect, it is questionable whether outside systems, however elaborate, can promise better satisfaction than personal means which are familiar to the user.

3.2 Prediction

edit

When the user delegates retrieval to the system, there must be some agreement, although tacit, between both parties about the way in which the system is designed to act on behalf of the user. A set of constraints is characteristic of any system whatsoever. It is beyond these constraints that the retrieval system is not expected to answer and the user is not normally allowed to ask. However, these obvious constraints seem to be often disguised and overlooked. (Note that at the moment we are talking about the current systems regardless of their future developments.) This aspect has been convincingly discussed by Fairthorne in many of his writings.

Then, what is the agreement? The straightforward answer is that the user should agree that the system works on the basis of subject similarity rather than relevance. The distinction between these two apparently similar notions should be made clear.

If any two readers are compared as to what each of them recognizes from the same document, both will be different in general from each other. For everyone tends to interpret subjectively what is written or said about. These individual interpretations may be superimposed for many readers, in order to separate the densely overlapping thus explicit meaning from the relatively subjective and more or less implicit meanings.

We can fairly reasonably say that in interpreting a document, the indexer tends to behave in an unanimous way; in principle he can discern from a document most of the explicit meaning. The implicit meaning that he may discern in addition would be negligible considering the large number of potential users. As opposed to the indexer, the user tends to interpret in a subjective way; he may not find any information from the explicit meaning but elsewhere. Various users may be very different from each other in finding information according to their past experiences and present state of mind.

Not attempting to be precise, let us associate the explicit, unanimous interpretation with subject similarity. And let us associate the subjective, individual interpretation with relevance. Perhaps we cannot discuss similarity without unanimity or commonality of interpretation; nor relevance without subjectivity or individuality of interpretation. Fairthorne [7] distinguishes them as extensional aboutness and intensional aboutness. We shall return to this distinction later.

It may be said that subject similarity is a necessary condition for relevance and that relevance is a sufficient condition for subject similarity. In this respect, the system which operates even ideally on the basis of subject similarity, or in an unanimous way, is liable to two types of error, that is, to miss relevant documents and to retrieve non-relevant documents. It cannot turn aside relevance by any means. Thus it can only predict relevance, however ideal in recognizing subject similarity.

Strictly speaking, relevance is a priori from the system's point of view. It is true in the sense that relevance criteria, however precisely stated by the user, cannot wholly be accepted due to the system constraints, so the accepted part may not be sufficient in the end. It is all the more so because relevance criteria, however readily accepted by the system, cannot affect indexing retrospectively. To borrow Fairthorne's contention [7]:

An indexer does not and cannot index all the ways in which a document will interest all kinds of readers, present and future.

We have still another reason to believe in the a priori characteristic of relevance. A great deal of experimental as well as operational experience in retrieval has been accumulated at least over the past twenty years. Retrieval languages and devices must have been greatly improved. Nevertheless, how much has been learned about how the user judges a relevant document* as such? Obviously not much.

If so understood, relevance must have been overemphasized in evaluating retrieval systems. Especially, comparison of different systems might have been unfair to some; for it is not certain whether or not subject similarity goes parallel with relevance. How retrieval systems are to be evaluated and compared should be made clear first; in terms of either subject similarity or relevance or both. The evaluation solely based on subject similarity would not tell much about how to give more satisfaction to the user; that solely based on relevance would not tell much about how to keep the agreement with the user better. Collation of both evaluation will be necessary to know about the relationship between the system and the user.

* Here we are mainly concerned with retrieving documents and the sort of information which is obtainable from the documents. We may attribute relevance to a document; but only indirectly, that is, after information has been discovered from it.

4. DOCUMENTS VS. SURROGATES

edit

A group of documents can be said to be similar to each other, when they have in common a set of identical properties A; they are similar with respect to the shared properties A. In general, each document in a similarity group has some other (different) properties B in addtion to A. Therefore, the content C of a document may be represented:

C = A * B.

This equation may apply somewhat analogously to the document surrogate, too.

Because of the repetitive nature of the shared properties A, a group of similar documents are characterized by semantic redundancy, even if not by textual redundancy. This characteristic will be transferred somewhat analogously to the corresponding document surrogates. That is to say, the identical properties A are repeated not only in similar documents but also in their surrogates. This repetition or redundancy in a group of similar surrogates appears to be inevitable, because there would be no grouping of similar documents or surrogates without that. But it is not quite so from the point of view of file organization. For one thing, the idea of inverted files may be worth remembering in this connection; however, this idea is likely to raise another kind of redundancy, that is, repetition of the name of the surrogate which belongs to many similarities, e.g., index terms.

This Thesis is supposed to be the first to mention 
semantic redundancy in contrast to the common textual 
and Colin Cherry's syntactical redundancy (p.182). 
It googled 410 hits. --KYPark 10:09, 13 Apr 2005 (UTC)

An abstract file as a retrieval tool is no exception to such redundancy. The comparative efficiency of abstracts in retrieval is still controversial. The low efficiency of abstracts, if true, may stem from difficulties in formalization and in machine processing. However, formalization does not really matter so much in human processing. And we can reasonably assert that abstracts contain much greater "semantic information" than other kinds of surrogates such as titles, sets of index terms, and classification codes. Therefore, without considering the time consumed, the human searching of abstracts should perform better than that of other surrogates in judging similarity, at least in principle.

Supposing that abstracting processes are formalized to such an extent that the above equation holds well. Then, it will be possible to exclude the identical part A from all but one of the similar abstracts, allowing them reference to A in the retained abstract. Otherwise, we can list all the similar abstracts in one of them. By doing so, we need not search for them one by one but by a group, whenever the search requests fall upon A.

Once the existence of A is accepted, a model abstract of the identical part A may be desired for all the documents that have A in common. A collection of such models will look like a classification scheme. This can be applied to individual abstracts. Then, each abstract may consist of the prescriptive code for A and the descriptive text for B, the different part. (This way of doing may be parallelly adapted to combining a hierrarchical classification system with a descriptive indexing system.) In practice, the prescriptive code may or may not be substituted for the text corresponding to A in an abstract. What is implied in this idea is not merely to reduce the textual or semantic redundancy involved in a group of similar abstracts. [4]

In general, document surrogates includes errors of various kinds. Let us take for example just one kind of errors: inconsistency in surrogation. Many inconsistencies can hardly be said to be errors in the strict sense, for the surrogates are fairly correct individually. The cause of these inconsistencies may be attributable to difficulty or lack in formalization.

In this respect, abstracting systems, particularly based on author abstracts, seem to be hopeless to control. However, this is not the whole point. The default is to leave the failure caused by inconsistency to be repeated each time the abstracts are searched. Certainly, this failure can be prevented or reduced by careful examination and grouping of similar abstracts, prior to a series of searches.

This prior grouping process implies retrieval which ensures high recall even at the cost of low precision. One thing that matters here is the manageable number of abstracts to be examined as to their similarity. The greater the number, the more preventive work there is to be done. What makes matters worse is the possible multiplicity of similarity groups which an abstract belongs to at the same time. We may not even make certain which groups will be more significant or more likely to be requested by the user. This situation will eventually demand enormous efforts. Our ideal to rule out inconsistencies may require prohibitive efforts.

We all know something about abstracts and extracts, not being pretentious. However, this general kind of knowledge may not suffice for critical discussion of their characteristics, merits, snags, and so on. An abstract was defined as an abbreviated, accurate representation of a document; and an extract as consisting of one or more portions of a document selected to represent the whole. Were they defined with accuracy? Were the definitions intended for making clearer how to make abstracts and extracts? Are there any really working standards for making them?

Any document surrogate of however small and biased content may be justified, because it is not the document itself but a representation, description or prescription. Sometimes it is mistaken that the content of a surrogate is the same as the content of the corresponding document; or that the equation C = A * B holds equally in both cases. Distinguishing between intensional aboutness and extensional aboutness, Faithorne [7] says that:

Parts of a document are not always about what the entire document is about, nor is a document usually about the sum of the things it mentions. A document is a unit of discourse, and its component statements must be considered in the light of why this unit has been acquired or requested.
Fairthorne's intensional aboutness  
Author's subjective/implicit meaning
Dumais's latent semantic indexing  [5] [6]
Stark refers to extensional aboutness. [7]
Cheti refers to intensional aboutness.
Cheti refers to the above quote. [8]
Hawthorne's intensional aboutness
Hawthorne's fluidity of meaning [9]
Author's flexibility and elasticity just below

Even with the great flexibility and elastisity of language, it seems almost impossible to make an abstract of about two hundred words exactly analogous to the content C of the corresponding document. In other words, selection and bias are more or less unavoidable in abstracting. If paraphrasing of selection is considered to be semantically superficial, then the difference between an abstract and an extract will be somewhat marginal. Both are biased selections or parts of the content C.

Roughly speaking, an abstract is more intended to balance selection uniformly over C, aiming at inductive information effects. Similarly, an extract is more intended to spot selection (perhaps conclusive part) eccentrically from C, aiming at immediate rather than inductive information effects. Yet, no formal procedures beyond conventions of a vague nature are available of what to select.

Considering the power of meta-language and its use in retrieval, Goffman, et al [13] notice that an abstract is given in meta-language whereas an extract in object-language. They further notice that many abstracts, being written in "trivial" meta-language, should more accurately be called extracts.

Selection or part of a document, whether balancing or spotting, should assume that it can do without the rest or context. In other words, it should be an independent unit of discourse. Truly, abstracts, extracts, titles, even index terms, all these tell us something on their own account. Fairthorne [7] paraphrases Bohnert's notion of data as:

parts of a document that, in the given environment, will be read in isolation from the rest of the text.

This phrase seems to be worth careful scrutiny. Perhaps, we can raise several questions such as:

  • What is the given environment?
  • What happens to a reader when he reads the parts in isolation from the rest?
  • What is the relationship between a document D and its part d in terms of effects on the reader?

We shall discuss these and other questions in the next chapter. Meanwhile, Belzer [14] calculates "the entropies of the various surrogates of error-free information," by assigning one bit of information to a full document. For five different types of surrogates - citation, abstract, first paragraph, last paragraph, first and last paragraph - he observes the 2 x 2 contingency of:

P = relevant as predicted from surrogates,
P'= non-relevant as predicted from surrogates,
R = relevant as evaluated from full documents,
R'= non-relevant as evaluated from full documents.

By showing the calculation result as in Table 1, and by calling attention to the fact that production of abstracts only requires extensive professional effort, he in effect revives superiority of extracts to abstracts. Comparison of a document with its surrogates is also interesting.

Table 1. The Entropies of Various Surrogates.

5. THE THEORY OF INTERPRETATION

edit

5.1 Denotation and Connotation

edit

We are free to think of anything.[9] [10] At one moment we think of the trees in the garden; at another, of the tomorrow's weather; at still another, of the past experience. These things that we think of, whether existent or non-existent, substantial or imaginary, true or false, may be said to be fairly stable in contrast to our free thoughts. The trees in the garden must be there dropping leaves, even while we stop thinking of them.[11] Thus we can freely organize or map our thoughts upon this stable background.[12]

We invent symbols* with which to express or map our thoughts on things. In isolation we are free to invent and use any symbols to express our thoughts. In a society where we should associate, cooperate, share, or communicate with other people [8], we have two options. That is, either to dictate our own invention or to obey the social rule. Small babies would begin with dictations, e.g., crying out instinctively to express their desire. However, they cannot merely dictate. That would be admittedly too limiting. Far more required of them and far more convenient in most cases is to conform to the social rule. They learn to practice it gradually but never completely. Even when grown up, dictations are necessary. As a matter of fact, the two options are normally intermingled, but eventually separable.

* By symbol is meant any unit of communication as a whole, regardless of its parts or constitutes. For example, a word, a phrase, a sentence, a journal article, or even a non-sensical string of words under consideration, as far as it is intended to convey some definite thought or idea. The term 'sign' may be broader than 'symbol.' For example, any sense-data or stimuli to organism may be treated as signs.

The social rule on how to use symbols has evolved from among people, somewhat loosely and still changing. Truly, people have been inventing, elaborating, and using more and more convenient symbols for more efficient communication. The speaker expects the listener to receive the symbol that is substantially imparted or communicated. However, the speaker's real expectation is that the listener would share the same thought and direct to the same thing that the speaker has in mind. Largely, this expectation is met when the speaker conforms to the well-established social rule, e.g., grammar and dictionary. Then the majority understand him unanimously; communication is clear-cut. In this case the meaning of such a symbol may be said to be denotative, explicit, or extensional. Refer to a dictionary for these words (Table 2).

On the other hand, communication does not appear as simple as that. Indeed, communication as a whole is a loose affair. Ambiguous and misleading expressions, misunderstandings and different interpretations, and so on. Supposing that someone intends to convey a definite thought or story with the following word string [8]:

woman, street, crowd, traffic, noise, haste, thief, bag, loss, scream, police, .....

which looks almost non-sensical as a whole. Then, what will happen to us listeners? We have a dictionary, but we cannot simply sum up the meanings of individual words. That "a whole is more than the sum of the parts" is too plain a saying. There seems to be no grammar to which the speaker might have conformed. He merely suggests rather than tells the story, which in other words is implied or implicit in the word string, i.e., symbol. From this awkward symbol we can guess the story with varying accuracies, if we are ready to take risks. In this case, the meaning of such a symbol may be said to be connotative, implicit, or intensional.[13] [14] [15] [16] [17] [18] Again, refer to a dictionary for these words (Table 2).

Table 2. Denotation and Connotation: Excerpts [15].

5.2 The Theory of Ogden and Richards

edit

When we communicate our thoughts about things, we use signs. That is to say, three factors - a thought, a thing, and a sign - are essentially involved in any communication event, either speaking or listening.[19] [20] [21] Ogden and Richards [9] place the three factors at the corners of a triangle, where the relations between these factors are represented by the sides, as shown in Figure 3. They recognize that there are causal, direct relations between a sign and a thought, and between a thought and a thing. But, they insist, the relation between a sign and a thing is merely "imputed" as opposed to the causal relations; it holds only indirectly round the two sides of the triangle. they further insist that it is because of this imputed relation that most of the language problems arise. Signs are instruments subjected to thinking or interpretation; they can be related to things only through thinking, or more specifically through interpretation.

Things and experiences are also interpreted; they are treated as signs. Thus, through all our life, we interpret signs in the widest sense, with few exceptions. Then, what happens when we interpret signs? Ogden and Richards [9] generalize the process of sign interpretations as follows:

The effects upon the organism due to any sign which may be any stimulus from without, or any process taking place within, depend upon the past history of the organism, both genereally and in a more precise fashion. In a sense, no doubt, the whole past history is relevant; but there will be some among the past events in that history which more directly determine the value of the present agitation than others.

For example, a dog, on hearing the dinner bell, interprets the bell sounds as a sign and runs into the dining room. He can do so owing to the past experience in which clumps of events - Bells, savours, longing for food, etc. - have recurred "nearly uniformly." Such a clump of events may be called an external context. And the mental events, occurring in the dog which can link merely the present bell sound together with the past experience of bells-savours-longings, may be called a psychological context. To define more precisely:

A context is a set of entities (things or events) related in a certain way; these entities have each a character such that other sets of entities occur having the same characters and related by the same relation; and these occur 'nearly uniformly.'[22]

Contexts occur more or less uniformly; that is to say, the constitutive characters of a context recur with uncertainty or with a probability. It follows that the context is said to be determinative with respect to one character if both characters are closely related. By taking very general constitutive characters and uniting relations, we have contexts of high probability; we can increase the probability of a context by adding suitable members. Thus we react the recurring part of the context in the same way as we did the whole context. Experience recurs in contexts which recur more or less uniformly, and interpretation is only possible in these recurring contexts.

The notion of relevance is of great importance in the theory of meaning. A consideration (notion, idea) or an experience, we shall say, is relevant to an interpretation when it forms part of the psychological context which links other contexts together in the peculiar fashion in which interpretation so links them.*
* "Other psychological linkings of external contexts are not essentially different from interpretation, but we are only here concerned with the cognitive aspect of mental process."

Finally, Ogden and Richards [9] attempt to narrow down their implications by applying the context theory of interpretation[23][24][25] to the use of words at different levels; from simple recognition of sounds as words to critical interpretation of words.

With most thinkers, however, the symbol seems to be less essential. It can be dispensed with, altered within limits and is subordinate to the reference for which it is a symbol. For such people, for the normal case that is to say, the symbol is only occasionally part of the psychological context required for the references. No doubt for us all there are references which we can only make by the aid of words, ie, by contexts of which words are members, but these are not necessarily the same for people of different mental types and levels; and further, even for one individual a reference which may be able to dispense with a word on one occasion may require it, in the sense of being impossible without it, on another. On different occasions quite different contexts may be determinative in respect of similar references. It will be remembered that two references, which are sufficiently similar in essentials to be regarded as the same for practical purposes, may yet differ very widely in their minor features. The contexts operative may include additional supernumerary members. But any one of these minor features may, through a change in the wider contexts upon which these narrower contexts depend, become an essential element instead of a mere accompaniment. This appears to happen in the change from word-freedom, when the word is not an essential member of the context of the reference, to word-dependence, when it is.

5.3 Implications for Information Retrieval

edit

Ogden and Richards [9] do not specify contexts in the triangle in Figure 3. Cherry [8] modifies the diagram as shown in Figure 4a. We shall further modify it as shown in Figure 4b, and say that the triangle is surrounded by the external context and contains the psychological context inside. Still, the diagram only represents either speaking or listening. Thus we shall develop the diagram further in the following.

Figure 4. Modified Triangle Diagram.
Figure 5. Functional Flow in a Unit Communication.
Figure 6. An Ideal Unit Communication.

A unit communication, including both speaking and listemning, may be represented by the diagram as shown in Figure 5. The arrows and the corresponding words may be convenient to represent the functional flow in a unit communication. Thus we shall say that:

In speaking
A thing initiates a thought which in turn adopts a sign.
In listening
A sign evokes a thought which in turn directs to a thing.

If there arises no physical distortion between two signs, then the sign in speaking and the sign in listening will be the same, or get together. If the listener's thought directs to the same thing that initiated the speaker's thought, then we shall have an ideal unit communication as shown in Figure 6.

We may better develop the diagram further in order to represent communication situations which are more complex than a unit communication. And we shall normally approximate individual units of communication to ideal units as shown in Figure 6.

Let us take for example the password game. The questioner, thinking of WATCHWORD, gives a symbol 'watchword' to the intermediary, who in turn gives another symbol 'password' to the answerer. Before and after translation from 'watchword' into 'password,' the intermediary's thoughts I and I' should be different such that I corresponds to WATCHWORD and I' to PASSWORD. Therefore, the answerer's thought should direct first to PASSWORD, and then to WATCHWORD which is the correct answer. The answerer should make a guess that is the reverse of translation. This password game is illustrated in Figure 7. Communication between the questioner and the intermediary makes an ideal unit, and that between the intermediary and the answerer makes another ideal. These two ideal units are separated by a communication gap which should be overcome by the answerer's guesswork. In corollary, complexity of communications involved in information retrieval may be shown as the diagram in Figure 8.

Figure 7. Password Game.
Figure 8. Complex Communication Involved in Retrieval.

6. PROPOSAL FOR FILE ORGANIZATION

edit

6.1 Incentives

edit

The idea proposed in this chapter is to use in information retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents (See Figure 9). It is only exploratory within the scope of this study. It can be justified on the grounds that the citing and the cited documents are coherent with each other, that extracts provide concise clues for discriminating these documents, and that even concise clues are interpreted meaningfully in the given contexts. Although widely practiced among information users, the idea has not yet been formally studied in view of efficient file organization as far as I know. Therefore, the implication of the idea might go farther in the future than can be expected to now, and require more exploration. In this respect, what is immediately required will be some rationale behind the idea. While all the preceding discussions are relevant to this rationale, the following are intended to support the idea focally.

Now it is almost certain that subject coverage or specialization can hardly be defined consistently and objectively. At best we can say that two documents are similar with respect to something, based on the evidence that we recognize from the documents. Still, the totality of evidence would not make sure similarity; it gives us no more than a degree of belief.

In most cases, two documents similar with respect to something are indexed or abstracted individually. In this sense they are related to each other only indirectly, or with some uncertainty. Indexing inconsistency, mainly caused by individual varieties even in case of fairly adequate assignment of index terms, is now well known. This will significantly degrade retrieval as a grouping process of similar documents. Therefore, to use the direct evidence of similarity established between the two or more documents will be desirable.

We can quite reasonably say that the citations, by which I mean both the citing and the cited articles inclusively, are similar at a certain level of abstraction, especially in highly specialized fields of science. Therefore we can trace back and forth between the citations in order to find similar articles. This is the principle of citation indexing applied by Garfield [16]. However, the serious objection to citation indexing is that it demands too much risk, relying heavily on the mere fact that X cites Y. Tracing back and forth tends to diverge tremendously. The solution required for this technique would be to exclude noise sources and provide all the citations with subject indicators more powerful than titles. Lipetz [17] attempted to improve selectivity of citations by providing "context indicators" rather than "subject indicators." His approach seems applausible, but demands much intellectual effort. After all, the usefulness of direct evidence has not yet been warranted significantly by citation indexing.

In this respect, the far more elaborate method, bibliographic coupling, developed by Kessler [18] shares the same fate as citation indexing. It is noticed [19] that "citation tracing is pervasive information-seeking mode." What should be further noticed is that backward tracing is much more pervasive and that any intellectual tracing is initiated by discerning some meaningful evidence rather than the "mere fact."

On the other hand, it is questionable whether indexes and abstracts are the only means of retrieval as an extention of information-seeking facilities. Books, reviews, monographs, and journal articles; all these are likely to lead our information needs to other sources of information. Almost all scientific articles cites, describes, analyses, and groups a number of other articles. Thus, the reader of the citing articles can, perhaps very easily, discriminate the cited articles as to their subjects, crucial points, logical relationships, and so on. By doing so, the reader is in effect retrieving relevant articles with the aid of expertise.

Vickery [20] emphasizes the importance of review articles and the like as an efficient, selective "means to discover what they must read amid the vast mass of available documents," pointing out that "the traditional means of discovery of the pertinent literature are inadequate." Nevertheless, the traditional means may better give access to more selective means. That is to say, the strategy of discovery may best be divided into two different means.

A similar strategy was considered by Goffman, et al [13] by introducing meta-linguistic terms to indexing. However, their approach appears passive in that it is simply intended to divide a file in order to economize searches. A more active approach is therefore desired for selective discovery in terms of quality rather than quantity.

On the other hand, Goffman, et al [13] regret that many abstracts written in "trivial" meta-language are much closer to object-linguistic "extract," and that many reviewers write abstracts instead of the state of the art. They seem to favor meta-linguistic abstracts more than object-linguistic "extracts." Ironically, one of the authors recently shows that extracts are better than abstracts in terms of calculated entropies as well as intellectual efforts [14]. The power of meta-language which they properly recognized suffers from inconclusiveness, waiting for further observation.

On the whole, most of the traditional means, such as subject indexes, abstracts, and extracts seem to go paralytic facing efficient file organization. Obsolescence of scientific literature [21] is now widely known. Brookes [22] was interested in obsolescence involved in a cumulative file. Unfortunately, his interest has not yet been worked out. Certainly accumulated in a large, cumulative file would be archival value, but at the cost of retrieval devaluation. Thus systematic file organization AND maintenance should be taken as most essential in view of information retrieval.

Recently, Blaxter and Blaxter [23] report an interesting observation on the needs and habits of scientific authors and readers in three research institutes. They show that the information needs of individual working scientists are met by a very small number of primary journals, and that the cited references appended to primary articles or review articles are used in most literature searches. More precisely:

  • Trace back from a paper : about 40% on average
  • Trace back from a review : more than 20% each.

If this were to be the general pattern of literature searches by working scientists, and if information retrieval is to meet ultimately the information needs of individual scientists, file organization should be considered in the light of the above observation.

6.2 Extracts as Indexing Sources

edit

Figure 9 shows the first paragraph extracted from an article* (hereafter called the sample article) in a recent issue of Physical Review. The extract has eight references (Refs. 1-8) not merely cited and described, but also criticized and collated. With respect to the cited references, the extract is meta-linguistic and of a review kind. Similar extracts can be made from other parts of the sample article wherever each cites one or more references (Figure 13). By extract is meant hereafter an extract of this kind, as opposed to a common, object-linguistic extract.

* G. J. Kutcher, P. P. Szydlik, and A. E. S. Green. "Independent-particle-model study of electrons elastically scattering from oxygen." Physical Review A, vol. 10 no. 3 (September 1974) pp. 842-850.

From the extract in Figure 9, a subject index to Refs. 1-8 may be derived as illustrated in Figure 10. The complete subject index to Refs. 1-37 of the sample article is shown in Figure 11. The actual indexing is done on the work sheet as illustrated in Figure 12, while the indexer scanning the source document selects index terms. There is therefore no need to make extracts substantially.

6.3 Extracts as Review Sources

edit

Figure 13 illustrates a provisional compilation for the sample article where:

  • ScD : Source of the sample article, i.e., location, title and author;
  • Abs : Abstract of the sample article;
  • Ext : Extracts (Exts a-aa);
  • Ref : References (Refs 1-37).

Similar compilations for a number of source documents may be serially accumulated into a file. Being combined with the subject index and the author index, this file may be used as personal or other means for information retrieval. Convenience of the file will remain a technical problem.

The use of the file in retrieval is much the same as that of reviews and text books which can lead the reader to various sources of information. As mentioned previously, extracts under consideration are in fact of a review kind. External and psychological contexts are involved in reading reviews. In extending to other sources of information, the reader can benefit from expertise provided by reviews of source documents. Certainly he would not make instantaneous, mechanistic YES-NO decisions based on simple criteria. To the contrary, his decisions will be carefully thought out.

Selection of one source document by using the subject index is relatively less important, since it is mainly intended to lead to retrieval of as many cited references as possible. Therefore usefulness of the file will depend on coherence of citations, i.e., coherence of cited references with each other as well as with the source document. And extracts should be made short as far as they do not significantly degrade the maximum coherence that is obtainable from the full text. Here, coherence may be defined:

            number of citations retrieved as relevant
coherence = -----------------------------------------.
                  number of citations examined 

From the extract Ext a in Figure 13, the reader may notice that all the cited references (Refs 1-8) are about INDEPENDENT PARTICLE MODEL, which presumably represents the significant aspect in common. Much subject content behind this representation may be covered by the abstract of the source document. Thus, given the context by the abstract, the reader can to some extent do without the individual abstracts of the cited articles. Similarly, the reader can benefit from other contexts which are exchanged between the cited references. How much he can benefit from these external contexts will depend on his psychological context.

Extracts should be made primarily in one or more sentences. Description in sentences is one of the advantages of extracts over description in keywords. However, some extracts are non-sensical, mostly redundant, or require modification. It would be better in these cases either:

  • to abondon an extract (See Exts h, n) or
  • to reinforce an extract (See Ext d) or
  • to select only keywords or phrases (See Exts g, q).

In short, the length and the coherence of extracts should be balanced. Extraction of keywords or phrases similar to subject indexing, may suffice in many cases.

Perhaps the simplest file organization would be to mark extracts directly on the source document and to derive the subject index from them. In a sophisticated environment, e.g., visual display and keyboard manipulation of constituent files, the following organization may be convenient.

  • Subject index - in alphabetic order.
  • Citation index - including unduplicated citations.
  • Extract file - including abstracts and extracts.

Figure 14 illustrates an entry to the subject index, and Figure 15 illustrates ways of access from the subject index to the citation index and to the extract file.

7. CONCLUSION

edit

I think, as many others may do,[26] that in his World Encyclopedia,[27] H. G. Wells[28] proposed in effect an ideal of file organization for information retrieval.[29] Refer again to the prefatory statement made by him. The crucial point here is to select and collate carefully, and to present critically. So far this study has attempted to move toward his ideal.[30]

Say, "World Encyclopedia." This somewhat tricky wording seems to bear some misunderstanding. Clearly, it is to put away miscellany and synthesize the essence only rather than to bring all together. In general, words being freed from its proper contexts, whether literary or external or psychological, are mischievous. and easily bring in misinterpretations. Incidentally, Wells himself experienced such a mischief done by a professional journalist. Hayakawa [24] says that:[31]

"...the ignoring of contexts in any act of interpretation is at best a stupid practice. At its worst, it can be a vicious practice."[32]

By saying "ignoring," however, he would not ignore the possibility of dispensing with part of the whole context. Given the environment, or given the wider context, part of the context is determinative in interpretation.

REFERENCES

edit

AFTERMATH

edit

1. INTRODUCTION

edit
Look back in anger man-machine communication
C. L. Borgman, Univ. of California, Los Angeles
N. J. Belkin, Rutgers Univ., New Brunswick, NJ
W. B. Croft, Univ. of Massachusetts, Amherst
M. E. Lesk, Bell Communications Research
T. K. Landauer, Bell Communications Research
"Retrieval systems for the information seeker: can the role of the intermediary be automated?" In: Proceedings of the SIGCHI conference on Human factors in computing systems, Washington, D.C., United States, 1988, p.51-53. (Abstract)

2. THE LINE OF ATTACK

edit
prediction and discrimination
Kenji Yamanishi (NEC Research Institute, Inc., 4 Independence Way, Princeton, NJ) "Randomized approximate aggregating strategies and their applications to prediction and discrimination." (Annual Workshop on Computational Learning Theory) Proceedings of the eighth annual conference on Computational learning theory, Santa Cruz, California, United States. p.83-90, 1995. ISBN:0-89791-723-5 [10]
Schematic view of IR events
The Atom of Work, as mentioned in Using the Methods of Fernando Flores by Jack Reilly [11]
  1. Preparation (Customer) cf. Interaction
  2. Negotiation (Provider) cf. Inference
  3. Performance (Provider) cf. Substitution
  4. Assessment (Customer) cf. Notification

3. SYSTEMS VS. USERS

edit
``Terry Winograd left the Artificial Intelligence field after trying to expand the SHRDLU work in natural language understanding. In the early 1980s, he was a founding member and national president of Computer Professionals for Social Responsibility. In the early 1990s, he worked on an early form of groupware with Fernando Flores. Their approach was based on conversation-for-action analysis. The work led to a new design perspective based on phenomenology. Starting in 1995, he served as adviser to Stanford PhD student Larry Page. In 1998, Page took a leave of absence from Stanford to co-found Google. In 2002, Winograd took a sabbatical from teaching and spent some time at Google as a visiting researcher. Today, he continues to do research at Stanford in human-computer interaction.``
``Fernando Flores spent three years as a political prisoner of General Augusto Pinochet (from September 11, 1973 to 1976). Released after negotiations of Amnesty International, he started to work as a researcher of the Computer Science departament at Stanford University where he studied a PhD under the guidance of Hubert Dreyfus, Stuart Dreyfus, John Searle and Ann Markussen. There he developed his work on philosophy, coaching, and workflow technology, influenced by Heidegger, Maturana, John Austin and others. He obtained a PhD in Philosophy from the University of California, Berkeley. His thesis was titled Management and Communication in the Office of the Future. He created several companies.``
``In conversation for action, there are always two players—a customer and a provider. These are shown on the left and right side of the following figure, which is known as the atom of work. To begin the atom of work cycle, in the upper left quadrant the provider makes an offer or the customer makes a request. Requests always include a "for sake of" statement. A "for sake of" statement explains something about why this request is being made; it provides context.``
``The generally accepted purpose of Customer Relationship Management (CRM) is to enable organizations to better serve their customers through the introduction of reliable processes and procedures for interacting with those customers.``
``The term CRM is used to describe either the software or the whole business strategy (or lack of one) oriented on customer needs. The second one is the description which is correct. The main misconception of CRM is that it is only software, instead of whole business strategy.``
``In economics, disintermediation is the removal of intermediaries in a supply chain: "cutting out the middleman". Instead of going through traditional distribution channels, which had some type of intermediate (such as a distributor, wholesaler, broker, or agent), companies may now deal with every customer directly, for example via the Internet. One important factor is a drop in the cost of servicing customers directly.``
"Reintermediation can be defined as the reintroduction of an intermediary between end users (consumers) and a producer. This term applies especially to instances in which disintermediation has occurred first."

3.1 Discrimination

edit
Relevance as satisfaction of information need
Park (1975) "...subjective...the real judge is the user."
van Rijsbergen (1975) "...Relevance is a subjective notion." [12]
Saracevic (1997) "...Whose relevance? Users!" [13]

3.2 Prediction

edit
  • necessary condition for relevance
    • Library Services in Theory and Context/retrieval [14]
    • The Principle of Relevance [15]
    • Relevance [16] pdf
    • Reference theory [17]
  • objectivity v. subjectivity

Bruza, P.D., Song, D., Wong, K.F. (2000) Aboutness from a Commonsense Perspective. Journal of the American Society for Information Science and Technology (JASIST), 51(12), 1090-1105. pdf another

Maron (1977) tackled aboutness by relating it to a probability of satisfaction. Three types of aboutness were characterized: S-about, O-about and R-about. S-about (i.e. subjective about) is a relationship between a document and the resulting inner experience of the user. O-about (i.e. objective about) is a relationship between a document and a set of index terms. More specifically, a document D is about a term set T if user X employs T to search for D. R-about purports to be a generalization of O-about to a specific user community (i.e., a class of users). Let I be an index term and D be a document, then D is R-about I is the ratio between the number of users satisfied with D when using I and the number of users satisfied by D. Using this as a point of departure, Maron further constructs a probabilistic model of R-aboutness. The advantage of this is that it leads to an operational definition of aboutness which can then be tested experimentally. However, once the step has been made into the probabilistic framework, it becomes difficult to study properties of aboutness, e.g. how does R-about behave under conjunction? The underlying problem relates to the fact that probabilistic independence lacks properties with respect to conjunction and disjunction. In other words, one's hands are largely tied when trying to express qualitative properties of aboutness within a probabilistic setting. setting. (For this reason Dubois et al. (1997) developed a qualitative framework for relevance using possibility theory).

Maron, M.E. (1977). On Indexing, Retrieval and the Meaning of About. Journal of the American Society for Information Science, 28 (1): 38-43.

Dubois, D., Farinas del Cerro, L., Herzig, A., & Prade, H. (1997). Qualitative Relevance and Independence: A Roadmap. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 62-67, 1997.

4. DOCUMENTS VS. SURROGATES

edit

model abstract

edit
  • Exemplary documents: a foundation for information retrieval design [18]

5. THE THEORY OF INTERPRETATION

edit

5.1 Denotation and Connotation

edit
  • Fairthorne (1969) Extensional and intensional aboutness
    • Hutchins, W.J. (1977). On the problem of 'aboutness' in document analysis. Journal of Informatics, 1(1):17-35, 1977. pdf cache
    • Lisa C. Stark (1999) pdf [19]
    • Stuart Hawthorne et al (2002) [20]
    • Alberto Cheti (2004) [21]
  • Park (1975) Explicit and implicit meaning
  • Bohm (1983) Implicate and Explicate Order
  • Searle (1983) Aboutness in Intentionality
  • Dumais (1988) Implicit structure/Latent semantic analysis [22] [23]
  • Cooper , W.S. (1971). A Definition of relevance for Information Retrieval. Information Storage and Retrieval, 7, pp. 19-37, 1971.
  • Maron, M.E. (1977). On Indexing, Retrieval and the Meaning of About. Journal of the American Society for Information Science, 28 (1): 38-43.
  • Swanson, D.R. (1986): ‘Fish Oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge’. Perspectives in Biology and Medicine, 30(1), 7-18.
  • The dictionary and grammar as the "social rule" work out explications, while implications rule out such rule-based reasoning. Refer to the non-sensical, ungrammatical word string implying a story!
  • John R. Searle, "Minds, Brains, and Programs," from The Behavioral and Brain Sciences, vol. 3. Copyright 1980 Cambridge University Press.
       I will consider the work of Roger Schank and his colleagues at Yale (Schank and Abelson 1977), because I am more familiar with it than I am with any other similar claims, and because it provides a very clear exampie of the sort of work I wish to examine. But nothing that follows depends upon the details of Schank’s programs. The same arguments would apply to Winograd’s SHRDLU (Winograd 1973), Weizenbaum’s ELIZA (Weizenbaum 1965), and indeed any Turing machine simulation of human mental phenomena. [See "Further Reading" for Searle’s references.]
       Very briefly, and leaving out the various details, one can describe Schank’s program as follows: The aim of the program is to simulate the human ability to understand stories. It is characteristic of human beings’ story-understanding capacity that they can answer questions about the story even though the information that they give was never explicitly stated in the story. Thus, for example, suppose you are given the following story: "A man went into a restaurant and ordered a hamburger. When the hamburger arrived it was burned to a crisp, and the man stormed out of the restaurant angrily, without paying for the hamburger or leaving a tip." Now, if you are asked "Did the man eat the hamburger?" you will presumably answer, "No, he did not." [...]
  • Strong AI experts speak well of Roger Schank. He said, "A theory like Chomsky's doesn't help me solve my problem; knowing the universal constraints on grammars of all languages isn't going to help me devise a program that can understand stories in English. Therefore Chomsky was wrong about language." [my boldtype]

5.2 The Theory of Ogden and Richards

edit

David Bohm, has frequently referred to meaning, particularly when talking about his recent experiments with dialogue groups in which "a free flow of meaning" is encouraged. This whole question of meaning, and what we mean by it is clearly of importance and, in particular, the question "What do you mean by language?"

C.K. Ogden and I. A. Richards's classic The Meaning of Meaning8 provides a useful introduction to such questions. Following Odgen and Richards the work of Ludgwig Wittgenstein had made a particularly significant contribution to the notion of meaning in linguistics.9 According to his dictum: Don't look for the meaning, look for the use. Essentially this can be interpreted as saying that meaning is a generalization that doesn't correspond to anything that is actually available in language behavior. What we actually rely upon are individual uses which are themselves interrelated according to a pattern of family resemblances. In this sense words could no more be said to "possess" an intrinsic meaning that is independent of their use than, in Bohr's view, could an electron be said to "possess" an intrinsic position or spin.

5.3 Implications for Information Retrieval

edit
  • Hubert Dreyfus
    • Cognitivism (psychology)
      "... Phenomenologist and hermeneutic philosophers have criticised the positivist approach of cognitivism for reducing individual meaning to what they perceive as measurements stripped of all significance. They argue that by representing experiences and mental functions as measurements, cognitivism is ignoring the context (cf contextualism) and, therefore, the meaning of these measurements. They believe that it is this personal meaning of experience gained from the phenomenon as it is experienced by a person (what Heidegger called being in the world) which is the fundamental aspect of our psychology that needs to be understood: therefore they argue that a context free psychology is a contradiction in terms. They also argue in favour of holism: that positivist methods cannot be meaningfully used on something which is inherently irreducible to component parts. Hubert Dreyfus has been the most notable critic of cognitivism from this point of view. Humanistic psychology draws heavily on this philosophy, and practitioners have been among the most critical of cognitivism."
      "...The idea that mental functions can be described as information processing models has been criticised by philosopher John Searle and mathematician Roger Penrose who both argue that computation has some inherent shortcomings which cannot capture the fundamentals of mental processes."
    • A Critical Review of What Computers Still Can't Do, by Hubert Dreyfus; Ron Barnette
      "... To reinforce earlier GOFAI criticisms, he describes insuperable difficulties (outlined in 1972) confronting successful modelling of common-sense understanding, which requires a notion of relevance, contextually and holistically characterized, resulting from worldly, bodily experiences, not compatible with atomistic, symbolic data structures and discrete computations. He then develops the 1979-edition criticism which cites a further insuperable GOFAI problem: that of modelling the know-how requisite to judge relevance. 'Know-how,' or that activity of generalizing and determining relevance in an open-textured world constantly confronted, is argued to be not a matter of manipulating data, no matter how much is provided. Moreover, with serial processing strictures, symbolic computational attempts appear to be biologically unrealizable as well, and demand, in principle, procedures that require a combinatorial explosion of information that cannot be resolved by computational means alone."
      "... Basically, CKP is defined by three problems: (1) How everyday knowledge must be organized so that one can make inferences from it; (2) How skills or knowledge can be represented as knowing-that; and (3) How relevant knowledge can be brought to bear in particular situations (xviii). In fact, one might treat all three problems as ones involving selection of relevance. For example, learning to generalize is critical for intelligent behavior, but this requires associating inputs of the same type with successful decisions and actions. But in what does the relevant type consist? Relevance in this regard and in the context of ignoring and attending to features of novel settings as we confront them are not inherent in the context data, Dreyfus argues, but are, instead, relative to current situations in light of a myriad of human background experiences. What is relevant in one setting might not be so in another. Thus, whether by means of symbolic tokens (GOFAI), or having been learned through adjusted network connections (PDP), relevance is not to be gleaned by means of providing more information for the system to work with and through. Information about what is relevant only leads to circularity or a vicious regress of what is relevant to relevance for relevance for..... Admittedly, in narrowly-defined problem domains a machine might seem to pull off generalization skills, but this would be to mimic intelligence, at best, artificially enforced, as it were. Solving the CKP is a gauntlet Dreyfus lays down. Can the PDP paradigm solve it?"

6. PROPOSAL FOR FILE ORGANIZATION

edit

6.1 Incentives

edit

Reviews of This Book by Douglas Hofstadter

edit

When Martin Gardner retired from writing his Mathematical Games column for Scientific American magazine, Hofstadter succeeded him with a column entitled Metamagical Themas (an anagram of "Mathematical Games").

Hofstadter invented the concept of Reviews of This Book, a book containing nothing but cross-referenced reviews of itself. He introduces the idea in Metamagical Themas:

"[it] is just a fantasy of mine. I would love to see a book consisting of nothing but a collection of reviews of it that appeared (after its publication, of course) in major newspapers and magazines. It sounds paradoxical, but it could be arranged with a lot of planning and hard work. First, a group of major journals would all have to agree to run reviews of the book by the various contributors to the book. Then all the reviewers would begin writing. But they would have to mail off their various drafts to all the other reviewers very regularly so that all the reviews could evolve together, and thus eventually reach a stable state of a kind known in physics as a "Hartree-Fock self-consistent solution". Then the book could be published, after which its reviews would come out in their respective journals, as per arrangement."

Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law".

Two experts, to explicate Meaning,
Wrote a book called The Meaning of Meaning.
But the world was perplexed!
So three experts wrote next
The Meaning of Meaning of Meaning.
- Douglas Hofstadter
Two experts, to explicate Meaning,
Penned a text called "The Meaning of Meaning",
But the world was perplexed,
So three experts penned next
"The Meaning of Meaning of Meaning".

Automatic Review Article Generation

edit
Title
Classification of Research Papers using Citation Links and Citation Types: Towards Automatic Review Article Generation.
Publication
In Proc. of the 11th SIG Classification Research Workshop, Classification for User Support and Learning, pages 117-134, 2000.
Authors
Hidetsugu Nanba, School of Information Science, Japan Advanced Institute of Science and Technology
Noriko Kando, National Institute of Informatics
Manabu Okumura, Precision and Intelligence Laboratory, Tokyo Institute of Technology
Abstract
We are investigating automatic generation of a review (or survey) article in a specific subject domain. In a research paper, there are passages where the author describes the essence of a cited paper and the differences between the current paper and the cited paper (we call them citing areas). These passages can be considered as a kind of summary of the cited paper from the current author's viewpoint. We can know the state of the art in a specific subject domain from the collection of citing areas. Further, if these citing areas are properly classified and organized, they can act as a kind of a review article. In our previous research, we proposed the automatic extraction of citing areas. Then, with the information in the citing areas, we automatically identified the types of citation relationships that indicate the reasons for citation (we call them citation types). Citation types offer a useful clue for organizing citing areas. In addition, to support writing a review article, it is necessary to take account of the contents of the papers together with the citation links and citation types. In this paper, we propose several methods BCCT-C, the bibliographic coupling considering only type C citations, which pointed out the problems or gaps in related works, are more effective than others. We also implemented a prototype system to support writing a review article, which is based on our proposed method.
Acknowledgement
The authors would like to express our gratitude to Dr. Dagobert Soergel of University of Maryland and anonymous reviewers [my boldtype] for their suggestions to improve our paper.

CiteSeer, in the past known as ResearchIndex, is a public specialty search engine and digital library that was created by researchers Dr. Steve Lawrence, Kurt Bollacker and Dr. Lee Giles while they were at the NEC Research Institute (now NEC Labs), Princeton, NJ, USA. CiteSeer crawls for and harvests academic scientific documents and uses autonomous citation indexing to permit querying by citation or by document. [...]

CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it is an example of the democratization of scientific knowledge and the Open Access movement that is revolutionizing academic and scientific publishing and scientific literature access. [...]

Google began as a research project in early 1996 by Larry Page and Sergey Brin, two Ph.D. students at Stanford who developed the hypothesis that a search engine based on analysis of the relationships between Web sites would produce better results than the basic techniques then in use. It was originally nicknamed BackRub because the system checked backlinks to estimate a site's importance. (A small search engine called RankDex was already exploring a similar strategy.)

Convinced that the pages with the most links to them from other highly relevant Web pages must be the most relevant ones, Page and Brin decided to test their thesis as part of their studies, and laid the foundation for their search engine. [...]

cf. Terry Winograd

6.2 Extracts as Indexing Sources

edit

6.3 Extracts as Review Sources

edit

7. CONCLUSION

edit

References

edit
edit
  • Alessandro Duranti and Charles Goodwin, eds. (1992) Rethinking Context: Language as an Interactive Phenomenon (Series No. 11: Studies in the Social and Cultural Foundations of Language). See: Cambridge UP Catalogue

See also

edit

Key Words and Names

edit

University College London

edit

The idea was worked out in more detailed form by Cerf's networking research group at Stanford in the 1973–74 period, resulting in the first TCP specification (Request for Comments 675) (The early networking work at Xerox PARC, which produced the PARC Universal Packet protocol suite, much of which was contemporaneous, was also a significant technical influence; people moved between the two).

DARPA then contracted with BBN Technologies, Stanford University, and the University College London to develop operational versions of the protocol on different hardware platforms. Four versions were developed: TCP v1, TCP v2, a split into TCP v3 and IP v3 in the spring of 1978, and then stability with TCP/IP v4 — the standard protocol still in use on the Internet today.

In 1975, a two-network TCP/IP communications test was performed between Stanford and University College London (UCL). In November, 1977, a three-network TCP/IP test was conducted between the U.S., UK, and Norway. Between 1978 and 1983, several other TCP/IP prototypes were developed at multiple research centres. A full switchover to TCP/IP on the ARPANET took place January 1, 1983.

— "TCP/IP," Wikipedia.

Internet

edit
  • "Worldwide network" in Wellsian terms

Advanced Research Projects Agency was renamed to Defence Advanced Research Projects Agency (DARPA) in 1972.

A fundamental pioneer in the call for a global network, J.C.R. Licklider, articulated the ideas in his January 1960 paper, Man-Computer Symbiosis.

"A network of such [computers], connected to one another by wide-band communication lines" which provided "the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions. "—J.C.R. Licklider.

World Wide Web

edit

The World Wide Web has evolved into a universe of information at our finger tips. But this was not an idea born with the Internet. This lecture recounts earlier attempts to disseminate information that influenced the Web - such as the French Encyclopédists in the 18th century, H. G. Wells' World Brain in the 1930s, and Vannevar Bush's Memex in the 1940s.

— Editorial comment

There are quite a number of published histories of the Internet and the World Wide Web. Typically these histories portray the Internet as a revolutionary development of the late 20th century—perhaps with distant roots that date back to the early 1960s. In one sense this is an apt characterization. The Internet is absolutely a creation of the computer age.

But we should not think of the Internet just as a revolutionary development. It is also an evolutionary development in the dissemination of information. In that sense the Internet is simply the latest chapter of a history that can be traced back to the Library of Alexandria or the printing press of William Caxton.

In this lecture I will not be going back to the Library of Alexandria or the printing press of William Caxton. Instead I will focus on the contributions of three individuals who envisioned something very like the Internet and the World Wide Web, long before the Internet became a technical possibility.

These three individuals each set an agenda. They put forward a vision of what the dissemination of information might become, when the world had developed the technology and was willing to pay for it. Since the World Wide Web became established in 1991 thousands of inventers and entrepreneurs have changed the way in which many of us conduct our daily lives. Today, most of the colonists of the Web are unaware of their debt to the past. I think Sir Isaac Newton put it best: “If [they] have seen further, it is by standing on the shoulders of giants.” This lecture is about three of those giants: H.G. Wells, Vannevar Bush, and J.C.R. Licklider.

— Introduction

Around 1937, Wells perceived that the world was drifting into war. He believed this was because of the sheer ignorance of ordinary people, that allowed them to be duped into voting for fascist governments. He believed that the World Brain could be a force in conquering this ignorance and he set about trying to raise the half-a-million pounds a year that he estimated would be needed to run the project. He lectured and wrote articles which were later published as a book called the World Brain (1938). He made an American lecture tour, hoping it would raise interest in his grand project. One lecture, in New York, was broadcast and relayed across the nation. He dined with President Roosevelt, and if Wells raised the issue of the World Brain with him — which seems more than likely — it did not have the effect of loosening American purse-strings. Sadly, Wells never succeeded in establishing his program before World War II broke out, and then of course such a cultural project would have been unthinkable in the exigencies of war.

— H. G. Wells and the World Brain

The rapid growth of the Internet in the 1990s was primarily due to the World Wide Web. The Web Browser made using the Internet easy for ordinary people, and also worth doing and worth investing in. The World Wide Web was invented by Sir Tim Berners-Lee working in the CERN European particle physics laboratory in Geneva, in 1991. As Berners-Lee put it himself, the World Wide Web was “the marriage of hypertext and the Internet.” The ideas were in the air. He just put the pieces together. And in so doing, he set in train a chain of events that have changed the world.

— Conclusion

Neither Berners-Lee nor Campbell-Kelly tells the truth enough, whether wittingly or unwittingly. The development of personal computers should be taken into account as well as the Internet. This was the very first and foremost of Vannevar Bush's wishes. Then hypertext necessarily follows. His dream Memex was coming true when the first personal hypertext program, Guide, was developed in 1982 by Peter Brown at the University of Kent that reminds us of Wellsian birthplace. Berners-Lee says that he also developed the like ENQUIRE as early as 1980, as the precursor to the Web. It should be stressed most importantly that hypertext, mass storage, personal computing, and networking were all going hand in hand in the late 1970s, aiming to replace the huge library and the mainframe time-sharing, and that Wells, Bush, and Licklider were all longing for such fantastic libraries and encyclopedias as "at our finger tips" by virtue of technology. Simply they took seriously our necessity that is the mother of invention, and took some action to some effect. Who was most effective? Bush? Was he aware of the Wellsian idea of World Encyclopaedia? As a national science administrator and science advisor to the U.S. President, he was probably well aware and affected. Otherwise he must have been shamefully ignorant. To credit him with every creativity may make a hero of him in danger of either shameless plagiarism or shameful ignorance.
  • On "information retrieval initiative"

The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.

— The first web page, CERN 1991 [24]
This first web passage of the first web page made it explicit that the World Wide Web aims to contribute to "information retrieval" (IR) which so strangely was seldom mentioned afterwards as if such a motivation had been mistaken. IR in turn is to contribute to information science rather than computer science. It is the core, if not almost everything, of information science. The fame of IR and information science may imply the shame of AI and computer science, which then would wish the web and hypertext as its basis to be an absolutely new computing agenda hence few bearing on any other tradition. Tim Berners-Lee regards the web as "the marriage of hypertext and the Internet." It should be because of hypertext rather than the Internet if the web should contribute to IR. Vannevar Bush was focally concerned with IR when the term was not used, whereas the other two "founding fathers" of hypertext, Douglas Engelbart and Ted Nelson, were not so as to contribute to the fallacy that both agendas were independent. In contrast, Charles Goldfarb, known as the father of SGML, said that it had aimed for IR, regretting the antagonism between the traditional and hypertextual IR paradigms looking like a religious war. Soon after the emergence of the web, it was said to be "better than sex" suggesting psychological warfare!

Hypertext

edit

The idea of hypertext is neither a great invention nor even an invention. To say it is great is to say information is great or computing is great. Neither is great. But its use may be either great or evil or better than nothing. It's up to you how about pornographic uses, for example. Put otherwise, an atom is barely great while an atomic bomb is surely great.

Speech is strictly linear, while text is less but still so indeed. But you need not read linearly from opening to ending. Reading is nonlinear from time to time, if not in general. So is it for reference works in particular. It is simply a programmer's job to help read nonlinearly as usually needed. But a programmer may be lucky enough to be the first who gets the job done. All congratulations on her chance of being the first programmer rather than inventor, surely in case she is not worth the inventor.

Notes

edit
  1. ^ A DIRECT APPROACH TO INFORMATION RETRIEVAL
  2. ^ "Master's Thesis"
  3. ^ "University of London"
  4. ^ "1975"
  5. ^ "Kyung-Youn Park"
  6. ^ "B. C. Brookes"
  7. ^ "University College London"
  8. ^ "[R]elevance judgments are so complicated depending on not only subject matter but also many other things. [...] the real judge is the user."

    Relevance is the basic underlying notion of IR by choice. [...] IR, as formulated, is based on [...] relevance. Whose relevance? Users! In reality, wanting or not, users and IR are not 'separable.' This brings us to the necessity for a broader context for IR provided by information science. Computer science provides the infrastructure. Information science the context. [emphasis original]

    — Saracevic (1997). "Users lost: Reflections on the past, future, and limits of information science." SIGIR Forum, 31 (2) 16-27. [1]
  9. ^ "We are free to think of anything."

    Some thinkers attack the problem of free will by distnguishing different notions of freedom or meaning of the word 'free'. In one sense we are free -- free enough for concepts of morality and responsibility to come into play. In another sense we are not free, and all that happens now is determined by what has happened earlier. According to this 'soft determinism', as William James called it, determinism is supposed to express a true doctrine in one sense of the words, and a false doctrine in another. Plenty of philosophers have argued that the problem about free will arises from what Hobbes called the 'inconstancy' of language. The same word, they say, is inconstant -- it can have several meanings. Even philosophers who argue for a simple determinism have to show that in their arguments the word 'free' is used with a constant sense, leading up to the conclusion that we are not free.

    — Ian Hacking (1975) Why Does Language Matter to Philosophy? (p. 4-5)
  10. ^ "We are free ..."

    Berlin did not assert that determinism was untrue, but rather that to accept it required a radical transformation of the language and concepts we use to think about human life -- especially a rejection of the idea of individual moral responsibility. To praise or blame individuals, to hold them responsible, is to assume that they have some control over their actions, and could have chosen differently. If individuals are wholly determined by unalterable forces, it makes no more sense to praise or blame them for their actions than it would to blame someone for being ill, or praise someone for obeying the laws of gravity. Indeed, Berlin suggested that acceptance of determinism -- that is, the complete abandonment of the concept of human free will -- would lead to the collapse of all meaningful rational activity as we know it.

    — "Isaiah Berlin," on: "Free Will and Determinism," in: Stanford Encyclopedia of Philosophy
  11. ^ "The trees in the garden must be there ..."

    Local realism is the combination of the principle of locality with the "realistic" assumption that all objects must objectively have pre-existing values for any possible measurement before these measurements are made. Einstein liked to say that the moon is "out there" even when no one is observing it.

  12. ^ "Thus we can freely organize or map our thoughts upon this stable background."
    In the previous excerpt on local realism, rephrase "value" into "form," "measurement" into "thought," and "made" into "mapped." Note that local realism underlies Einstein's determinism and hidden variable theory opposing to quantum indeterminism and Copenhagen interpretation as embraced by Niels Bohr and mainstream quantum physicists. Bohmian quantum mechanics attempts to preserve determinism in virtue of nonlocality but at the cost of locality, although Bell's inequality complicates the view. Ted Honderich at UCL argues against quantum indeterminism as too detached to be relevant to our life. A synoptic version or vision of positivism was strongly desired to escape from the reductionistic logical positivism as well as skepticism.
  13. ^ Stephen E. Robertson (1975 at UCL) "Explicit and implicit variables in information retrieval systems," Journal of the American Society for Information Science, 26(4): 214-22.
  14. ^ Mary Douglas (1975 at UCL) Implict Meanings: Essays in Anthropology
  15. ^ Paul Grice (1975 at UC Berkeley) Implicature
  16. ^ John Searle (1975 at UC Berkeley) Indirect speech act
  17. ^ David Bohm (1980 at Birkbeck College) Implicate and Explicate Order
  18. ^ Susan Dumais, et al (1988 at Bellcore) Latent semantic analysis [2] [3]
  19. ^ 5.2 The Theory of Ogden and Richards
  20. ^ When we communicate our thoughts about things, we use signs. That is to say, three factors - a thought, a thing, and a sign - are essentially involved in any communication event, either speaking or listening.
  21. ^ Walker Percy (1975) "The Delta Factor" (in) The Message in the Bottle
  22. ^ Peter P. Chen (1976) "The Entity-Relationship Model: Toward a Unified View of Data," ACM Transactions on Database Systems, 1(1): 9-36.
  23. ^ Ogden and Richards (1923) also called their theory the "contextual theory of reference" or "causal theory of reference" from which the current use differs.
  24. ^ Contextual theory of reference:

    McGinn's aim is two-fold: to undermine both descriptive and causal theories of reference, and to argue for his preferred, 'contextual' theory of reference. McGinn is moved to this position by emphasizing indexicals—which he takes to be the primary referential devices—rather than proper names. Linguistic reference, for McGinn, is a conventional activity governed by rules that prescribe the spatio-temporal conditions of correct use; the semantic referent of a speaker's term is given by combining its linguistic meaning with the spatio-temporal context in which the speaker is located. McGinn concludes his defence of this theory by demonstrating the plausibility of its implications for such topics as abstract objects, self-reference, attribution, the language of thought hypothesis, truth, and the reducibility of reference.

    — (Abstract) Colin McGinn (2002) "The Mechanism of Reference" (in) Knowledge and Reality, pp. 197-223.
  25. ^ Context in context

    Context is a term that has come into more and more frequent use in the last thirty or forty years in a number of disciplines--among them, anthropology, archaeology, art history, geography, intellectual history, law, linguistics, literary criticism, philosophy, politics, psychology, sociology, and theology. A trawl through the on-line catalogue of the Cambridge University Library in 1999 produced references to 1,453 books published since 1978 with the word context in the title (and 377 more with contexts in the plural). There have been good reasons for this development. The attempt to place ideas, utterances, texts, and other artifacts "in context" has led to many insights.

    — Peter Burke (2002) "Context in Context." Common Knowledge, 8(1): 152-177.
  26. ^ 7. CONCLUSION
  27. ^ World Encyclopedia
  28. ^ H. G. Wells
  29. ^ "I think, as many others may do, that in his World Encyclopedia, H. G. Wells proposed in effect an ideal of file organization for information retrieval."
  30. ^ "The crucial point here is to select and collate carefully, and to present critically. So far this study has attempted to move toward his ideal."

    Today the digital library community spends some effort on scanning, compression, and OCR; tomorrow it will have to focus almost exclusively on selection, searching, and quality assessment. Input will not matter as much as relevant choice. Missing information won't be on the tip of your tongue; it will be somewhere in your files. Or, perhaps, it will be in somebody else's files. With all of everyone's work online, we will have the opportunity first glimpsed by H. G. Wells (and a bit later and more concretely by Vannevar Bush) to let everyone use everyone else's intellectual effort. We could build a real `World Encyclopedia' with a true `planetary memory for all mankind' as Wells wrote in 1938. [Wells 1938]. He talked of ``knitting all the intellectual workers of the world through a common interest;`` we could do it. The challenge for librarians and computer scientists is to let us find the information we want in other people's work; and the challenge for the lawyers and economists is to arrange the payment structures so that we are encouraged to use the work of others rather than re-create it.

    — Michael Lesk (1997) How Much Information Is There In the World? (Excerpt from Conclusion)
    Note: He was a visiting professor at UCL.
    Note: This unpublished is one of 11 articles Jim Gray recommended.

    World Brain or Global Brain proponents tend to extrapolate quite extravagantly the capabilities and implications of emerging technology. For Wells it was microfilm. Today it is the infinitely more sophisticated Internet and World Wide Web which have enmeshed our globe in a fantastically intricate and diffused communications infrastructure. By means of this technology as World or Global Brain proponents imagine it taking shape, the effective deployment of the entire universe of knowledge will become possible. But this begs unresolved questions about the relative value of the individual and the state, about the nature of individual and social benefits and how they are best to be allocated, about what constitutes freedom and how it might be appropriately constrained. It flies in the face of the intransigent reality that what constitutes the ever-expanding store of human knowledge is almost incalculably massive in scale, is largely viewpoint-dependent, is fragmented, complex, ceaselessly in dispute and always under revision.

    — W. Boyd Rayward (1999) "H. G. Wells's Idea of a World Brain: A Critical Re-Assessment." JASIS, 50: 557-579.
    Note: He was also a visiting professor at UCL.
  31. ^ "Hayakawa"
  32. ^ ""...the ignoring of contexts in any act of interpretation is at best a stupid practice. At its worst, it can be a vicious practice.""