Note: 01/06/2002 The AltaVista piece on this page was written in october 1997 and first published on the web in january 1998. Obviously, since then a lot has changed.
The most important evolution since that time  is probably that Google came up with a better ranking technology that takes the webbed nature of WWW into account. For those interested in "link analysis", the papers of Page and Brin are still around. Even though the details of the implementation of PageRank(TM) are not public, the way Google functions can be understood pretty well from the information that is available. I have very much the same feeling about Google as I used to have about AltaVista: given the task it's an awesome machine. In general searching has become much easier: since Google I do not do a lot of Boolean anymore.
Since most other search engines started to do some form of link analysis, the business of site promotion has responded by changing too. There is still a lot blah blah about keywords and META-tags around though. However: in order to be succesfull keywords in   the TITLE-tags (still all important) are now to be combined with links from other sites. There are a lot of clever ways to do that but the best way now (as ever) is simply to make a good web site, with useful content that a lot of people want to know about. So a lot has probably changed for the better.
One of the many minor changes since 1997 I came accross while (at last!) updating the links was that AltaVista still had http://altavista.digital.com as URL. It still works, but you will be taken to the regular www.altavista.com.

Two popular questions & answers (don't ask them again):
1. No, I do not have the original AltaVista PX software anymore. I've been looking for it myself but it seems to have disappeared from the web alltogether. Amazing, but true. Once immensely popular, now gone. Contrary to what many people in the closing years of last century said about the web it is not place where all kinds of information linger in eternity. It might be keeping itself up to date much better than some of us thought. If anyone can find the original software somewhere I would be very glad to receive a copy. (I still have a Win '98 machine somewhere and it might still function.)
2. Also no: I'm quiet happy with my library job and not interested in a site promotion career.


© Copyright Dirk van Eylen
All standard disclaimers apply, this article contains theory and conjecture and does not claim accuracy


Executive summary

While everybody agrees that internet search engines are valuable tools to bring some order to chaos, many people that are used to dealing with information systems feel they cannot rely on search engines and their indexes since the companies that own them do not provide adequate information on how documents are processed for indexing and retrieval. This article is expanded from a software review I did for one of my courses in the Library and Information Specialist program. Since the package under review was the AltaVista Personal Extensions program, general notions in this text are exemplified by referring to AltaVista.
After the introduction follows a section on indexing. General full text index characteristics, such as distribution of index terms according to Zipf's law, are mentioned to offer some understanding of what one might expect to find in an automatically generated full text index.
The next section tries to dispell some of the mystery that surrounds ranking of search results in AltaVista. Based on extensive usage of several AltaVista products and on information from the AltaVista help files, general considerations are offered on the relation between retrieval weights and frequency of terms in the index. After that the basic ranking algorithm is explained in detail by going through the results of a sample simple keyword search step by step. Even though the same simple algorithm seems to be used to process queries with multiple keywords, some aspects of results ranking in complex searches remain unclear.
The article concludes with a summary of findings and their impact on the usage of a search engine such as AltaVista on the information that is retrieved. Anti-spamdexing measures taken by search engines sometimes seem to get into the way of retrieval effectiveness.


CONTENTS
Introduction
Indexing
More indexing information
Full text indexing and Zipf
Ranking
Single keyword queries
5 - 4 - 1.5 - 1
Multiple keyword queries
Conclusions
Summary table
Implications for information retrieval
How relevant is relevance ranking?


Introduction

In a recent thread in Web4Lib a general feeling of frustration with the poor documentation of many internet search engines became apparent. As many librarians argued correctly : a good (even if it is general) understanding of how search engines work is crucial for search engines to be fully accepted as information retrieval tools. I hope that sharing my experience as a user of several AltaVista products will add to the understanding of how AltaVista's ranking algorithm works.
The search engines have good reasons to not disclose detailed information on their inner workings. One reason is the ongoing war with keyword spammers. Keyword spamming or spamdexing is generally disapproved by most serious site promotors. At the same time the thread in Web4Lib started, another thread was initiated in Online Advertising, a discussion group for site marketeers and promotors. (This discussion has been summarized by Danny Sullivan as the "Searchengines are dead discussion"). Untill recently submitting your site to search engines was considered a valuable tool to direct traffic to the site. Now, the competition among various site-promotion companies seems to have become so intense that search engine traffic is no longer considered the most efficient way to do promotion. You have to work very hard to get your site into the Top Ten (default number of displayed results in many engines) and the next day there will be somebody else taking your place. It's no longer worth the effort, ROI (Return On Investment seems to be a favorite term amongst promotors) has become too small. While many site promotors do not have a library background it is sort of interesting to see that their main daily concern is with the information retrieval siamese twins : indexing and retrieval, traditional librarian core competences. Another reason why search engines are not very forthcoming on the subject of ranking is that it cannot be readily explained without introducing a host of concepts that are probably not familiar to most of the millions of users.

Indexing

As is probably well known, the index of any internet search engine is built by spidering documents on the internet and indexing them. AltaVista used to provide information on what "indexing" means :

"AltaVista treats every page on the Web and every article of Usenet news as a sequence of words. A word in this context means any string of letters and digits delimited either by punctuation and other non-alphabetic characters (for example, &, %, $, /, #, _, ~), or by white space (spaces, tabs, line ends, start of document, end of document). To be a word, a string of alphanumerics does not have to be spelled correctly or be found in any dictionary. All that is required is that someone typed it as a single word in a Web page or Usenet news article. Thus, the following are words if they appear delimited in a document HAL5000, Gorbachevnik, 602e21, www, http, EasierSaidThenDone, etc. The following are all considered to be two words because the internal punctuation separates them: don't, digital.com, x-y, AT&T, 3.14159, U.S., All'sFairInLoveAndWar."

This is so simple, that it needs some time to sink in. Look how comfortably mindless the indexing algorithm eats its way through documents from head to toe. The only question it ever, ever asks : is this character I'm reading a string boundary? Yes or no is decided by a lookup of the character against a table that contains all characters that are known as "delimiting characters" (non-alphabetics, spaces, tabs, line-ends etc.). If the answer to the question is yes, a new "word" (or a new occurrence of a "word") is added to the index. If the answer is no, the indexing algorithm reads the next character and asks the same question.
What holds true for indexing inevitably holds true for retrieval too. A URL that appears as text in a page, say http://www.dma.be/p/amphion/brakke-h/, is not one string, but 8 different "words" : http, www, dma, be, p, amphion, brakke, h. As AltaVista indicates, the index that results from all this will be full of all kinds of nonsense-strings such as "602e21". A slight rewording of AltaVista's statement should go on a three-by-five near the PC you are doing your searches on : "All that is required is that some moron typed it as a single word in a Web Page or Usenet article." Try the most exotic spelling mistake: if someone ever made it (on the web) you will find it (if it has been spidered and indexed that is). As librarians know, this is not necessarily a bad feature. The presence of non-sense strings can also be of great help. Try for instance the phrase "All'sFairInLoveAndWar" to find copies of the AltaVista help files. The reverse holds true too : valuable information might be lost for retrieval because of spelling mistakes. But that's an old one.

More Index information

One of the nice things about AltaVista is that you can always count the occurences of a term in the index. If a keyword occurs too many times in the index, the search itself will be "ignored" but the number of times the word appears in the index will still be adequatly counted. Librarians tend to call these too frequently occuring keywords "stop words". Stop words they are, but in the old days the status of stop word was assigned manually. This is no longer the case in the huge full text databases that are generated by spidering documents on the internet.
Now some arbitrary value is used as cut-off. As the database grows, more and more words will have a number of occurences that is above this value. The important thing here is that the old static concept of "stop words" has become dynamic. What is a valid term today can be a stop word tomorrow.

Full text indexing and Zipf

As I will come to explain later, some insight in the distribution of terms in the index is important in order to understand what happens to your searches. Because AltaVista is a full text index, terms in the index will be roughly distributed according to Zipf's law.
Materials for an Information Retrieval course (note 06/2002: the course by Mr. Allan is still there but I could not find the WSJ and TIME data anymore) at the University of Massachusetts, Amherst, include examples of the most frequent terms in a full text database of TIME articles and a database of Wall Street Journal articles. I took these terms and counted them in AltaVista. Since I cannot know which terms are in AltaVista and not in the list of most frequent words of the TIME or WSJ databases, the results of this count do not constitute a Top 40 of most frequent AltaVista terms. It's only an approximation of how a Top 40 might look like.

Word Occ. Word Occ. Word Occ. Word Occ.
the 1,364 mn it 151 mn new 64 mn our 40 mn
of 771 mn be 151 mn u 58 mn who 37 mn
and 711 mn this 146 mn one 57 mn out 37 mn
to 662 mn as 131 mn he 57 mn when 36 mn
a 645 mn are 129 mn but 56 mn search 34 mn
in 474 mn at 128 mn has 54 mn been 33 mn
for 298 mn from 118 mn which 53 mn would 33 mn
s 269 mn an 89 mn about 51 mn date 30 mn
is 265 mn was 88 mn they 51 mn its 30 mn
an 203 mn not 87 mn more 50 nm had 29 mn
with 161 mn have 86 mn up 45 mn internet 29 mn
by 156 mn all 85 mn their 43 mn into 26 mn
or 154 mn week 67 mn his 42 mn    

Note how small these TIME and WSJ databases are compared to the AltaVista index. If we take the number of times "the" occurs as an indication of size :

Conversely, a rough estimate on the size of the AltaVista index could be derived. The WSJ corpus consists of 46,449 newspaper articles, with 19 million term occurrences, which means that "the" occurs on average once every 17 words. If the same ratio would apply to the AltaVista index, total index size might be estimated at 23,000 million occurences. While this is certainly impressive, I do not believe that it represents more than half of what is actually out on the web. The fact that most internet indexes contain documents in various languages is of course a flaw in the "the"-argument presented here. Reliable data or even reliable estimates on how many documents there are on the web are missing. Hundred million documents was the latest I read, but no authority was given. Data on search engine indexing policies are (except for various claims to be "the biggest" index) missing as well. In fact "completeness" has for some time now been almost a non-issue with most of the major search engines. It has been a while since I last saw a search engine claim to index THE internet.
The words that were most frequent in the TIME full text database and in the Wall Street Journal example are distributed similarly in the AltaVista index. There are numerous exceptions, some due to "database vernacular" such as "britain" and "govern" in the TIME database, "million", "company" and "market" in the WSJ corpus or "internet" in the AltaVista index. So even though I cannot say very much about the total size of an internet index such as AltaVista's, I know how strings or words are distributed in the index. This will be important later on when we will consider the role "weighting" plays in the retrieval process.
The important thing is that Zipf makes for a very skewed distribution. Jakob Nielsen has a very interesting column on web site popularity and Zipf distribution, part of which I use here with his permission to clarify what is at stake :

linear scales on both axes logaritmic scales on both axes
zipf_linear.gif (5993 bytes) zipf_log.gif (3794 bytes)

A simple description of data that follow a Zipf distribution is that they have :

Zipf distributions have been shown to characterize use of words in a natural language (like English) and the popularity of library books, so typically

End of quote. In terms of our index this means that the very few words that occur extremely often are the "stop words" which are not allowed for searches (but of which the occurences in the index can be counted). While "the abundance of words that are almost never used" may become interesting keywords to do searches that do not give too much of a headache.

Ranking

AltaVista ranks the results of a search on criteria that - according to a help file that came with the Personal Extensions program - include these :

These are general principles that hold true for almost any indexing and retrieval program. But what does it mean exactly? For instance, the third criterium seems very generous but isn't it just good practice that your search on "education AND 'distance learning' AND resources" ranks results higher if they match 3 of your terms instead of just 2?
When I reviewed the freely downloadable demo package of Alta Vista's Personal Extensions, I noticed that it had an extra search interface to be used when the default search interface failed to install properly (find and doubleclick a file named pav_gui.exe after you have installed the progam). This interface actually gave ranking scores for each document that was retrieved :

page.gif (11299 bytes)
Screen shot of the results window with ranking scores

A few months ago AltaVista opened up a Belgian branch (note 06/2002: this service no longer functions, the new service at http://altavista.advalvas.be/av2/nl/default.asp simply reroutes queries to the regular AltaVista site that does not return ranking scores) which does also give ranking scores.
Ranking scores like these are very valuable information because if you have the patience to go and look in the retrieved documents and count the occurences of your keyword and look at where the hits are, you can learn a lot about how ranking is done. In the next section I will focus on single keyword searching in order to describe the basics of ranking. After that I will theorise on what happens in complex searches.

Single keyword searches

One of the first surprises I had was that single keyword searches constantly returned "groups" of results. In the case of the simple search on "page" with the Personal AltaVista (screenshot above) the results had only three different ranking scores (765; 229; 153) for all 15 documents that were retrieved. The same holds true if you do a single keyword search on the Belgian branch, even though the scores might be a little "fuzzy". For instance a search on "flaubert" returns four groups : 875-870; 700-696; 262-261; 175 which in my experience is the maximum number of different ranking scores for any single keyword search. (We will look at these different groups in more detail further down.)
Number crunchers will have noticed already, but for a poor mathematician as I am it took quite some fiddling around before I saw it. Once you've seen it though you will never again NOT see it. The ranking scores are numerically related as :

5 - 4 - 1.5 - 1

Since the lowest score in the row equals 1 in the numerical relation I will call this score the index-score or index-weight. The three other scores can always be found by multiplying the index-score :
875 = 5 x 175
700 = 4 x 175
262 = 1.5 x 175
I call this score the index score because it is computed - as AltaVista indicates in the help file above - in function of the number of occurences ("rarity") in the index. I do not know how that computation is done, but we get an idea of what goes on if we look at the following table in which index-scores of some keywords are matched with their occurences in the global index :

Word

Number of occurences
in the index

Index score

where 19,001,167 2/3
bookmark 1,388,021 64
edison 255,634 102
flaubert 11,228 175
monotheistic 5,317 192
zipf 3,492 202
rogiers 678 242
logaritmic 155 275
flipsies 1 356

I think the general rule is pretty obvious and logical : the more occurences in the index, the smaller the weight becomes. The underlying assumption is plain old information theory basics : the greater the probability that a word (string, character, event etc.) will occur, the less information it carries.
The main problem to overcome here is of course the Zipf distribution. What is a good measure to compute index weights? The enormous differences between the number of times any given term appears in the index should not and cannot give rise to the same order of difference in the weights because that would allow one term to dominate other terms in a way that would make any serious information retrieval impossible. As said before I do not know how AltaVista computes the weights, and tastes differ, but I've always thought they're doing a pretty good job at it. I guess you could find out the function AltaVista uses if you plot enough coordinates (occurrence-weight couples). Also, there's room for some tweaking here : different functions will give different weights which will be good for different kinds of users.
In the table above I have only tried to give an idea of the boundaries of the system. The weight of a common word such as "where" is very low (it should be somewhere between 2 and 3, hence the 2/3 notation). It was the closest I could find before a word becomes a "stop word". The 0 weight is where the cut-off value for stop words is. Note that this feature can be used do a rough simulation of NLP (Natural Language Processing) of queries. A natural language question such as Where do I find Bill Gates on the net? will yield similar results as just typing Bill Gates (NOT as a phrase) because most of the words in the sentence are stop words and will not be used to perform the search anyway. Whether the results will be usefull is another matter.
"Flipsies" on the other hand occurs only once in the index and gets 356. This seems to be the maximum weight a word can aspire to. Now, "flipsies" is clearly a rather exotic case. I do not think that there are many single words that occur only once in the index, but phrases are treated as single keywords too.
Note also that these boundaries are not fixed. Since the weights are computed against the index, and the index grows larger every day, the maximum weight will increase slightly with a larger index size.

5 - 4 - 1.5 - 1

Back to the numerical relation. What causes a document to be ranked with the index-score, or with 4 or 5 times the index score? In order to find out we will look at some pages of results. Following popular format each table lists resultnumbers, titles, URL's, description and score.
Let's look at the first ten URL's that are retrieved by the search on "Flaubert", one of my favourite writers. :

855 documents found on keyword flaubert | showing page 1-10
1 Flaubert on creativity
- http://sunsite.unc.edu/ibic/CPB/msg00085.html

Sponsored this month by: How you can sponsor these pages. [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Flaubert on...

score 875
2 Index of /faces/local/us/in/bloomington/flaubert
- http://www.cs.indiana.edu/faces/local/us/in/bloomington/flaubert/

Index of /faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -

score 875
3 Penguin - Three Tales by Gustave Flaubert
- http://www.futurenet.co.uk/Penguin/Books/0140441069.html

Three Tales. by Gustave Flaubert Date published: 1/2/61. TRANSLATED WITH AN INTRODUCTION BY ROBERT BALDICK. Twenty years after Madame Bovary, Flaubert...

score 875
4 COM L 411 The Short Novel from Flaubert and James to the Present (Courses of
- http://www.cornell.edu/Academic/Courses97/csas/as485.html

COLLEGE OF ARTS AND SCIENCES COMPARATIVE LITERATURE 1997-98 COURSE DESCRIPTIONS. COM L 411 The Short Novel from Flaubert and James to the Present. Spring..

score 875
5 Flaubert : L'éducation sentimentale
- http://www.alexandrie.com/alex2/pagealex/litterat/roman/flaubert/educsent/tdm.html

Gustave Flaubert L'éducation sentimentale. ALEXANDRIE - La Bibliothèque Virtuelle. TABLE DES MATIERES. Première partie. Chapitre...

score 875
6 Gustave Flaubert: Lust
- http://www.anesi.com/q0012.htm

At other times, seared by that hidden fire which her adultery kept feeding, consumed with longing, feverish with desire, she would open her window, inhale.

score 875
7 Index of /faces/users/us/in/bloomington/flaubert/lisa
- http://www.cs.indiana.edu/faces/users/us/in/bloomington/flaubert/lisa/

Index of /faces/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95 13:38 1k.

score 875
8 LE DICTIONNAIRE DES IDEES RECUES DE FLAUBERT
- http://l3av01.univ-lille3.fr/www/PUS/TDM_HERSCHBERG.HTML

LE DICTIONNAIRE DES IDEES RECUES DE FLAUBERT par Anne HERSCHBERG PIERROT. Avant-Propos. INTRODUCTION. L'Opinion et les majorités. Langage de la bêtise....

score 875
9 Normandie Web : Gustave Flaubert
- http://www.normandie.fr.eu.org/culture/litterature/flaubert/flaubert.html

Gustave Flaubert

score 875
10 Flaubert's Critics
- http://www.wtamu.edu/academic/finearts/english/critics.htm

Flaubert's Critics. "Too frequent in studying a great work we only end where in fact we should have begun: by examine directly our impressions as we read..

score 875

It is pretty obvious that all the retrieved results have the keyword in the title. These documents are ranked highest. For most simple searches the result of a search roughly equals a good old title search in the library catalogue. Let's have a look at where ranking scores begins to fall :

documents found on keyword flaubert; showing page 71-80
71 Gustave Flaubert (1821-1880) : Amazon Book List
- http://www.mala.bc.ca/~mcneil/list/citamaflaub.htm

Gustave Flaubert (1821-1880) : Amazon Book List. | Advanced LC Search | Advanced Amazon Search | 56 items shown. Click on title for more details and...

score 870
72 newyork.sidewalk: Chère Maître: The Flaubert-Sand Correspondence
- http://newyork.sidewalk.com/detail/32138

restaurants  events  arts & music  places to...

score 870
73 Index of /picons/db/users/us/in/bloomington/flaubert/lisa
- http://www.cs.indiana.edu/picons/db/users/us/in/bloomington/flaubert/lisa/

Index of /picons/db/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95...

score 700
74 Yahoo! France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
- http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave__1821_1880_/

Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880) Options. Bouvard et Pécuchet. Coeur simple, Un. Éducation...

score 700
75 Index of /l/www/faces/local/us/in/bloomington/flaubert
- http://www.cs.indiana.edu/l/www/faces/local/us/in/bloomington/flaubert/

Index of /l/www/faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -

score 700
76 Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size
- http://www.cs.indiana.edu/picons/db/local/us/in/bloomington/flaubert/

Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -

score 700
77 Penguin - The Dictionary of Received Ideas by Gustave Flaubert
- http://www.futurenet.co.uk/Penguin/Books/0140389040.html

The Dictionary of Received Ideas. Preface by Julian Barnes. by Gustave Flaubert Preface by Julian Barnes Date published: 14/11/94. Lake: Always have a...

score 700
78 Index of /picons/db/local/us/in/bloomington/flaubert/lisa
- http://www.cs.indiana.edu/picons/db/local/us/in/bloomington/flaubert/lisa/

Index of /picons/db/local/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - face.gif 31-Mar-95...

score 696
79 Palabras y vacío: Lenguaje y tópico en Gustave Flaubert - nº 4 Espéculo
- http://www.ucm.es/OTROS/especulo/numero4/g_flaub.htm

Palabras y vacío. Lenguaje y tópico en la obra de Gustave Flaubert. Joaquín Mª Aguirre Romero Dpto. Filología Española III (CC Información) Universidad...

score 696
80 Index of /picons/db/users/us/in/bloomington/flaubert
- http://www.cs.indiana.edu/picons/db/users/us/in/bloomington/flaubert/

Index of /picons/db/users/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - lisa/ 31-Mar-95 13:38 -

score 696

First thing to note is that this is already the seventh page with results. As site promotors (and librarians, I should add) know, seventh page is beyond most users. Several interesting things happen here. While 870 is still the highest score (x 5), from the third URL onwards the score drops to the next level (x4), first 700 and then 696. We still have our keyword in the title, but is has moved after the eigth word position and this seems to be the criterium to drop the score. You would expect that a rule that relates scores and positions within a document, would be measured in characters rather than words but that is not the case. If you go back to the AltaVista explanation of how the indexing algorithm works, you can guess why : the indexing algorithm only knows "strings". For instance, the title of the 76th result :

Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size

Since the / is a non-alphabetic character, it is considered a word boundary and our keyword "flaubert" occupies position number nine. This makes the difference with the almost identical result we encountered on the first page of results which had

Index of /faces/local/us/in/bloomington/flaubert

for title. Almost identical. Almost, but "flaubert" occupies position number eigth : just enough to make it to the top. There are more files like these from the same server. They contain nothing but a directory overview of a very smal directory, the keyword "flaubert" occurs once in the body of the text because the text starts with the same words as the title (Index of /picons/db/local/us/in/bloomington/flaubert/lisa).
These files shouldn't have been indexed to begin with (the read property of the directory was probably set to "all users" so that the spider robot read and indexed it as any other file). The name of the directory has changed over time and each time it has been indexed by the spider. Apparently, this kind of near-duplicates cannot be easily (automatically that is) removed from the index. They constitute some of the inevitable noise that is retrieved first by the indexing mechanism and later by searches. Note however, that the "noise" in this case can easily make it to the top 10.
Why only the first 8 words of a title should be considered important is an interesting question but cannot be answered. My guess would be that this is largely an anti-spamdexing measure. Lots of people in the business of unethical site promotion know from experience that nothing can beat a word that appears in the title of an HTML file. The result was that some began to use multiple TITLE-tags. Others began to write extremely long titles with lots of repeated keywords. In addition to spamdexing there are genuine mistakes. Some people may forget to close the TITLE-tag. Since documents cannot be relied on to provide a bulletproof boundary, it is very likely that the spidering robot has some simple rule to decide where a title begins and ends : index and count eight strings beginning from <TITLE>, then count and index x more as 'rest of the title', after that index everything as if it was part of the document body. Of course there might be plenty of other reasons why files are indexed the way they are.

Let's continue our journey through the "flaubert" search.

855 documents found on keyword flaubert | showing page 81-90
81 Penguin - The Temptation of Saint Anthony by Gustave Flaubert
- http://www.futurenet.co.uk/Penguin/Books/0140444106.html

The Temptation of Saint Anthony. by Gustave Flaubert Date published: 31/3/83. Price: £6.99 (Paperback) ISBN: 0140444106. Order this book. Click here..

score 696
82 Yahoo! France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
- http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave__1821_1880_

Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880)   Options. Bouvard et Pécuchet. Coeur simple, Un....

score 696
83 Yahoo! France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
- http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave

Qwam: l'intelligence de l'économie. Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave. Options. Bouvard et Pécuchet....

score 696
84 Maxime Du Camp
- http://www.med.univ-rennes1.fr/bfe/ducamp.htm

Extrait de "Souvenirs littéraires" de Maxime Du Camp (écrivain, ami de Gustave Flaubert) "Au mois de janvier 1844, Gustave...

score 262
85 Loreart & BUCHLUST
- http://www.loreart.com/lothar.htm

Buchlust  Medienwelt   Café Central   Shopping Passage  Stellenmarkt Anzeigenmarkt Home. KOLLEGENGESPRÄCHE und andere...

score 262
86 Penguin - Three Classic Romantic Stories by Charlotte Bront‹, Emily Bront‹ and
- http://www.futurenet.co.uk/Penguin/Books/0140860983.html

Three Classic Romantic Stories. by Charlotte Bront‹, Emily Bront‹ and Gustave Flaubert Date published: 3/11/94. (PEN 99) 9hrs B. Three classic romantic...

score 262
87 Magasins Sud Est
- http://www.kiabi.fr/sp/magsudes.htm

REGION SURESTE. TIENDAS. DIRECCION. Nº TELEFONO. Horario comercial. CLERMONT. Boulevard Gustave Flaubert. 63000 - CLERMONT FERRAND. Telf :...

score 262
88 Aschehoug forlag
- http://www.aschehoug.no/aschehoug/host97/boker/076.html

GUSTAVE FLAUBERT Frédéric Moreau En ung manns historie. Flauberts romanhelt Frédéric er en passiv, lettbevegelig...

score 262
89 Records for Madame Bovary : a story of provincial life. (in MARION)
- http://utcat.library.utoronto.ca:8002/MARION/+MADAME%20BOVARY/5a8ed2004100/0

Madame Bovary : a story of provincial life. Records 1 to 1 of 1. Flaubert, Gustave, 1821-1880.Madame Bovary : a story of provincial life / Gustave...

score 262
90 Recherches 3e
- http://bleue.ac-aix-marseille.fr/bleue/francais/mrs19e/flaubert.htm

Flaubert, l'initiation amoureuse et le rêve oriental. Retour sommaire Retour page précédente. En 1840, Flaubert est reçu au...

score 262

From the 84th result onward the score lowers considerably. The keyword no longer appears in the title of the HTML file, but it does appear in the body text of the files. Here is where my second big surprise was. I had always assumed that the number of occurences of a keyword in a document would be important. It is not, or rather it is not in the way I had expected it to be. Opening documents and counting occurences with the Find-command of a browser learns that at this level (x 1.5) there can be an unlimited amount of occurences of the keyword. As long as there are at least two, the score will remain the same. Again, from an information retrieval point of view this may appear very strange, but in the light of spamdexing it is not. A common spamdexing technique is to repeat keywords over and over in the body of the text. This is done either shamelessly visible, or invisible by writing a white font on a white background. A bright idea, but -at least in AltaVista's case- not a very efficient one : mere repetition is not very highly rewarded.
Just one more page of results :

855 documents found on keyword flaubert | showing page 171-180
171 Search Results
- http://www.booksmith.com/bin/search.cgi/author=Flaubert,%20G/874616992091


score 261
172 Søkeresultat: tittelnr. 24732
- http://www.of.fylkesbibl.no/cgi-bin/bibliofil/x_base=data/x_frameOn=0/x_tabell=0/t_vis=24732.4241

Katalogopplysninger: HYLLEPLASS: 840.9 A FORFATTER: Amadou, Anne-Lisa, n., 1930- TITTEL: Omkring Marcel Proust : elleve franske romanstudier ANSVARLIGE:...

score 261
173 art_inserto1.html-"il manifesto" del 03-Luglio-1997
- http://mir10.mir.it/mani/insert/talpa/03-Luglio-1997/art_talpa1.html

Uno scrittore bestiale. - JACQUELINE RISSET. "I CAPOLAVORI sono bêtes; hanno il volto tranquillo delle produzioni della natura, dei grandi animali e.

score 261
174 Re: Noweb and html.
- http://www.uni-giessen.de/hrz/tex/more_info/info/mailarchiv/litprog.1995/msg00442.html

Prev][Next][Index][Thread] Re: Noweb and html. Subject: Re: Noweb and html. From: norman@flaubert.bellcore.com (Norman Ramsey) Date: 29 Apr 1995 03:44:04..

score 261
175 Re: Beginner's Guide?
- http://www.uni-giessen.de/hrz/tex/more_info/info/mailarchiv/litprog.1995/msg00562.html

Prev][Next][Index][Thread] Re: Beginner's Guide? Subject: Re: Beginner's Guide? From: norman@flaubert.bellcore.com (Norman Ramsey) Date: 18 May 1995...

score 261
176 Thornton's: Editions Pleiade
- http://www.demon.co.uk/thorntons/pleiade.htm

Editions Pleiade. Complete Set. Individual Volumes Also Available. Apollinaire, 1971. Balzac, 1962. Baudelaire, 1974. Camus, 1982. Carroll, 1990. Celine,..

score 175
177 December quote
- http://www.mindspring.com/~melscrib/decquote.htm

Quote of the month: December. "...none of us can ever express the exact measure of his needs or his thoughts or his sorrows; and human speech is like.

score 175
178 CMLT C347 1014 Ideas in Literature
- http://www.indiana.edu/~deanfac/blspr97/cmlt/cmlt_c347_1014.html

Comparative Literature | Ideas in Literature C347 | 1014 | Johnston. Topic -- Love and Tears: Women Criminals and Saints How do saints, criminals and...

score 175
179 The Puzzle Factory
- http://www.puzzlefactory.com/

Welcome to the Puzzle Factory. Who are we? Not sure yet but I'll keep you posted. Until then, I'll leave you with a couple of my favorite quotes. "Be...

score 175
180 WHOLESALE PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L]
- http://www.wholesaleproducts.com/fictionbookstorefl.html

WHOLESALE PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L] Orders fulfilled by Book Stacks Unlimited. Wholesale Products Bookstore. Pick from your favorite.

score 175

Finally then, we have reached the index weight at result 176. I did not inspect the long tail of results beyond this page (877 documents found!). Also, most of the times AltaVista does not allow inspection after the 200 result either. From result 176 onwards, the keyword will only occur once in the body text of the document.

Multiple keyword queries

What would happen if we were interested in the relation between Flaubert and his travel companion, the writer and photographer, Maxime du Camp? In Simple Search mode we would do a search that would look like :

+flaubert +"maxime du camp"

Where the + is used to simulate Boolean AND. In Simple Search mode ranking is executed automatically. Only Advanced Search mode supports the use of the AND operator. In order to obtain the same results in Advanced Search mode you would have to type flaubert AND "maxime du camp" in the search box, and the keywords flaubert and "maxime du camp" in the ranking box (without the Boolean AND).
Unfortunately, the Belgian branch does not support any searching more complex than single keyword and phrase searches. So I first did a search on the internet and opened all the documents I could open. Then I had the Personal AltaVista program index my browser cache. Of course, the index resulting from my browser cache was very small and contained comparatively (measured against total index size that is) many flaubert's and maxime du camp's. Thus, the index weights I found were 165 (Flaubert) and 188 ("maxime du camp"). Then I did the search described above (+flaubert +"maxime du camp") on Personal AltaVista. Here is what I found :

flaubert.gif (20091 bytes)

Screenshot of ranking results for complex query

All documents contain at least 1 flaubert and at least 1 "maxime du camp" because of the +... +... formulation. The documents are ranked the same way as I found them ranked on the Simple Search over internet. Up to 1187, the scores make sense. The lowest score seems to be computed as :

(165 x 1.5) + (188 x 1) = 435.5

The decimal is dropped and 435 is left over. The second score seems to be :

(165 x 1) + (188 x 1.5) = 447

And so on. It does not take long to see the metric that is applied here. The individual scores are computed exactly the same way they are computed in single keyword queries, but they are then added up. Opening the document and counting the occurences confirms that all documents with a score of 435 have multiple flaubert's and only one maxime du camp. The document with score 1187 (in fact 1187.5) has one maxime du camp in the title and several flaubert's in the body of the text, hence :

(165 x 1.5) + (188 x 5) = 1187.5

This formula can be visualised in the following table :

Weight 188
165 + x 1 x 1.5 x 4 x 5
x 1 353 447 917 1105
x 1.5 435.5 529.5 999.5 1187.5
x 4 848 942 1412 1600
x 5 1013 1107 1577 1765

Scores that I found in the results are rendered in red. Not all combinations are found in a small search such as this one.
Looking back to what I said about Zipf, the importance of the weight conversion according to "rarity" in the index becomes clear. If one keyword had 3,000 as weight and the other 50, the keyword that had 50 would always be in the back somewhere, no matter where it appeared in the document. So even if it was appearing in the title, it would always be ranked at the far end of the results.
Note that this way of processing results is elegantly simple and robust, but that it is also very heavy on the processor. Every keyword that is added to the query potentially adds a factor 4 to required processing power.
The scores from 1187 upwards are unclear. None of the documents in this segment have one of the keywords in their titles, still they rank very highly. Higher in fact than could be expected on the basis of the simple rule above. Grouping is done in a similar way (depending on occurences of keywords) but some additional criterium seems to be used for computation. I cannot find what it is but it seems likely that some distance metric is involved because only documents that have both keywords close to each other (close being something like "in the same sentence") are in the upper segment. However, also in the first section (up to 1187) there are two documents with the keywords very near each other. So even though the basis is clear, more research needs to be done. I would be delighted if someone with experience in statistical information retrieval techniques would look into this.

Conclusions

Besides some yet unknown distance metric, there seem to be only four criteria for ranking scores : two positional ones, and two related to occurences in the text. The positional ones have to do with the words that are contained in the HTML TITLE-tags, and the criteria that have to do with the occurrences (or frequency) of keywords have to do with the body text of the document. For clarity these criteria are summarized in the table below.

Summary table

POSITION OCCURENCES IN TEXT
first 8 words of the title rest of title 2 - unlimited occ. 1 occ.
x 5 x 4 x 1.5 1

Implications for information retrieval

Some issues with regard to information retrieval are :

How relevant is relevance ranking?

There's room for a lot of philosophy here. I had to confess that when I began to see how strongly retrieval relied on simple brute computing power, I was disappointed and relieved at the same time. Disappointed because what seemed such a great tool, was again one of those really dumb computer tricks, relieved because of the same reason. But then again, given the size of the index, most of the time a search engine does the job and AltaVista does it well. At least, if you are looking for something rather specific. On the other hand, even if you are not looking for something specific you will very likely find something relevant on the first page of results (but miss a lot of potentially relevant material further down). After all, it is this combination of characteristics that has made AltaVista one of the popular engines for end users. For instance, if you are looking for different tie knots, you might try a simple search on the phrase : "tie knot*". Some noise will enter the results because tie knots will be written the same (but have a different meaning) in a sentence such as John knows how to tie knots. The first page of results will yield an online tie shop that offers drawings of Windsor and Half-Windsor knots. However, only at page four will you find the page of the venerable American Neckwear Association which is unfortunately titled How to tie a tie, but which contains the best graphics on how to really solve everyday tie-knot problems.
In the end it seems to be just the same old story over again : if you do not expect too much it works great, if you want to get some work done it pays to get to know the machine first.