Note: 01/06/2002 The AltaVista piece on this page was written in october 1997 and first published
on the web in january 1998. Obviously, since then a lot has changed.
The most important evolution since that time is probably that Google
came up with a better ranking technology that takes the webbed nature of WWW into account.
For those interested in "link analysis", the papers
of Page and Brin are still around. Even though the details of the implementation
of PageRank(TM) are not public, the way Google functions can be understood pretty
well from the information that is available. I have very much the same feeling
about Google as I used to have about AltaVista: given the task it's an awesome
machine. In general searching has become much easier: since Google I do not
do a lot of Boolean anymore.
Since most other search engines started to do some form of link analysis, the
business of site promotion has responded by changing too. There is still a lot
blah blah about keywords and META-tags around though. However: in order to be
succesfull keywords in the TITLE-tags (still all important) are now to
be combined with links from other sites. There are a lot of clever ways to do
that but the best way now (as ever) is simply to make a good web site, with
useful content that a lot of people want to know about. So a lot has probably changed
for the better.
One of the many minor changes since 1997 I came accross while (at last!) updating
the links was that AltaVista still had http://altavista.digital.com
as URL. It still works, but you will be taken to the regular www.altavista.com.
Two popular questions & answers
(don't ask them again):
1. No, I do not have the original AltaVista PX software anymore. I've been looking
for it myself but it seems to have disappeared from the web alltogether. Amazing,
but true. Once immensely popular, now gone. Contrary to what many people in the closing
years of last century said about the web it is not place where all kinds of information linger in eternity. It might be keeping itself up to date much
better than some of us thought. If anyone can find the original software somewhere I would be very glad to receive a copy.
(I still have a Win '98 machine somewhere and it might still function.)
2. Also no: I'm quiet happy with my library job and not interested in a site
promotion career.
© Copyright Dirk
van Eylen
All standard disclaimers apply, this article contains theory and conjecture
and does not claim accuracy
Executive summary
While everybody agrees that internet search engines are
valuable tools to bring some order to chaos, many people that are used to dealing
with information systems feel they cannot rely on search engines and their indexes
since the companies that own them do not provide adequate information on how
documents are processed for indexing and retrieval. This article is expanded from
a software review I did for one of my courses in the Library
and Information Specialist program. Since the package under review was the
AltaVista Personal Extensions program, general notions in this text are exemplified
by referring to AltaVista.
After the introduction
follows a section on indexing. General full text index characteristics, such
as distribution of index terms according to Zipf's law, are mentioned to offer
some understanding of what one might expect to find in an automatically generated
full text index.
The next section
tries to dispell some of the mystery that surrounds ranking of search results
in AltaVista. Based on extensive usage of several AltaVista products and on
information from the AltaVista help files, general considerations are offered
on the relation between retrieval weights and frequency of terms in the index.
After that the basic ranking algorithm is explained in detail by going through
the results of a sample simple keyword search step by step. Even though the
same simple algorithm seems to be used to process queries with multiple keywords,
some aspects of results ranking in complex searches remain unclear.
The article concludes
with a summary of findings and their impact on the usage of a search engine
such as AltaVista on the information that is retrieved. Anti-spamdexing measures
taken by search engines sometimes seem to get into the way of retrieval effectiveness.
CONTENTS
Introduction
Indexing
More
indexing information
Full
text indexing and Zipf
Ranking
Single keyword queries
5 - 4 - 1.5 - 1
Multiple keyword queries
Conclusions
Summary table
Implications for information retrieval
How
relevant is relevance ranking?
In a recent thread in Web4Lib
a general feeling of frustration with the poor documentation of many internet
search engines became apparent. As many librarians argued correctly : a good
(even if it is general) understanding of how search engines work is crucial
for search engines to be fully accepted as information retrieval tools. I hope
that sharing my experience as a user of several AltaVista products will add
to the understanding of how AltaVista's
ranking algorithm works.
The search engines
have good reasons to not disclose detailed information on their inner workings.
One reason is the ongoing war with keyword spammers. Keyword spamming or spamdexing is generally disapproved
by most serious site promotors. At the same time the thread in Web4Lib
started, another thread was initiated in Online
Advertising, a discussion group for site marketeers and promotors. (This
discussion has been summarized by Danny Sullivan as the "Searchengines
are dead discussion"). Untill recently submitting your site to search
engines was considered a valuable tool to direct traffic to the site. Now, the
competition among various site-promotion companies seems to have become so intense
that search engine traffic is no longer considered the most efficient way to
do promotion. You have to work very hard to get your site into the Top Ten (default
number of displayed results in many engines) and the next day there will be
somebody else taking your place. It's no longer worth the effort, ROI (Return
On Investment seems to be a favorite term amongst promotors) has become too
small. While many site promotors do not have a library background it is sort
of interesting to see that their main daily concern is with the information
retrieval siamese twins : indexing and retrieval, traditional librarian core
competences. Another reason why search engines are not very forthcoming on the
subject of ranking is that it cannot be readily explained without introducing
a host of concepts that are probably not familiar to most of the millions of
users.
As is probably well known, the index of any internet search engine is built by spidering documents on the internet and indexing them. AltaVista used to provide information on what "indexing" means :
"AltaVista treats every page on the Web and every article of Usenet news as a sequence of words. A word in this context means any string of letters and digits delimited either by punctuation and other non-alphabetic characters (for example, &, %, $, /, #, _, ~), or by white space (spaces, tabs, line ends, start of document, end of document). To be a word, a string of alphanumerics does not have to be spelled correctly or be found in any dictionary. All that is required is that someone typed it as a single word in a Web page or Usenet news article. Thus, the following are words if they appear delimited in a document HAL5000, Gorbachevnik, 602e21, www, http, EasierSaidThenDone, etc. The following are all considered to be two words because the internal punctuation separates them: don't, digital.com, x-y, AT&T, 3.14159, U.S., All'sFairInLoveAndWar."
This is so simple, that it needs some time to sink in.
Look how comfortably mindless the indexing algorithm eats its way through documents
from head to toe. The only question it ever, ever asks : is this character I'm
reading a string boundary? Yes or no is decided by a lookup of the character
against a table that contains all characters that are known as "delimiting
characters" (non-alphabetics, spaces, tabs, line-ends etc.). If the answer
to the question is yes, a new "word" (or a new occurrence of a "word")
is added to the index. If the answer is no, the indexing algorithm reads the
next character and asks the same question.
What holds true
for indexing inevitably holds true for retrieval too. A URL that appears as
text in a page, say http://www.dma.be/p/amphion/brakke-h/, is not one string,
but 8 different "words" : http, www, dma, be, p, amphion, brakke,
h. As AltaVista indicates, the index that results from all this will be full
of all kinds of nonsense-strings such as "602e21". A slight rewording
of AltaVista's statement should go on a three-by-five near the PC you are doing
your searches on : "All that is required is that some moron typed it as
a single word in a Web Page or Usenet article." Try the most exotic spelling
mistake: if someone ever made it (on the web) you will find it (if it has been
spidered and indexed that is). As librarians know, this is not necessarily a
bad feature. The presence of non-sense strings can also be of great help. Try
for instance the phrase "All'sFairInLoveAndWar" to find copies of
the AltaVista help files. The reverse holds true too : valuable information
might be lost for retrieval because of spelling mistakes. But that's an old
one.
One of the nice things about AltaVista is that you can
always count the occurences of a term in the index. If a keyword occurs too
many times in the index, the search itself will be "ignored" but the
number of times the word appears in the index will still be adequatly counted.
Librarians tend to call these too frequently occuring keywords "stop words".
Stop words they are, but in the old days the status of stop word was assigned
manually. This is no longer the case in the huge full text databases that are
generated by spidering documents on the internet.
Now some arbitrary
value is used as cut-off. As the database grows, more and more words will have
a number of occurences that is above this value. The important thing here is
that the old static concept of "stop words" has become dynamic. What
is a valid term today can be a stop word tomorrow.
As I will come to explain later, some insight in the distribution
of terms in the index is important in order to understand what happens to your
searches. Because AltaVista is a full text index, terms in the index will be
roughly distributed according to Zipf's law.
Materials for
an Information
Retrieval course (note 06/2002: the course by Mr. Allan is still there but
I could not find the WSJ and TIME data anymore) at the University of Massachusetts,
Amherst, include examples of the most frequent terms in a full text database
of TIME articles and a database of Wall Street Journal articles. I took these
terms and counted them in AltaVista. Since I cannot know which terms are in
AltaVista and not in the list of most frequent words of the TIME or WSJ databases,
the results of this count do not constitute a Top 40 of most frequent AltaVista
terms. It's only an approximation of how a Top 40 might look like.
| Word | Occ. | Word | Occ. | Word | Occ. | Word | Occ. |
|---|---|---|---|---|---|---|---|
| the | 1,364 mn | it | 151 mn | new | 64 mn | our | 40 mn |
| of | 771 mn | be | 151 mn | u | 58 mn | who | 37 mn |
| and | 711 mn | this | 146 mn | one | 57 mn | out | 37 mn |
| to | 662 mn | as | 131 mn | he | 57 mn | when | 36 mn |
| a | 645 mn | are | 129 mn | but | 56 mn | search | 34 mn |
| in | 474 mn | at | 128 mn | has | 54 mn | been | 33 mn |
| for | 298 mn | from | 118 mn | which | 53 mn | would | 33 mn |
| s | 269 mn | an | 89 mn | about | 51 mn | date | 30 mn |
| is | 265 mn | was | 88 mn | they | 51 mn | its | 30 mn |
| an | 203 mn | not | 87 mn | more | 50 nm | had | 29 mn |
| with | 161 mn | have | 86 mn | up | 45 mn | internet | 29 mn |
| by | 156 mn | all | 85 mn | their | 43 mn | into | 26 mn |
| or | 154 mn | week | 67 mn | his | 42 mn |
Note how small these TIME and WSJ databases are compared to the AltaVista index. If we take the number of times "the" occurs as an indication of size :
Conversely, a rough estimate on the size of the AltaVista
index could be derived. The WSJ corpus consists of 46,449 newspaper articles,
with 19 million term occurrences, which means that "the" occurs on
average once every 17 words. If the same ratio would apply to the AltaVista
index, total index size might be estimated at 23,000 million occurences. While
this is certainly impressive, I do not believe that it represents more than
half of what is actually out on the web. The fact that most internet indexes
contain documents in various languages is of course a flaw in the "the"-argument
presented here. Reliable data or even reliable estimates on how many documents there are on
the web are missing. Hundred million documents was the latest I read, but no
authority was given. Data on search engine indexing policies are (except for
various claims to be "the biggest" index) missing as well. In fact
"completeness" has for some time now been almost a non-issue with
most of the major search engines. It has been a while since I last saw a search
engine claim to index THE internet.
The words that
were most frequent in the TIME full text database and in the Wall Street Journal
example are distributed similarly in the AltaVista index. There are numerous
exceptions, some due to "database vernacular" such as "britain"
and "govern" in the TIME database, "million", "company"
and "market" in the WSJ corpus or "internet" in the AltaVista
index. So even though I cannot say very much about the total size of an internet
index such as AltaVista's, I know how strings or words are distributed in the
index. This will be important later on when we will consider the role "weighting"
plays in the retrieval process.
The important thing
is that Zipf makes for a very skewed distribution. Jakob
Nielsen has a very interesting column on web site popularity and Zipf distribution,
part of which I use here with his permission to clarify what is at stake :
| linear scales on both axes | logaritmic scales on both axes |
![]() |
![]() |
A simple description of data that follow a Zipf distribution is that they have :
- a few elements that score very high (the left tail in the diagrams)
- a medium number of elements with middle-of-the-road scores (the middle part of the diagram)
- a huge number of elements that score very low (the right tail in the diagram)
Zipf distributions have been shown to characterize use of words in a natural language (like English) and the popularity of library books, so typically
- a language has a few words ("the", "and", etc.) that are used extremely often, and a library has a few books that everybody wants to borrow (current bestsellers)
- a language has quite a lot of words ("dog", "house", etc.) that are used relatively much, and a library has a good number of books that many people want to borrow (crime novels and such)
- a language has an abundance of words ("Zipf", "double-logarithmic", etc.) that are almost never used, and a library has piles and piles of books that are only checked out every few years (reference manuals for Apple II word processors, etc.)
End of quote. In terms of our index this means that the very few words that occur extremely often are the "stop words" which are not allowed for searches (but of which the occurences in the index can be counted). While "the abundance of words that are almost never used" may become interesting keywords to do searches that do not give too much of a headache.
AltaVista ranks the results of a search on criteria that - according to a help file that came with the Personal Extensions program - include these :
- Whether the words or phrases are found in the first few lines of the document (for example, in the title of a web page).
- The frequency of occurrence of a query word or phrase. Rare words in a query are weighted more heavily than common words (rarity is determined by the number of occurrences of the word in the index).
- Whether all of the specified words or phrases appear in a document. A document containing all three words specified in a three-word query would rank higher than a document containing only two or one of the words.
- Whether multiple query words or phrases are found close to each other in a document.
These are general principles that hold true for almost
any indexing and retrieval program. But what does it mean exactly? For instance,
the third criterium seems very generous but isn't it just good practice that
your search on "education AND 'distance learning' AND resources" ranks
results higher if they match 3 of your terms instead of just 2?
When I reviewed
the freely downloadable demo package of Alta Vista's Personal Extensions, I
noticed that it had an extra search interface to be used when the default search
interface failed to install properly (find and doubleclick a file named pav_gui.exe
after you have installed the progam). This interface actually gave ranking scores
for each document that was retrieved :

Screen shot of the results window with ranking scores
A few months ago AltaVista opened up a Belgian
branch (note 06/2002: this service no longer functions, the new service
at http://altavista.advalvas.be/av2/nl/default.asp
simply reroutes queries to the regular AltaVista site that does not return ranking
scores) which does also give ranking scores.
Ranking scores
like these are very valuable information because if you have the patience to
go and look in the retrieved documents and count the occurences of your keyword
and look at where the hits are, you can learn a lot about how ranking is done.
In the next section I will focus on single keyword searching in order to describe
the basics of ranking. After that I will theorise on what happens in complex
searches.
One of the first surprises I had was that single keyword
searches constantly returned "groups" of results. In the case of the
simple search on "page" with the Personal AltaVista (screenshot
above) the results had only three different ranking scores (765; 229; 153)
for all 15 documents that were retrieved. The same holds true if you do a single
keyword search on the Belgian branch, even though the scores might be a little
"fuzzy". For instance a search on "flaubert" returns four
groups : 875-870; 700-696; 262-261; 175 which in my experience is the maximum
number of different ranking scores for any single keyword search. (We will look
at these different groups in more detail further down.)
Number crunchers
will have noticed already, but for a poor mathematician as I am it took quite
some fiddling around before I saw it. Once you've seen it though you will never
again NOT see it. The ranking scores are numerically related as :
5 - 4 - 1.5 - 1
Since the lowest score in the row equals 1 in the numerical
relation I will call this score the index-score or index-weight. The three other
scores can always be found by multiplying the index-score :
875 = 5 x 175
700 = 4 x 175
262 = 1.5 x 175
I call this score the index score because it is computed - as AltaVista indicates
in the help file above
- in function of the number of occurences ("rarity") in the index.
I do not know how that computation is done, but we get an idea of what goes
on if we look at the following table in which index-scores of some keywords
are matched with their occurences in the global index :
Word |
Number
of occurences |
Index score |
| where | 19,001,167 | 2/3 |
| bookmark | 1,388,021 | 64 |
| edison | 255,634 | 102 |
| flaubert | 11,228 | 175 |
| monotheistic | 5,317 | 192 |
| zipf | 3,492 | 202 |
| rogiers | 678 | 242 |
| logaritmic | 155 | 275 |
| flipsies | 1 | 356 |
I think the general rule is pretty obvious and logical
: the more occurences in the index, the smaller the weight becomes. The underlying
assumption is plain old information theory basics : the greater the probability
that a word (string, character, event etc.) will occur, the less information
it carries.
The main problem
to overcome here is of course the Zipf distribution. What is a good measure
to compute index weights? The enormous differences between the number of times
any given term appears in the index should not and cannot give rise to the same
order of difference in the weights because that would allow one term to dominate
other terms in a way that would make any serious information retrieval impossible.
As said before I do not know how AltaVista computes the weights, and tastes
differ, but I've always thought they're doing a pretty good job at it. I guess
you could find out the function AltaVista uses if you plot enough coordinates
(occurrence-weight couples). Also, there's room for some tweaking here : different
functions will give different weights which will be good for different kinds
of users.
In the table above
I have only tried to give an idea of the boundaries of the system. The weight
of a common word such as "where" is very low (it should be somewhere
between 2 and 3, hence the 2/3 notation). It was the closest I could find before
a word becomes a "stop word". The 0 weight is where the cut-off value
for stop words is. Note that this feature can be used do a rough simulation
of NLP (Natural Language Processing) of queries. A natural language question
such as Where do I find Bill Gates on the net? will yield similar results
as just typing Bill Gates (NOT as a phrase) because most of the words
in the sentence are stop words and will not be used to perform the search anyway.
Whether the results will be usefull is another matter.
"Flipsies"
on the other hand occurs only once in the index and gets 356. This seems to
be the maximum weight a word can aspire to. Now, "flipsies" is clearly
a rather exotic case. I do not think that there are many single words that occur
only once in the index, but phrases are treated as single keywords too.
Back to the numerical relation. What causes a document
to be ranked with the index-score, or with 4 or 5 times the index score? In
order to find out we will look at some pages of results. Following popular format
each table lists resultnumbers, titles, URL's, description and score.
Let's look at
the first ten URL's that are retrieved by the search on "Flaubert",
one of my favourite writers. :
| 855 documents found on keyword flaubert | showing page 1-10 | |
| 1 | Flaubert
on creativity - http://sunsite.unc.edu/ibic/CPB/msg00085.html Sponsored this month by: How you can sponsor these pages. [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Flaubert on... score 875 |
| 2 | Index
of /faces/local/us/in/bloomington/flaubert - http://www.cs.indiana.edu/faces/local/us/in/bloomington/flaubert/ Index of /faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 - score 875 |
| 3 | Penguin
- Three Tales by Gustave Flaubert - http://www.futurenet.co.uk/Penguin/Books/0140441069.html Three Tales. by Gustave Flaubert Date published: 1/2/61. TRANSLATED WITH AN INTRODUCTION BY ROBERT BALDICK. Twenty years after Madame Bovary, Flaubert... score 875 |
| 4 | COM
L 411 The Short Novel from Flaubert and James to the Present (Courses
of - http://www.cornell.edu/Academic/Courses97/csas/as485.html COLLEGE OF ARTS AND SCIENCES COMPARATIVE LITERATURE 1997-98 COURSE DESCRIPTIONS. COM L 411 The Short Novel from Flaubert and James to the Present. Spring.. score 875 |
| 5 | Flaubert
: L'éducation sentimentale - http://www.alexandrie.com/alex2/pagealex/litterat/roman/flaubert/educsent/tdm.html Gustave Flaubert L'éducation sentimentale. ALEXANDRIE - La Bibliothèque Virtuelle. TABLE DES MATIERES. Première partie. Chapitre... score 875 |
| 6 | Gustave
Flaubert: Lust - http://www.anesi.com/q0012.htm At other times, seared by that hidden fire which her adultery kept feeding, consumed with longing, feverish with desire, she would open her window, inhale. score 875 |
| 7 | Index
of /faces/users/us/in/bloomington/flaubert/lisa - http://www.cs.indiana.edu/faces/users/us/in/bloomington/flaubert/lisa/ Index of /faces/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95 13:38 1k. score 875 |
| 8 | LE
DICTIONNAIRE DES IDEES RECUES DE FLAUBERT - http://l3av01.univ-lille3.fr/www/PUS/TDM_HERSCHBERG.HTML LE DICTIONNAIRE DES IDEES RECUES DE FLAUBERT par Anne HERSCHBERG PIERROT. Avant-Propos. INTRODUCTION. L'Opinion et les majorités. Langage de la bêtise.... score 875 |
| 9 | Normandie
Web : Gustave Flaubert - http://www.normandie.fr.eu.org/culture/litterature/flaubert/flaubert.html Gustave Flaubert score 875 |
| 10 | Flaubert's
Critics - http://www.wtamu.edu/academic/finearts/english/critics.htm Flaubert's Critics. "Too frequent in studying a great work we only end where in fact we should have begun: by examine directly our impressions as we read.. score 875 |
It is pretty obvious that all the retrieved results have the keyword in the title. These documents are ranked highest. For most simple searches the result of a search roughly equals a good old title search in the library catalogue. Let's have a look at where ranking scores begins to fall :
| documents found on keyword flaubert; showing page 71-80 | |
| 71 | Gustave
Flaubert (1821-1880) : Amazon Book List - http://www.mala.bc.ca/~mcneil/list/citamaflaub.htm Gustave Flaubert (1821-1880) : Amazon Book List. | Advanced LC Search | Advanced Amazon Search | 56 items shown. Click on title for more details and... score 870 |
| 72 | newyork.sidewalk:
Chère Maître: The Flaubert-Sand Correspondence - http://newyork.sidewalk.com/detail/32138 restaurants events arts & music places to... score 870 |
| 73 | Index
of /picons/db/users/us/in/bloomington/flaubert/lisa - http://www.cs.indiana.edu/picons/db/users/us/in/bloomington/flaubert/lisa/ Index of /picons/db/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95... score 700 |
| 74 | Yahoo!
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, - http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave__1821_1880_/ Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880) Options. Bouvard et Pécuchet. Coeur simple, Un. Éducation... score 700 |
| 75 | Index
of /l/www/faces/local/us/in/bloomington/flaubert - http://www.cs.indiana.edu/l/www/faces/local/us/in/bloomington/flaubert/ Index of /l/www/faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 - score 700 |
| 76 | Index
of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size
- http://www.cs.indiana.edu/picons/db/local/us/in/bloomington/flaubert/ Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 - score 700 |
| 77 | Penguin
- The Dictionary of Received Ideas by Gustave Flaubert - http://www.futurenet.co.uk/Penguin/Books/0140389040.html The Dictionary of Received Ideas. Preface by Julian Barnes. by Gustave Flaubert Preface by Julian Barnes Date published: 14/11/94. Lake: Always have a... score 700 |
| 78 | Index
of /picons/db/local/us/in/bloomington/flaubert/lisa - http://www.cs.indiana.edu/picons/db/local/us/in/bloomington/flaubert/lisa/ Index of /picons/db/local/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - face.gif 31-Mar-95... score 696 |
| 79 | Palabras
y vacío: Lenguaje y tópico en Gustave Flaubert - nº 4 Espéculo - http://www.ucm.es/OTROS/especulo/numero4/g_flaub.htm Palabras y vacío. Lenguaje y tópico en la obra de Gustave Flaubert. Joaquín Mª Aguirre Romero Dpto. Filología Española III (CC Información) Universidad... score 696 |
| 80 | Index
of /picons/db/users/us/in/bloomington/flaubert - http://www.cs.indiana.edu/picons/db/users/us/in/bloomington/flaubert/ Index of /picons/db/users/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - lisa/ 31-Mar-95 13:38 - score 696 |
First thing to note is that this is already the seventh page with results. As site promotors (and librarians, I should add) know, seventh page is beyond most users. Several interesting things happen here. While 870 is still the highest score (x 5), from the third URL onwards the score drops to the next level (x4), first 700 and then 696. We still have our keyword in the title, but is has moved after the eigth word position and this seems to be the criterium to drop the score. You would expect that a rule that relates scores and positions within a document, would be measured in characters rather than words but that is not the case. If you go back to the AltaVista explanation of how the indexing algorithm works, you can guess why : the indexing algorithm only knows "strings". For instance, the title of the 76th result :
Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size
Since the / is a non-alphabetic character, it is considered a word boundary and our keyword "flaubert" occupies position number nine. This makes the difference with the almost identical result we encountered on the first page of results which had
Index of /faces/local/us/in/bloomington/flaubert
for title. Almost identical. Almost, but "flaubert"
occupies position number eigth : just enough to make it to the top. There are
more files like these from the same server. They contain nothing but a directory
overview of a very smal directory, the keyword "flaubert" occurs once
in the body of the text because the text starts with the same words as the title
(Index of /picons/db/local/us/in/bloomington/flaubert/lisa).
These files shouldn't
have been indexed to begin with (the read property of the directory was probably
set to "all users" so that the spider robot read and indexed it as
any other file). The name of the directory has changed over time and each time
it has been indexed by the spider. Apparently, this kind of near-duplicates
cannot be easily (automatically that is) removed from the index. They constitute
some of the inevitable noise that is retrieved first by the indexing mechanism
and later by searches. Note however, that the "noise" in this case
can easily make it to the top 10.
Why only the first
8 words of a title should be considered important is an interesting question
but cannot be answered. My guess would be that this is largely an anti-spamdexing
measure. Lots of people in the business of unethical site promotion know from
experience that nothing can beat a word that appears in the title of an HTML
file. The result was that some began to use multiple TITLE-tags. Others began
to write extremely long titles with lots of repeated keywords. In addition to
spamdexing there are genuine mistakes. Some people may forget to close the TITLE-tag.
Since documents cannot be relied on to provide a bulletproof boundary, it is
very likely that the spidering robot has some simple rule to decide where a
title begins and ends : index and count eight strings beginning from <TITLE>,
then count and index x more as 'rest of the title', after that index everything
as if it was part of the document body. Of course there might be plenty of other
reasons why files are indexed the way they are.
Let's continue our journey through the "flaubert" search.
| 855 documents found on keyword flaubert | showing page 81-90 | |
| 81 | Penguin
- The Temptation of Saint Anthony by Gustave Flaubert - http://www.futurenet.co.uk/Penguin/Books/0140444106.html The Temptation of Saint Anthony. by Gustave Flaubert Date published: 31/3/83. Price: £6.99 (Paperback) ISBN: 0140444106. Order this book. Click here.. score 696 |
| 82 | Yahoo!
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
- http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave__1821_1880_ Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880) Options. Bouvard et Pécuchet. Coeur simple, Un.... score 696 |
| 83 | Yahoo!
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
- http://www.yahoo.fr/Art_et_culture/Litterature/Genres/Romans/Romanciers/Flaubert__Gustave Qwam: l'intelligence de l'économie. Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave. Options. Bouvard et Pécuchet.... score 696 |
| 84 | Maxime
Du Camp - http://www.med.univ-rennes1.fr/bfe/ducamp.htm Extrait de "Souvenirs littéraires" de Maxime Du Camp (écrivain, ami de Gustave Flaubert) "Au mois de janvier 1844, Gustave... score 262 |
| 85 | Loreart
& BUCHLUST - http://www.loreart.com/lothar.htm Buchlust Medienwelt Café Central Shopping Passage Stellenmarkt Anzeigenmarkt Home. KOLLEGENGESPRÄCHE und andere... score 262 |
| 86 | Penguin
- Three Classic Romantic Stories by Charlotte Bront, Emily Bront
and - http://www.futurenet.co.uk/Penguin/Books/0140860983.html Three Classic Romantic Stories. by Charlotte Bront, Emily Bront and Gustave Flaubert Date published: 3/11/94. (PEN 99) 9hrs B. Three classic romantic... score 262 |
| 87 | Magasins
Sud Est - http://www.kiabi.fr/sp/magsudes.htm REGION SURESTE. TIENDAS. DIRECCION. Nº TELEFONO. Horario comercial. CLERMONT. Boulevard Gustave Flaubert. 63000 - CLERMONT FERRAND. Telf :... score 262 |
| 88 | Aschehoug
forlag - http://www.aschehoug.no/aschehoug/host97/boker/076.html GUSTAVE FLAUBERT Frédéric Moreau En ung manns historie. Flauberts romanhelt Frédéric er en passiv, lettbevegelig... score 262 |
| 89 | Records
for Madame Bovary : a story of provincial life. (in MARION) - http://utcat.library.utoronto.ca:8002/MARION/+MADAME%20BOVARY/5a8ed2004100/0 Madame Bovary : a story of provincial life. Records 1 to 1 of 1. Flaubert, Gustave, 1821-1880.Madame Bovary : a story of provincial life / Gustave... score 262 |
| 90 | Recherches
3e - http://bleue.ac-aix-marseille.fr/bleue/francais/mrs19e/flaubert.htm Flaubert, l'initiation amoureuse et le rêve oriental. Retour sommaire Retour page précédente. En 1840, Flaubert est reçu au... score 262 |
From the 84th result onward the score lowers considerably.
The keyword no longer appears in the title of the HTML file, but it does appear
in the body text of the files. Here is where my second big surprise was. I had
always assumed that the number of occurences of a keyword in a document would
be important. It is not, or rather it is not in the way I had expected it to
be. Opening documents and counting occurences with the Find-command of a browser
learns that at this level (x 1.5) there can be an unlimited amount of occurences
of the keyword. As long as there are at least two, the score will remain the
same. Again, from an information retrieval point of view this may appear very
strange, but in the light of spamdexing it is not. A common spamdexing technique
is to repeat keywords over and over in the body of the text. This is done either
shamelessly visible, or invisible by writing a white font on a white background.
A bright idea, but -at least in AltaVista's case- not a very efficient one :
mere repetition is not very highly rewarded.
Just one more
page of results :
| 855 documents found on keyword flaubert | showing page 171-180 | |
| 171 | Search
Results - http://www.booksmith.com/bin/search.cgi/author=Flaubert,%20G/874616992091 score 261 |
| 172 | Søkeresultat:
tittelnr. 24732 - http://www.of.fylkesbibl.no/cgi-bin/bibliofil/x_base=data/x_frameOn=0/x_tabell=0/t_vis=24732.4241 Katalogopplysninger: HYLLEPLASS: 840.9 A FORFATTER: Amadou, Anne-Lisa, n., 1930- TITTEL: Omkring Marcel Proust : elleve franske romanstudier ANSVARLIGE:... score 261 |
| 173 | art_inserto1.html-"il
manifesto" del 03-Luglio-1997 - http://mir10.mir.it/mani/insert/talpa/03-Luglio-1997/art_talpa1.html Uno scrittore bestiale. - JACQUELINE RISSET. "I CAPOLAVORI sono bêtes; hanno il volto tranquillo delle produzioni della natura, dei grandi animali e. score 261 |
| 174 | Re:
Noweb and html. - http://www.uni-giessen.de/hrz/tex/more_info/info/mailarchiv/litprog.1995/msg00442.html Prev][Next][Index][Thread] Re: Noweb and html. Subject: Re: Noweb and html. From: norman@flaubert.bellcore.com (Norman Ramsey) Date: 29 Apr 1995 03:44:04.. score 261 |
| 175 | Re:
Beginner's Guide? - http://www.uni-giessen.de/hrz/tex/more_info/info/mailarchiv/litprog.1995/msg00562.html Prev][Next][Index][Thread] Re: Beginner's Guide? Subject: Re: Beginner's Guide? From: norman@flaubert.bellcore.com (Norman Ramsey) Date: 18 May 1995... score 261 |
| 176 | Thornton's:
Editions Pleiade - http://www.demon.co.uk/thorntons/pleiade.htm Editions Pleiade. Complete Set. Individual Volumes Also Available. Apollinaire, 1971. Balzac, 1962. Baudelaire, 1974. Camus, 1982. Carroll, 1990. Celine,.. score 175 |
| 177 | December
quote - http://www.mindspring.com/~melscrib/decquote.htm Quote of the month: December. "...none of us can ever express the exact measure of his needs or his thoughts or his sorrows; and human speech is like. score 175 |
| 178 | CMLT
C347 1014 Ideas in Literature - http://www.indiana.edu/~deanfac/blspr97/cmlt/cmlt_c347_1014.html Comparative Literature | Ideas in Literature C347 | 1014 | Johnston. Topic -- Love and Tears: Women Criminals and Saints How do saints, criminals and... score 175 |
| 179 | The
Puzzle Factory - http://www.puzzlefactory.com/ Welcome to the Puzzle Factory. Who are we? Not sure yet but I'll keep you posted. Until then, I'll leave you with a couple of my favorite quotes. "Be... score 175 |
| 180 | WHOLESALE
PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L] - http://www.wholesaleproducts.com/fictionbookstorefl.html WHOLESALE PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L] Orders fulfilled by Book Stacks Unlimited. Wholesale Products Bookstore. Pick from your favorite. score 175 |
Finally then, we have reached the index weight at result 176. I did not inspect the long tail of results beyond this page (877 documents found!). Also, most of the times AltaVista does not allow inspection after the 200 result either. From result 176 onwards, the keyword will only occur once in the body text of the document.
What would happen if we were interested in the relation between Flaubert and his travel companion, the writer and photographer, Maxime du Camp? In Simple Search mode we would do a search that would look like :
+flaubert +"maxime du camp"
Where the + is used to simulate Boolean AND. In Simple
Search mode ranking is executed automatically. Only Advanced Search mode supports
the use of the AND operator. In order to obtain the same results in Advanced
Search mode you would have to type flaubert AND "maxime du camp"
in the search box, and the keywords flaubert and "maxime du camp"
in the ranking box (without the Boolean AND).
Unfortunately,
the Belgian branch
does not support any searching more complex than single keyword and phrase searches.
So I first did a search on the internet and opened all the documents I could
open. Then I had the Personal AltaVista program index my browser cache. Of course,
the index resulting from my browser cache was very small and contained comparatively
(measured against total index size that is) many flaubert's and maxime du camp's.
Thus, the index weights I found were 165 (Flaubert) and 188 ("maxime du
camp"). Then I did the search described above (+flaubert +"maxime
du camp") on Personal AltaVista. Here is what I found :
Screenshot of ranking results for complex query
All documents contain at least 1 flaubert and at least 1 "maxime du camp" because of the +... +... formulation. The documents are ranked the same way as I found them ranked on the Simple Search over internet. Up to 1187, the scores make sense. The lowest score seems to be computed as :
(165 x 1.5) + (188 x 1) = 435.5
The decimal is dropped and 435 is left over. The second score seems to be :
(165 x 1) + (188 x 1.5) = 447
And so on. It does not take long to see the metric that is applied here. The individual scores are computed exactly the same way they are computed in single keyword queries, but they are then added up. Opening the document and counting the occurences confirms that all documents with a score of 435 have multiple flaubert's and only one maxime du camp. The document with score 1187 (in fact 1187.5) has one maxime du camp in the title and several flaubert's in the body of the text, hence :
(165 x 1.5) + (188 x 5) = 1187.5
This formula can be visualised in the following table :
| Weight | 188 | ||||
| 165 | + | x 1 | x 1.5 | x 4 | x 5 |
| x 1 | 353 | 447 | 917 | 1105 | |
| x 1.5 | 435.5 | 529.5 | 999.5 | 1187.5 | |
| x 4 | 848 | 942 | 1412 | 1600 | |
| x 5 | 1013 | 1107 | 1577 | 1765 | |
Scores that I found in the results are rendered in red.
Not all combinations are found in a small search such as this one.
Looking back to
what I said about Zipf, the importance of the weight conversion according to
"rarity" in the index becomes clear. If one keyword had 3,000 as weight
and the other 50, the keyword that had 50 would always be in the back somewhere,
no matter where it appeared in the document. So even if it was appearing in
the title, it would always be ranked at the far end of the results.
Note that this
way of processing results is elegantly simple and robust, but that it is also
very heavy on the processor. Every keyword that is added to the query potentially
adds a factor 4 to required processing power.
The scores from
1187 upwards are unclear. None of the documents in this segment have one of
the keywords in their titles, still they rank very highly. Higher in fact than
could be expected on the basis of the simple rule above. Grouping is done in
a similar way (depending on occurences of keywords) but some additional criterium
seems to be used for computation. I cannot find what it is but it seems likely
that some distance metric is involved because only documents that have both
keywords close to each other (close being something like "in the same sentence")
are in the upper segment. However, also in the first section (up to 1187) there
are two documents with the keywords very near each other. So even though the
basis is clear, more research needs to be done. I would be delighted if someone
with experience in statistical information retrieval techniques would look into
this.
Besides some yet unknown distance metric, there seem to be only four criteria for ranking scores : two positional ones, and two related to occurences in the text. The positional ones have to do with the words that are contained in the HTML TITLE-tags, and the criteria that have to do with the occurrences (or frequency) of keywords have to do with the body text of the document. For clarity these criteria are summarized in the table below.
| POSITION | OCCURENCES IN TEXT | ||
| first 8 words of the title | rest of title | 2 - unlimited occ. | 1 occ. |
| x 5 | x 4 | x 1.5 | 1 |
Implications for information retrieval
Some issues with regard to information retrieval are :
How relevant is relevance ranking?
There's room for a lot of philosophy here. I had to confess
that when I began to see how strongly retrieval relied on simple brute computing
power, I was disappointed and relieved at the same time. Disappointed because
what seemed such a great tool, was again one of those really dumb computer tricks,
relieved because of the same reason. But then again, given the size of the index,
most of the time a search engine does the job and AltaVista does it well. At
least, if you are looking for something rather specific. On the other hand,
even if you are not looking for something specific you will very likely find
something relevant on the first page of results (but miss a lot of potentially
relevant material further down). After all, it is this combination of characteristics
that has made AltaVista one of the popular engines for end users. For instance,
if you are looking for different tie knots, you might try a simple search on
the phrase : "tie knot*". Some noise will enter the results because
tie knots will be written the same (but have a different meaning) in
a sentence such as John knows how to tie knots. The first page of results
will yield an online tie shop that offers drawings of Windsor and Half-Windsor
knots. However, only at page four will you find the page of the venerable American
Neckwear Association which is unfortunately titled How to tie a tie,
but which contains the best graphics on how to really solve everyday tie-knot
problems.
In the end it
seems to be just the same old story over again : if you do not expect too much
it works great, if you want to get some work done it pays to get to know the
machine first.