Note: 01/06/2002 The AltaVista piece on this page was written in october 1997 and first published
on the web in january 1998. Obviously, since then a lot has changed.
The most important evolution since that time is probably that Google came up with a better ranking technology that takes the webbed nature of WWW into account. For those interested in "link analysis", the papers of Page and Brin are still around. Even though the details of the implementation of PageRank(TM) are not public, the way Google functions can be understood pretty well from the information that is available. I have very much the same feeling about Google as I used to have about AltaVista: given the task it's an awesome machine. In general searching has become much easier: since Google I do not do a lot of Boolean anymore.
Since most other search engines started to do some form of link analysis, the business of site promotion has responded by changing too. There is still a lot blah blah about keywords and META-tags around though. However: in order to be succesfull keywords in the TITLE-tags (still all important) are now to be combined with links from other sites. There are a lot of clever ways to do that but the best way now (as ever) is simply to make a good web site, with useful content that a lot of people want to know about. So a lot has probably changed for the better.
One of the many minor changes since 1997 I came accross while (at last!) updating the links was that AltaVista still had http://altavista.digital.com as URL. It still works, but you will be taken to the regular www.altavista.com.
Two popular questions & answers
(don't ask them again):
1. No, I do not have the original AltaVista PX software anymore. I've been looking for it myself but it seems to have disappeared from the web alltogether. Amazing, but true. Once immensely popular, now gone. Contrary to what many people in the closing years of last century said about the web it is not place where all kinds of information linger in eternity. It might be keeping itself up to date much better than some of us thought. If anyone can find the original software somewhere I would be very glad to receive a copy. (I still have a Win '98 machine somewhere and it might still function.)
2. Also no: I'm quiet happy with my library job and not interested in a site promotion career.
© Copyright Dirk
All standard disclaimers apply, this article contains theory and conjecture and does not claim accuracy
While everybody agrees that internet search engines are
valuable tools to bring some order to chaos, many people that are used to dealing
with information systems feel they cannot rely on search engines and their indexes
since the companies that own them do not provide adequate information on how
documents are processed for indexing and retrieval. This article is expanded from
a software review I did for one of my courses in the Library
and Information Specialist program. Since the package under review was the
AltaVista Personal Extensions program, general notions in this text are exemplified
by referring to AltaVista.
After the introduction follows a section on indexing. General full text index characteristics, such as distribution of index terms according to Zipf's law, are mentioned to offer some understanding of what one might expect to find in an automatically generated full text index.
The next section tries to dispell some of the mystery that surrounds ranking of search results in AltaVista. Based on extensive usage of several AltaVista products and on information from the AltaVista help files, general considerations are offered on the relation between retrieval weights and frequency of terms in the index. After that the basic ranking algorithm is explained in detail by going through the results of a sample simple keyword search step by step. Even though the same simple algorithm seems to be used to process queries with multiple keywords, some aspects of results ranking in complex searches remain unclear.
The article concludes with a summary of findings and their impact on the usage of a search engine such as AltaVista on the information that is retrieved. Anti-spamdexing measures taken by search engines sometimes seem to get into the way of retrieval effectiveness.
More indexing information
Full text indexing and Zipf
Single keyword queries
5 - 4 - 1.5 - 1
Multiple keyword queries
Implications for information retrieval
How relevant is relevance ranking?
In a recent thread in Web4Lib
a general feeling of frustration with the poor documentation of many internet
search engines became apparent. As many librarians argued correctly : a good
(even if it is general) understanding of how search engines work is crucial
for search engines to be fully accepted as information retrieval tools. I hope
that sharing my experience as a user of several AltaVista products will add
to the understanding of how AltaVista's
ranking algorithm works.
The search engines have good reasons to not disclose detailed information on their inner workings. One reason is the ongoing war with keyword spammers. Keyword spamming or spamdexing is generally disapproved by most serious site promotors. At the same time the thread in Web4Lib started, another thread was initiated in Online Advertising, a discussion group for site marketeers and promotors. (This discussion has been summarized by Danny Sullivan as the "Searchengines are dead discussion"). Untill recently submitting your site to search engines was considered a valuable tool to direct traffic to the site. Now, the competition among various site-promotion companies seems to have become so intense that search engine traffic is no longer considered the most efficient way to do promotion. You have to work very hard to get your site into the Top Ten (default number of displayed results in many engines) and the next day there will be somebody else taking your place. It's no longer worth the effort, ROI (Return On Investment seems to be a favorite term amongst promotors) has become too small. While many site promotors do not have a library background it is sort of interesting to see that their main daily concern is with the information retrieval siamese twins : indexing and retrieval, traditional librarian core competences. Another reason why search engines are not very forthcoming on the subject of ranking is that it cannot be readily explained without introducing a host of concepts that are probably not familiar to most of the millions of users.
As is probably well known, the index of any internet search engine is built by spidering documents on the internet and indexing them. AltaVista used to provide information on what "indexing" means :
"AltaVista treats every page on the Web and every article of Usenet news as a sequence of words. A word in this context means any string of letters and digits delimited either by punctuation and other non-alphabetic characters (for example, &, %, $, /, #, _, ~), or by white space (spaces, tabs, line ends, start of document, end of document). To be a word, a string of alphanumerics does not have to be spelled correctly or be found in any dictionary. All that is required is that someone typed it as a single word in a Web page or Usenet news article. Thus, the following are words if they appear delimited in a document HAL5000, Gorbachevnik, 602e21, www, http, EasierSaidThenDone, etc. The following are all considered to be two words because the internal punctuation separates them: don't, digital.com, x-y, AT&T, 3.14159, U.S., All'sFairInLoveAndWar."
This is so simple, that it needs some time to sink in.
Look how comfortably mindless the indexing algorithm eats its way through documents
from head to toe. The only question it ever, ever asks : is this character I'm
reading a string boundary? Yes or no is decided by a lookup of the character
against a table that contains all characters that are known as "delimiting
characters" (non-alphabetics, spaces, tabs, line-ends etc.). If the answer
to the question is yes, a new "word" (or a new occurrence of a "word")
is added to the index. If the answer is no, the indexing algorithm reads the
next character and asks the same question.
What holds true for indexing inevitably holds true for retrieval too. A URL that appears as text in a page, say http://www.dma.be/p/amphion/brakke-h/, is not one string, but 8 different "words" : http, www, dma, be, p, amphion, brakke, h. As AltaVista indicates, the index that results from all this will be full of all kinds of nonsense-strings such as "602e21". A slight rewording of AltaVista's statement should go on a three-by-five near the PC you are doing your searches on : "All that is required is that some moron typed it as a single word in a Web Page or Usenet article." Try the most exotic spelling mistake: if someone ever made it (on the web) you will find it (if it has been spidered and indexed that is). As librarians know, this is not necessarily a bad feature. The presence of non-sense strings can also be of great help. Try for instance the phrase "All'sFairInLoveAndWar" to find copies of the AltaVista help files. The reverse holds true too : valuable information might be lost for retrieval because of spelling mistakes. But that's an old one.
More Index information
One of the nice things about AltaVista is that you can
always count the occurences of a term in the index. If a keyword occurs too
many times in the index, the search itself will be "ignored" but the
number of times the word appears in the index will still be adequatly counted.
Librarians tend to call these too frequently occuring keywords "stop words".
Stop words they are, but in the old days the status of stop word was assigned
manually. This is no longer the case in the huge full text databases that are
generated by spidering documents on the internet.
Now some arbitrary value is used as cut-off. As the database grows, more and more words will have a number of occurences that is above this value. The important thing here is that the old static concept of "stop words" has become dynamic. What is a valid term today can be a stop word tomorrow.
Full text indexing and Zipf
As I will come to explain later, some insight in the distribution
of terms in the index is important in order to understand what happens to your
searches. Because AltaVista is a full text index, terms in the index will be
roughly distributed according to Zipf's law.
Materials for an Information Retrieval course (note 06/2002: the course by Mr. Allan is still there but I could not find the WSJ and TIME data anymore) at the University of Massachusetts, Amherst, include examples of the most frequent terms in a full text database of TIME articles and a database of Wall Street Journal articles. I took these terms and counted them in AltaVista. Since I cannot know which terms are in AltaVista and not in the list of most frequent words of the TIME or WSJ databases, the results of this count do not constitute a Top 40 of most frequent AltaVista terms. It's only an approximation of how a Top 40 might look like.
|the||1,364 mn||it||151 mn||new||64 mn||our||40 mn|
|of||771 mn||be||151 mn||u||58 mn||who||37 mn|
|and||711 mn||this||146 mn||one||57 mn||out||37 mn|
|to||662 mn||as||131 mn||he||57 mn||when||36 mn|
|a||645 mn||are||129 mn||but||56 mn||search||34 mn|
|in||474 mn||at||128 mn||has||54 mn||been||33 mn|
|for||298 mn||from||118 mn||which||53 mn||would||33 mn|
|s||269 mn||an||89 mn||about||51 mn||date||30 mn|
|is||265 mn||was||88 mn||they||51 mn||its||30 mn|
|an||203 mn||not||87 mn||more||50 nm||had||29 mn|
|with||161 mn||have||86 mn||up||45 mn||internet||29 mn|
|by||156 mn||all||85 mn||their||43 mn||into||26 mn|
|or||154 mn||week||67 mn||his||42 mn|
Note how small these TIME and WSJ databases are compared to the AltaVista index. If we take the number of times "the" occurs as an indication of size :
Conversely, a rough estimate on the size of the AltaVista
index could be derived. The WSJ corpus consists of 46,449 newspaper articles,
with 19 million term occurrences, which means that "the" occurs on
average once every 17 words. If the same ratio would apply to the AltaVista
index, total index size might be estimated at 23,000 million occurences. While
this is certainly impressive, I do not believe that it represents more than
half of what is actually out on the web. The fact that most internet indexes
contain documents in various languages is of course a flaw in the "the"-argument
presented here. Reliable data or even reliable estimates on how many documents there are on
the web are missing. Hundred million documents was the latest I read, but no
authority was given. Data on search engine indexing policies are (except for
various claims to be "the biggest" index) missing as well. In fact
"completeness" has for some time now been almost a non-issue with
most of the major search engines. It has been a while since I last saw a search
engine claim to index THE internet.
The words that were most frequent in the TIME full text database and in the Wall Street Journal example are distributed similarly in the AltaVista index. There are numerous exceptions, some due to "database vernacular" such as "britain" and "govern" in the TIME database, "million", "company" and "market" in the WSJ corpus or "internet" in the AltaVista index. So even though I cannot say very much about the total size of an internet index such as AltaVista's, I know how strings or words are distributed in the index. This will be important later on when we will consider the role "weighting" plays in the retrieval process.
The important thing is that Zipf makes for a very skewed distribution. Jakob Nielsen has a very interesting column on web site popularity and Zipf distribution, part of which I use here with his permission to clarify what is at stake :
|linear scales on both axes||logaritmic scales on both axes|
A simple description of data that follow a Zipf distribution is that they have :
- a few elements that score very high (the left tail in the diagrams)
- a medium number of elements with middle-of-the-road scores (the middle part of the diagram)
- a huge number of elements that score very low (the right tail in the diagram)
Zipf distributions have been shown to characterize use of words in a natural language (like English) and the popularity of library books, so typically
- a language has a few words ("the", "and", etc.) that are used extremely often, and a library has a few books that everybody wants to borrow (current bestsellers)
- a language has quite a lot of words ("dog", "house", etc.) that are used relatively much, and a library has a good number of books that many people want to borrow (crime novels and such)
- a language has an abundance of words ("Zipf", "double-logarithmic", etc.) that are almost never used, and a library has piles and piles of books that are only checked out every few years (reference manuals for Apple II word processors, etc.)
End of quote. In terms of our index this means that the very few words that occur extremely often are the "stop words" which are not allowed for searches (but of which the occurences in the index can be counted). While "the abundance of words that are almost never used" may become interesting keywords to do searches that do not give too much of a headache.
AltaVista ranks the results of a search on criteria that - according to a help file that came with the Personal Extensions program - include these :
- Whether the words or phrases are found in the first few lines of the document (for example, in the title of a web page).
- The frequency of occurrence of a query word or phrase. Rare words in a query are weighted more heavily than common words (rarity is determined by the number of occurrences of the word in the index).
- Whether all of the specified words or phrases appear in a document. A document containing all three words specified in a three-word query would rank higher than a document containing only two or one of the words.
- Whether multiple query words or phrases are found close to each other in a document.
These are general principles that hold true for almost
any indexing and retrieval program. But what does it mean exactly? For instance,
the third criterium seems very generous but isn't it just good practice that
your search on "education AND 'distance learning' AND resources" ranks
results higher if they match 3 of your terms instead of just 2?
When I reviewed the freely downloadable demo package of Alta Vista's Personal Extensions, I noticed that it had an extra search interface to be used when the default search interface failed to install properly (find and doubleclick a file named pav_gui.exe after you have installed the progam). This interface actually gave ranking scores for each document that was retrieved :
Screen shot of the results window with ranking scores
A few months ago AltaVista opened up a Belgian
branch (note 06/2002: this service no longer functions, the new service
simply reroutes queries to the regular AltaVista site that does not return ranking
scores) which does also give ranking scores.
Ranking scores like these are very valuable information because if you have the patience to go and look in the retrieved documents and count the occurences of your keyword and look at where the hits are, you can learn a lot about how ranking is done. In the next section I will focus on single keyword searching in order to describe the basics of ranking. After that I will theorise on what happens in complex searches.
Single keyword searches
One of the first surprises I had was that single keyword
searches constantly returned "groups" of results. In the case of the
simple search on "page" with the Personal AltaVista (screenshot
above) the results had only three different ranking scores (765; 229; 153)
for all 15 documents that were retrieved. The same holds true if you do a single
keyword search on the Belgian branch, even though the scores might be a little
"fuzzy". For instance a search on "flaubert" returns four
groups : 875-870; 700-696; 262-261; 175 which in my experience is the maximum
number of different ranking scores for any single keyword search. (We will look
at these different groups in more detail further down.)
Number crunchers will have noticed already, but for a poor mathematician as I am it took quite some fiddling around before I saw it. Once you've seen it though you will never again NOT see it. The ranking scores are numerically related as :
5 - 4 - 1.5 - 1
Since the lowest score in the row equals 1 in the numerical
relation I will call this score the index-score or index-weight. The three other
scores can always be found by multiplying the index-score :
875 = 5 x 175
700 = 4 x 175
262 = 1.5 x 175
I call this score the index score because it is computed - as AltaVista indicates in the help file above - in function of the number of occurences ("rarity") in the index. I do not know how that computation is done, but we get an idea of what goes on if we look at the following table in which index-scores of some keywords are matched with their occurences in the global index :
I think the general rule is pretty obvious and logical
: the more occurences in the index, the smaller the weight becomes. The underlying
assumption is plain old information theory basics : the greater the probability
that a word (string, character, event etc.) will occur, the less information
The main problem to overcome here is of course the Zipf distribution. What is a good measure to compute index weights? The enormous differences between the number of times any given term appears in the index should not and cannot give rise to the same order of difference in the weights because that would allow one term to dominate other terms in a way that would make any serious information retrieval impossible. As said before I do not know how AltaVista computes the weights, and tastes differ, but I've always thought they're doing a pretty good job at it. I guess you could find out the function AltaVista uses if you plot enough coordinates (occurrence-weight couples). Also, there's room for some tweaking here : different functions will give different weights which will be good for different kinds of users.
In the table above I have only tried to give an idea of the boundaries of the system. The weight of a common word such as "where" is very low (it should be somewhere between 2 and 3, hence the 2/3 notation). It was the closest I could find before a word becomes a "stop word". The 0 weight is where the cut-off value for stop words is. Note that this feature can be used do a rough simulation of NLP (Natural Language Processing) of queries. A natural language question such as Where do I find Bill Gates on the net? will yield similar results as just typing Bill Gates (NOT as a phrase) because most of the words in the sentence are stop words and will not be used to perform the search anyway. Whether the results will be usefull is another matter.
"Flipsies" on the other hand occurs only once in the index and gets 356. This seems to be the maximum weight a word can aspire to. Now, "flipsies" is clearly a rather exotic case. I do not think that there are many single words that occur only once in the index, but phrases are treated as single keywords too.
5 - 4 - 1.5 - 1
Back to the numerical relation. What causes a document
to be ranked with the index-score, or with 4 or 5 times the index score? In
order to find out we will look at some pages of results. Following popular format
each table lists resultnumbers, titles, URL's, description and score.
Let's look at the first ten URL's that are retrieved by the search on "Flaubert", one of my favourite writers. :
|855 documents found on keyword flaubert | showing page 1-10|
Sponsored this month by: How you can sponsor these pages. [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Flaubert on...
Index of /faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -
- Three Tales by Gustave Flaubert
Three Tales. by Gustave Flaubert Date published: 1/2/61. TRANSLATED WITH AN INTRODUCTION BY ROBERT BALDICK. Twenty years after Madame Bovary, Flaubert...
L 411 The Short Novel from Flaubert and James to the Present (Courses
COLLEGE OF ARTS AND SCIENCES COMPARATIVE LITERATURE 1997-98 COURSE DESCRIPTIONS. COM L 411 The Short Novel from Flaubert and James to the Present. Spring..
: L'éducation sentimentale
Gustave Flaubert L'éducation sentimentale. ALEXANDRIE - La Bibliothèque Virtuelle. TABLE DES MATIERES. Première partie. Chapitre...
At other times, seared by that hidden fire which her adultery kept feeding, consumed with longing, feverish with desire, she would open her window, inhale.
Index of /faces/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95 13:38 1k.
DICTIONNAIRE DES IDEES RECUES DE FLAUBERT
LE DICTIONNAIRE DES IDEES RECUES DE FLAUBERT par Anne HERSCHBERG PIERROT. Avant-Propos. INTRODUCTION. L'Opinion et les majorités. Langage de la bêtise....
Web : Gustave Flaubert
Flaubert's Critics. "Too frequent in studying a great work we only end where in fact we should have begun: by examine directly our impressions as we read..
It is pretty obvious that all the retrieved results have the keyword in the title. These documents are ranked highest. For most simple searches the result of a search roughly equals a good old title search in the library catalogue. Let's have a look at where ranking scores begins to fall :
|documents found on keyword flaubert; showing page 71-80|
Flaubert (1821-1880) : Amazon Book List
Gustave Flaubert (1821-1880) : Amazon Book List. | Advanced LC Search | Advanced Amazon Search | 56 items shown. Click on title for more details and...
Chère Maître: The Flaubert-Sand Correspondence
restaurants events arts & music places to...
Index of /picons/db/users/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - face.gif 31-Mar-95...
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880) Options. Bouvard et Pécuchet. Coeur simple, Un. Éducation...
Index of /l/www/faces/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -
of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size
Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - lisa/ 31-Mar-95 13:58 -
- The Dictionary of Received Ideas by Gustave Flaubert
The Dictionary of Received Ideas. Preface by Julian Barnes. by Gustave Flaubert Preface by Julian Barnes Date published: 14/11/94. Lake: Always have a...
Index of /picons/db/local/us/in/bloomington/flaubert/lisa. Name Last modified Size Description. Parent Directory 14-Feb-95 04:31 - face.gif 31-Mar-95...
y vacío: Lenguaje y tópico en Gustave Flaubert - nº 4 Espéculo
Palabras y vacío. Lenguaje y tópico en la obra de Gustave Flaubert. Joaquín Mª Aguirre Romero Dpto. Filología Española III (CC Información) Universidad...
Index of /picons/db/users/us/in/bloomington/flaubert. Name Last modified Size Description. Parent Directory 21-Dec-92 12:18 - lisa/ 31-Mar-95 13:38 -
First thing to note is that this is already the seventh page with results. As site promotors (and librarians, I should add) know, seventh page is beyond most users. Several interesting things happen here. While 870 is still the highest score (x 5), from the third URL onwards the score drops to the next level (x4), first 700 and then 696. We still have our keyword in the title, but is has moved after the eigth word position and this seems to be the criterium to drop the score. You would expect that a rule that relates scores and positions within a document, would be measured in characters rather than words but that is not the case. If you go back to the AltaVista explanation of how the indexing algorithm works, you can guess why : the indexing algorithm only knows "strings". For instance, the title of the 76th result :
Index of /picons/db/local/us/in/bloomington/flaubert. Name Last modified Size
Since the / is a non-alphabetic character, it is considered a word boundary and our keyword "flaubert" occupies position number nine. This makes the difference with the almost identical result we encountered on the first page of results which had
Index of /faces/local/us/in/bloomington/flaubert
for title. Almost identical. Almost, but "flaubert"
occupies position number eigth : just enough to make it to the top. There are
more files like these from the same server. They contain nothing but a directory
overview of a very smal directory, the keyword "flaubert" occurs once
in the body of the text because the text starts with the same words as the title
(Index of /picons/db/local/us/in/bloomington/flaubert/lisa).
These files shouldn't have been indexed to begin with (the read property of the directory was probably set to "all users" so that the spider robot read and indexed it as any other file). The name of the directory has changed over time and each time it has been indexed by the spider. Apparently, this kind of near-duplicates cannot be easily (automatically that is) removed from the index. They constitute some of the inevitable noise that is retrieved first by the indexing mechanism and later by searches. Note however, that the "noise" in this case can easily make it to the top 10.
Why only the first 8 words of a title should be considered important is an interesting question but cannot be answered. My guess would be that this is largely an anti-spamdexing measure. Lots of people in the business of unethical site promotion know from experience that nothing can beat a word that appears in the title of an HTML file. The result was that some began to use multiple TITLE-tags. Others began to write extremely long titles with lots of repeated keywords. In addition to spamdexing there are genuine mistakes. Some people may forget to close the TITLE-tag. Since documents cannot be relied on to provide a bulletproof boundary, it is very likely that the spidering robot has some simple rule to decide where a title begins and ends : index and count eight strings beginning from <TITLE>, then count and index x more as 'rest of the title', after that index everything as if it was part of the document body. Of course there might be plenty of other reasons why files are indexed the way they are.
Let's continue our journey through the "flaubert" search.
|855 documents found on keyword flaubert | showing page 81-90|
- The Temptation of Saint Anthony by Gustave Flaubert
The Temptation of Saint Anthony. by Gustave Flaubert Date published: 31/3/83. Price: £6.99 (Paperback) ISBN: 0140444106. Order this book. Click here..
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave (1821-1880) Options. Bouvard et Pécuchet. Coeur simple, Un....
France - Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert,
Qwam: l'intelligence de l'économie. Index:Art et culture:Littérature:Genres:Romans:Romanciers:Flaubert, Gustave. Options. Bouvard et Pécuchet....
Extrait de "Souvenirs littéraires" de Maxime Du Camp (écrivain, ami de Gustave Flaubert) "Au mois de janvier 1844, Gustave...
Buchlust Medienwelt Café Central Shopping Passage Stellenmarkt Anzeigenmarkt Home. KOLLEGENGESPRÄCHE und andere...
- Three Classic Romantic Stories by Charlotte Bront, Emily Bront
Three Classic Romantic Stories. by Charlotte Bront, Emily Bront and Gustave Flaubert Date published: 3/11/94. (PEN 99) 9hrs B. Three classic romantic...
REGION SURESTE. TIENDAS. DIRECCION. Nº TELEFONO. Horario comercial. CLERMONT. Boulevard Gustave Flaubert. 63000 - CLERMONT FERRAND. Telf :...
GUSTAVE FLAUBERT Frédéric Moreau En ung manns historie. Flauberts romanhelt Frédéric er en passiv, lettbevegelig...
for Madame Bovary : a story of provincial life. (in MARION)
Madame Bovary : a story of provincial life. Records 1 to 1 of 1. Flaubert, Gustave, 1821-1880.Madame Bovary : a story of provincial life / Gustave...
Flaubert, l'initiation amoureuse et le rêve oriental. Retour sommaire Retour page précédente. En 1840, Flaubert est reçu au...
From the 84th result onward the score lowers considerably.
The keyword no longer appears in the title of the HTML file, but it does appear
in the body text of the files. Here is where my second big surprise was. I had
always assumed that the number of occurences of a keyword in a document would
be important. It is not, or rather it is not in the way I had expected it to
be. Opening documents and counting occurences with the Find-command of a browser
learns that at this level (x 1.5) there can be an unlimited amount of occurences
of the keyword. As long as there are at least two, the score will remain the
same. Again, from an information retrieval point of view this may appear very
strange, but in the light of spamdexing it is not. A common spamdexing technique
is to repeat keywords over and over in the body of the text. This is done either
shamelessly visible, or invisible by writing a white font on a white background.
A bright idea, but -at least in AltaVista's case- not a very efficient one :
mere repetition is not very highly rewarded.
Just one more page of results :
|855 documents found on keyword flaubert | showing page 171-180|
Katalogopplysninger: HYLLEPLASS: 840.9 A FORFATTER: Amadou, Anne-Lisa, n., 1930- TITTEL: Omkring Marcel Proust : elleve franske romanstudier ANSVARLIGE:...
manifesto" del 03-Luglio-1997
Uno scrittore bestiale. - JACQUELINE RISSET. "I CAPOLAVORI sono bêtes; hanno il volto tranquillo delle produzioni della natura, dei grandi animali e.
Noweb and html.
Prev][Next][Index][Thread] Re: Noweb and html. Subject: Re: Noweb and html. From: email@example.com (Norman Ramsey) Date: 29 Apr 1995 03:44:04..
Prev][Next][Index][Thread] Re: Beginner's Guide? Subject: Re: Beginner's Guide? From: firstname.lastname@example.org (Norman Ramsey) Date: 18 May 1995...
Editions Pleiade. Complete Set. Individual Volumes Also Available. Apollinaire, 1971. Balzac, 1962. Baudelaire, 1974. Camus, 1982. Carroll, 1990. Celine,..
Quote of the month: December. "...none of us can ever express the exact measure of his needs or his thoughts or his sorrows; and human speech is like.
C347 1014 Ideas in Literature
Comparative Literature | Ideas in Literature C347 | 1014 | Johnston. Topic -- Love and Tears: Women Criminals and Saints How do saints, criminals and...
Welcome to the Puzzle Factory. Who are we? Not sure yet but I'll keep you posted. Until then, I'll leave you with a couple of my favorite quotes. "Be...
PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L]
WHOLESALE PRODUCTS FICTION AND LITERATURE BOOKSTORE [F-L] Orders fulfilled by Book Stacks Unlimited. Wholesale Products Bookstore. Pick from your favorite.
Finally then, we have reached the index weight at result 176. I did not inspect the long tail of results beyond this page (877 documents found!). Also, most of the times AltaVista does not allow inspection after the 200 result either. From result 176 onwards, the keyword will only occur once in the body text of the document.
Multiple keyword queries
What would happen if we were interested in the relation between Flaubert and his travel companion, the writer and photographer, Maxime du Camp? In Simple Search mode we would do a search that would look like :
+flaubert +"maxime du camp"
Where the + is used to simulate Boolean AND. In Simple
Search mode ranking is executed automatically. Only Advanced Search mode supports
the use of the AND operator. In order to obtain the same results in Advanced
Search mode you would have to type flaubert AND "maxime du camp"
in the search box, and the keywords flaubert and "maxime du camp"
in the ranking box (without the Boolean AND).
Unfortunately, the Belgian branch does not support any searching more complex than single keyword and phrase searches. So I first did a search on the internet and opened all the documents I could open. Then I had the Personal AltaVista program index my browser cache. Of course, the index resulting from my browser cache was very small and contained comparatively (measured against total index size that is) many flaubert's and maxime du camp's. Thus, the index weights I found were 165 (Flaubert) and 188 ("maxime du camp"). Then I did the search described above (+flaubert +"maxime du camp") on Personal AltaVista. Here is what I found :
Screenshot of ranking results for complex query
All documents contain at least 1 flaubert and at least 1 "maxime du camp" because of the +... +... formulation. The documents are ranked the same way as I found them ranked on the Simple Search over internet. Up to 1187, the scores make sense. The lowest score seems to be computed as :
(165 x 1.5) + (188 x 1) = 435.5
The decimal is dropped and 435 is left over. The second score seems to be :
(165 x 1) + (188 x 1.5) = 447
And so on. It does not take long to see the metric that is applied here. The individual scores are computed exactly the same way they are computed in single keyword queries, but they are then added up. Opening the document and counting the occurences confirms that all documents with a score of 435 have multiple flaubert's and only one maxime du camp. The document with score 1187 (in fact 1187.5) has one maxime du camp in the title and several flaubert's in the body of the text, hence :
(165 x 1.5) + (188 x 5) = 1187.5
This formula can be visualised in the following table :
|165||+||x 1||x 1.5||x 4||x 5|
Scores that I found in the results are rendered in red.
Not all combinations are found in a small search such as this one.
Looking back to what I said about Zipf, the importance of the weight conversion according to "rarity" in the index becomes clear. If one keyword had 3,000 as weight and the other 50, the keyword that had 50 would always be in the back somewhere, no matter where it appeared in the document. So even if it was appearing in the title, it would always be ranked at the far end of the results.
Note that this way of processing results is elegantly simple and robust, but that it is also very heavy on the processor. Every keyword that is added to the query potentially adds a factor 4 to required processing power.
The scores from 1187 upwards are unclear. None of the documents in this segment have one of the keywords in their titles, still they rank very highly. Higher in fact than could be expected on the basis of the simple rule above. Grouping is done in a similar way (depending on occurences of keywords) but some additional criterium seems to be used for computation. I cannot find what it is but it seems likely that some distance metric is involved because only documents that have both keywords close to each other (close being something like "in the same sentence") are in the upper segment. However, also in the first section (up to 1187) there are two documents with the keywords very near each other. So even though the basis is clear, more research needs to be done. I would be delighted if someone with experience in statistical information retrieval techniques would look into this.
Besides some yet unknown distance metric, there seem to be only four criteria for ranking scores : two positional ones, and two related to occurences in the text. The positional ones have to do with the words that are contained in the HTML TITLE-tags, and the criteria that have to do with the occurrences (or frequency) of keywords have to do with the body text of the document. For clarity these criteria are summarized in the table below.
|POSITION||OCCURENCES IN TEXT|
|first 8 words of the title||rest of title||2 - unlimited occ.||1 occ.|
|x 5||x 4||x 1.5||1|
Implications for information retrieval
Some issues with regard to information retrieval are :
How relevant is relevance ranking?
There's room for a lot of philosophy here. I had to confess
that when I began to see how strongly retrieval relied on simple brute computing
power, I was disappointed and relieved at the same time. Disappointed because
what seemed such a great tool, was again one of those really dumb computer tricks,
relieved because of the same reason. But then again, given the size of the index,
most of the time a search engine does the job and AltaVista does it well. At
least, if you are looking for something rather specific. On the other hand,
even if you are not looking for something specific you will very likely find
something relevant on the first page of results (but miss a lot of potentially
relevant material further down). After all, it is this combination of characteristics
that has made AltaVista one of the popular engines for end users. For instance,
if you are looking for different tie knots, you might try a simple search on
the phrase : "tie knot*". Some noise will enter the results because
tie knots will be written the same (but have a different meaning) in
a sentence such as John knows how to tie knots. The first page of results
will yield an online tie shop that offers drawings of Windsor and Half-Windsor
knots. However, only at page four will you find the page of the venerable American
Neckwear Association which is unfortunately titled How to tie a tie,
but which contains the best graphics on how to really solve everyday tie-knot
In the end it seems to be just the same old story over again : if you do not expect too much it works great, if you want to get some work done it pays to get to know the machine first.