MIT/Brown Vannevar Bush Symposium 1995 – 50 Years After ‘As We May Think’ – Part 3/5

[MUSIC PLAYING] VAN DAM: Our last speaker
today is Michael Lesk. And to tell you a bit about
why he is here, as you know, memex was a personal information
organization and retrieval mechanism. And whereas Michael is
probably known to you as one of the original
Unix hackers– for example, Lex
and Yacc and UUCP– the real reason
for his being here is that he’s a deep
scholar equally at home in the humanities and
in computer science where he’s been specializing
in information retrieval. He’s the chief research
scientist at Bellcore. He’s also visiting professor
at the University College in London where
he simultaneously holds appointments in computer
science and librarianship. And he’s had an abiding
interest in books and libraries. Michael. LESK: Thank you very much. Of all the undeserved
praise, I will in particular correct that I did
not write Yacc. I’m very flattered to be here. And I thank you. The title of this
talk is the Seven Ages of Information Retrieval. If we could have
the first slide, Shakespeare wrote about
the seven ages of man. And information
retrieval, we think of it as being born in 1945. I think it’s going
to work out very well to view it as a human life. That, i.e. by the time that
it would be, say, 70 in 2015, we will be done. We will have achieved
what Bush set out to. We will have a library
of a million books fitting into your desk,
at least virtually. To show you how far
along we are on this, everyone in the audience think. The last time you
needed to know something that you had to look up,
you didn’t remember it and the guy in the next
office didn’t know, did you look it up on
a screen or on paper? How many on a screen? Raise your hand. And how many on paper? Raise your hand. Last time, okay. It was about 75%
to 80% on screens. Hands down. That is how far we have come. And I can easily believe
that in the next 20 years, we’ll get the rest of the way. So I’m going to talk about IR
as a life with its seven stages. This works particularly well for
me because I was born in 1945. And so I think of
this, right, you know, when IR was 20
years old and should have been in, you know,
sort of college student, change-the-universe
mode, well, you know, I was working for Gerry Salton. And we were publishing the first
paper about SMART and CACM, trying to introduce
free-text indexing. So it works well enough. The other comment I use as
a thread through the talk is there’s been a tension
throughout this life. We’ve heard a lot about Bush. Now, in addition
to Professor Bush, there was another MIT man
named Warren Weaver who was active at the same time. Weaver wrote an article
in 1949 recommending machine translation. Bush, remember, is
alluding in his article to the work of the
physicists, the people who built the nuclear bomb
and the microwave radio. Weaver, who was another MIT man,
another propeller head going out from Cambridge
to rule the world, was thinking in terms of the
cryptographers, the people who had broken codes with
computers during the war. He thought of, well, we
could use that technology on language. He said, quote, “it
is very tempting to say that a book
written in Chinese is simply a book written
in English, which was coded into the Chinese code. If we have useful methods
for solving almost any cryptographic
problem, may it not be that with
proper interpretation we already have useful
methods for translation?” So Bush started hypertext
and information retrieval. Weaver started
machine translation. But there’s also a tension here. Bush thought of
information organization as, people will make
trails analogous to the manual indexers
who had been working in libraries for generations. Weaver is talking
about statistics. And throughout the life
of information retrieval, there has been this
tension between, are we going to do intellectual
analysis– whether it be manual or automatic–
or are we going to sort of do word counting? It’s exactly the
same tension that exists in chess between
the people who say, we’ll try all legal moves
as far ahead as we can. And those who say
no, no, no, there must be a role for
intelligent selection of which moves to evaluate. Add in the same way, we’ve
had this tension in IR for its entire history. So, okay, we’ll go through this. As I say, in 1945
the field starts. Shakespeare’s next
stage is the schoolboy, which would have been the ’60s,
then adulthood, experience. Shakespeare actually goes on to
things like soldier and justice and comic character
on the stage. I won’t do those, so we’ve
changed in a little bit. Bush’s predictions are
rather interesting. If we could have the
next slide, please. This is a list of
some of the things Bush said we were going to have. And it will not be unfamiliar
to everybody in the business that most of the
technology predictions were achieved at some point. We have instant photography. In fact, we had it very
soon after he wrote. We have motor-driven cameras
and we have automatic exposure cameras. We have computers that
have card and film control and select their own data. What don’t we have? Well, we haven’t actually
achieved his storage goals. We can’t put a million
volumes in a desk yet, although Mead Data
Center has the equivalent of a million volumes. They have 2.6 terabytes. But that’s a very
selected set of stuff. We don’t have
speech recognition. We don’t have
really working OCR. And some of the
things that were done turned out not to
be worth doing. Stereo photography
has been achieved, but it’s sort of a toy
to sell to tourists. And the ultramicrofiche that
was held up earlier today, again was achieved. But if you’re a librarian and
let’s say out of each $100 you budget, you spend $25 on
the building and somebody tells you that regular
microfiche would reduce that cost to $5 and
you don’t do it, it’s unlikely that telling
you that ultra microfiche will reduce the cost to $2 will
make enough of a difference. If reducing 125 down to 105
doesn’t cut it, down to 102 isn’t going to cut it either. So again, we’ve done most
of the hardware stuff. We’re still waiting on
some of the software stuff. We haven’t got automatic
typing from dictation. We haven’t got OCR. We also, by the way,
haven’t got the stuff about plugging your
nerves directly into electronic systems. Although, I did see a story
on that in the newspaper fairly recently. What I found more
interesting was that Bush talked a lot about
individualized systems. The way Bush envisioned
his interface, each person would have their own
personalized information space. For most of the history
of information retrieval, that isn’t what’s been done. What’s been done
is to have systems that look the same to
everybody who uses them and which therefore
provide the best access to the information
that is sort of impersonal, the journal articles. But again, that’s something
we’ll come back to later. So let’s go on. We still have this
infancy stage. Now, in 1957 the
Soviet Union put up the first artificial
earth satellite. And that produced a
wide array of fears that the United States
was falling behind in science and a realization
that the United States didn’t even know much about what was
going on in Russian science. So there was some funding
of Russian language studies and machine translation. And there was a lot of funding
of information transfer, to say, well, the answer
will be that we’ll improve our knowledge
by building systems to distribute information. There were widespread
urban legends of some company that had spent
either $100,000 or $250,000 to reproduce a
result that had been readily available in literature,
but they had not found. I once eventually saw a paper
in which someone claimed that they had run the story into
the ground, but it was false. But it was widely
believed at the time, so people set to work. The first thing they did was
they built quick indexes. I don’t know how
many of you remember keyword-in-context indexes,
a man named HP Luhn. If you think that’s
primitive, you should remember what
the competition was. How many people here
remember something called edge notch punch cards? A reasonable number. You know, I thought
about bringing one. And then I said, where am I
going to find an edge notch card today? So I made some. And the idea for those of
you who haven’t seeing them is you have these cards. And there’s a row
of holes and you can hold them up with
something through the holes. And let’s say this
is a bibliography, so this card has the
citation for Moby Dick. And this card has the
citation for Principles of Computer Graphics. And you take each hole
and you assign it. So let’s say that we decide
that this hole is going to be books written by Andy van Dam. So for this card, we
tear out that notch. And let’s say the next hole
is books written by Al Aho. We haven’t got one of those. And let’s say the next
one is going to be books written by Jim Foley. So again we’ll make a notch. And let’s say that somewhere
we get to Herman Melville and we make another. And now you see we have
this batch of cards. And again, you can put
something through the slots. And suppose you want all books,
all your references for van Dam you pick his hole. You put something through. Now traditionally, you
used a darning needle. But these holes are so big
in the demonstration which I’m going to emphasize that this
is a digital process by using my finger. And you shake the cards and
the right ones fall out. This is the retrieval
technology of the 1950s. So this is what we got replaced. Now as I said, because
of things like Sputnik, it was a boom time for
retrieval in the ’50s and ’60s. More people attended SIGIR– Special Interest Group
on Information Retrieval Conferences– in the early
’60s with the predecessor conferences, actually,
than attended them a couple of years ago. It’s really remarkable. The first experiments were
systems that used indexing. Because things were being keyed
in and they were painful to do. What happened in
the ’60s when we’ve gone into the experimental
schoolboy stage was people started doing experiments
on free-text retrieving, most particularly
Gerry Salton, who died recently, unfortunately,
of cancer a few weeks ago. Gerry was the
first person to try to do large scale
experiments to say, yes, free-text indexing can be
compared and really work. Gerry worked a lot with a
man named Cyril Cleverdon. Cyril Cleverdon was the inventor
of the recall and precision measures. For the first time,
we had an attempt to make this part
of computer science an experimental science in which
people would run experiments. This is not typical
in computer science. You know, people do not
do evaluated experiments. When I wrote UUCP, we didn’t
say, oh, well, now let’s take 20 MIT undergraduates
and tell 10 of them to get a message to Brown
with UUCP and 10 of them to try Amtrak or Greyhound and
see which ones get there first. We just sort of write it
and put it out and say here. Well, Gerry and
Cyril had the idea that, no, information
retrieval was going to be evaluated experiments. Because there was this history
of there is a lot of stuff out there. And we need an effective
way to find it. There had been a long
tradition of, well, the way to do this is with
standardized nomenclature and whatnot. But nobody knew whether
they would really work. And so a lot of
experiments were done. The test collections, in
fact, that Gerry and Cyril prepared in the 1960s were
basically still in use until about 1992 as the
standard retrieval techniques. Now, this stuff was all
basic keyword retrieval. As this work was going
on, some new techniques that were also
sort of statistical went on like relevance
feedback– the idea that, well, we’ll find
one relevant document and we’ll use that as
a set of search terms to retrieve more
relevant documents. But all of this was,
again, statistical. Now, at the same time, I said
there’s this other thread. The other thread is
intellectual analysis. And in the IR
context, that was AI. That was artificial
intelligence. And what we had was that people
like Terry Winograd and Daniel began looking at
programs that would do linguistic
analysis of queries, and perhaps some
documents, and attempt to match them and attempt to
retrieve answers on that basis. Now, these two groups
didn’t get along very well. The AI people were in the
computer science tradition of we’re going to
build something that will make enough examples
to fill the back of a thesis, and that’s it. And the IR people were down this
trend of, we’re going to run the same 1,400-document
text-collection over and over again until the experimenters
memorized all the articles on aeronautics. And all this was going on, but
the tension is still there. The tension fortunately is
only at the research level. Because at this
point, we’re still in the schoolboy phase and
not that much is actually getting done. Now, then we get into the 1970s. In the 1970s, we’re now an
adult. The field is now, you know, in the 20s. And what’s happening? The main thing that’s happening
is computer type setting and word processing. You know, I appreciated
Ted Nelson’s videotape of real cut-and-paste. I remember that. I haven’t seen anybody
do that for 20 years. Once we had computer
typesetting, and I hadn’t realized
that Bush had worked– I mean, one of the other things
I’ve learned from Andy’s talk was that Bush had
worked on typesetting and that he had worked
on management theory. So that he was not
only the progenitor of information retrieval,
but he was also sort of the antecedent
of Monotype and Linotype modern typesetting
machines and of Dilbert. But we now had, because of
computer typesetting and word processing, we had large volumes
of material in free text. We also had online time sharing. All of the experiments
Gerry Salton did in the ’60s were done in a
batch [INAUDIBLE].. You wrote down your queries
and sent them into a system, you got it back later. And there was a
whole lot of talk about selective
dissemination of information. People would write
down lists of queries and leave them in the
background to be run. And this all was blown away
when time sharing came in. All of a sudden, you could
put up these real systems. You also had some early
examples of real cooperation. OCLC, for example, this was an
organization now called the– it was originally called
the Ohio College Library Center, that being an
insufficiently expansive name. It is now the online
computer library center. But OCLC was founded to
distribute catalog cards in libraries. And there was a
cooperative effort. If a library got a book
which hadn’t been cataloged in the file, they would catalog
it and enter it in the file and other people would
then retrieve that record. And if you entered a record,
you got the next 10 records free or something as an
incentive to do that. These were all very
limited search systems. But they were very popular. A lot of promises were made. Why don’t we look
at the next slide. People believe that,
you know, there was going to be a change
in libraries, last chance before everything
goes on microfilm. And now if we could look at
the top half of the next slide please. There are earlier
statements about microfilm that will remind you of
some of the statements that have been made over the years
by the more extreme proponents of artificial
intelligence and hypertext in which people say that
microfilm, micro photography is one of the most
important developments in the transmission of the
printed word since Gutenberg. And so what I have to say from
this is that hypertext did not only not invent text as
they would have you believe, they didn’t even invent hype. [LAUGHING] [APPLAUSE] So what happened in the
’70s in the research arena? Well, again, we’ve
still got this tension. We still have the people
doing statistical processing. And now, in fact,
they get a new weapon. Keith van Rijsbergen shows up
with probabilistic information retrieval and introduces even
more statistics to an area. I should confess that I’m not
terribly sympathetic to things like statistics, you know. I was taught as a
chemist that statistics are what people
do when they can’t get their apparatus in
good enough adjustment to get convincing answers. But we went on. And the AI people were
getting into trouble now. They had made lots of
promises about what they were going to do
with machine translation and computational linguistics. And to get away from these,
they had started making promises about speech recognition. And the thing they did they
got into information retrieval was they invented
expert systems. Now, the easiest way to
think about expert systems in the late ’70s
and early ’80s is that they occupied
the same buzzword niche that intelligent
agents do today. And people wrote, for example,
“the 1980s are very probably going to be the era
of the expert system. By the end of the decade, it
is possible that each of us will telephone an
expert system whenever we need to obtain
advice and information on a range of technical,
legal, or medical topics.” In fact, when I went
out to get quotes like that for some
talk a few years ago, it was shooting
fish in a barrel. They were all over the place. People have wildly
different views about the importance of this. You know, my name is Lesk. It’s a very rare
name, and it’s not an English word or a word in
any other language that I know. When I do a search on
“Lesk” in the database, if I don’t find myself, it’s
because I found my brother or my second cousin. That’s it. Stu Card is in the room. I suspect Stu Card does
not feel the same way about dumb searching
of ordinary strings. There was a problem, though,
that none of the AI systems really seemed to
generalize very well. Roger Schank was perhaps
the best proponent of AI systems in
the ’70s, that we’re going to introduce higher-level
language processing. And their idea was that every
document could be mapped into standard frameworks. For example, there’s
a very large number of medical articles
that boil down to, we have a batch of rats. They’re all suffering from
such and such a disease. We gave them such and such. And some of them got better
and some of them died. This is how many
in each category. And Schank’s group would try
to construct such schemas for many areas. And then they tried
to fill them out. They produced a lot of
argument about whether this was for real or not. There were other systems like
Bill Woods’ or LUNAR System or Stan Petrick’s
Transformational question-answering
that were evaluated, but they tended not to
do off ordinary text. So we weren’t getting
very far on this. But we still had this tension
between people who said, no, all you need is statistics. And now intellectual
analysis will help. But everyone was going
away from manual indexing. They all said, well, we can’t
afford to do manual indexing. So if we’re going to do
intellectual analysis, it’s going to have to
be done by machines. And we’ve got to
get machines that are smart enough to do that. So now we get on to the 1980s. And a couple of things
go on in the 1980s. We have a steady increase in
word processing to the extent that it becomes impossible
to buy hot-lead typesetting machines. And they only exist as
sort of craft devices. And the price of
disk space goes down. So everybody starts
to think of, ah, these are going to be
the new alternatives. There’s a little data
on the next few slides. Go to the next slide, please. [INAUDIBLE] any data. This is a Berke Breathed
“Outland” cartoon in which one of the characters
comes over and says, you know, could I borrow– Oh, no, he said,
we’re on the brink of a gleaming, digital upheaval. 500 cable channels, so many
sitcoms, so little time. TVs, telephones,
computers merged in one, our lives awash with
instant visual input. And well, what do you want? Well, I actually
wanted to curl up with your copy of
Winnie the Pooh. And you can see
in the last frame that the guy is here
staring at a compact disk. And that’s what– [? NELSON: ?] Hey, Lesk. LESK: What? [? NELSON: ?] These are the
slides I was gonna close with. LESK: Sorry, well, we’ll
go onto the next one then, the next slide. And the next slide
is why the libraries are starting to
chase this so hard, why there’s a business here. This is what university
library budgets look like. Your typical university spends
3% to 4% on its library. Unless it’s the
place up the river that you all are afraid of
which spends about 6% or 7% on its library. Actually, there
was a nasty comment about Harvard made earlier. I’ll make a nasty one back. All right, I’m talking
about the seven ages of man in Shakespeare. Everybody who knows which play
it comes from, raise your hand? Got you. The answer is as you like it. But it reminds me of the story
about the time the bus fare in Cambridge went from $0.50
to $0.60 and some kid gets on and he pays the old fare and
walks past without realizing. The bus driver
calls out and says, hey, you, are you from MIT and
you can’t read or from Harvard and you can’t count? Anyway. Most universities– 3%
to 4% on the library. And it’s not going up. Where does the
library budget go? About a third of it goes for
purchases, about half of it goes for salaries, and
the rest goes for other. This is not quite fair. Most universities
don’t monetize space. About another third would have
to be added if the library were charged the fair rental
value of the buildings it’s cluttering up. Of what’s spent on buying
a book, where does that go? The right-hand column
is the breakdown of the prices of books. Retail markup is like 40%. Distributor gets 15%. Printer gets 15%. Publishing office gets 20%. Author gets 10%. What that means is,
if you say, suppose we blew away this system,
suppose, you know, what Ted Nelson wants happened. We were getting stuff directly
from the author to the reader. And we didn’t have to bother
with paying for the printer or paying for the librarian
to put it on the shelf. You know, relatively little
of that money is needed. So there’s a lot of potential
economic gain in this system if we can do it. Can we do it? The next slide is the price
of disk space over time. And this is a chart of
how much disk space– And you see it’s going
down very nicely. And now I’ll tell you what the
people in the back can’t see, which is it’s a log scale. Between 1971 and now, we’ve
had 100,000 fold decrease in the price of disk space. So that’s why these things
are becoming possible. Next slide is, as a
result, the increase in the number of databases. But an interesting
thing, the top green is the number of online
commercial databases. The red line near the bottom
that’s going up much faster is CD-ROM. And then the mag
tape is the blue line that’s actually going down. So what we’re seeing is a
switch to CD-ROMs, which are becoming increasingly popular. CD-ROM is one of the big
inventions of the ’80s. Let’s go on– well,
let me just talk for a minute about the ’80s. Okay, by the ’80s, we’re up
to the 40 year mark, right? So the world is
now getting mature. And what that means
is, yes, these systems are commercially available. Lots of people have
personal accounts on DIALOG and CompuServe and
they’re looking things up. And people are actually
getting used to OPACs. Ordinary people walk
into libraries now and they get met with
computer terminals, not with card catalogs. Some of them write
nasty articles for The New Yorker magazine
and make trouble for us. But basically, most
people are pretty happy. What is annoying people is
that the research community has been cranking along
all these years, right? And they’ve invented
probabilistic retrieval and relevancy [INAUDIBLE]. And none of it’s in use. All these commercial
services are doing dumb free-text searching. They don’t think they need
either intellectual analysis or even good
statistical analysis. The AI community, meanwhile,
is down on this expert system and knowledge representation
language stuff that imagines translations
into unbelievably sophisticated and
complicated languages. I mean, Feigenbaum got the
Turing Award last year. I think he should have
gotten a special award from the Department of
Commerce for sending the Japanese down the
fifth-generation rat hole. And the enthusiasm, in the
early part of the ’80s, this stuff was riding high. In the middle of the ’80s, we
got what is called AI winter. And we have now gone
into a world in which instead of people believing
that it was possible– I should say, the idea that
you can take natural language descriptions of
subjects and turn them into a single
artificial symbolism predates artificial
intelligence. There’s probably
one other person in the room who
remembers the name JJ Faraday, put your hand up. Dick? Where’s Dick Marcus? You must know Faraday. He’s the only one. Faraday was this British
information retrieval guy who came up with this
amazing nine operator language with special symbols for
each operator, you know, and he believed that
if he encoded something in this language, he would
get 100% precision in recall. Everything would be perfect. Well, it’s the same thing as
translating into [INAUDIBLE] or some language like that. Most people now are
sort of almost back. There was a famous linguist
named Benjamin Lee Whorf. Whorf had a theory that, in
fact, language constrains thought. What you will think
depends on the language you use to express it. And many of us are going back
to believe things like that. Okay, so now we’ve made
it through the ’80s, things are going fine. Basically, the intellectual
analysis people are in full retreat. The statistical people
are riding high. All the systems that
are running are dumb. And we get into the 1990s. And the world is now, you
know, in the late ’40s, and it’s time for
the midlife crisis. What is happening? Well, what happens
is the internet. Even one year ago,
there were people at Bellcore saying
to me, only 15% of the computers in people’s
homes even have modems. Who cares about the internet? Nobody says that anymore. I now see charts suggesting that
by 2003, the entire population of the world will be on the net. Now, what is remarkable is
that everybody is providing it and everybody is
classifying it themselves. You know, it’s the revenge of
the Bush people over the Weaver people. The Weaver people have
been winning for 40 years. They’ve almost won. And now all of a sudden the
Bush people make a comeback. And there’s all
these people, you know, organizing
their own stuff. Now, admittedly, there
is a lot of problem with quality on this. You know, I used to be taught
as an example of probability, the statement that if you
took a million monkeys and sat them in a
million typewriters and gave them long
enough, they’d write all the works of Shakespeare. The internet has proven
that this is not true. [LAUGHING] [APPLAUSE] Sorry? There’s this vast amount. And the question is, how
do you sort through this? I mean, the Lycos
people now think there are 10 million
pages on the net or more. You know, everybody now
thinks, you know, in the ’80s, I first met with the
expectation that everyone is supposed to have a fax machine. And now you meet with the
expectation everyone is supposed to have a home page. The world’s hard disk
industry will ship over a terabyte of disks this year– two megabytes for every
person in the world. I can remember not
too long ago when the people at Bell Labs,
not a stingy institution, were bugging me to keep my
disk space below 200 kilobytes. The next thing that
we’ll talk about is, well, what
are we going to do to get some of the
quality valued information into the libraries. And one of the
answers is scanning. One of the big things
that happened in the ’90s, is the rise of scanning. So the next slide
shows an example. I figure I have to sooner or
later get to my own research. This is a sample of the core
information retrieval system which we built at Cornell. This is based on scanned
chemical journals. This page is an
image of a real page. This is also an image. You’re looking at a
picture of the page. Somebody keyed the page
in in Columbus, Ohio, took the computer typesetting
tape, printed it on paper. I took the paper, fed it back
in through a scanning machine to get that. You may say, that’s dumb. The next slide is
the alternative. It is an ASCII thing,
in which in this case, it’s a different Bellcore
retrieval system. But in this case, this stuff
is regenerated from ASCII. We did some
experiments– or actually “we” is Dennis Egan who’s
sitting in the back somewhere, people can read both of these
about equally quickly, so they both work. And they both work if you
have to search for something a lot better than paper. So you now see a lot of things
coming out in image form, a lot of libraries doing things. Some more examples of
how you get things in. The next slide shows
another alternative of how you might get things
into the digital library. Mozart writing a digital
version of Symphony number 38. This is not what you do. Let’s see, what’s the next? Put up the next slide. The next slide is an example. I was helping some people
who wanted to digitize Charles Sanders Peirce. So he wrote 100,000
pages of manuscripts such as that at the bottom. And you could scan it in, and
the scholars could read it. I should say the sample here is
the page when he was applying for money to the
Carnegie Foundation to publish his works in 1900. They turned him down. In 1992 about, the
people who wanted to scan all his
manuscripts applied to the Carnegie
Foundation for money, and they turned down again. So scholars never learn. Let’s skip the next
slide, go on to the one after that, which is an example. Once you get into images, you
can do lots of other things. This is a British Library
1771 map of New England. The black is the original map. The red is an overlay of
where the boundaries really are, assuming that
the geological survey today knows what it’s doing. And if you look at
this, you’ll see that the latitude
is reasonably good, but the longitude is
somewhat mucked up, which is what you’d expect. The next slide is
another scanned map. And this slide is one of the
reasons why you’re not seeing these, you know, from some– I’m sorry, rotate it
90 degrees, please. Right. This is Manhattan
Island in 1775. And the reason
that I’m not really doing this on a SparkBook– the image from which this
comes is 50 megabytes. And it takes a little
bit long to put up. And the view graph
goes up very quickly. Simply, as a matter
of attractiveness, we’ll put up the
next slide, which is another map, Lord
Jeffery Amherst’s map of New York in 1770. The next slide is to move
on to still other media. There’s lots of talk
about other media. I like to listen to the radio. So I’ve got this
system where I’ve got a radio plugged into my
workstation permanently tuned to the public radio
station in New York. And every day it digitizes “All
Things Considered” and “Morning Edition.” And I can listen to them
at my convenience later. And I can actually listen
to them faster than normal. I can also clip out interesting
things and save them. And I can do things like
segment by looking for silences. The top bit is that I also
like to listen to the BBC. Well, you know,
they don’t really broadcast a strong
enough signal. So a colleague in
London has a radio plugged into her workstation. And it’s tuned to BBC Radio 4. And if you ever wonder why
the response on the net isn’t even good at 4:30
AM, it’s because there are people like me dragging
over the BBC from London so they can listen to it
when they get to work late. Let’s see, other things. Bush talked about OCR. The next slide is some random
attempt to evaluate OCR. And the top is a
newspaper story which would be perfectly readable
to you if you could see it. It got 80% of the words right. Better printing you get 90%. But if you’re counting by
words, we still don’t have that. We need to get there. The next slide is
another example of why you would
like to have images. And the problem, “cat,”
C-A-T, three bytes. That little frame maker’s sketch
was 1,000 bytes in frame maker. The picture on the
right is 12,000 bytes. So a picture is worth a
thousand words, but it costs it. I said the price of
disk space had gone down about a factor of 100,000. The problem is the
difference between video and Ascii is about 50,000. So nearly all of the
progress has been given back. What does that mean? In the 1960s and early
’70s, all sorts of scholars had key punched one
thing, they had their copy of Paradise Lost or
something that they had keyed in that they
were doing research on. And right now it’s the same way. I meet people who have their
45 seconds of digital video that they’re experimenting with. So, all right, based on past
history, another five to 10 years we’re going to have really
large digital collections which will be video
collections, which we will have no idea how to search. So, all right, what is
happening with that? Well, research. Interesting thing
with research– there is suddenly a
text retrieval industry. All of a sudden– well, actually, let me back off. I want to talk a little bit
about the libraries again. And they’re getting
all this stuff. And next slide please. Here’s another quote for
the future, you know. “Libraries for books
will cease to exist.” To be true in 1984,
predicted in 1964. No, you know, it didn’t happen. Today, it’s another story. What are the relative costs? Scanning a book– well, let’s
put up the next slide, please. Scanning a book costs about
$30, between $30 and $40. And you need another $10 to pay
for the disk drive to hold it. To build a space to put
it on the shelf, Cornell– $20 for a book in the newest
book stack they’ve built. Berkeley is building a
stack at $30 per book. The top building there, the
British Library in London, will come in at $75 per book. And the French
National Library is $100 per space for
a book on the shelf. So the libraries are
now in a situation where within a
few years, it will be cheaper for them
to scan and not put up the space to put the book. And since many of them
are under strong pressure not to put up any more
buildings for other reasons, you’re going to see an awful
lot more of conversion. It’s already economical if any
libraries could get together. But they can’t do it right. Let’s see, we also have a
text retrieval industry. Next slide is the
online search industry. We’re at a couple of
billion dollars a year now among the different vendors
in which LexisNexis is biggest and Dialog is next. The next slide is the software
sales to support that industry. This is from 1990 to ’96. The rate of growth
in selling programs like Personal Librarian and
DynaText and things like that. So this is a thriving
business now. And by god, all of the
research technologies that have been ignored for
so long are now in use. Bruce Croft’s Center for
Intelligent Information Retrieval, which has developed
a lot of intelligent text processing algorithms,
provided the software for the THOMAS system at
the Library of Congress. Gerry Salton’s software was used
to make a CD-ROM encyclopedia. The Waze system has
relevance feedback in it. So we have an active industry. And by god, it’s finally
using the research results. Politically, we
have a big thing. Vise president Gore, as
people probably know, decided that since
his father had started the interstate highway
bill through Congress, he would do the internet
and the national information infrastructure as
the analogous thing. And he constantly talks
about his school child in his hometown of
Carthage, Tennessee being able to turn
on her computer and plug into the
Library of Congress. And so the feds have started a
major digital library program. Do me a favor, skip
the next two slides and put up the one after that. The black spots here are
the federally DLI groups at Stanford, Santa Barbara,
Berkeley, Michigan. Some other projects
are in green. For example, the JSTOR
project in Michigan is the Mellon
Foundation’s attempt to see whether it really
is true that libraries can scan as a replacement
for shelf space. And what Mellon
is doing is paying to scan 10 major economics
and history journals back to the beginning of
time, of their time, and seeing whether these
libraries can use these instead. We also got a return
to evaluation. I said that for 30
years, everybody had been using the same
collections for evaluation. And Donna Harman started
running something called the TREC
Conference in which people sent in hundreds of queries were
run against a gigabyte of text. And you suddenly got
some realistic numbers. And what we learned,
not surprisingly, was that there is still enormous
scatter in the performance of these systems. The reason we can’t agree
on what a good system is is that it depends very much
on the query and the user. You could do enormously
better with best hindsight. If you could say for each
query, on this query, we’ll use that search
system, on the other query, we’ll use the other
search system. We don’t know how
to do that yet, but there’s clearly
gains to be made. So as I said, the midlife
crisis is, all of a sudden this is succeeding, but
it’s succeeding with manual content
analysis rather than just the statistical retrieval. They’re actually both
working together, sort of the Bush disciples
have made a comeback. All right, so what happens next? The next decade is the 2000s. IR is going to be 55 to 65. So this is fulfillment, right? We should be [INAUDIBLE] away
the money for our retirement. And our problem is
going to be, how are we going to do the image? We can get all
these images, video, it’s bad enough
indexing the text. At least we have
free text retrieval. What are we going to do with
all the images of the video? We’ve got to have
manual analysis for that at the moment. I don’t know how we do it. That’s going to
be the big issue. But it’s going to have to
rely on people doing it, at the moment. And that’s going to
require us to learn how to manage manual indexing
after many years in which librarians were sort
of viewed as, you know, well, librarians are sort of the
equivalent of Quill pen makers in the typesetting era. Suddenly, we need these people. We need people who know how to
organize information and make use of it. Okay, but I think we
can get through that. Finally– 2010. This is– I’ve labeled it
retirement on my view graph. Put back the original
view graph, please. Shakespeare had it as senility. I prefer to be more optimistic. At this point, I think that,
basically, the conversion job has been done, that by 30 years,
you know– well, by this time, we will have had
enough of the scanning and enough of the
keying available and we will have a market
in which people can buy this stuff that we
will basically have the material we need online. We will need to
know how to find it. But we will have it there. And we can all go off to
study biotech or something. The printed books remain sort
of in warehouses somewhere. The library
buildings on campuses have been turned over to more
bureaucrats and administrators and something else. What might go wrong with this? I’m very optimistic, but what
are the problems going to be? One of them is internationalism. We’re doing just fine now on
the internet on the assumption that everybody in the
world speaks English. At some point, that’s
going to break down. And it’s going to be political
as well as technical. There’s going to need to be
more new kinds of research. I am a little worried
that 10 years from now, there will be
library departments giving PhDs in arcane details
of probabilistic indexing, carrying out the arguments
that Keith van Rijsbergen had with Gerry Salton in the 1970s. You know, we’ve got to get
onto some of these things. There will be a few
of us old fogies who even remember that we were
promised automatic language parsing and question answering. We’re still waiting for it. More seriously, I’m
worried about some of the social effects. CB radio, anybody
remember CB radio, right? You know, lots of trash. How do we keep the
internet from being overwhelmed with that stuff? How do we get a
role for the people who will sort out what is
reasonable and what is not and let us find that stuff? That is going to require
some form of charging. Bob Kahn already said, we need
to go to service charging. We need some leadership
in how to do this right. My problem is that,
you know, it’s hard to look to
economics for leadership. Even the Economist
magazine once said that an economist was somebody
who was good with numbers but lacks the personal
charisma to be an accountant. And there is a problem of
where do we get the leadership to build a system, such as
the one that Ted Nelson has envisioned, where people
actually do get paid and it works. At the moment, the copyright
law is a real problem. You know, somebody made
some comment about, well, we needed more legal stuff. I mean, I have been going
to computer conferences since 1969. So I remember the
days before there were lawyers in computing. I never heard at one of
those conferences anybody suggest that we had a
problem in computing because there weren’t enough
lawyers paying attention to it. There is a story that– Okay, oh, wait,
no, that’s right. IBM did prepare a
CD-ROM to commemorate the 50th anniversary of the
Columbus voyage to America. There is a story that they
paid over a million dollars to clear the rights for the
material used on that CD-ROM. That’s okay, but
only 10,000 of it was paid to the rights holders. All the rest went into
the administrative costs. We have to get somewhere. There is a real possibility
that the world of the future is everything published
after about 1990 is available because the publishers
have it and they’re making it available. Everything before 1920 which is
out of copyright is available, and everything in
between is falling into a black hole in which
every once in a while, someone tries hard to find
the people who have control and get the data. Finally, as I alleged
in an earlier question, I’m worried about
political problems. There are technologies
which looked as if they were going to
be successful and have been stopped by political
or legal liability issues. As I say, nuclear power,
childhood vaccination. We need to try to
design our world so that doesn’t happen to us. All right, but I don’t want
to end on an unhappy note. You know, it is possible
that we will all, you know, be drowned in information
water or something. But I think that it
will actually work. I think given the
rates of progress, Bush’s dream will be
achieved in one lifetime, in the lifetime of
this profession. Now Bush actually talked about
the information organization. He said there will
be a new profession. There will be
trailblazers, people who make trials on request. Now, part of the problem
is that today people think of librarians as someone
who alphabetizes books and puts them on a shelf. The function that is known in
a library as “mark and park.” You label the book and
stuff it on the shelf. We need to get sort of a
higher status for this, whether it’s called
information trailblazer. I’m not as good as some
people at making up new words for things. I’m optimistic about that. Once upon a time, accountants
had to be good in arithmatic. Then computers came
along and made skill in doing arithmetic totally
irrelevant to the real world. Did that mean that accountants
became uninteresting minimum wage people? No, accountants took over
all the major corporations. Bean counters now run the world. So if computers go
in there and say, all right, alphabetizing
is no longer interesting, what happens to librarians? And I hope that, you
know, information becomes more valuable. Somebody earlier talked
about information as a sea. All right. The purpose of librarians
and trailblazers, whatever, the purpose of
those people in the future is now going to be to navigate. It is not going to be
to provide the water. That’s it. Thank you. [APPLAUSE] VAN DAM: [INAUDIBLE] LESK: Thank you. VAN DAM: Great. Let’s have some
questions for Michael. NELSON: How do you see– LESK: Ted? NELSON: I can yell
so everybody– LESK: No, no but then you
won’t be recorded, Ted. Please. NELSON: How do you see the
copyright issue as resolving? LESK: First of all, the
question was by Ted Nelson. I don’t– I would hope, frankly,
that some payment scheme is adopted. My personal guess is that it
will be a much simpler scheme than yours. When I talk to the
journal publishers and say, how do you want to
charge for your stuff online, they don’t say a
hundredth of a cent a byte or hundredth of a cent a minute. They say $25 a year because
that’s what they’re used to. When I talk to a book
publisher and say, what do you want to
charge for online access? $50 a person. So my hope is that there will be
larger and bigger units charged so that I don’t get into the
huge overhead of $0.02 here and $0.03 there. NELSON: You think
that’s a simpler system. LESK: Yeah, I do. You know what I
would really like? I’d really like the
German solution, the tax on blank tape. You know, the Germans do this. They want to deal with
over-the-air taping. And they put a tax on blank
tape and give the money to the German
society of composers. And I wish some mechanism
like that would work. I’ve never understood why
in the US political context a tax on blank tape, which
is all made in the Far East for the benefit of recording
artists who are all Americans doesn’t go through Congress. But it never does. Anyway. AUDIENCE: Michael, you
said there are only about a terabyte– VAN DAM: [INAUDIBLE] please. AUDIENCE: Raj Reddy– only
a terabyte of sales of disk. I don’t know where
you got the number. LESK: No, I’m sorry. 2 to the– it’s
the 10 to the 15th. I’m sorry. AUDIENCE: Good. LESK: Yeah, I’m wrong. It’s a petabyte. AUDIENCE: Yeah, at
least a petabyte. Last year, 50 million
pieces were sold. Even if each of them
had only 100 megabytes, it would be more
than a petabyte. It would be at least
five petabytes. It is probably
more like 20 or 30. LESK: The number that I got
came from Jim Gray giving a talk at VLDB last fall. AUDIENCE: It’s
changing every day. LESK: I mean that
may be the answer. That that was last September. AUDIENCE: Last year we only
bought one gigabyte disks. This year we’re buying
9 gigabyte disks for the same price. The second issue I think
is more interesting, you said OCR doesn’t work. LESK: That’s right. AUDIENCE: One company,
[? Care, ?] has a $50 million a year product business. LESK: Yes. AUDIENCE: Obviously
a lot of people are buying it and using it. So none of the technologies
that you talked about and we are all
working on will ever be perfect, including
information retrieval. The precision in recall
is still pretty lousy. And it will never
get that much better. It will never be perfect. LESK: Yes. AUDIENCE: The question is,
how do you define what works and what does not work? LESK: When my friends
stopped sending things to be key punched
in the Philippines because OCR is good
enough that they don’t need to do that anymore. AUDIENCE: And lots of people
are still doing it, I gather. LESK: To tell you a really
horrible story, admittedly, from a few years ago, a
friend of mine runs the– AUDIENCE: A few years is a
long time in this industry. LESK: OCR isn’t
changing that much. AUDIENCE: I’m surprised. LESK: Well, a friend
of mine was involved in having the publications
of the AIAA put into one of the online services. She not only talked to
the online service about, do you want the pages to
OCR, they offered them the typesetting tapes. The service said, nah,
we’ll send it to Korea. We got lots of people we can
hire overseas to keystroke. We can’t even find people to
decode your typesetting format, let alone to patch up the OCR. So I still see too much stuff
being sent out for keying. AUDIENCE: So the definition
is when most people use– I’m sorry. I agree with your definition,
when most people routinely use scanners for
use of information rather than sending it
out to be keypunched. LESK: When people use
OCR rather than keying. In the same way,
speech recognition I will recognize
as really practical when the business of typing
from dictation disappears. AUDIENCE: So the issue is, when
you type with a word processor, you make mistakes also. LESK: Yes. AUDIENCE: And you learn to
live with it, you fix it when you make a mistake. So the issue is when you make
a mistake with voice input, you’ll make mistakes. There’s no such thing as–
it’ll never be perfect. Pen input and voice
input, including keyboard input for
word processing, will never be perfect. LESK: I agree it’s not perfect. The librarians have a funny
attitude towards this. They read too many books on
total quality maintenance long before there
were any such books. A friend of mine
made the mistake of saying in public that the key
stroking of the British Library catalog had introduced
50,000 lost books by errors in the press marks. He got fired for it. A more interesting story– many people have thought of the
idea of let’s scan something, OCR it, and because
the OCR isn’t perfect, we’ll use the OCR only for
searching behind the scenes. And we’ll display
the scanned image. This is the way the Elsevier
TULIP project works. I first heard this
idea from 20 years ago at the national
agricultural library. And I said, well, what
happened to that project? Answer– they started it. But they had the OCR. It seemed ridiculous
to have this and not send it out to the people who
were buying the scanned disks. But it wasn’t accurate. It was embarrassing to
send it out this way. So they started
trying to correct it. Then they couldn’t afford
to put out the product. So the whole thing died. And I said, you couldn’t
argue this one out and just distribute knowing,
inaccurate stuff, but saying for its
purpose it’s good enough? No, that’s not our world view. We want to be proud
of what we’re doing. Oh, forget it. AUDIENCE: Adobe is
selling a product just like the one you mentioned. You can buy it in the market– LESK: Yes. AUDIENCE: –for a couple of
hundred dollars right now. LESK: Okay. AUDIENCE: Samuel Epstein
with And one of the things
that you mentioned was that I guess by the year
2000 or some date like that, that all the
information that we need will be digitized and up on
the net ready to be retrieved. One of my concerns,
and I’m starting to see it now with a
lot of the smaller kids that we do some work with
is that the idea that, well, anything that we
need to find we can find by going into Infoseek
or Lycos and doing it. Regarding the parochialism
of data and the fact that even in a few
years or 20 years, that there’s going to be
a lot of stuff that is not on the net and stuff that
even never will be on the net, whether it’s coming out of
a rainforest or whoever, how do we as designers of these
systems impart to our users that this is not the end to all
ends of information retrieval and as seductive as it sounds
the reality is sometimes you gotta go crawl around a jungle? LESK: I know. And the unfortunate
answer, I think, is if you go to Western
European libraries, there are many
manuscripts in them, that have never been printed. The average manuscript that
survives in an old library has not been transcribed
to printing in 500 years. So what’s happened to this stuff
since no ordinary person ever finds it. A few devoted scholars shrinking
with university budgets every year devote their lives
to crawling through this. So if somebody says,
does there happen to be any estimate of the
cost of having a horse shod in Germany and 1300, one
of them pipes up and says, yeah, I saw that 20 years ago. It’s in a library in
Dresden or something. The same thing, I
am afraid, is going to happen to the data that
is not easily available in electronic form. And this is unfortunate. And I don’t really know
what to do about it. Because I’m afraid that the cost
of converting the entire past is too high. Enough will be
converted to serve most people most of the time. And then we’re
going to be stuck. And I don’t know how
to get around that. AUDIENCE: Just as a follow up,
do you see, possibly– now, we’re starting to
see advertisement for professional
paid net surfers to go out there
and search the web. Do you see in the
future a potential for a career as a professional
analog information surfer? LESK: Yeah, I think
there will be. My problem is that I think
that the career for somebody who wants to dig around in
the non-electronic stuff will turn out to be
a few specialists, as with the people who
deal with manuscripts. I mean, there’s this whole
profession of archivists. And, you know, you
cannot get a job in it. So I just don’t know. I wish people would be more
interested in diversity and getting more
kinds of information. AUDIENCE: Thanks. AUDIENCE: John Smith,
University of North Carolina. I’d like to ask you two
related questions, if I could. First a factual question
and then kind of an opinion question. Your projection about
this or your vision is kind of based on, in
part, on this amortization of cost for storing and
providing access to books and the sort of declining
curve of cost for disk space. But what you didn’t
mention is what is the cost trend for moving
bits around the internet? Is that going up, going down? Are we seeing any kind of a
logarithmic decrease in that? Because it seems to me
that this vision really is predicated on lower
and lower cost of that. And the second question is
kind of a follow on to that, and that is I think a lot
of us think that internet is a kind of God-given right. But it expands in use. I really worry
about how it’s going to be financed to bring it
up to that level of service. And so I wonder if what
kind of economic model you see in the future that would
make the internet universally accessible. LESK: It’s irresistible. Bob is the next person
at the microphone, and he can answer both of these
questions better than I can. AUDIENCE: [INAUDIBLE]
answer my question. LESK: And the reality
is, costs of transmission are going way down. I mean, for example,
when you were growing up, long distance calls cost a
lot more than local calls. Today, most large purchases
of long distance service pay irrespective of distance. You know, LL Bean will pay
something like $0.06 a minute, flat rate, anywhere
in the United States. So these costs are coming down. And all the costs
are coming down. You’ve given me the option. Ted gave out this paper. And one of the things he
says in this 1965 paper is, “the costs are
now down considerably. A small computer with
mass memory and video-type display now cost $37,000.” “Several people could
use it around the clock.” [LAUGHING] All of these costs are crashing. Now, how did we
advertise or how did we arrange that people
like me who are perhaps using vast quantities
of internet service for frivolous purposes get
charged more than people who are only using a little? I don’t know how
that will play out. My gut feeling is that
the bandwidth regulation model we have now, which is that
Bellcore pays more because it has a T1 link than people
who have a 9,600 baud link, is not entirely unreasonable,
some sort of peak bandwidth limitation pricing. But I don’t– the economists are
going to have to sort that one out. KAHN: Bob Kahn. Michael, I’d like to
get you to speculate about retirement in 2010. In particular,
given that this is a birth-to-retirement process,
the presumption I make is that you think that the
information retrieval problem will have been solved by then. I’m afraid that
that may turn out to be more hype
than it is reality in that people’s expectations of
what the information retrieval capabilities will be will be
far greater than the reality. So can you give us some sense
of the scope of our information retrieval capabilities
at retirement? LESK: Oh. What I believe
will have happened is that there will be
vast numbers of people, for reasons I probably
won’t understand, who have put together enough
trials on enough subjects that when I sit
down and I say, I want to find a
photograph of Bob Kahn, there will be a way to do it. Now, I don’t
understand all this. There are some
subjects on the net, you know, if I want to
know the names assigned to the locomotives of any
class of British railways, I assure you I can
post that question, and 30 people will
fall over themselves to show off that
they can answer it. If I were to say, what fraction
of current computer programs are written in Pascal? I am also quite
confident I would not receive a single
competent answer, although I would receive
many idiotic opinions. I don’t know how to arrange
that for every subject, there will be some
bibliography equivalent. But I believe that
is what it will be. What I’m saying
is, the Bush people are eventually going to
win over the Weaver people. We’re not going to be able to
do the image and sound and video analysis. I know, Raj, you
know, I know there’s a nice demo at CMU
of how we’re going to solve all these problems,
but I’m not convinced. Whereas, I am amazed
by the number of people on the web willing to
spend their time collecting every locomotive picture,
every picture of a pet rabbit, and making lists of them. [LAUGHING] Someone objects to
the pet rabbits? Okay. VAN DAM: Terrific. LESK: Okay. VAN DAM: Great. Great way to finish. [APPLAUSE] Okay, we’re going to go
for half an hour max. We’re going to take
some local questions, and then we’re going to take
some questions via the MBone. All right, who will be first? Okay, general or to
anyone particular? AUDIENCE: Perhaps both. But I think I’ll direct the
question first to Robert Kahn. My name is Ricky
[? Goldman-Siegel, ?] and I deal a lot with video
annotation tools. I was concerned about one of
the things you talked about, which was this notion
of malleable content. And I guess what
worries me and what I think about is that
what happens when the content that we have, let’s
say the video images are images of people in our own
research, that we’ve taken in our own research
of children and adults doing all kinds of things. Now, we have ethical
approval to use those. We might even have
ethical approval to use them on the internet. But what happens when someone
else uses that and takes it out of context? It’s not just a
matter of authorship. Because let’s say I’m the
author, I’m the research. Okay, I have the authorship. It’s not just an issue
of property rights. It’s an issue of ethics. And no one in the
panel today, no one in any of the– none
of the speakers today addressed the issue of the
ethics of malleable content. So could you talk
about that, please? KAHN: Well, I mean,
malleable content is not exactly a household
word in most quarters. And even today in the
world of what I would call hard copyright– I don’t mean hardcore copyright,
I just mean hard copyright– there are moral rights that
do attach to that material. You might think that
after copyright runs out you can do anything you want. But under the Berne
Convention, there are still moral rights
retained by authors. So for example, if you were to
try and do syncopated whoever and the estate of
that musician did not want you to do that
because they thought it impinged upon the moral
rights of the musical works writer, they could prevent
you from doing that legally. Similar kinds of constraints
occur in other related areas. So moral rights
even pertain when you have hard
copyright where you think it’s now in
the public domain, but it was originally protected. My sense of what’s
needed to happen is that people need
to be able to state what it is they’re
willing to have happen to material that
they have rights in. And if somebody
really does not want somebody else to manipulate
material that they have rights in, that is their prerogative. And therefore all the trappings
of the law that would normally pertain in whatever part
of the world that you’re in ought to apply to that. On the other hand, I can see
many people providing material where they say you
can make changes. You know, for example,
suppose that I had created a film called, I
don’t know, Gone with the Wind, pick your favorite. But I put it out into
the world of cyberspace in a form where the
storyline was out there, the emotions of the
characters was out there. And the only thing that
was not really fixed was the actual choice of
characters or the location. But all the rest
of it was sort of pinned down in the storyline. And let’s say my guideline
was fix the storyline, can’t change that, but
you can change the party. So you can watch Gone
with the Wind starring you as Scarlett, for example. And that might be– AUDIENCE: [INAUDIBLE] KAHN: You like that? That would be okay
with me, okay? But maybe I put out
another one where I sort of had the
start of a good story and it was really more
like interactive fiction and you could change the
storyline by virtue of actions that you took. And that was okay
with me because you aren’t changing the core
that is manipulating the changes of the storyline. And that was what
I put out there. There may be somebody
else who says, well, I’m going to
put out a blank slate and you can put anything
you want on there and they consider that a work
although somebody else might not. So I see a whole range
of possibilities. I just would like
to make it possible that we can deal with derivative
works out there or additions to works. Here’s another case in point. Nobody can stop you today
from doing a visual overlay on a wall. So if you are doing a
simultaneous showing of A and B, two different audio
visual works, then what you see is the composite of those two. And somebody may object
for legal reasons that you’re affecting
the original, but you’re actually not
changing it directly, you’re just changing the visual
experience that somebody has. So I could easily see
somebody giving you a template that if somebody
else created one that would do a projection
that would overlay on it, in fact, that
might be okay, too. NELSON: Can I
piggy-back on that? I’d just like to,
once again, explain how the transcopyright
seems, to some people, to solve this issue. You see, what’s been at
the end of the trail here has been the problem that
all copyright owners have under this Berne
Convention of being able to stop you from taking
things out of context. And meanwhile, people are
quote “downloading” things on the network and inventing
for themselves all sorts of schemes and in
their minds that make it reasonable and correct
and with total ignorance of the law, basically. So the transcopyright
proposal is essentially that anyone can reuse this
arbitrarily in a new context pre-permission for that new
context with the understanding that all the pieces will be
bought from the originator. Now, what about
altering the bytes? That’s where it
gets into trouble. So the only way the
transcopyright thing works is if the bytes are always
obtained from the originator. So if you want it to be
stretched and morphed or something like that,
what the map then contains is the directions for how to
stretch and morph these things once you download them. And that gives the
desired result. So malleability is possible
within that framework on a strict basis. The reason that some people want
to talk about the moral issue and others of us want to talk
about the legalistic issue is that moralisms can
blow away with the wind, whereas if you can set
down some guidelines that can be implemented as
a workable solution that people can live with that
could have a long term effect. BERNERS-LEE: I could offer– I can offer an alternative. Hello? I can offer an alternative. The alternative point of view to
suggesting that transcopyright should be mandated, basically
should be a flag that you turn on or turn off is to say– I’ll make two observations. First of all, to observe that a
system which tries to constrain how people behave doesn’t fly. So if you, for example,
make a documentation which requires everybody to write
in a given word processor it doesn’t work. If you try to make
a system which changes the way the
relationships that people have, which forces them
into some mold, even if the CEO of the
company of 50,000 people mandates that
everybody use it, they will use it under coercion,
and it won’t really work. So the reason that
hypertext is neat is that it’s very
unconstraining. It allows a lot of flexibility. Now, the second
observation is that when you look at the agreement under
which information is passed from one person to
another or anything else, for that matter,
goods, Corn Flakes. But when you buy Corn Flakes
you think at the first level that there is an
exchange whereby money goes in one
direction and Corn Flakes go in the other direction. But in fact what
you’re getting is you’re getting the Corn
Flakes, and on the packet there’s a UPC bar code,
which, if you send it in, it will get you 100 American
Airlines free miles. You can send the
coupon it with $10 and you can get
yourself a free camera and 10 free rolls of film. And if you combine
it with the UPC from a particular brand
of washing powder, you can get a free
ticket to the theater. So there are an incredible
number of different– This is a very complex
agreement involved. There are an incredible
number of different kinds of legal tender and
illegal tender which have crept in here. So that if you’re going to
allow this sort of behavior to go on, which marketing
people seem to need to do, they seem to need to have 16
different types of train ticket and 18 different types
of airplane ticket so as to extract the most
money from the populace, then if you’re going to
represent that in a system, if you’re going to represent
the commerce and the agreement on the network, the network
has to be extremely flexible so that when you
pass information from one person to
another, basically you have to be on a
pass an arbitrarily complicated expression
of the license terms. And, yeah, I mean, even when
you buy it, think of a video. What is it, really? What do you get the right to
do when you take a video home from the video store? It doesn’t have a
complex license on it. But you get the understanding
that you can show it under reasonable terms
as long as you’re not on a coach or an oil
rig to people who are, well, in your family, well,
say, in your house. The number of people
who can reasonably cluster around the television. And things like that. And it’s reasonable to watch
it twice as well as long as you get it back by the date. If you try to write that
sort of thing down in Lisp, it gets frightfully complicated. If you don’t write
it down in a Lisp, and you send the
thing over a network and it sits there on a proxy
several and it’s cached. And then somebody else
asks for the thing. This is video. This is serious network traffic. So the proxy server
is very interested to know whether it
can give you a copy. If it doesn’t understand
the license agreement because it can’t
read it, it can’t work out whether it can
A, give it to you for free or, B, give it to you and
charge you a certain amount and pass the cash back. So in fact, this is a
really big, hairy problem. In fact in solving
that problem, if you can find a sufficiently
powerful solution, you may solve a whole lot of
other problems accidentally. VAN DAM: Given the fact that
this is very complicated, I think we should get off the
topic and go to another one. Otherwise, I’m
afraid we’ll be stuck with this very interesting
one for the remainder of the period. Roger. AUDIENCE: I’m Roger Bloomberg. And this is a question
for each of you, if you’d like to answer. As you know, in
Bush’s essay, he cites meant disappearance of Mendel’s
paper as a catastrophe, that in reforming modes of
communication and transmission, one must avoid. So I’d like to ask each
of you to speculate about the future of reforms
in modes of communication and transmission. What catastrophes
should we avoid? NELSON: I’ve got the microphone,
so I’ll say it very briefly. Basically, I see ours as
an age of information loss. And as we digitize
more and more, the formats get
crazier and crazier. I’ve heard that NASA actually
has a job designation called data archeologist. And the great problem of being
able to see the same document twice years apart is
it’s very important to be able to see
it the same way. I’ll tell this very quickly. When I was editing creative
computing magazine, I saw a great
article on Smalltalk that I wanted to run from
an obscure English journal. And it was so much
better than that piece that Alan Kay had written in
The Scientific American, which really ticked me off. And I was very irritated with
Xerox PARC and their attitudes, anyway, though I liked Alan. So we got permission
to reprint this article from this obscure
English journal. And guess what? It was plagiarized. It was Alan’s article which
they changed every occurrence of the word we to they, thinking
that this somehow guarded against a copyright violation. Now, that was the same article
I had read several years before. And I was a different person
when I wrote the article. The article had not changed
except for “we” to “they.” And what this shows
is the importance of being able to
know that you’re seeing the same thing twice,
even though you have changed. And if you change and
the documents change, nothing can be kept track of. [LAUGHING] [APPLAUSE] LESK: I agree that saving
all data is a problem. I believe that there are several
task forces working on this. I think that one can be solved. The problem that
worries me much more is the potential for
the loss of diversity in information sources. It is not clear whether the new
technology with very low cost for making many
copies of one thing will drive us in the
direction of there is one extremely glitzy
multi-media college physics textbook instead of
20 easier to prepare, ordinary, written textbooks. And I don’t want to see the
technology change in such a way that we lose diversity of
information preparation. And that means we have to work
on making multimedia authoring easier. And we have to
work on, you know, preserving symmetric
access to networks rather than let us
say direct broadcast satellite for everything. NELSON: And ways of
freely merging material. LESK: Well, merging– not
necessarily freely merging, merging with low
administrative cost. NELSON: Unfettered merging. LESK: Unfettered– I’ll
go with unfettered. AUDIENCE: I would just like to– VAN DAM: Name? AUDIENCE: –bring us back to– VAN DAM: David? AUDIENCE: Yes. VAN DAM: Your name please? AUDIENCE: I’m sorry
I’m David [INAUDIBLE] from Brown University. I would just like to bring
us back to Vannevar Bush. He was, as you know, an
extraordinary manager. And one of his concerns
was directing the efforts that he was in charge of toward
essential and specialized tasks. To what end do you see the
worldwide web, the internet, electronic text,
electronic libraries, if you care, becoming essential
services in our society? And if they are becoming
essential services, how do we justify
our taxation of them? Or how do we go
about funding them, in that case, which is something
you’re already addressing, of course? KAHN: Well, having
the mic, let me just pick a piece of that to
address, rather than deal with it totally generically. Because I think every instance
of capability in society can be, in some ways,
addressed separately or should be
addressed separately. In the case of the internet,
the thing that is really the most viable attribute
there is the connectivity that it provides in
an open architecture framework at the moment. That may change with
more functionality, but that, I would say, is the
basic elements that’s there. I don’t see how that
is going to disappear. But the form and shape
in which it’s provided could change fairly
dramatically over time. This is a marketplace
right now for services. And every instance
where a capability can be provided as a
marketplace service, then the market, it
seems to me, will deal with those issues
as time moves on. Things may evolve. Other parties may come in. Some may drop out. Prices may go up. Prices may go down. Parts of it could be subsidized. Parts of it could be paid
for and straightforward or indirect ways. But it seems to me that at least
at that level what you have is a basic service that’s
out there in the marketplace. Now there are areas
where oversight has turned out to be important. This was nurtured from
really very little through federal government, US
government efforts initially. Today, the US government
still plays a significant role in the oversight process. But more and more, the
participation in that process has become broader. It’s involved folks from
the commercial sector. The universities have been
involved from day one. And involves international
participation. And I see that continuing. But the participation
is at the level of ensuring that the
process by which things work is maintained effectively,
not in the provision of individual services. People may want to talk
about your other points separately as well. VAN DAM: Why don’t we
take two more questions, and then we’ll switch to MBone. Eric? AUDIENCE: My name is
Eric [? Nelson. ?] And what I’d like
to ask is if there is another
McCarthy-era-style witch hunt sometime in the
future, will it be made more vicious by
today’s new tools? Specifically, two issues,
one is does one leave a trail when one uses the network? And one is if information
that is about you can be processed faster
by parties unknown, do you necessarily want that? BERNERS-LEE: Doug, you
haven’t said anything yet. ENGELBART: Well, the
criterion for my answering this is that I haven’t
said anything yet. [LAUGHING] NELSON: You’ve said more than
enough, but please go on. ENGELBART: That’s the kind
of thing that interests me a lot is the long-term impact
of how our social processes will change. And some of them can get bad. And so the only
thing I look at is saying if you get there first
with the good stuff that’ll help you solve the
bad stuff or not. And so I’ve really been
focusing all the time on how can people
that are collectively trying to really
cope with a problem in a straight forward way do it. And it’s all been
focused on domains in which you assume
that people are agreeing to collaborate on it. So the copyright issues
aren’t there in a salient way. So therefore, I’ve said it. And I’m not much good at
answering your question, I think. He’ll always say something. NELSON: I came out
of– in high school I read Fahrenheit 451 and 1984
and The Space Merchants and all those terrible government
conspiracy kinds of things. And so I’ve always thought the
government was a conspiracy. And when Lee Harvey
Oswald was shot, I predicted the shooting of
Lee Harvey Oswald, Debbie, didn’t I, three minutes
before it happened? And so for years
during the ’60s, I was an assassination
buff and just sure that the entire structure
of American society was controlled by
some terrible group. It was so obvious. I mean, there’s so much
evidence everywhere. And gradually, this hypothesis
has fallen of its own weight because the number
of people involved in the conspiracy
at a witting level had to be in the
hundreds of thousands. But it is certainly
the case that there are new tools out there
that people are figuring out how to use [? adversarially ?]
in new ways every day. The creation of any
new tool creates a new adversarial
method, perhaps, just as the creation
of any new capability creates a deprived class
and a privileged class. People talk about how are we
going to make sure everyone has equal access to information. Give me a break. When did anyone ever have
equal access to information and how can it ever be possible? Those of us who
spend 48 hours a day trying to get as much
information as we can are obviously going to be
in a different category from couch potatoes. Okay, so anyway here. BERNERS-LEE: Good answer. Good question. This is really– I think it
concerns a lot of people. It certainly concerns me. This is largely the
import of my talk. Civilization is walking a
road between the mountains of despotic dictatorship
and the swamps of terrorism. And every now and
again, somebody feels that we’re veering in
one direction and drives hard in the other direction. So how can we ensure
that, in fact, we stay in the green
pastures in between? Perhaps one of the
answers is making a sort of [? fractal ?] society. The interesting thing
for me that Ted said, there is when the
required number of people to have been in the conspiracy
approach, what is it 10,000? 100,000? At that point, you
realized that they couldn’t do that without
you knowing one of them. NELSON: Right. BERNERS-LEE: So in other words– NELSON: Or being one of them and
not knowing it, which is worse. BERNERS-LEE: Right, who knows? [LAUGHING] So it’s a question of the
number of links in a way. If society is structured right,
then you will know somebody sufficiently well who knows
somebody sufficiently well who knows somebody who knows
the president that you trust the guy or you don’t. Or if it’s too many, if there
are too many levels, then you and all the people that
you know and all the people that they know can all have
the same conspiracy theory, can all be meeting in
the woods doing nasty, mean things with
explosives and seriously think that they’re right. VAN DAM: One more question. AUDIENCE: I’m Rosemary
[? Simpson. ?] And this question is for Ted. You mentioned the
problem of spaghetti webs and the issue of
information programming and alluded to
Dijkstra in your talk. And I wonder if you
could address that issue. What solutions you would
propose for enabling us to create better-structured
hypertext, given the tools that we have now? NELSON: Well, that
last codicil, that’s, again, like what kind of ketchup
do you want on your cow patty? And the important thing
is to improve the tools. Right now what Tim has given
us in the way of addressability of links has been fabulous. We have to go on, make
these bi-followable, again, not bi-directional. Because links can be directional
and followable from both ends. And secondly–
transclusion meaning that you can look at
the same thing in some of its many contexts. And trans-parallel
visualization in browsers. So transparallel
visualization could be added to Netscape
or any other browser. You say, just, well,
keep the thing you’re jumping from on the left
and add the thing you’re jumping to on the right
and sort of scroll leftward as you keep jumping
through things. Isn’t it amazing that hypertext,
which has non-linear structure, is stored by Netscape
and Mosaic a sequence? You say back and forward
in the structure. What does that mean? You’re talking about
back and forward in time in your history
of the structure. Anyway, so that trans-parallel
visualization and transclusion are my answers, and it doesn’t
matter what the question is. [LAUGHING] [APPLAUSE] ENGELBART: This is
the kind of thing, I think, that Ted and I
have waved at each other across the frontier
for many years, that I think there’s
more to it than that. Eloquently spoken,
but there are a lot of tool changes
in the things you can do in there to make a
different way for moving around, for visualizing
what’s going on, for constructing views that
are interesting and supportive, and for getting conventions
in the way you structure and tag things so that they’re
more useful, et cetera. So there’s a lot to explore. And I think it’s not going
to happen unless there is a purposeful push
just to try to make a better constructive way in
which knowledge is dynamically developed by a collective group. And so– but thank you
very much for the ideas. VAN DAM: Let’s switch to MBone
Let’s lower the lights a little so we can see the question. CREW: Not yet. VAN DAM: Not yet? Ho, I was under the impression
we had some things cued up. Okay, in that case, somebody
else from the audience. Hey, Raj. AUDIENCE: Raj Reddy,
question for Michael. Do you think we’ll
get to the stage where all the libraries will
stop funding the current 3% and go all electronic? And when should they do that? LESK: You’re saying
when should lib– AUDIENCE: –we stop funding
the current library systems and go to all electronic
digital libraries? LESK: This obviously is going
to be a progressive thing. There are pharmaceutical
companies today where 80% of their
acquisitions budget goes to purchasing
electronic materials. There are publishers today
where more than half their– publishers like the
American Chemical Society that have a paper tradition
where more than half of their revenue comes from
bytes rather than pages. So there is a
transition happening. What I am saying is, I
believe that by 2010, most of the information
that is needed by ordinary students and
ordinary faculty members will be coming electronically. And that will be the
major acquisition item in a library budget. They probably won’t
be buying objects. They’ll probably be buying the
right to access the object, in some way. AUDIENCE: The issue is,
what is the transition path? Namely, those of
us in universities who have a $3 or
$5 million budget have to worry about how
to do it without impacting a lot of people and
without impacting students. And what is the right
transition path? LESK: Actually, my experience
of a number of universities is I know more
library directors who are anxious to move in this
direction being held up by some faculty members
demanding that the library keep buying books than
I do the reverse, places where the faculty
is beating up the library because it keeps buying books. Every university
librarian will have a story of how they have had
to cutback journal purchases. Because one of
the odd statistics I tried to calculate
recently is if you plot the statistic, books
purchased by academic research libraries in the United States
divided by number of books published in the United
States year by year, it’s been going down
steeply for 20 years. It’s now back to
where it was in 1920. So every librarian has
stories of cutting back journal subscriptions. There are screams from the
faculty when they try it. After it’s done, nobody notices. So the answer is,
unfortunately, fairly clear. Librarians are going
to cut back the sorts of journals that exist because
a few people on the faculty have to get tenure and for no
other reason, journals that are right only and never read. And the journals will go away. And no one will notice. And it will free up a bit
of money for the libraries to do new service. And what they
really need is a way to do this without the
faculty members who have been seduced into
being editorial board members of those journals
from complaining. VAN DAM: Last last
question from Stu. And then Paul will tell
us where we go next. AUDIENCE: Stu Card. So as I was sitting
back listening, it finally struck me
what was odd about all your presentations, in
a way, is that you all assume a world in which things
go, eventually, some sooner, some later, into a completely–
we go into a completely electronic world and leave
the paper world behind, as opposed to a world in
which we have a role for paper or augmented realities in
which physical realities are interleaved with electronic
realities, pieces of paper, Ted’s cut and
paste, for example, that have digitally-embedded
URLs in them so that, in fact, he carries around linkages
to things around the world or paper-tronic systems
in which the computer data structures are embedded
perhaps in the half tone images of the things
that are printed out or other forms that are mixtures
between the physical and electronic world. So I was wondering if someone
might make a comment on that. KAHN: Holding the
microphone, I would say that if you got
that out of my talk, you’ve got a problem
with your perceiver. Because I actually hold
the view that paper is, in fact, going
to continue to be probably an increasingly present
component of this system. I remember all of those articles
about the paperless office. And I think that’s
as likely to happen as the paperless bathroom. I mean, just not really
in the cards, so to speak. And I just don’t
believe that people are going to be without paper. I think there will be
augmentation technologies, not of the kind that
Doug has worked on, but that give you the
equivalent of that. If you want to carry
it around, and you don’t like the equivalent of a
laptop or something like that. And it may even be embedded
in the clothing that you wear, if that’s the case. But I think paper is going to
be with us for a long time. ENGELBART: How long? KAHN: Hm? ENGELBART: How long? KAHN: As long as I’m
around to perceive it. ENGELBART: Gee, goodbye. Well, I think there’s
some things you look at and you could say,
it’s inevitable. And you don’t have to
set the time exactly. But if it’s inevitable, you
start kind of counting on it in the future. And I think it’s inevitable
that we won’t have paper, that technology will let you
have, if it’s because it’s convenient and
small, it’ll still let you have all kinds of
views you can toss in on it or something like that. So I think its days– it’s doomed. And, you know, I used to tell
people I like books, too. You know, but
practically speaking, I can hold in my hand
something that has access to all kinds of
books, it won’t weigh as much maybe as a
book does, it will have more information,
accessibility to it, and flexible usage of it. So I think it’s inevitable. NELSON: I love books. I have warehouses full of them. And I don’t get to
read them very often. And I would like to
have them all right there in a virtual surround. Because they are
there in my heart. And everybody’s
information environment is essentially their
spiritual environment. I mean, your books,
the magazines you read, the TV you watch,
this is your sense of identity. But for me, for
the last 35 years, it’s just been clear
as a divine revelation, as Doug has said,
that obviously that’s what we want to get away from. And I have been designing a
paperless world for 35 years. And I’m very
comfortable with it. And I promise you that
when you actually see it, it’ll be great. But the other part of that is– and as I’ve said to other
people who bring up the bathroom metaphor, confusing an office
and a bathroom, I think, is a rather deep question. KAHN: Don’t invite
them to your office. NELSON: Right, I wouldn’t
want to go to your office. And– [LAUGHING] And I just want to
say this one thing. I am very, very
happy this moment to be between two of the men
I admire most in the world. BERNERS-LEE: Thank you. I love books, too. And when it comes
to pieces of paper to scribble on, just looking
at the hardware technology, come off it, I mean– in my office, I have two of the
biggest high resolution screens I could get stuck together
as close as I could get them so that you can drag
the windows across. And I have very little paper. And I must confess,
at home, I have stacks and stacks of paper. I feel quite at home
in either environment. But for scribbling
on, for doodling, I think that whereas
the reference books you’re going to have
to have in cyberspace, most of the pieces
of paper, it’s going to be so difficult to find
hardware, which you can layer, pin on the wall quite as easily. My feeling is that you’re
not going to get rid of paper unless you can make virtual
paper, in other words, unless you can with
your hand gestures, with your virtual
reality glasses, you can basically do
the equivalent thing. You can actually, zoom with
perhaps accelerated mouse hands so that you can just
flick 100 miles to pick up a really interesting piece
of paper and flick it back. Scribble on it, then
screw it into a wall, and throw it into the
metaphorical wastebasket. When you can do
that sort of thing, then maybe you will be
getting to the point where you can get rid of– [VOCALIZING] Bye, folks. Bye. [LAUGHING] [APPLAUSE] PENFIELD: I think we should
give a wonderful hand to the panel here and to Andy. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *