EDITOR'S
WELCOME:
[ But first this: IT'S OFFICIAL! The
ABC audit figures are just in. And they prove *conclusively*
that we're now even more popular than 'Sewerage Maintenance'
monthly magazine. It's all down to you dear reader. Thank
you! ]
~ * ~
I'm sitting on the balcony of an hotel
in the centre of Paris as I write this. And I'm reading the
many conspiracy theories about the recent Google update.
My wife is delighted that, our next door
neighbours on Faubourg Saint-Honore, are Yves Saint Laurent
and Christian Dior. Personally, I'm delighted with the little
miracle the French know as "wee-fee". For it is
the power of "wee-fee" which allows me to sit, cable
free, and connect to the Internet gazing at the Eiffel Tower...
While the staff at our next door neighbours,
gaze dutifully at my wife's credit cards...
So, what, actually, has happened over
at Google? And who's to blame? Just before we move to my overdue
interview with a Google guy called Daniel Dulitz, let me try
and explain a couple of things I've learned about information
retrieval. This may help to rationalise things a little.
I actually wrote a spoof conspiracy theory
article which ended with Google being the bad guys and forcing
SEM's to open AdWords accounts, yada, yada... But then I realised
that some people may take it seriously. In fact: Some people
are taking this whole thing far too seriously!
I don't want to get too deep "into
the weeds" here. But perhaps a quick overview of the
fundamental principles of what's involved in a modern information
retrieval system for the web, may be a little... enlightening?
Back in 1999 Ricardo Baeza-Yates and
Berthier Ribeiro-Neto authored a groundbreaking text book
which has become standard fodder for students of modern information
retrieval. In fact, that's actually the title of the book.
It was groundbreaking in the sense that,
it distinguished, from the very beginning of the book, the
difference between data retrieval and information retrieval.
There has been a considerable amount of previous work carried
out in the field of data retrieval. Many systems and processes
have been tried and tested and proven. But data retrieval,
as in the context of an information retrieval system, consists
mainly of determining which documents in a collection contain
the keywords in the user query. And this is simply not enough
to satisfy a user's information need. One million documents
may contain the keywords, so therefore they're all relevant
to the query. But which ones are the most important or authoritative
documents in the returned set?
What we have here is what's known as
the "abundance problem". Too many documents and,
generally speaking, a lazy user who has no desire to go beyond
the first few pages of results. And, again, generally speaking,
a user who has little or no understanding of how to form advanced
queries to broaden or narrow the scope of their search.
The area of information retrieval, since
the dawning of the web, has advanced and grown well beyond
its primary goals of simply indexing text and finding relevant
documents in a collection (or corpus).
Data retrieval may satisfy the user of
a database system, in providing some form of a solution. But
it doesn't solve the problem of retrieving "information"
about a "subject" or "topic". And it certainly
doesn't provide a (basic) user friendly method of ranking
the most important documents related to the user's information
needs.
The foundation of modern information
retrieval on the web is all about hypertext-based machine
learning and data mining methods such as clustering, collaborative
filtering and supervised learning. However, in the main, it
focuses on web linkage analysis.
The two most popular algorithms used
in information retrieval on the web are, PageRank and HITS.
The use of these algorithms substantially enhances the quality
and relevancy of keyword-based searches.
Stay with me here guys - there's not
much further to go.
The fundamental difference between the
two algorithms is this: Page Rank is a keyword "independent"
algorithm and HITS is a keyword "dependent" algorithm.
What that means is, with PageRank, you already have a score,
even before the user keys in a query i.e. your PageRank is
computed "upfront". But with HITS, your score is
determined based on the keywords which are input and then
the score is gauged around the linkage data surrounding the
community relevant to that keyword/phrase.
Of course, from here on in, it gets to
become very complicated. In fact, it took over 30 pages in
my book to explain this. And that was heavily edited.
So why am I using this space to explain
part of it now?
Simply for this reason (and these are
purely my own thoughts and opinions): I believe that PageRank
has always been flawed. I believe that Kleinberg's HITS algorithm
(and the variations on it), being closer to subject specific,
provides more relevant results.
A few years ago when Teoma was launched,
there were lots of comparisons made about Jon Kleinberg's
HITS algorithm. What many people didn't realise was, Kleinberg's
algorithm had suffered its own problems: Namely "topic
drift" and "run time analysis" delays.
Monica Henzinger, now head of research
at Google, played a major role in developing solutions to
the "topic drift" problem (curiously enough by introducing
a little element of PageRank in the recipe). But the "run
time analysis" problem remained. In simple terms, the
results from the HITS algorithm were more relevant, but they
took an eternity (in web search expectation time) to compute.
So how had the guys at Teoma resolved
this and managed to get the results of a "keyword dependent"
algorithm in sub second times? Apostolos Gerasoulis, the scientist
and founder of Teoma, either discovered the Holy Grail, or
knows some pretty good parlour tricks. Personally I believe
it's closer to the former.
Now if Teoma can do it - why shouldn't
Google? What if this major upheaval is just that - a shift
toward more subject specific relevancy and keyword dependent
relevancy. It would certainly mean that the idea of "quality"
links would count for a lot more than "quantity"
which was something PageRank could so frequently be fooled
with.
So, what if Google is pushing the envelope
when it comes to information retrieval. What if they're experimenting
the use of this new technology over a corpus of three billion
documents. A corpus of information the size of which has never
before been known to mankind. And therefore, never before
manipulated in such a way.
What if this change, which will likely
take a few iterations, as you'd expect in a "living experiment"
is all about providing more relevant results to the END USER?
And what if it has nothing at all to
do with penalising search engine marketers; webmasters; affiliate
marketers or "algorithm botherers" per se?
I lost some and I gained some in the
recent shake up. And I'm not losing a second of sleep over
it. There may be another huge, and perhaps unexpected, information
retrieval belch over at Google come the next dance. But you
know, whatever does happen... It won't be technology going
backwards to see if it can raise a few dead listings for some
impertinent folks who seemingly want to bite the hand that's
been feeding them these past years.
I've read about people taking the case
to the FTC, class action blah, blah, blah... There's somebody
who is losing a fortune because he lost his top ten rank at
Google. This is the guy who's business model relies on some
external organisation, which he has no control over whatsoever,
sending him customers for free! I'll say that again: For FREE!
Now he wants to sue!
I have a much better idea for that guy
and all the others like him. Go to Google and wait in the
reception area for Sergey Brin to arrive. When he does, bend
him over (at this point you may wish to apply a little dab
of cologne) and give his ass a HUGE kiss. He's been sending
you wheelbarrow loads of free money since he opened the Google
doors for business!
I'll start worrying when Google starts
worrying. And that's when end users start complaining. And
as I've said many times before: The end user wouldn't know
a PageRank from a HITS and I doubt if they would even care.
Yes, I could write something with a conspiracy
theory around it. But why should I? Google is a fantastic
company, with fabulous technology, great people and a HUGE
amount of pride over the INTEGRITY of their results.
You know, in this game, right now, they're
simply the best. And more power to them I say. Roll on Bill
Gates and anybody else. You'll have your work cut out!
Okay. So what I've written is a little
dull and a little pragmatic... The search for the truth usually
is...
Soap box back in the corner...
Read on, dear reader - read on!
Mike.
GOOGLE
GUY NAILED TO THE GROUND AND FORCED TO ANSWER QUESTIONS!
Earlier this year I was booked to have
lunch with Daniel Dulitz from Google. However, at the very
last moment, I was forced to cancel as I was rushed off to
New England Medical Centre! Serious bout of poisoning over,
and released from hospital, I tried to rearrange...
But the Google guy had gone and I was
circumnavigating the globe again, as is my wont... So, some
months later, I managed to catch up with Daniel on the phone.
He at the Googleplex, and me at home. The Devil I have come
to know better than most, is: Time-zone! Daniel is keen to
have lunch. While I am keen to pull a sheet over my head and
go to sleep! Somehow, he manages to last without his lunch...
And I manage to keep my head above the sheet...
If this is the first time you've received
this newsletter, then you may not know that, what you are
about to read is a transcript. It's not an editorial piece:
It's exactly how it was recorded! << Play >>
Mike: Daniel,
nice to get to speak again. So, where were we before I popped
out for a quick medical thing? [bursts out laughing] I think
we were starting with your background, so let's pick it up
from there...
Daniel: Okay, I'm not
sure how far back you want to go Mike...
Mike: [Adopts stereotypical
therapist character] I vont to take you back to ven you were
just a leetle child Daniel...
Daniel: [Big laugh] Well,
my very first memory was of...
Mike: [Still laughing]
Okay, I'll stop you there... let's do University first.
Daniel: I got my Bachelors
Degree at Cornell University. I didn't really want to stay
at school, but at that time I hadn't decided what I'd like
to do with my life. So, I thought maybe work was the best
idea, so I went to Motorola in Texas and got myself a job
on an automatic, computer aided design, synthesis system project.
And that is still very advanced, even this many years later:
still state-of- the-art.
My wife got a teaching job in Pennsylvania,
she teaches Anglo Saxon, and there aren't that many good,
Anglo Saxon jobs anywhere in the world. So when you get one...
Mike: You gotta hang
on to that one...
Daniel: Right! So, I
moved up there and worked on digital signal processing for
a while and well, it was, frankly, not the most satisfying
work. It was very difficult to find good people there in Pennsylvania
and a little military. I'm more interested in doing things
that people use and enjoy. I'd been using the search engine
Google for years. Well, actually, not so many years at that
time, maybe a year. And I thought now that seems like it would
be a great place to work.
My wife, ultimately, convinced me that
if I didn't like what I was doing then maybe I should be looking
for a job somewhere else in the world... anyway, that's kind
of how I came to Google.
Mike: It's very interesting
Daniel. I've done a lot of research into the early work of
innovators in information retrieval, and if memory serves
me well: Isn't Cornell the same University that Gerard Salton
developed the SMART system? And not only that, the same place
where Jon Kleinberg developed the HITS algorithm?
Daniel: Yes, Mike that's
right...
Mike: So Cornell's kind
of like Stanford, I guess, in a sense, in that, guys like
Larry page and Sergey Brin who came out of Stanford, kind
of suggests that both universities are sort of steeped in
this search technology as it has become?
Daniel: Indeed, it's
ironic though, that when I was at Cornell, I focused almost
all my efforts on high performance optimising compilers and
did absolutely nothing at all in information retrieval. I
did a little bit of work in AI (artificial intelligence) and
I did listen to two very interesting talks by Salton. But
I never actually did anything significant with him.
Mike: I just thought,
in passing, that it was an interesting observation that yourself
and alumni's such as Salton and Kleinberg were there at the
same time.
Of course, there are two main algorithms
based on linkage data. HITS which was developed, as I've already
said, by Jon Kleinberg at Cornell and PageRank at Stanford.
There's a great deal which has been written about PageRank
(not so much about HITS) but it would still be nice to get
the simple version of what PageRank actually is.
When you joined Google, did they give
you a kind of "Idiots Guide to PageRank" to use
as a simple analogy when you get asked the question?
Daniel: So, as you're
very well aware yourself Mike, it's a 'not so' complex, but
very useful mathematical model of how surfers might surf from
page-to-page. Essentially, the random surfer starts at a particular
page on the web. And, given enough time, there's a certain
chance that the surfer will arrive at ANY given page - and
the chance that he'd arrive there is the PageRank of that
particular page.
It's a fairly correct explanation. Perhaps,
an easier to understand explanation is that, PageRank captures
the vote that one page makes for another. You know if you
have a web page and you link to another web page then, you
have used your useful screen real-estate to send someone to
this other page. And if you do that - then you must believe
that there's some value in that other page. Or you wouldn't
have put it in front of the eyes of your own users.
So, for some, the first response to that
is; what if you just link to a bunch of really useless pages?
And the response to that is: well who links to you? If you
provide some service then more people will link to you. Your
vote will be more important because your own PageRank score
is higher.
Mike: It's the citation/co-citation
thing. You know, people saying, basically we're pointing to
you because we think you're an important page. The hubs and
authorities principle which Jon Kleinberg developed is based
on the same thing really.
In fact, Daniel, I seem to remember a
scientist once saying to me: "We think we'd rarely come
across a web page which has - these are links to the 20 worst
web pages you'll ever visit!"
Daniel: [Laughs] Right,
and in as much as they do, there is some limited value to
that if you want to do it for humour or as an example as "what
not to do" but it's little value.
Mike: I think the search
engine optimisation community or fraternity, whichever suits
best, do tend to spend a WHOLE lot of time gazing at the Google
toolbar and weighing up the PageRank of each page that they
visit. Just how accurate is the data in the toolbar Daniel?
And do you think that search engine optimisers may be a little
TOO engrossed in this PageRank thing?
I mean it really is just a single factor
in the overall ranking algorithm isn't it?
Daniel: Okay, so first,
how accurate is the toolbar? Well, I don't want to give too
'nerdy' an answer here Mike, but there's two things really:
there's accuracy and there's precision. And, in terms of accuracy,
it is a completely accurate indicator of the PageRank of a
page.
We don't... well, we don't lie in the
PageRank bar. But it's not very precise. We have a lot more
precision available to us than we represent in a ten step
scale or whatever it is on the PageRank tool...
Mike: So there is data
- but it's not precise! Does this mean then, that if I have
PageRank seven and my competitor has PageRank nine, I shouldn't
really have a breakdown and go and visit my therapist or anything...[Laughs]
Daniel: Not really!
So, the second part of the question was,
are people focusing too much on PageRank. I guess that depends
on what they're after really. You know, if they're curious
about PageRank and how it works, well that's really why we
put that bar on the toolbar in the first place. You know,
if people are just casually surfing the web they can see what
Google's PageRank impression of that page is. And that's great.
I'm all for people doing that.
But for search engine marketing, search
engine optimisation purposes, yeah, I'd say that there's too
much emphasis placed on what that PageRank number actually
is. Our job as a search engine is to return the best results
that we can. And we're not naive enough to think that we can
condense every indicator about a page into a number from one
to ten. We certainly can't do that.
So, if people are trying to look at what
we're doing and their idea is based on that single number
from one to ten, then ... well, they're not going to be effective
in figuring out what we're doing at all.
Mike: Okay Daniel, moving
swiftly onto something else here... Since the development
of the web's first full-text retrieval search engines, such
as WebCrawler and Lycos hit the web a number of years ago,
the technology for developing web sites has advanced dramatically.
With the dawning of broadband, we have more multi-media sites,
including audio visual entertainment sites and such. They're
a lot more commonplace now. Yet, these are the types of technology
which seem to cause the most problems for the average crawler
out there (if there is such a thing!).
How well do you think search engine crawlers
are coping with the likes of dynamically delivered content
and non HTML files?
Daniel: Clearly there's
a lot of room for improvement here. On the other hand, I think
that Google's crawler does a pretty good job with the wide
variety of content sites there are out there. It can handle
Microsoft Word documents, PowerPoint documents, pdf files...
And Google's crawler has, for quite a
long time now, parsed links out of Flash files [swf] in order
to discover other content. Certainly there has been an explosion
in the number of content types that someone uses. And the
search engines and Google are doing everything we can to track
that in a reliable way. We don't want to put out something
that, perhaps, works not very well. We want to support it
well. And again on the other hand, I think it is clearly a
case that most useful content on the web is in, you know,
Microsoft Word, pdf or HTML. There are clearly some examples
of useful content in other formats...
So, I'm making sure that I'm not giving
the wrong impression. But by far, the most important content
is in those three formats.
Mike: What about the
subject of dynamic delivery and those major database driven
sites? I mean, obviously, one of the problems that you have
is that search engine optimisers and web site developers know
that the appearance of ? in the URL or query string and that's
it! A crawler is most likely to back out of thereI know that
Google prefers to say that it can deal with that type of content.
But I seem to remember other search engines saying that they
do have problems with that type of delivery. This is why pay
for inclusion services were created I guess.
You know with a pay for inclusion service
you can avoid some of those problems. But Google insists it
won't use a pay for inclusion service. Would it not be wise
for you to create one so that you could help webmasters to
get those problem URL's in there?
Daniel: So... Mike, that's
a very complicated question!!
Mike: Daniel I'm sorry,
I realised that I was actually rolling that into about three
questions there... Would you like me to break it down to three
parts? Sorry, my fault there!
Daniel: [Laughs] Okay,
let's deal with the dynamic delivery part first. We deal very
well with dynamic delivery per se. So the mere presence of
a question mark in the URL doesn't throw us for a loop at
all. We will crawl those pages more slowly than we do with
pages without question marks in the URL. Purely because they've
identified themselves as being dynamic and we certainly don't
want to bring anyone's site down because we're hitting the
database too hard. If you hit a web server at a couple of
pages per second, nobody's going to notice that much. But
if you hit a web server with a couple of queries per second,
especially the smaller sites, that can be a very big problem
for them. That's the primary motivation to crawl dynamic delivery
sites more slowly, but we do crawl it.
The difficulty isn't really with dynamic
content or not. The difficulty is more about how people use
content management tools... content management systems. Content
management systems are really powerful and a great way for
webmasters to provide a lot of content on their site in a
way which is visually pleasing. And so, of course, a lot of
sites use them. But there are a couple of factors which make
CMS's not very crawler friendly. And one of those is that
they provide multiple views of the same data. You know, in
different sort orders and different colour schemes and, well,
links which are producing the same content just a little expanded
view here and there. And crawlers have to try and identify
that and try to NOT crawl too much of the same content.
And that's a hard problem. And I have
to say it's not a problem which is unique to dynamic content.
But dynamic content is certainly the one which makes it easier
to generate those types of issues.
Mike: So, I guess that
before we had these content management systems there was still
a problem with duplicate material being spread across organisations.
You know mirror sites for different geographic regions and
that sort of thing... o Daniel: Yes, that did present a problem,
but on a smaller scale. If you look at it from a technical
standpoint, there are very few of those mirror sites, but
there are very many of those Cold Fusion and Domino type tools...
That makes it so easy to duplicate vast
amounts of duplicate information. One mirror site results
in one duplicate. With Domino for instance, being able to
hide or expand columns, one page can easily map to 25 or 20
URL's which we really need to be able to identify, ultimately,
as duplicate material.
Mike: So what kind of
advice could you give to those larger organisations where
they have, you know, 25 offices world wide and they have so
many people working across the organisation with the same
material on these content management systems... Is there any
kind of 'rule of thumb' that we could give to those guys
I mean do they have to pull this stuff
out of the database and make it static on the surface if they
want it to be found, or only one version of it to be found?
Daniel: Not really no...
The general 'rule of thumb' that I would use is, provide some
sort of 'browsable' interface to your content first of all.
Some people want to find information by search and some want
to find some by browsing a hierarchy because they're not sure
what they really want.
So they don't know what to search for.
Take Google, we're a search engine, but we still provide a
directory for people who simply want to browse. That would
be my first message, to give people who want to browse, the
opportunity to browse. And that will also help with search
engines a lot.
The second major factor is to disable
the features of your content management system that are programmed
to produce the most duplicate content. Session ID's are something
I've mentioned many times at conferences. They really do cause
a lot of grief for crawlers. Let me just give a brief picture
of what a session ID is for the benefit of those who aren't
sure.
Most content management systems provide
a facility whereby, if a user comes in that does not accept
cookies, they can still track that user by changing the URL.
Robots don't accept cookies, so the content management systems
end up delivering session ID's for them. We do try and detect
these session ID's and not take them into account but it's
difficult. And sometimes we'll be wrong.
So, it will certainly help from the point
of view of getting listed and so that we don't accidentally
crawl a site many, many times, to disable the session ID's.
You know what happens is, we might have 20 or 40 bots out
crawling the web and they know not to crawl the same URL twice.
If bot one crawls URL X then bot 10 isn't going to crawl it.
But every time one of these bots visits one of these pages
on a session ID site, it's likely to get a different session
ID back.
So 40 bots may crawl the site - and so
we end up with 40 copies of the site. And that's a lot of
pages from the same site, using a lot of processing power
and that's not good for the site owner either.
Mike: So what about the
solutions provided by other search engines. You know like
Inktomi for instance. They provide a pay for inclusion service.
So you can get around the problem by only feeding into the
database those safe URL's as they would be. What's Google's
standpoint on that type of service?
Daniel: Mike, we keep
an open mind about doing whatever we can to help our end users.
But the thing that has kept us from offering paid inclusion,
up until now, is really that... Well, when someone pays you
to include a URL - then are you really free to judge that
as you would all other pages?
And the other point is, what relationship
have you formed with that content provider, so that, if they
ask you: "Should my page look like X, Y or Z?" What
do you tell them? What obligations have been entered into
here to provide a tighter support relationship with them because
you've accepted their money to list that page?
When we enter into a business relationship
with someone we want it to be fair to them.
Mike: I think it is a
very curious situation. I mean, it does have the benefits
of faster refreshes in the database and the fact that you
can feed in some of those URL's with the characters and query
strings that don't stand as much of a chance just through
the natural crawling process.
But you're right: What is it that you
really are paying for? As I've said many times before - if
you have a page which is already 5,052 in the index - do you
really want to start paying to be there!
Daniel: [Laughing] Right...
right... But you know, I think that the concept that paid
inclusion leads to faster refreshes... Well, in general, I
don't think that's true. Google has worked very diligently
to update as many pages as possible on a rapid basis. I'm
sure you know about 'Fresh-bot'?
Mike: Of course... Yes,
I do...
Daniel: So that gives
us the flexibility to do the right thing for our users, if
WE choose which pages to crawl on a 'freshness' basis. Instead
of only crawling those pages, for freshness purposes, that
people are paying us to crawl. It's a hard issue Mike
What we'd like to do, going forward,
is to provide more means for webmasters to communicate with
us, where money is NOT an issue. We're keen to do things where
we're able to help webmasters and they can help us - but where
webmasters don't have to pay.
Mike: Yeah, you know,
I think that there's a lot closer kind of relationship taking
place now between search engine marketers and search engines.
Well, certainly a lot closer than it used to be. It has been
a little bit of a strange space between the search engine
optimisation community and the search engines. I'm coming
back to that in a minute actually Daniel, what I want to do
first is to touch on the subject of cloaking.
Just as we're glossing over the search
engine/webmaster relationship. Cloaking has been, and still
is I guess, a very controversial issue. I'd like to get the
official Google line on cloaking, but first, and for the benefit
of those readers who don't know what cloaking is, could you
give a brief technical overview and let us know what actually
constitutes cloaking in the first instance?
Daniel: The short answer
is, that cloaking is identifying a search engine so that you
can serve the search engine different content to that which
you'd serve to a real browser. That's really the idea...
Mike: So Daniel, it's
not really so much about the method or the way that you actually
do it technically - it's more about the principle yes? It's
about the fact that a search engine crawler is going to see
one page - but when I click on that link I'm going to see
something else...
Daniel: Yes, people,
well some people, try to obscure things. Some people say that
Google does cloaking. You know that we detect things like
which country your from and serve you different content based
on your IP address. That's not really cloaking. There's nothing
hidden about that. There's nothing below board about that
and we're very up front that we provide slightly different
content, slightly different logos and so forth for people
from different geographical locations. From a search engine
standpoint, we have an obligation to our users to judge the
content of the page that they will see when they visit our
sites.
And anyone who tries to stand in the way
of that, where we evaluate the process of what a browser would
see, that person who is engaging in cloaking, is the person
that gets in the way of the relationship with our customers...
Mike: And that's something
you frown upon... I know. Actually, I'm going to put you on
the spot a little here Daniel. Let's just take this example
(and a true example it is). Take a large media company, a
TV station in fact. They have a web site which is based on
a very popular TV show. When somebody comes to that web site,
what they'll expect to see is streaming media, you know, audio
visual presentations, sound-bites, Flash and other animation,
that type of technology.
All of that stuff is very, very crawler
unfriendly as we've already discussed. So, suppose the crawler
is reading text which is completely accurate and relates to
the actual content of that web site. And when a person clicks
through the link and they get to see the web site which is
the audio visual experience they wanted - not the text version
- surely everybody's happy there?
I mean the search engine served up a relevant
result to the query. The end user is happy with the web site
she's viewing. The webmaster is happy that everybody else
is happy... Is that not okay...
Daniel: Well, there are
a couple of actors who are not happy in that situation. One
is the end user who does not have Flash installed in their
browser. They go to that site and they simply don't have the
fancy high performance, high bandwidth streaming technology
capability etc. They're obviously going to have a bad experience
in that particular scenario.
But that's not necessary you know. Almost
all new technology and HTML itself provides ways of including
alternate content for folks who don't have the fancy add-ons
and plug-ins. So there's one actor who may not be very happy.
Another actor who may not be very happy is the person who
is visually impaired and is not going to be able to take advantage
of text to speech - at all- because there is no text there.
There is pretty much a case of this technology
recognising these accessibility issues. And you know it's
entirely possible, in every technology that I've looked at,
where a single page which handles a wide variety of browser
types, including text only browsers and robots, which don't
understand that fancy content. There is another issue as well
and that's that Google and other search engines would like
to, as best as possible, to understand what that content actually
is, in general. That complicated high bandwidth Flash content
etc. And we can't do that if we can't get to the real content.
So going forward, for instance, we have
people who need to know how much better we need to be at Flash
conversion. You know, if there's lots of Flash out there we
can't see it if people are cloaking us and so it looks like
it's not important for us to work on that problem.
Mike: Got you. And you
know, I'd never even thought about that as an issue. It's
a fact. The less you're able to identify this sort of alien
content, as it is, the less you're likely to know how important
it is in its application across the web. That means it's less
than likely that you'll get around to trying to find a resolution...
Daniel: Absolutely, and
that leads to a situation where, content providers will have
to do something special for search engines. And that means
spending effort on doing things specifically for search engines
when they shouldn't have to. We want to help content providers
- right.
Every time someone does something special
because a search engine exists, is sort of, well it's like
a case of where we've failed.
Mike: And that takes
us back to the standard philosophical thing about, you know,
if the search engines didn't exist, would you be going to
all of this trouble?
Daniel: [Big laugh] Yeah,
yeah... To a certain extent, yes...
Mike: For the last edition
of my book, one of the things I wanted to dispel was the notion
of themed web sites. By that, I mean the idea that people
had about trying to develop your entire web site a round a
couple of key words. You know like, every page has to be about
"blue widgets" and the domain should be "blue-widgets.com"
yada, yada... I think it was nothing more than SEO propaganda
the whole thing - what are your thoughts?
Daniel: I think people
sometimes mean different things by "themes." The
statement above -- that somehow your blue widget site would
be "weaker" if it contained a page about Tigers
- is completely wrong.
No search engine would want to do that;
having a page on Tigers doesn't affect your ability to be
a resource for blue widgets. We'd miss good blue widget pages
if we excluded the sites that also talk about Tigers.
However, there is a difference between
"having a little Bit of content about blue widgets"
and "having in-depth Content about blue widgets."
Clearly we prefer in-depth (more useful) content. That's not
so much a preference for themes as a preference for depth.
"Utility" and "depth" really should be
measured by a site's users.
Mike: So Daniel, there's
good advice and there's bad advice about how to achieve a
better rank at search engines. And people do spend SO MUCH
time trying to do whatever they can to get a decent rank at
a search engine. It's so important to be in the top ten at
Google and the other major search engines.
So, what about those people who do take
the bad advice. You know, the spamming and stuff, and there's
plenty of bad advice out there, what's the deal with them?
What kind of penalties are people likely to incur? There are
horror stories out there and people are afraid to do this...
or afraid to do that. What are the penalties? Can I get banned
from Google? Do I get banished for life for that stuff...
Daniel: Mike - you cannot
get banned from Google for life... But yes, you know particularly,
and these are extreme cases... Yes, bad spammers could be
removed from the index. But that happens so rarely. I mean,
very rarely...
You know, I could count up the number
of people who say they've been banned from Google... Well,
it's a lot more than I know that really have been banned from
Google. So, I think there are some people who are just too
worried about the ban thing. You know, if you're just trying
to provide good content to your own users, you should simply
not be worried about a ban.
Mike: Absolutely, this
is one of the things that I say a lot. I mean I do study a
lot of this and I look at what's written and, well, there
are so many scaremongers out there. Don't do that, don't do
this... The search engine will have you banned for life. There'll
be a plague of Locusts and... You know, that sort of thing...
But basically Daniel, what you're saying is: If you've got
good content, just present it as best as you possibly can
and get it out there..
Daniel: Yes!
Mike: And the next "no-no"
Daniel. There are so many products that I see out there for
making automated submissions to search engines and doing ranking
checks and link pop checks. They're either online services
or desktop applications... I mean, basically, all of these
things are a "no-no" with search engines... Correct?
Daniel: Right, well again
it's a case of people interfering with our ability to serve
our non automated users. And yes, Google is a robot, by that
I mean our crawlers are robots. We obey the robots exclusion
protocol. If someone says, no robots on this site or a particular
part of a site, you know, we don't go there. We're not going
to crawl those pages. These automated tools you're talking
about here are also robots. And they are not obeying our no
robots text file.
The reason that we have a robots text
file is because, every query that comes into Google uses a
fixed resource. We provide this resource for free to LIVE
people. And we want to continue doing that. In fact, that's
what a lot of us at Google enjoy, just providing those free
services to real people. But the more automated queries we
get, the harder it becomes to continue providing that free
service. It's just using resources that could be better spent
elsewhere...
Mike: I think part of
the problem also, is that people are fascinated too much about
what kind of rank they have at search engines - regardless
of whether it's any use having that rank. You know, whether
it brings them any traffic or not. And those issues seem to
become just so important to them. As far as I'm concerned,
I get a better idea of what traffic I'm getting, and from
where, just by looking at log files - not by doing a rank
check to see if I'm still number one for my own name spelt
backwards...
But - people are always going to be curious
I guess. What's the answer - you do a rank check by hand?
Daniel: Much more important
than checking your rank on search engines... In fact, I think
THE most important thing you can possibly do is track conversions
of users who visit your site - and to track that by referrer
- yes, as you say, to analyse your own site logs.
Then you can work backwards from there,
you can see what referrers are doing a good job for you. Are
these directory or links from other sites providing a lot
of good traffic AND are search engines providing good traffic.
And then you can look at changes in those relationships.
You can detect between yesterday and today
that there are fewer conversions from, say, Alta Vista than
there used to be. Then you have a real live person case asking
the question: "Why?" You can go to Alta Vista and
then try a query yourself and see what happened - did my listing
disappear, is my listing on a different page, that sort of
thing. It doesn't do any good to have that kind of information
if you're not using it. And if you are using the important
bits of information, then really, it takes no effort to get
it manually.
Mike: So, as I said,
you can learn a lot more about your success from your log
files than you will doing a rank check for a keyword or phrase,
which even if you're number one, doesn't convert for you...
Can I just touch on the subject of the
relationship between search engines and search engine marketers
or optimisers as they are also. I said I'd try and get back
to it again. You can see that there is a lot more contact
with the search engine marketing community as it develops
into a fully fledged business sector. Guys like yourself from
Google turning up at conferences...
Do you foresee the possibility of search
engines getting into relationships in the way that, in conventional
advertising and marketing, agencies have a very strong tie
with the media companies?
Daniel: Yes, I think
that's entirely possible. I mean, some agencies, while they
don't have formal relationships with search engines, do certainly
have a lot of understanding about search engines and can give,
overall, better advice about advertising strategy to clients.
You can certainly see that within the search engine optimisation
industry itself, that there is a transition towards more "full
service" marketing and less emphasis on, say ranking
checks, or any single factor. I certainly would not be surprised
to see that trend continue and...
Well, I don't have any special insight...
all I do is work for a search engine! But, yeah, I think customers
want a one stop solution.
Mike: What about the
general state, or condition, of the Industry? There have been
so many changes taking place, I really have to ask about the
acquisitions. Yahoo! Buys Inktomi, Overture buys Alta Vista
and Fast - Yahoo! Buys Overture... Is that all about consolidation?
Daniel: Sure, definitely
there has been a medium term trend towards consolidation.
Part of that is driven by the recognition that search is important.
And I'm very happy that Google has played some role in convincing
people that search is important. But there's no escaping the
fact that, the business we're in is very competitive.
It's competitive the day before yahoo!
buys Inktomi: It's competitive the day after! We know that
people come to Google because we serve good results. Everyone
of us here is working really hard to ensure that we continue
to provide better results all the time. That way, people will
always be happy to come to Google. And for as long as we're
doing that, it really doesn't matter who owns Alta Vista,
or fast or Inktomi.
There are a lot of smart people at those
companies. They do very excellent work, which means that we
have our work cut out for us. But there again we always have.
We've always been competing against the best and we think
we can continue to do that.
Mike: Will we continue
to see Google results at Yahoo! Daniel?
Daniel: Who knows? They're
a great partner of ours and we certainly do everything we
can to make them happy. There are lots of ways that Yahoo!
could use a search engine. We hope that our relationship continues,
but, ultimately, it's up to them.
Mike: Daniel, thanks
so much for your time, I know I'm keeping you away from your
lunch...
Daniel: Correct - and
I'm extremely hungry! But it was a pleasure talking to you
Mike.
(c) 2003 Net Writer Publishing.
< http://www.google.com
>
MEET
THE NEW ASSOCIATE EDITOR OF e-marketing-news:
It's one of those moments I recall all
too well. I was being viciously beaten around the head with
the very hefty conference handbook at SES in San Jose. There
was a woman's voice and she was screaming: "Okay, okay,
I'll do it... Now leave me alone and stop stalking me before
I call a cop - creep!!"
And so, in that historic moment, I applied
an ice pack to my head and celebrated my success in having
convinced one of the web's leading SEM professionals to join
me here, at your favourite e-rag.
But why don't I shut up for a minute
(as if that were really possible!!) and let Christine introduce
herself. So how do you feel about teaming up for the newsletter
then?
"Newsletter? Heck I thought that
was a blog! <big grin>
Okay - I'll be serious. Yes, Mike and
I will be joining forces to keep the newsletter alive. Ages
ago I had mentioned that it would be fun to work on a joint
project sometime. Little did I know that offhand comment would
turn into managing his newsletter.
I have several years experience running
an online newsletter - I should know better. The truth is
there is something irresistible about the opportunity to work
with someone as knowledgeable and hilarious as Mike Grehan.
I just hope I can stop laughing long enough to work."
Laughing? Laughing? You mean I'm not
being taken seriously!
BUT really seriously: It's an honour
and a privilege to have you come on board to shape us up for
next year Christine.
< http://www.keyrelevance.com
>
SEARCH
ENGINE STRATEGIES CONFERENCE COMES TO CHICAGO:
Yes, there's still time for you to make
that last minute decision and head to Chicago for the final
search engine strategies conference of this year.
I'll be teaming up with my panel partners
Brett Tabke of WebMasterWorld and Anthony Muller of Range
Online Media this time for the organic listings forum.
This is definitely one of my favourite
sessions as it's a chance to tackle all things "webmastery".
It usually starts fairly general, but then it gets right down
into the weeds with stuff on technical platforms, crawling
issues, linkage data and even a good old spam discussion from
time-to-time.
So, don't be afraid to "pipe-up"
with anything at all at this session. With guys as knowledgeable
as Brett and Anthony to hand, and stalwart Detlev Johnson
moderating, you really do have access to a dream-team of search
consultants.
And let's not forget the link building
session, always great fun and very informative. It's the San
Jose team back together again this time. So that's me, "Link
Mensch" Eric Ward, my fiend Paul Gardi from Jeeves/Teoma
and Marissa Mayer from Google.
This, again, is very informative with
Marissa explaining PageRank and linkage data over at Google
and Paul Gardi explaining how web communities are identified
through linkage data. (A subject close to my heart as you
know!)
Organised by the great man himself, Danny
Sullivan, SES now attracts the leading practitioners in the
field of search marketing and the vendors, media companies
and technology firms that go with it.
I frequently get asked by people "is
it worth going if I already know that stuff?" The answer
is fairly simple, the conference is designed to accommodate
attendees at all levels. So whether you're brand new to the
discipline or an old hand, you still find sessions which are
tailored to suit.
And not only that, you get to hang with
the coolest people in the industry at al of the sponsored
events such as Overture's SES Mixer and also the SEMPO (Search
Engine Marketing Professional Organisation) lig. So there's
free drinks too!
Don't be afraid to come over and speak
and you'll also get to meet our new associate editor Christine
Churchill as well as Jill Whalen and her High Rankings crew
and more!
< http://www.jupiterevents.com/sew/fall03/index.html
>
BUILD
YOUR MAILING LIST BY *UNSUBSCRIBING* EVERYONE:
Jim Sterne, I'm proud to say, is a friend
of mine. He and I have worked together a few times this year,
both here and in the States.
I listen to what Jim says, because he
knows what he's talking about. But when I got a mail from
him, just the same as the rest of the subscribers to his newsletter,
I thought...
Well? Has my mate been sitting in the
California sun too long. Turns out: Not at all! In fact, once
again, he's simply just ahead of the rest of us! As usual.
What is the point of having a figure
which SUGGESTS what your mailing list is? Wouldn't you just
rather know how many READERS you have?
And, to that end, he unsubscribed them
all and told them to subscribe again if they thought his newsletter
was important enough!
I subscribed again, of course. But what
happened after that? Here's a tale and a short answer for
brave marketers and realists! I wrote to Jim in CA and asked
him what happened?
Too much sun? Nope - just too little
information about my readers. Are they all avid readers or
are they too lazy to unsubscribe? Are they hanging on every
word or unaware that their spam filters are eating every issue.
No way to tell. So - it's Off With Their Heads!
What happened?
Within four days, 10% had re-subscribed.
I expected about 2% so this was a real ego boost.
What would I recommend to others who
are tempted to do the same? Phase it in.
I quit everybody cold turkey - all at
once - no notice.
I'd suggest you tell people you're going
to do it.
Remind them you're going to do it. And
then do it - with an opportunity to re-subscribe each time.
And let me know how it goes...
< http://www.targeting.com
>
THOSE LITTLE THINGS
WORTH A MENTION. Includes:
o MIKE GREHAN APPOINTED AS MANAGING DIRECTOR,
iPROSPECT, EUROPE. Yes, your brave and fearless leader has
joined the world's leading search marketing firm.
< http://biz.yahoo.com/prnews/031104/netu016_1.html
>
o FREE SEARCH ENGINE MARKETING SUITE OF
TOOLS. Did he say free? I believe he did!
New features in Web CEO 4.2 include:
- Free Link popularity analysis tool - Free Optimization tool
- Free Knowledge base updates - Free support - Free SE submission
(166 search engines to choose from).
< http://www.e-marketing-news.co.uk/webceo
>
o WEBMASTERWORLD.COM ANNOUNCES PUBCONFERENCE
V1 IN FLORIDA: I'll be there with my SES team mate Brett Tabke
- find out more!
< http://www.webmasterworld.com/conference
>
o ONLINE RETAIL STUDY FOR CUSTOMER FOCUSED
EXCELLENCE: Conversion and analytics gurus Bryan and Jeffrey
Eisenberg, of Future Now Inc., have prepared a free report.
< http://www.futurenowinc.com
>
o SBI DOES CARTWHEELS OVER HTML EDITOR
COMPATIBILITY. And FormBuilder... and Masters Course and...
< http://www.e-marketing-news.co.uk/sbi
>
o DANIEL BAZAC IN NEW YORK POINTS ME
IN THE DIRECTION OF: "Search the Web More Efficiently:
Tips, Techniques and Strategies"(He also has a search
blog too!)
< http://www.web-design-in-new-york.com/search-the-web.html
>
o YOU MAY BE A GREAT ONLINE MARKETER:
But how good are you in the sack?
< http://www.e-marketing-news.co.uk/emodeIQ
>
Editor: Mike Grehan. Search
engine marketing consultant, speaker and author.
http://www.search-engine-book.co.uk
Associate Editor: Christine
Churchill. KeyRelevance.com
e-marketing-news is published selectively on a when it's ready basis.
At no cost you may use the content of this newsletter on
your own site, providing you display it in its entirety (no
cutting) with due credits and place a link to:
< http://www.e-marketing-news.co.uk
>
|