Home

Book

Current

EDITOR'S WELCOME:

[ But first this: IT'S OFFICIAL! The ABC audit figures are just in. And they prove *conclusively* that we're now even more popular than 'Sewerage Maintenance' monthly magazine. It's all down to you dear reader. Thank you! ]

~ * ~

I'm sitting on the balcony of an hotel in the centre of Paris as I write this. And I'm reading the many conspiracy theories about the recent Google update.

My wife is delighted that, our next door neighbours on Faubourg Saint-Honore, are Yves Saint Laurent and Christian Dior. Personally, I'm delighted with the little miracle the French know as "wee-fee". For it is the power of "wee-fee" which allows me to sit, cable free, and connect to the Internet gazing at the Eiffel Tower...

While the staff at our next door neighbours, gaze dutifully at my wife's credit cards...

So, what, actually, has happened over at Google? And who's to blame? Just before we move to my overdue interview with a Google guy called Daniel Dulitz, let me try and explain a couple of things I've learned about information retrieval. This may help to rationalise things a little.

I actually wrote a spoof conspiracy theory article which ended with Google being the bad guys and forcing SEM's to open AdWords accounts, yada, yada... But then I realised that some people may take it seriously. In fact: Some people are taking this whole thing far too seriously!

I don't want to get too deep "into the weeds" here. But perhaps a quick overview of the fundamental principles of what's involved in a modern information retrieval system for the web, may be a little... enlightening?

Back in 1999 Ricardo Baeza-Yates and Berthier Ribeiro-Neto authored a groundbreaking text book which has become standard fodder for students of modern information retrieval. In fact, that's actually the title of the book.

It was groundbreaking in the sense that, it distinguished, from the very beginning of the book, the difference between data retrieval and information retrieval. There has been a considerable amount of previous work carried out in the field of data retrieval. Many systems and processes have been tried and tested and proven. But data retrieval, as in the context of an information retrieval system, consists mainly of determining which documents in a collection contain the keywords in the user query. And this is simply not enough to satisfy a user's information need. One million documents may contain the keywords, so therefore they're all relevant to the query. But which ones are the most important or authoritative documents in the returned set?

What we have here is what's known as the "abundance problem". Too many documents and, generally speaking, a lazy user who has no desire to go beyond the first few pages of results. And, again, generally speaking, a user who has little or no understanding of how to form advanced queries to broaden or narrow the scope of their search.

The area of information retrieval, since the dawning of the web, has advanced and grown well beyond its primary goals of simply indexing text and finding relevant documents in a collection (or corpus).

Data retrieval may satisfy the user of a database system, in providing some form of a solution. But it doesn't solve the problem of retrieving "information" about a "subject" or "topic". And it certainly doesn't provide a (basic) user friendly method of ranking the most important documents related to the user's information needs.

The foundation of modern information retrieval on the web is all about hypertext-based machine learning and data mining methods such as clustering, collaborative filtering and supervised learning. However, in the main, it focuses on web linkage analysis.

The two most popular algorithms used in information retrieval on the web are, PageRank and HITS. The use of these algorithms substantially enhances the quality and relevancy of keyword-based searches.

Stay with me here guys - there's not much further to go.

The fundamental difference between the two algorithms is this: Page Rank is a keyword "independent" algorithm and HITS is a keyword "dependent" algorithm. What that means is, with PageRank, you already have a score, even before the user keys in a query i.e. your PageRank is computed "upfront". But with HITS, your score is determined based on the keywords which are input and then the score is gauged around the linkage data surrounding the community relevant to that keyword/phrase.

Of course, from here on in, it gets to become very complicated. In fact, it took over 30 pages in my book to explain this. And that was heavily edited.

So why am I using this space to explain part of it now?

Simply for this reason (and these are purely my own thoughts and opinions): I believe that PageRank has always been flawed. I believe that Kleinberg's HITS algorithm (and the variations on it), being closer to subject specific, provides more relevant results.

A few years ago when Teoma was launched, there were lots of comparisons made about Jon Kleinberg's HITS algorithm. What many people didn't realise was, Kleinberg's algorithm had suffered its own problems: Namely "topic drift" and "run time analysis" delays.

Monica Henzinger, now head of research at Google, played a major role in developing solutions to the "topic drift" problem (curiously enough by introducing a little element of PageRank in the recipe). But the "run time analysis" problem remained. In simple terms, the results from the HITS algorithm were more relevant, but they took an eternity (in web search expectation time) to compute.

So how had the guys at Teoma resolved this and managed to get the results of a "keyword dependent" algorithm in sub second times? Apostolos Gerasoulis, the scientist and founder of Teoma, either discovered the Holy Grail, or knows some pretty good parlour tricks. Personally I believe it's closer to the former.

Now if Teoma can do it - why shouldn't Google? What if this major upheaval is just that - a shift toward more subject specific relevancy and keyword dependent relevancy. It would certainly mean that the idea of "quality" links would count for a lot more than "quantity" which was something PageRank could so frequently be fooled with.

So, what if Google is pushing the envelope when it comes to information retrieval. What if they're experimenting the use of this new technology over a corpus of three billion documents. A corpus of information the size of which has never before been known to mankind. And therefore, never before manipulated in such a way.

What if this change, which will likely take a few iterations, as you'd expect in a "living experiment" is all about providing more relevant results to the END USER?

And what if it has nothing at all to do with penalising search engine marketers; webmasters; affiliate marketers or "algorithm botherers" per se?

I lost some and I gained some in the recent shake up. And I'm not losing a second of sleep over it. There may be another huge, and perhaps unexpected, information retrieval belch over at Google come the next dance. But you know, whatever does happen... It won't be technology going backwards to see if it can raise a few dead listings for some impertinent folks who seemingly want to bite the hand that's been feeding them these past years.

I've read about people taking the case to the FTC, class action blah, blah, blah... There's somebody who is losing a fortune because he lost his top ten rank at Google. This is the guy who's business model relies on some external organisation, which he has no control over whatsoever, sending him customers for free! I'll say that again: For FREE! Now he wants to sue!

I have a much better idea for that guy and all the others like him. Go to Google and wait in the reception area for Sergey Brin to arrive. When he does, bend him over (at this point you may wish to apply a little dab of cologne) and give his ass a HUGE kiss. He's been sending you wheelbarrow loads of free money since he opened the Google doors for business!

I'll start worrying when Google starts worrying. And that's when end users start complaining. And as I've said many times before: The end user wouldn't know a PageRank from a HITS and I doubt if they would even care.

Yes, I could write something with a conspiracy theory around it. But why should I? Google is a fantastic company, with fabulous technology, great people and a HUGE amount of pride over the INTEGRITY of their results.

You know, in this game, right now, they're simply the best. And more power to them I say. Roll on Bill Gates and anybody else. You'll have your work cut out!

Okay. So what I've written is a little dull and a little pragmatic... The search for the truth usually is...

Soap box back in the corner...

Read on, dear reader - read on!

Mike.

GOOGLE GUY NAILED TO THE GROUND AND FORCED TO ANSWER QUESTIONS!

Earlier this year I was booked to have lunch with Daniel Dulitz from Google. However, at the very last moment, I was forced to cancel as I was rushed off to New England Medical Centre! Serious bout of poisoning over, and released from hospital, I tried to rearrange...

But the Google guy had gone and I was circumnavigating the globe again, as is my wont... So, some months later, I managed to catch up with Daniel on the phone. He at the Googleplex, and me at home. The Devil I have come to know better than most, is: Time-zone! Daniel is keen to have lunch. While I am keen to pull a sheet over my head and go to sleep! Somehow, he manages to last without his lunch... And I manage to keep my head above the sheet...

If this is the first time you've received this newsletter, then you may not know that, what you are about to read is a transcript. It's not an editorial piece: It's exactly how it was recorded! << Play >>

Mike: Daniel, nice to get to speak again. So, where were we before I popped out for a quick medical thing? [bursts out laughing] I think we were starting with your background, so let's pick it up from there...

Daniel: Okay, I'm not sure how far back you want to go Mike...

Mike: [Adopts stereotypical therapist character] I vont to take you back to ven you were just a leetle child Daniel...

Daniel: [Big laugh] Well, my very first memory was of...

Mike: [Still laughing] Okay, I'll stop you there... let's do University first.

Daniel: I got my Bachelors Degree at Cornell University. I didn't really want to stay at school, but at that time I hadn't decided what I'd like to do with my life. So, I thought maybe work was the best idea, so I went to Motorola in Texas and got myself a job on an automatic, computer aided design, synthesis system project. And that is still very advanced, even this many years later: still state-of- the-art.

My wife got a teaching job in Pennsylvania, she teaches Anglo Saxon, and there aren't that many good, Anglo Saxon jobs anywhere in the world. So when you get one...

Mike: You gotta hang on to that one...

Daniel: Right! So, I moved up there and worked on digital signal processing for a while and well, it was, frankly, not the most satisfying work. It was very difficult to find good people there in Pennsylvania and a little military. I'm more interested in doing things that people use and enjoy. I'd been using the search engine Google for years. Well, actually, not so many years at that time, maybe a year. And I thought now that seems like it would be a great place to work.

My wife, ultimately, convinced me that if I didn't like what I was doing then maybe I should be looking for a job somewhere else in the world... anyway, that's kind of how I came to Google.

Mike: It's very interesting Daniel. I've done a lot of research into the early work of innovators in information retrieval, and if memory serves me well: Isn't Cornell the same University that Gerard Salton developed the SMART system? And not only that, the same place where Jon Kleinberg developed the HITS algorithm?

Daniel: Yes, Mike that's right...

Mike: So Cornell's kind of like Stanford, I guess, in a sense, in that, guys like Larry page and Sergey Brin who came out of Stanford, kind of suggests that both universities are sort of steeped in this search technology as it has become?

Daniel: Indeed, it's ironic though, that when I was at Cornell, I focused almost all my efforts on high performance optimising compilers and did absolutely nothing at all in information retrieval. I did a little bit of work in AI (artificial intelligence) and I did listen to two very interesting talks by Salton. But I never actually did anything significant with him.

Mike: I just thought, in passing, that it was an interesting observation that yourself and alumni's such as Salton and Kleinberg were there at the same time.

Of course, there are two main algorithms based on linkage data. HITS which was developed, as I've already said, by Jon Kleinberg at Cornell and PageRank at Stanford. There's a great deal which has been written about PageRank (not so much about HITS) but it would still be nice to get the simple version of what PageRank actually is.

When you joined Google, did they give you a kind of "Idiots Guide to PageRank" to use as a simple analogy when you get asked the question?

Daniel: So, as you're very well aware yourself Mike, it's a 'not so' complex, but very useful mathematical model of how surfers might surf from page-to-page. Essentially, the random surfer starts at a particular page on the web. And, given enough time, there's a certain chance that the surfer will arrive at ANY given page - and the chance that he'd arrive there is the PageRank of that particular page.

It's a fairly correct explanation. Perhaps, an easier to understand explanation is that, PageRank captures the vote that one page makes for another. You know if you have a web page and you link to another web page then, you have used your useful screen real-estate to send someone to this other page. And if you do that - then you must believe that there's some value in that other page. Or you wouldn't have put it in front of the eyes of your own users.

So, for some, the first response to that is; what if you just link to a bunch of really useless pages? And the response to that is: well who links to you? If you provide some service then more people will link to you. Your vote will be more important because your own PageRank score is higher.

Mike: It's the citation/co-citation thing. You know, people saying, basically we're pointing to you because we think you're an important page. The hubs and authorities principle which Jon Kleinberg developed is based on the same thing really.

In fact, Daniel, I seem to remember a scientist once saying to me: "We think we'd rarely come across a web page which has - these are links to the 20 worst web pages you'll ever visit!"

Daniel: [Laughs] Right, and in as much as they do, there is some limited value to that if you want to do it for humour or as an example as "what not to do" but it's little value.

Mike: I think the search engine optimisation community or fraternity, whichever suits best, do tend to spend a WHOLE lot of time gazing at the Google toolbar and weighing up the PageRank of each page that they visit. Just how accurate is the data in the toolbar Daniel? And do you think that search engine optimisers may be a little TOO engrossed in this PageRank thing?

I mean it really is just a single factor in the overall ranking algorithm isn't it?

Daniel: Okay, so first, how accurate is the toolbar? Well, I don't want to give too 'nerdy' an answer here Mike, but there's two things really: there's accuracy and there's precision. And, in terms of accuracy, it is a completely accurate indicator of the PageRank of a page.

We don't... well, we don't lie in the PageRank bar. But it's not very precise. We have a lot more precision available to us than we represent in a ten step scale or whatever it is on the PageRank tool...

Mike: So there is data - but it's not precise! Does this mean then, that if I have PageRank seven and my competitor has PageRank nine, I shouldn't really have a breakdown and go and visit my therapist or anything...[Laughs]

Daniel: Not really!

So, the second part of the question was, are people focusing too much on PageRank. I guess that depends on what they're after really. You know, if they're curious about PageRank and how it works, well that's really why we put that bar on the toolbar in the first place. You know, if people are just casually surfing the web they can see what Google's PageRank impression of that page is. And that's great. I'm all for people doing that.

But for search engine marketing, search engine optimisation purposes, yeah, I'd say that there's too much emphasis placed on what that PageRank number actually is. Our job as a search engine is to return the best results that we can. And we're not naive enough to think that we can condense every indicator about a page into a number from one to ten. We certainly can't do that.

So, if people are trying to look at what we're doing and their idea is based on that single number from one to ten, then ... well, they're not going to be effective in figuring out what we're doing at all.

Mike: Okay Daniel, moving swiftly onto something else here... Since the development of the web's first full-text retrieval search engines, such as WebCrawler and Lycos hit the web a number of years ago, the technology for developing web sites has advanced dramatically. With the dawning of broadband, we have more multi-media sites, including audio visual entertainment sites and such. They're a lot more commonplace now. Yet, these are the types of technology which seem to cause the most problems for the average crawler out there (if there is such a thing!).

How well do you think search engine crawlers are coping with the likes of dynamically delivered content and non HTML files?

Daniel: Clearly there's a lot of room for improvement here. On the other hand, I think that Google's crawler does a pretty good job with the wide variety of content sites there are out there. It can handle Microsoft Word documents, PowerPoint documents, pdf files...

And Google's crawler has, for quite a long time now, parsed links out of Flash files [swf] in order to discover other content. Certainly there has been an explosion in the number of content types that someone uses. And the search engines and Google are doing everything we can to track that in a reliable way. We don't want to put out something that, perhaps, works not very well. We want to support it well. And again on the other hand, I think it is clearly a case that most useful content on the web is in, you know, Microsoft Word, pdf or HTML. There are clearly some examples of useful content in other formats...

So, I'm making sure that I'm not giving the wrong impression. But by far, the most important content is in those three formats.

Mike: What about the subject of dynamic delivery and those major database driven sites? I mean, obviously, one of the problems that you have is that search engine optimisers and web site developers know that the appearance of ? in the URL or query string and that's it! A crawler is most likely to back out of thereI know that Google prefers to say that it can deal with that type of content. But I seem to remember other search engines saying that they do have problems with that type of delivery. This is why pay for inclusion services were created I guess.

You know with a pay for inclusion service you can avoid some of those problems. But Google insists it won't use a pay for inclusion service. Would it not be wise for you to create one so that you could help webmasters to get those problem URL's in there?

Daniel: So... Mike, that's a very complicated question!!

Mike: Daniel I'm sorry, I realised that I was actually rolling that into about three questions there... Would you like me to break it down to three parts? Sorry, my fault there!

Daniel: [Laughs] Okay, let's deal with the dynamic delivery part first. We deal very well with dynamic delivery per se. So the mere presence of a question mark in the URL doesn't throw us for a loop at all. We will crawl those pages more slowly than we do with pages without question marks in the URL. Purely because they've identified themselves as being dynamic and we certainly don't want to bring anyone's site down because we're hitting the database too hard. If you hit a web server at a couple of pages per second, nobody's going to notice that much. But if you hit a web server with a couple of queries per second, especially the smaller sites, that can be a very big problem for them. That's the primary motivation to crawl dynamic delivery sites more slowly, but we do crawl it.

The difficulty isn't really with dynamic content or not. The difficulty is more about how people use content management tools... content management systems. Content management systems are really powerful and a great way for webmasters to provide a lot of content on their site in a way which is visually pleasing. And so, of course, a lot of sites use them. But there are a couple of factors which make CMS's not very crawler friendly. And one of those is that they provide multiple views of the same data. You know, in different sort orders and different colour schemes and, well, links which are producing the same content just a little expanded view here and there. And crawlers have to try and identify that and try to NOT crawl too much of the same content.

And that's a hard problem. And I have to say it's not a problem which is unique to dynamic content. But dynamic content is certainly the one which makes it easier to generate those types of issues.

Mike: So, I guess that before we had these content management systems there was still a problem with duplicate material being spread across organisations. You know mirror sites for different geographic regions and that sort of thing... o Daniel: Yes, that did present a problem, but on a smaller scale. If you look at it from a technical standpoint, there are very few of those mirror sites, but there are very many of those Cold Fusion and Domino type tools...

That makes it so easy to duplicate vast amounts of duplicate information. One mirror site results in one duplicate. With Domino for instance, being able to hide or expand columns, one page can easily map to 25 or 20 URL's which we really need to be able to identify, ultimately, as duplicate material.

Mike: So what kind of advice could you give to those larger organisations where they have, you know, 25 offices world wide and they have so many people working across the organisation with the same material on these content management systems... Is there any kind of 'rule of thumb' that we could give to those guys

I mean do they have to pull this stuff out of the database and make it static on the surface if they want it to be found, or only one version of it to be found?

Daniel: Not really no... The general 'rule of thumb' that I would use is, provide some sort of 'browsable' interface to your content first of all. Some people want to find information by search and some want to find some by browsing a hierarchy because they're not sure what they really want.

So they don't know what to search for. Take Google, we're a search engine, but we still provide a directory for people who simply want to browse. That would be my first message, to give people who want to browse, the opportunity to browse. And that will also help with search engines a lot.

The second major factor is to disable the features of your content management system that are programmed to produce the most duplicate content. Session ID's are something I've mentioned many times at conferences. They really do cause a lot of grief for crawlers. Let me just give a brief picture of what a session ID is for the benefit of those who aren't sure.

Most content management systems provide a facility whereby, if a user comes in that does not accept cookies, they can still track that user by changing the URL. Robots don't accept cookies, so the content management systems end up delivering session ID's for them. We do try and detect these session ID's and not take them into account but it's difficult. And sometimes we'll be wrong.

So, it will certainly help from the point of view of getting listed and so that we don't accidentally crawl a site many, many times, to disable the session ID's. You know what happens is, we might have 20 or 40 bots out crawling the web and they know not to crawl the same URL twice. If bot one crawls URL X then bot 10 isn't going to crawl it. But every time one of these bots visits one of these pages on a session ID site, it's likely to get a different session ID back.

So 40 bots may crawl the site - and so we end up with 40 copies of the site. And that's a lot of pages from the same site, using a lot of processing power and that's not good for the site owner either.

Mike: So what about the solutions provided by other search engines. You know like Inktomi for instance. They provide a pay for inclusion service. So you can get around the problem by only feeding into the database those safe URL's as they would be. What's Google's standpoint on that type of service?

Daniel: Mike, we keep an open mind about doing whatever we can to help our end users. But the thing that has kept us from offering paid inclusion, up until now, is really that... Well, when someone pays you to include a URL - then are you really free to judge that as you would all other pages?

And the other point is, what relationship have you formed with that content provider, so that, if they ask you: "Should my page look like X, Y or Z?" What do you tell them? What obligations have been entered into here to provide a tighter support relationship with them because you've accepted their money to list that page?

When we enter into a business relationship with someone we want it to be fair to them.

Mike: I think it is a very curious situation. I mean, it does have the benefits of faster refreshes in the database and the fact that you can feed in some of those URL's with the characters and query strings that don't stand as much of a chance just through the natural crawling process.

But you're right: What is it that you really are paying for? As I've said many times before - if you have a page which is already 5,052 in the index - do you really want to start paying to be there!

Daniel: [Laughing] Right... right... But you know, I think that the concept that paid inclusion leads to faster refreshes... Well, in general, I don't think that's true. Google has worked very diligently to update as many pages as possible on a rapid basis. I'm sure you know about 'Fresh-bot'?

Mike: Of course... Yes, I do...

Daniel: So that gives us the flexibility to do the right thing for our users, if WE choose which pages to crawl on a 'freshness' basis. Instead of only crawling those pages, for freshness purposes, that people are paying us to crawl. It's a hard issue Mike

What we'd like to do, going forward, is to provide more means for webmasters to communicate with us, where money is NOT an issue. We're keen to do things where we're able to help webmasters and they can help us - but where webmasters don't have to pay.

Mike: Yeah, you know, I think that there's a lot closer kind of relationship taking place now between search engine marketers and search engines. Well, certainly a lot closer than it used to be. It has been a little bit of a strange space between the search engine optimisation community and the search engines. I'm coming back to that in a minute actually Daniel, what I want to do first is to touch on the subject of cloaking.

Just as we're glossing over the search engine/webmaster relationship. Cloaking has been, and still is I guess, a very controversial issue. I'd like to get the official Google line on cloaking, but first, and for the benefit of those readers who don't know what cloaking is, could you give a brief technical overview and let us know what actually constitutes cloaking in the first instance?

Daniel: The short answer is, that cloaking is identifying a search engine so that you can serve the search engine different content to that which you'd serve to a real browser. That's really the idea...

Mike: So Daniel, it's not really so much about the method or the way that you actually do it technically - it's more about the principle yes? It's about the fact that a search engine crawler is going to see one page - but when I click on that link I'm going to see something else...

Daniel: Yes, people, well some people, try to obscure things. Some people say that Google does cloaking. You know that we detect things like which country your from and serve you different content based on your IP address. That's not really cloaking. There's nothing hidden about that. There's nothing below board about that and we're very up front that we provide slightly different content, slightly different logos and so forth for people from different geographical locations. From a search engine standpoint, we have an obligation to our users to judge the content of the page that they will see when they visit our sites.

And anyone who tries to stand in the way of that, where we evaluate the process of what a browser would see, that person who is engaging in cloaking, is the person that gets in the way of the relationship with our customers...

Mike: And that's something you frown upon... I know. Actually, I'm going to put you on the spot a little here Daniel. Let's just take this example (and a true example it is). Take a large media company, a TV station in fact. They have a web site which is based on a very popular TV show. When somebody comes to that web site, what they'll expect to see is streaming media, you know, audio visual presentations, sound-bites, Flash and other animation, that type of technology.

All of that stuff is very, very crawler unfriendly as we've already discussed. So, suppose the crawler is reading text which is completely accurate and relates to the actual content of that web site. And when a person clicks through the link and they get to see the web site which is the audio visual experience they wanted - not the text version - surely everybody's happy there?

I mean the search engine served up a relevant result to the query. The end user is happy with the web site she's viewing. The webmaster is happy that everybody else is happy... Is that not okay...

Daniel: Well, there are a couple of actors who are not happy in that situation. One is the end user who does not have Flash installed in their browser. They go to that site and they simply don't have the fancy high performance, high bandwidth streaming technology capability etc. They're obviously going to have a bad experience in that particular scenario.

But that's not necessary you know. Almost all new technology and HTML itself provides ways of including alternate content for folks who don't have the fancy add-ons and plug-ins. So there's one actor who may not be very happy. Another actor who may not be very happy is the person who is visually impaired and is not going to be able to take advantage of text to speech - at all- because there is no text there.

There is pretty much a case of this technology recognising these accessibility issues. And you know it's entirely possible, in every technology that I've looked at, where a single page which handles a wide variety of browser types, including text only browsers and robots, which don't understand that fancy content. There is another issue as well and that's that Google and other search engines would like to, as best as possible, to understand what that content actually is, in general. That complicated high bandwidth Flash content etc. And we can't do that if we can't get to the real content.

So going forward, for instance, we have people who need to know how much better we need to be at Flash conversion. You know, if there's lots of Flash out there we can't see it if people are cloaking us and so it looks like it's not important for us to work on that problem.

Mike: Got you. And you know, I'd never even thought about that as an issue. It's a fact. The less you're able to identify this sort of alien content, as it is, the less you're likely to know how important it is in its application across the web. That means it's less than likely that you'll get around to trying to find a resolution...

Daniel: Absolutely, and that leads to a situation where, content providers will have to do something special for search engines. And that means spending effort on doing things specifically for search engines when they shouldn't have to. We want to help content providers - right.

Every time someone does something special because a search engine exists, is sort of, well it's like a case of where we've failed.

Mike: And that takes us back to the standard philosophical thing about, you know, if the search engines didn't exist, would you be going to all of this trouble?

Daniel: [Big laugh] Yeah, yeah... To a certain extent, yes...

Mike: For the last edition of my book, one of the things I wanted to dispel was the notion of themed web sites. By that, I mean the idea that people had about trying to develop your entire web site a round a couple of key words. You know like, every page has to be about "blue widgets" and the domain should be "blue-widgets.com" yada, yada... I think it was nothing more than SEO propaganda the whole thing - what are your thoughts?

Daniel: I think people sometimes mean different things by "themes." The statement above -- that somehow your blue widget site would be "weaker" if it contained a page about Tigers - is completely wrong.

No search engine would want to do that; having a page on Tigers doesn't affect your ability to be a resource for blue widgets. We'd miss good blue widget pages if we excluded the sites that also talk about Tigers.

However, there is a difference between "having a little Bit of content about blue widgets" and "having in-depth Content about blue widgets." Clearly we prefer in-depth (more useful) content. That's not so much a preference for themes as a preference for depth. "Utility" and "depth" really should be measured by a site's users.

Mike: So Daniel, there's good advice and there's bad advice about how to achieve a better rank at search engines. And people do spend SO MUCH time trying to do whatever they can to get a decent rank at a search engine. It's so important to be in the top ten at Google and the other major search engines.

So, what about those people who do take the bad advice. You know, the spamming and stuff, and there's plenty of bad advice out there, what's the deal with them? What kind of penalties are people likely to incur? There are horror stories out there and people are afraid to do this... or afraid to do that. What are the penalties? Can I get banned from Google? Do I get banished for life for that stuff...

Daniel: Mike - you cannot get banned from Google for life... But yes, you know particularly, and these are extreme cases... Yes, bad spammers could be removed from the index. But that happens so rarely. I mean, very rarely...

You know, I could count up the number of people who say they've been banned from Google... Well, it's a lot more than I know that really have been banned from Google. So, I think there are some people who are just too worried about the ban thing. You know, if you're just trying to provide good content to your own users, you should simply not be worried about a ban.

Mike: Absolutely, this is one of the things that I say a lot. I mean I do study a lot of this and I look at what's written and, well, there are so many scaremongers out there. Don't do that, don't do this... The search engine will have you banned for life. There'll be a plague of Locusts and... You know, that sort of thing... But basically Daniel, what you're saying is: If you've got good content, just present it as best as you possibly can and get it out there..

Daniel: Yes!

Mike: And the next "no-no" Daniel. There are so many products that I see out there for making automated submissions to search engines and doing ranking checks and link pop checks. They're either online services or desktop applications... I mean, basically, all of these things are a "no-no" with search engines... Correct?

Daniel: Right, well again it's a case of people interfering with our ability to serve our non automated users. And yes, Google is a robot, by that I mean our crawlers are robots. We obey the robots exclusion protocol. If someone says, no robots on this site or a particular part of a site, you know, we don't go there. We're not going to crawl those pages. These automated tools you're talking about here are also robots. And they are not obeying our no robots text file.

The reason that we have a robots text file is because, every query that comes into Google uses a fixed resource. We provide this resource for free to LIVE people. And we want to continue doing that. In fact, that's what a lot of us at Google enjoy, just providing those free services to real people. But the more automated queries we get, the harder it becomes to continue providing that free service. It's just using resources that could be better spent elsewhere...

Mike: I think part of the problem also, is that people are fascinated too much about what kind of rank they have at search engines - regardless of whether it's any use having that rank. You know, whether it brings them any traffic or not. And those issues seem to become just so important to them. As far as I'm concerned, I get a better idea of what traffic I'm getting, and from where, just by looking at log files - not by doing a rank check to see if I'm still number one for my own name spelt backwards...

But - people are always going to be curious I guess. What's the answer - you do a rank check by hand?

Daniel: Much more important than checking your rank on search engines... In fact, I think THE most important thing you can possibly do is track conversions of users who visit your site - and to track that by referrer - yes, as you say, to analyse your own site logs.

Then you can work backwards from there, you can see what referrers are doing a good job for you. Are these directory or links from other sites providing a lot of good traffic AND are search engines providing good traffic. And then you can look at changes in those relationships.

You can detect between yesterday and today that there are fewer conversions from, say, Alta Vista than there used to be. Then you have a real live person case asking the question: "Why?" You can go to Alta Vista and then try a query yourself and see what happened - did my listing disappear, is my listing on a different page, that sort of thing. It doesn't do any good to have that kind of information if you're not using it. And if you are using the important bits of information, then really, it takes no effort to get it manually.

Mike: So, as I said, you can learn a lot more about your success from your log files than you will doing a rank check for a keyword or phrase, which even if you're number one, doesn't convert for you...

Can I just touch on the subject of the relationship between search engines and search engine marketers or optimisers as they are also. I said I'd try and get back to it again. You can see that there is a lot more contact with the search engine marketing community as it develops into a fully fledged business sector. Guys like yourself from Google turning up at conferences...

Do you foresee the possibility of search engines getting into relationships in the way that, in conventional advertising and marketing, agencies have a very strong tie with the media companies?

Daniel: Yes, I think that's entirely possible. I mean, some agencies, while they don't have formal relationships with search engines, do certainly have a lot of understanding about search engines and can give, overall, better advice about advertising strategy to clients. You can certainly see that within the search engine optimisation industry itself, that there is a transition towards more "full service" marketing and less emphasis on, say ranking checks, or any single factor. I certainly would not be surprised to see that trend continue and...

Well, I don't have any special insight... all I do is work for a search engine! But, yeah, I think customers want a one stop solution.

Mike: What about the general state, or condition, of the Industry? There have been so many changes taking place, I really have to ask about the acquisitions. Yahoo! Buys Inktomi, Overture buys Alta Vista and Fast - Yahoo! Buys Overture... Is that all about consolidation?

Daniel: Sure, definitely there has been a medium term trend towards consolidation. Part of that is driven by the recognition that search is important. And I'm very happy that Google has played some role in convincing people that search is important. But there's no escaping the fact that, the business we're in is very competitive.

It's competitive the day before yahoo! buys Inktomi: It's competitive the day after! We know that people come to Google because we serve good results. Everyone of us here is working really hard to ensure that we continue to provide better results all the time. That way, people will always be happy to come to Google. And for as long as we're doing that, it really doesn't matter who owns Alta Vista, or fast or Inktomi.

There are a lot of smart people at those companies. They do very excellent work, which means that we have our work cut out for us. But there again we always have. We've always been competing against the best and we think we can continue to do that.

Mike: Will we continue to see Google results at Yahoo! Daniel?

Daniel: Who knows? They're a great partner of ours and we certainly do everything we can to make them happy. There are lots of ways that Yahoo! could use a search engine. We hope that our relationship continues, but, ultimately, it's up to them.

Mike: Daniel, thanks so much for your time, I know I'm keeping you away from your lunch...

Daniel: Correct - and I'm extremely hungry! But it was a pleasure talking to you Mike.

< http://www.google.com >

MEET THE NEW ASSOCIATE EDITOR OF e-marketing-news:

It's one of those moments I recall all too well. I was being viciously beaten around the head with the very hefty conference handbook at SES in San Jose. There was a woman's voice and she was screaming: "Okay, okay, I'll do it... Now leave me alone and stop stalking me before I call a cop - creep!!"

And so, in that historic moment, I applied an ice pack to my head and celebrated my success in having convinced one of the web's leading SEM professionals to join me here, at your favourite e-rag.

But why don't I shut up for a minute (as if that were really possible!!) and let Christine introduce herself. So how do you feel about teaming up for the newsletter then?

"Newsletter? Heck I thought that was a blog! <big grin>

Okay - I'll be serious. Yes, Mike and I will be joining forces to keep the newsletter alive. Ages ago I had mentioned that it would be fun to work on a joint project sometime. Little did I know that offhand comment would turn into managing his newsletter.

I have several years experience running an online newsletter - I should know better. The truth is there is something irresistible about the opportunity to work with someone as knowledgeable and hilarious as Mike Grehan. I just hope I can stop laughing long enough to work."

Laughing? Laughing? You mean I'm not being taken seriously!

BUT really seriously: It's an honour and a privilege to have you come on board to shape us up for next year Christine.

< http://www.keyrelevance.com >

SEARCH ENGINE STRATEGIES CONFERENCE COMES TO CHICAGO:

Yes, there's still time for you to make that last minute decision and head to Chicago for the final search engine strategies conference of this year.

I'll be teaming up with my panel partners Brett Tabke of WebMasterWorld and Anthony Muller of Range Online Media this time for the organic listings forum.

This is definitely one of my favourite sessions as it's a chance to tackle all things "webmastery". It usually starts fairly general, but then it gets right down into the weeds with stuff on technical platforms, crawling issues, linkage data and even a good old spam discussion from time-to-time.

So, don't be afraid to "pipe-up" with anything at all at this session. With guys as knowledgeable as Brett and Anthony to hand, and stalwart Detlev Johnson moderating, you really do have access to a dream-team of search consultants.

And let's not forget the link building session, always great fun and very informative. It's the San Jose team back together again this time. So that's me, "Link Mensch" Eric Ward, my fiend Paul Gardi from Jeeves/Teoma and Marissa Mayer from Google.

This, again, is very informative with Marissa explaining PageRank and linkage data over at Google and Paul Gardi explaining how web communities are identified through linkage data. (A subject close to my heart as you know!)

Organised by the great man himself, Danny Sullivan, SES now attracts the leading practitioners in the field of search marketing and the vendors, media companies and technology firms that go with it.

I frequently get asked by people "is it worth going if I already know that stuff?" The answer is fairly simple, the conference is designed to accommodate attendees at all levels. So whether you're brand new to the discipline or an old hand, you still find sessions which are tailored to suit.

And not only that, you get to hang with the coolest people in the industry at al of the sponsored events such as Overture's SES Mixer and also the SEMPO (Search Engine Marketing Professional Organisation) lig. So there's free drinks too!

Don't be afraid to come over and speak and you'll also get to meet our new associate editor Christine Churchill as well as Jill Whalen and her High Rankings crew and more!

< http://www.jupiterevents.com/sew/fall03/index.html >

BUILD YOUR MAILING LIST BY UNSUBSCRIBING EVERYONE:

Jim Sterne, I'm proud to say, is a friend of mine. He and I have worked together a few times this year, both here and in the States.

I listen to what Jim says, because he knows what he's talking about. But when I got a mail from him, just the same as the rest of the subscribers to his newsletter, I thought...

Well? Has my mate been sitting in the California sun too long. Turns out: Not at all! In fact, once again, he's simply just ahead of the rest of us! As usual.

What is the point of having a figure which SUGGESTS what your mailing list is? Wouldn't you just rather know how many READERS you have?

And, to that end, he unsubscribed them all and told them to subscribe again if they thought his newsletter was important enough!

I subscribed again, of course. But what happened after that? Here's a tale and a short answer for brave marketers and realists! I wrote to Jim in CA and asked him what happened?

Too much sun? Nope - just too little information about my readers. Are they all avid readers or are they too lazy to unsubscribe? Are they hanging on every word or unaware that their spam filters are eating every issue. No way to tell. So - it's Off With Their Heads!

What happened?

Within four days, 10% had re-subscribed. I expected about 2% so this was a real ego boost.

What would I recommend to others who are tempted to do the same? Phase it in.

I quit everybody cold turkey - all at once - no notice.

I'd suggest you tell people you're going to do it.

Remind them you're going to do it. And then do it - with an opportunity to re-subscribe each time.

And let me know how it goes...

< http://www.targeting.com >

THOSE LITTLE THINGS WORTH A MENTION. Includes:

o MIKE GREHAN APPOINTED AS MANAGING DIRECTOR, iPROSPECT, EUROPE. Yes, your brave and fearless leader has joined the world's leading search marketing firm.

< http://biz.yahoo.com/prnews/031104/netu016_1.html >

o FREE SEARCH ENGINE MARKETING SUITE OF TOOLS. Did he say free? I believe he did!

New features in Web CEO 4.2 include: - Free Link popularity analysis tool - Free Optimization tool - Free Knowledge base updates - Free support - Free SE submission (166 search engines to choose from).

< http://www.e-marketing-news.co.uk/webceo >

o WEBMASTERWORLD.COM ANNOUNCES PUBCONFERENCE V1 IN FLORIDA: I'll be there with my SES team mate Brett Tabke - find out more!

< http://www.webmasterworld.com/conference >

o ONLINE RETAIL STUDY FOR CUSTOMER FOCUSED EXCELLENCE: Conversion and analytics gurus Bryan and Jeffrey Eisenberg, of Future Now Inc., have prepared a free report.

< http://www.futurenowinc.com >

o SBI DOES CARTWHEELS OVER HTML EDITOR COMPATIBILITY. And FormBuilder... and Masters Course and...

< http://www.e-marketing-news.co.uk/sbi >

o DANIEL BAZAC IN NEW YORK POINTS ME IN THE DIRECTION OF: "Search the Web More Efficiently: Tips, Techniques and Strategies"(He also has a search blog too!)

< http://www.web-design-in-new-york.com/search-the-web.html >

o YOU MAY BE A GREAT ONLINE MARKETER: But how good are you in the sack?

< http://www.e-marketing-news.co.uk/emodeIQ >

Editor: Mike Grehan. Search engine marketing consultant, speaker and author. http://www.search-engine-book.co.uk

Associate Editor: Christine Churchill. KeyRelevance.com

e-marketing-news is published selectively on a when it's ready basis.

At no cost you may use the content of this newsletter on your own site, providing you display it in its entirety (no cutting) with due credits and place a link to:

< http://www.e-marketing-news.co.uk >

In This Issue

Editor's Welcome
GOOGLE GUY NAILED TO THE GROUND AND FORCED TO ANSWER QUESTIONS!
MEET THE NEW ASSOCIATE EDITOR OF e-marketing-news:
SEARCH ENGINE STRATEGIES CONFERENCE COMES TO CHICAGO
BUILD YOUR MAILING LIST BY *UNSUBSCRIBING* EVERYONE
THOSE LITTLE THINGS WORTH A MENTION

Newsletter Signup

Your Editors

Editor: Search Engine Marketing Consultant - Mike Grehan. Author: Search Engine Marketing: The essential best practice guide Associate Editor: Christine Churchill President, KeyRelevance

Subscription Info