Matt Cutts: There is this spectrum where on one side a site w3c validates, it’s very clean and I encourage it but you don’t get a ranking boost. On the other end of the spectrum there are people who make really, really sloppy errors. They are coding a site by hand and might not close their tables, they might have lots of nested tables.
So what we try to do, we have to crawl the web as we find it, is try to process it well. We handle some sites that don’t necessarily validate and that have some simple syntax errors but it is possible to have degenerate web pages that can effectively cause you page not to be indexed. If a page is hundreds of mega bites long that might not get completely indexed.
We have seen in some very limited situations where we would have regular expression where we try to match something and it would match the first half and then people would go off and have the most random gibberish text. So the regular expression eventually like died and brought that computer down with it.
So in processing the web we found there were a few documents that would tickle the problems within this particular regular expression. But for the most part if you have a reasonable page, something that most users can see we will be able to process it. We will be able to index it.
The easy way to check that is to open it in a text browser or two or three of your favorite browsers to make sure that you can see the text. If all that text is visible then it should be able to be indexed by Google.