Starting with a bad sample guarantees bad results.

kexej28769@nongnue · Post by **kexej28769@nongnue** » Thu Feb 13, 2025 8:50 am

It gets worse, though. Indexes like Moz report our total stats (the number of links or domains in our index). However, this can be very misleading. Imagine a restaurant that claims to have the largest wine selection in the world with 1,000,000 bottles. They can make that claim, but it wouldn’t be useful if they had 1,000,000 of a single type, or only Cabernet, or half a bottle. It’s easy to be misleading when you just throw out big numbers. Instead, it would be much better to randomly bangladesh number data wines from the world and estimate whether that restaurant has it in stock, and how much. Only then will you have a good measure of their inventory. The same goes for measuring link indexes — that’s the theory behind my methodology.

Unfortunately, it turns out that getting a random sample of the web is really hard. The first instinct for most of us at Moz was to just take a random sample of the URLs in our index. Of course we couldn’t — that would skew the sample toward our own index, so we scrapped that idea. The next thought was: “We know all these URLs from the SERPs we collect — maybe we can use them.” But we knew that would be biased toward high-quality pages. Most URLs don’t rank for anything — scrap that idea. It was time to take a deeper look.

I fired up Google Scholar to see if any other organizations had attempted this process and literally found a paper, produced by Google in June 2000, called " On Near-Uniform URL Sampling ." I hastily pulled out my credit card to buy the paper after reading the first sentence of the abstract: " We consider the problem of sampling URLs uniformly from the Web. " It was exactly what I needed.