Crawling Statistics:
Does Search Console Make Log Files Obsolete?

Share

By: Rank Jacker ∣ Updated: April 10, 2021

crawling statistics

On November 24th, 2020, Google launched the new Google Search Console report for crawling statistics. This gives us exciting crawling insights out-of-the-box and is super helpful!

 

But the information should also be treated with caution: it paints a pretty perfect crawling world that doesn’t reflect the full truth. We’ll show you what the report can and can’t do.

Review: What Were The Old Crawling Statistics Capable Of?

So far, there have also been crawling statistics. The old report was not available from January 19, 2020.
What were the old crawling statistics capable of
The old three reports were very straightforward:
  1. Pages crawled per day: Here you can see whether there have been major rashes If there’s an outlier after a seemingly irrelevant release for SEO, it’s worth digging deeper. However, not every rash is a cause for uncertainty. Sometimes a lot of pictures are just crawled.
  2. Size of the downloaded pages: This is nice info in the sense of NOT very helpful. But it can give you a little information about which URLs were crawled (pictures).
  3. Average response time: This report is very helpful for keeping an eye on server performance.

 

The new report can also do all of these features, just in the new Search Console.

What New Features Does The New Report Bring?

new Search Console

The new report brings a lot of numbers and evaluations. That is probably also the reason why Google is hiding the report from laypeople. You can only find the report under “Settings”. There you will find these features.

Differentiation According To File Type

In the old report, you had to guess whether Google just crawled a lot of images.

Differentiation according to file type
The individual file types can even be clicked on. And then we see that the outliers were mainly because of the pictures – which is rarely problematic.
which is rarely problematic

Differentiation According To Answer Or Status Code

This feature is cool! Now you can see with just one click whether the number of crawled 404 pages has gone up because these points are also all clickable.
Differentiation according to answer or status code
This is a very helpful addition to the previous coverage report, which simply shows the sum of the URLs.

Differentiation according to Googlebot

The distinction between bot is also new and quite interesting: Does Google mainly crawl images? And are there page types that are increasingly being crawled as desktop pages?

Differentiation according to Googlebot

Differentiation according to purpose

The purpose is also new and a feature that can not be evaluated with log files: So whether Google has only crawled a previously known URL or a new one.
Differentiation according to purpose

The breakdown according to findability, i.e. the overview of the new URLs, is extremely helpful in the detailed view. 


URLs were crawled in one day
Reason enough to ask our customer whether so many new products have been launched without letting us know!

Host status

Another report that qualifies as a regular monitoring bookmark is the host status. Here you can see at a glance whether there have been any problems with the server recently that you should address.

Host status

Why should the data be treated with caution?

As beautiful as the data is, it should still be treated with caution. This becomes clear when you compare it with the old report. The numbers in the old report differ significantly from those in the new one.


We dug deeper into a property in the Search Console: the old report showed 14,422 URLs on December 7th. The new report only shows 7,041 crawled pages. So where are the other 7,000+ URLs in the new report?

In the log files, we find 6,950 calls to HTML pages for this day, while the new report only reports 3,229 requests for HTML pages alone. That’s less than half.


Google is filtering here and only wants to display relevant URLs. Why Google doesn’t show all of them is unclear. However, Google’s motto is probably based on: “Some of these answers would unsettle you.”


What is missing, in this case, is pretty clear on closer inspection: 3,021 URLs of the crawled pages contain a parameter and are not indexable. However, Google does not show a single parameter URL in the example URLs, even though that makes up almost half of the URLs crawled.


Here’s the first problem: The fact that Google crawls so many parameter URLs is quite relevant to me – because they shouldn’t be crawled in the first place. And on top of that, they contain broken links that don’t appear in my crawls.


The second problem is the example URLs: A comparison shows that of the 30 most crawled pages in the log files, only 22 are in the Search Console list. If you include the parameter URLs, there are only 12 of the top 30 URLs.

Does Search Console make log files redundant?

So the new picture is helpful, but far from complete. It is worthwhile to do a log file analysis anyway or to collect the data yourself with Rytes BotLogs. The particularly exotic URLs are usually the ones I want to get rid of. If these are withheld from me, I lose important potential.

Good Good!

Despite the concerns mentioned, the new report is very good! There are many new, exciting insights that you can get quickly and easily, without having to deal with large data.

By the way, very few people have to deal with the large crawling data: It only makes sense to deal with crawling control from several thousand to millions of URLs. And then hopefully you will have a good SEO agency on hand that is familiar with it anyway. 😉

  • Do You Want To Automate Your
    SEO Lead Generation?

    Get Access To The Blueprint of our system to generate continuos flow of SEO prospects in a predictable manner.

In The Content