Audit Report – Google Crawler Behavior Analysis - HREF Parsing - 2025-09-03

Audit Report about GoogleOther crawler collecting non-existent (phantom) URLs from JavaScript code blocks

TL;DR

Google's research crawler (GoogleOther) is collecting non-existent URLs from JavaScript code blocks, leading to 404 errors. This behavior appears to be part of Google's research and development activities rather than their main indexing process. This causes the following effects:

Attempting to crawl non-existent (phantom) URLs
Potential negative SEO impact due to 404 errors
Wasted crawl budget
Unnecessary server load

Audit Information

Name of the auditor: Zsolt Tövis
Name of the company: Stacklegend
Competence of the auditor: Full Stack Developer, Co-Founder
Date of the audit start: 2025-09-03
Date of the report completion: 2025-09-05
Date of final documentation: 2025-09-18 (waiting for Google Search Console warning message confirmation)

Subject of Audit

Find out whether Google Crawler collects non-existent (phantom) URLs from parts of HTML documents where it shouldn't.

History of the Issue

While going live with the Stacklegend Blog site, we noticed that Google Crawler was collecting and attempting to crawl URLs that were not even listed in the HTML section of the website as HREF attributes or any HTML tags or in the Sitemap. In our case, Not found (404) crawls accounted for 36% (critical) within the first 24 hours.
After a deep investigation, we found out that Google Crawler was parsing URLs from JavaScript code blocks, even when those URLs were not part of any HTML tags or attributes.
This behavior was causing Google Crawler to attempt to crawl non-existent (phantom) URLs, leading to 404 errors and potentially negatively impacting the website's SEO performance. Furthermore, it was creating unnecessary load on the server and wasting crawl budget.

Description of the Audit

We have created a very simple index.html file in the test_files folder, which is deployed under google-crawler-test-20250903.stacklegend.com.
The index.html file contains various HTML elements, including html comments and script tags, to test how Google Crawler interacts with different parts of the document.
We have connected the website to Google Search Console. We submitted the sitemap.xml and https://google-crawler-test-20250903.stacklegend.com/ start URL for indexing.
We have monitored the URLs collected by Google Crawler over the crawling period to see if any non-existent (phantom) URLs were collected from the JavaScript code blocks or comments.

Results of the Audit

Based on our server logs, the specific Google Crawler involved is GoogleOther, which according to Google's public information is used for research and development purposes:

Crawling preferences addressed to the GoogleOther user agent don't affect any specific product. GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development. It has no effect on Google Search or other products.

https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
Important Note: While Google's documentation states that GoogleOther is used for research purposes and "has no effect on Google Search or other products," the reality is more complex. These phantom URLs do appear in Google Search Console as crawl errors, and Google continues to track and attempt to crawl them over time, suggesting some level of integration with Google's broader crawling infrastructure. The actual SEO impact remains unclear and may be more significant than Google's documentation suggests.
During the crawling period, Google Crawler collected several URLs that do not exist in the deployed index.html.
The collected URLs were not part of any HTML attributes, tags, or sitemap entries, but appeared inside JavaScript code blocks.
Examples of phantom URLs collected:
Monitoring via Server Logs and Google Search Console confirmed that these URLs were attempted to be crawled, resulting in 404 responses from the server.
This confirms that Google Crawler is parsing URLs from JavaScript code blocks, even when these URLs are not linked or intended to be crawled.
Impact:
- Attempting to crawl non-existent (phantom) URLs
- Potential negative SEO impact due to 404 errors
- Wasted crawl budget
- Unnecessary server load

Understanding the Issue (Technical)

Modern web applications use JavaScript to build URLs dynamically for routing, navigation, or internal logic.
Google Crawler parses HTML pages to discover URLs. When it encounters <script> tags, it does not execute the JavaScript, but instead applies pattern matching or regex to the raw JS code.
This approach can mistakenly identify URL-like strings inside JS comments, variables, objects, or template strings as valid URLs.
As a result, Google Crawler collects phantom URLs that do not exist as HTML links, attributes, or sitemap entries.
For example, in a React/Next.js application, JavaScript objects often define route segments like { locale: 'en', category: 'category-1', href: '/product-1' }. In reality, this object might be converted into the following URL: /en/category-1/product-1. These href strings exist only in the code and do not correspond to rendered HTML links. Google Crawler’s parser incorrectly treats these strings as real URLs and attempts to crawl them, causing phantom URLs and 404 errors.

Important Questions Raised by the Bug

How false 404 errors affect a page's indexing and ranking?
How much crawl budget is wasted on phantom URLs?
How many websites worldwide could be affected by this bug?
Are important pages being under-crawled because the crawler wastes resources on phantom URLs?
Can internal link equity be diluted due to non-existent URL indexing?
Are sitemap submissions being partially ignored or deprioritized due to phantom URL noise?

Proof of Issue

All evidence collected during the audit can be found in the proof/ folder:

Reproducibility: The issue can be reproduced by deploying the test application and monitoring the URLs crawled by Google.
Timeline Note: The audit was completed on September 5th, 2025, but final documentation was delayed until September 18th, 2025, as we waited for Google Search Console to provide the warning message confirmation that arrived just before finalizing this report.
GSC crawl reports and 404 pages:
Server logs:

These files demonstrate the crawler behavior that triggers phantom URL collection and 404 errors.

Conclusion

GoogleOther crawler exhibits behavior where it collects and attempts to crawl phantom URLs from JavaScript code blocks. Despite Google's documentation stating that GoogleOther "has no effect on Google Search or other products," the evidence shows these phantom URLs appear in Google Search Console as crawl errors and are persistently tracked by Google's systems. This suggests potential integration with Google's broader crawling infrastructure, making the actual SEO impact uncertain and potentially more significant than officially documented.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
proof		proof
test_files		test_files
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audit Report – Google Crawler Behavior Analysis - HREF Parsing - 2025-09-03

TL;DR

Audit Information

Subject of Audit

History of the Issue

Description of the Audit

Results of the Audit

Understanding the Issue (Technical)

Important Questions Raised by the Bug

Proof of Issue

Conclusion

About

Uh oh!

Languages

stacklegend/google-crawler-test-20250903

Folders and files

Latest commit

History

Repository files navigation

Audit Report – Google Crawler Behavior Analysis - HREF Parsing - 2025-09-03

TL;DR

Audit Information

Subject of Audit

History of the Issue

Description of the Audit

Results of the Audit

Understanding the Issue (Technical)

Important Questions Raised by the Bug

Proof of Issue

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages