Audit Report about GoogleOther crawler collecting non-existent (phantom) URLs from JavaScript code blocks
Google's research crawler (GoogleOther) is collecting non-existent URLs from JavaScript code blocks, leading to 404 errors. This behavior appears to be part of Google's research and development activities rather than their main indexing process. This causes the following effects:
- Attempting to crawl non-existent (phantom) URLs
- Potential negative SEO impact due to 404 errors
- Wasted crawl budget
- Unnecessary server load
- Name of the auditor: Zsolt Tövis
- Name of the company: Stacklegend
- Competence of the auditor: Full Stack Developer, Co-Founder
- Date of the audit start: 2025-09-03
- Date of the report completion: 2025-09-05
- Date of final documentation: 2025-09-18 (waiting for Google Search Console warning message confirmation)
Find out whether Google Crawler collects non-existent (phantom) URLs from parts of HTML documents where it shouldn't.
- While going live with the Stacklegend Blog site, we noticed that Google Crawler was collecting and attempting to crawl URLs that were not even listed in the HTML section of the website as HREF attributes or any HTML tags or in the Sitemap. In our case, Not found (404) crawls accounted for 36% (critical) within the first 24 hours.
- After a deep investigation, we found out that Google Crawler was parsing URLs from JavaScript code blocks, even when those URLs were not part of any HTML tags or attributes.
- This behavior was causing Google Crawler to attempt to crawl non-existent (phantom) URLs, leading to 404 errors and potentially negatively impacting the website's SEO performance. Furthermore, it was creating unnecessary load on the server and wasting crawl budget.
- We have created a very simple
index.html
file in thetest_files
folder, which is deployed undergoogle-crawler-test-20250903.stacklegend.com
. - The
index.html
file contains various HTML elements, including html comments and script tags, to test how Google Crawler interacts with different parts of the document. - We have connected the website to Google Search Console. We submitted the
sitemap.xml
andhttps://google-crawler-test-20250903.stacklegend.com/
start URL for indexing. - We have monitored the URLs collected by Google Crawler over the crawling period to see if any non-existent (phantom) URLs were collected from the JavaScript code blocks or comments.
-
Based on our server logs, the specific Google Crawler involved is GoogleOther, which according to Google's public information is used for research and development purposes:
Crawling preferences addressed to the GoogleOther user agent don't affect any specific product. GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development. It has no effect on Google Search or other products.
https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
-
Important Note: While Google's documentation states that GoogleOther is used for research purposes and "has no effect on Google Search or other products," the reality is more complex. These phantom URLs do appear in Google Search Console as crawl errors, and Google continues to track and attempt to crawl them over time, suggesting some level of integration with Google's broader crawling infrastructure. The actual SEO impact remains unclear and may be more significant than Google's documentation suggests.
-
During the crawling period, Google Crawler collected several URLs that do not exist in the deployed index.html.
-
The collected URLs were not part of any HTML attributes, tags, or sitemap entries, but appeared inside JavaScript code blocks.
-
Examples of phantom URLs collected:
- https://google-crawler-test-20250903.stacklegend.com/url-from-js-comment-block-1
- https://google-crawler-test-20250903.stacklegend.com/url-from-js-comment-block-2
- https://google-crawler-test-20250903.stacklegend.com/url-from-js-variable-name
- https://google-crawler-test-20250903.stacklegend.com/url-from-js-object-property
-
Monitoring via Server Logs and Google Search Console confirmed that these URLs were attempted to be crawled, resulting in 404 responses from the server.
-
This confirms that Google Crawler is parsing URLs from JavaScript code blocks, even when these URLs are not linked or intended to be crawled.
-
Impact:
- Attempting to crawl non-existent (phantom) URLs
- Potential negative SEO impact due to 404 errors
- Wasted crawl budget
- Unnecessary server load
- Modern web applications use JavaScript to build URLs dynamically for routing, navigation, or internal logic.
- Google Crawler parses HTML pages to discover URLs. When it encounters
<script>
tags, it does not execute the JavaScript, but instead applies pattern matching or regex to the raw JS code. - This approach can mistakenly identify URL-like strings inside JS comments, variables, objects, or template strings as valid URLs.
- As a result, Google Crawler collects phantom URLs that do not exist as HTML links, attributes, or sitemap entries.
- For example, in a React/Next.js application, JavaScript objects often define route segments like
{ locale: 'en', category: 'category-1', href: '/product-1' }
. In reality, this object might be converted into the following URL:/en/category-1/product-1
. These href strings exist only in the code and do not correspond to rendered HTML links. Google Crawler’s parser incorrectly treats these strings as real URLs and attempts to crawl them, causing phantom URLs and 404 errors.
- How false 404 errors affect a page's indexing and ranking?
- How much crawl budget is wasted on phantom URLs?
- How many websites worldwide could be affected by this bug?
- Are important pages being under-crawled because the crawler wastes resources on phantom URLs?
- Can internal link equity be diluted due to non-existent URL indexing?
- Are sitemap submissions being partially ignored or deprioritized due to phantom URL noise?
All evidence collected during the audit can be found in the proof/
folder:
- Reproducibility: The issue can be reproduced by deploying the test application and monitoring the URLs crawled by Google.
- Timeline Note: The audit was completed on September 5th, 2025, but final documentation was delayed until September 18th, 2025, as we waited for Google Search Console to provide the warning message confirmation that arrived just before finalizing this report.
- GSC crawl reports and 404 pages:
- Server logs:
These files demonstrate the crawler behavior that triggers phantom URL collection and 404 errors.
GoogleOther crawler exhibits behavior where it collects and attempts to crawl phantom URLs from JavaScript code blocks. Despite Google's documentation stating that GoogleOther "has no effect on Google Search or other products," the evidence shows these phantom URLs appear in Google Search Console as crawl errors and are persistently tracked by Google's systems. This suggests potential integration with Google's broader crawling infrastructure, making the actual SEO impact uncertain and potentially more significant than officially documented.