This is a Python script that generates a sitemap.xml
file for a given website by crawling its pages. It only includes pages that belong to the specified domain and excludes external links.
- Domain-Specific Crawling: Ensures only URLs within the starting domain are included in the sitemap.
- Proper URL Encoding: Handles spaces and special characters in URLs.
- Error Handling: Skips non-HTML content and logs errors for inaccessible pages.
- User-Agent Simulation: Mimics browser behavior to bypass bot detection.
- Python 3.x
- Required libraries:
requests
beautifulsoup4
lxml
Install the required libraries using pip:
pip install requests beautifulsoup4 lxml
- Save the script to a file, e.g.,
generate_sitemap.py
. - Run the script:
python generate_sitemap.py
- Enter the starting URL when prompted.
- The script will crawl the website and generate a
sitemap.xml
file in the same directory.
If the starting URL is:
http://example.com
The script will crawl all pages under example.com
and save a sitemap.xml
file that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://example.com/</loc>
</url>
<url>
<loc>http://example.com/page1</loc>
</url>
</urlset>
-
Exclusions:
- The script excludes external links.
- Pages with non-standard URL patterns can be ignored by adding additional filtering logic.
-
Customizing the Script:
- You can update the
headers
in theget_links
function to mimic specific browser behavior.
- You can update the
- The script does not handle JavaScript-generated links.
- Very large websites may take time to process.
If you'd like to contribute or suggest improvements, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.