Sitemap Generator

This is a Python script that generates a sitemap.xml file for a given website by crawling its pages. It only includes pages that belong to the specified domain and excludes external links.

Features

Domain-Specific Crawling: Ensures only URLs within the starting domain are included in the sitemap.
Proper URL Encoding: Handles spaces and special characters in URLs.
Error Handling: Skips non-HTML content and logs errors for inaccessible pages.
User-Agent Simulation: Mimics browser behavior to bypass bot detection.

Requirements

Python 3.x
Required libraries:
- requests
- beautifulsoup4
- lxml

Install the required libraries using pip:

pip install requests beautifulsoup4 lxml

Usage

Save the script to a file, e.g., generate_sitemap.py.
Run the script:
```
python generate_sitemap.py
```
Enter the starting URL when prompted.
The script will crawl the website and generate a sitemap.xml file in the same directory.

Example

If the starting URL is:

http://example.com

The script will crawl all pages under example.com and save a sitemap.xml file that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://example.com/</loc>
  </url>
  <url>
    <loc>http://example.com/page1</loc>
  </url>
</urlset>

Notes

Exclusions:
- The script excludes external links.
- Pages with non-standard URL patterns can be ignored by adding additional filtering logic.
Customizing the Script:
- You can update the headers in the get_links function to mimic specific browser behavior.

Limitations

The script does not handle JavaScript-generated links.
Very large websites may take time to process.

Contributing

If you'd like to contribute or suggest improvements, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
sitemap.py		sitemap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sitemap Generator

Features

Requirements

Usage

Example

Notes

Limitations

Contributing

License

About

Releases

Packages

Languages

Tim-Twinlabs/Sitemap-creator

Folders and files

Latest commit

History

Repository files navigation

Sitemap Generator

Features

Requirements

Usage

Example

Notes

Limitations

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages