Web Content Extraction API Documentation

📌 Overview

This API is designed to extract structured information from web pages and provides the following features:

✅ HTML content (only body section)

✅ Cleaned text (using the Trafilatura library)

✅ Internal and external links (only from the homepage of the website)

✅ HTTP status codes (for both target URL and homepage)

✅ Error handling (with retry mechanism for network issues)

🔗 Important Links after server launch

Interactive Documentation (Swagger UI):

http://127.0.0.1:8000/docs

Ideal for direct API testing and exploring available endpoints.

Alternative Documentation (ReDoc):

http://127.0.0.1:8000/redoc

Provides a more organized display of documentation.

Technologies Used

FastAPI (Python web framework)
aiohttp (Asynchronous HTTP client)
BeautifulSoup4 (HTML parsing)
Trafilatura (Text extraction from web pages)

🔧 Setup and Installation Prerequisites

Python 3.8 or higher
pip (Python package manager)

Install the required libraries:

pip install fastapi aiohttp beautifulsoup4 trafilatura tldextract uvicorn

Run the server:
```
uvicorn main:app --reload
```

📡 API Endpoints

Extract information from a URL (GET) Path: /extract

Input parameters:

url (required): The URL of the website to extract content from.
timeout (optional): Timeout duration for the request (default: 10 seconds).
max_retries (optional): Number of retry attempts in case of error (default: 2 retries).

Request Example:

curl "http://127.0.0.1:8000/extract?url=https://example.com&timeout=15&max_retries=3"

Response Example:

{
  "original_url": "https://example.com",
  "home_url": "https://example.com/",
  "page_data": {
    "url": "https://example.com",
    "status_code": 200,
    "error": null,
    "body": "<body>...content...</body>",
    "text": "Cleaned text...",
    "success": true
  },
  "home_data": {
    "url": "https://example.com/",
    "status_code": 200,
    "error": null,
    "body": "<body>...homepage content...</body>",
    "text": "Homepage cleaned text...",
    "success": true,
    "links": {
      "internal": ["https://example.com/about"],
      "external": ["https://google.com"]
    }
  },
  "timing": {
    "processing_time_seconds": 1.5,
    "timestamp": "2023-05-20T12:00:00.000000"
  }
}

Extract information from a batch of URLs (POST) Path: /extract/batch

Request body (JSON):

{
  "urls": ["https://example.com", "https://another-site.com"],
  "timeout": 20,
  "max_retries": 3
}

Response Example:

{
  "results": [
    {
      "original_url": "https://example.com",
      "page_data": { ... },
      "home_data": { ... }
    },
    {
      "original_url": "https://another-site.com",
      "page_data": { ... },
      "home_data": { ... }
    }
  ],
  "successful": 2,
  "failed": 0,
  "total_time": 4.2
}

Check Service Health (GET) Path: /health

Response Example:

{
  "status": "healthy",
  "timestamp": "2023-05-20T12:00:00.000000"
}

⚙️ Potential Errors

Status Code	Description
400	Invalid URL
408	Request timed out
500	Internal server error

📌 Important Notes ✅ Supports HTTPS ✅ Automatic homepage analysis even if the input URL is problematic ✅ Logs error details for troubleshooting ✅ Asynchronous processing for better performance

📄 Python Sample Code

import requests

# Extract single URL
response = requests.get("http://127.0.0.1:8000/extract?url=https://example.com")
print(response.json())

# Extract batch URLs
data = {
    "urls": ["https://example.com", "https://another-site.com"],
    "timeout": 20
}
response = requests.post("http://127.0.0.1:8000/extract/batch", json=data)
print(response.json())

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
extractor.py		extractor.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Content Extraction API Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mohammad6706/export-data-url

Folders and files

Latest commit

History

Repository files navigation

Web Content Extraction API Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages