Skip to content

Commit

Permalink
Scrape Tor/Onion Links
Browse files Browse the repository at this point in the history
  • Loading branch information
itsOwen committed Oct 27, 2024
1 parent e3c19a6 commit e19fad5
Show file tree
Hide file tree
Showing 8 changed files with 387 additions and 25 deletions.
34 changes: 32 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,25 @@ FROM python:3.10-slim-bullseye
# Set the working directory in the container
WORKDIR /app

# Install system dependencies including Git
# Install system dependencies including Git and Tor
RUN apt-get update && apt-get install -y \
wget \
gnupg \
git \
tor \
tor-geoipdb \
# Additional dependencies that might be needed
build-essential \
python3-dev \
libffi-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Configure Tor
RUN echo "SocksPort 9050" >> /etc/tor/torrc && \
echo "ControlPort 9051" >> /etc/tor/torrc && \
echo "CookieAuthentication 1" >> /etc/tor/torrc

# Cyberscraper repo :)
RUN git clone https://github.com/itsOwen/CyberScraper-2077.git .

Expand All @@ -22,22 +33,41 @@ ENV PATH="/app/venv/bin:$PATH"
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Install additional Tor-related Python packages
RUN pip install --no-cache-dir \
PySocks \
requests[socks]

# Install playwright and its browser
RUN pip install playwright requests
RUN playwright install chromium
RUN playwright install-deps

# Expose port 8501 for Streamlit
# Expose ports for Streamlit and Tor
EXPOSE 8501
EXPOSE 9050
EXPOSE 9051

# Create a shell script to run the application
RUN echo '#!/bin/bash\n\
# Start Tor service\n\
service tor start\n\
\n\
# Wait for Tor to be ready\n\
echo "Waiting for Tor to be ready..."\n\
timeout 60 bash -c "until nc -z localhost 9050; do sleep 1; done"\n\
\n\
if [ ! -z "$OPENAI_API_KEY" ]; then\n\
export OPENAI_API_KEY=$OPENAI_API_KEY\n\
fi\n\
if [ ! -z "$GOOGLE_API_KEY" ]; then\n\
export GOOGLE_API_KEY=$GOOGLE_API_KEY\n\
fi\n\
\n\
# Check Tor connection\n\
echo "Verifying Tor connection..."\n\
curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/api/ip\n\
\n\
streamlit run main.py\n\
' > /app/run.sh

Expand Down
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Whether you're a corpo data analyst, a street-smart netrunner, or just someone l
- 🤖 **AI-Powered Extraction**: Utilizes cutting-edge AI models to understand and parse web content intelligently.
- 🖥️ **Sleek Streamlit Interface**: User-friendly GUI that even a chrome-armed street samurai could navigate.
- 🔄 **Multi-Format Support**: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.
- 🧅 **Tor Network Support**: Safely scrape .onion sites through the Tor network with automatic routing and security features.
- 🕵️ **Stealth Mode**: Implemented stealth mode parameters that help avoid detection as a bot.
- 🦙 **Ollama Support**: Use a huge library of open source LLMs.
-**Async Operations**: Lightning-fast scraping that would make a Trauma Team jealous.
Expand Down Expand Up @@ -242,6 +243,89 @@ As this feature is in beta, we highly value your feedback. If you encounter any

Your input is crucial in helping us refine and stabilize this feature for future releases.

## 🧅 Tor Network Scraping

> **Note**: The Tor network scraping feature allows you to access and scrape .onion sites. This feature requires additional setup and should be used responsibly and legally.

CyberScraper 2077 now supports scraping .onion sites through the Tor network, allowing you to access and extract data from the dark web safely and anonymously. This feature is perfect for researchers, security analysts, and investigators who need to gather information from Tor hidden services.

### Prerequisites

1. Install Tor on your system:
```bash
# Ubuntu/Debian
sudo apt install tor
# macOS (using Homebrew)
brew install tor
# Start the Tor service
sudo service tor start # on Linux
brew services start tor # on macOS
```

2. Install additional Python packages:
```bash
pip install PySocks requests[socks]
```

### Using Tor Scraping

1. **Basic Usage**:
Simply enter an .onion URL, and CyberScraper will automatically detect and route it through the Tor network:
```
http://example123abc.onion
```
2. **Safety Features**:
- Automatic .onion URL detection
- Built-in connection verification
- Tor Browser-like request headers
- Automatic circuit isolation
### Configuration Options
You can customize the Tor scraping behavior by adjusting the following settings:
```python
tor_config = TorConfig(
socks_port=9050, # Default Tor SOCKS port
circuit_timeout=10, # Timeout for circuit creation
auto_renew_circuit=True, # Automatically renew Tor circuit
verify_connection=True # Verify Tor connection before scraping
)
```

### Security Considerations

- Always ensure you're complying with local laws and regulations
- Use a VPN in addition to Tor for extra security
- Be patient as Tor connections can be slower than regular web scraping
- Avoid sending personal or identifying information through Tor
- Some .onion sites may be offline or unreachable

### Docker Support

For Docker users, add these additional flags to enable Tor support:
```bash
docker run -p 8501:8501 \
--network="host" \
-e OPENAI_API_KEY="your-api-key" \
cyberscraper-2077
```

### Troubleshooting

If you encounter issues with Tor scraping:
- Verify Tor service is running (`sudo service tor status`)
- Check SOCKS port availability (`netstat -an | grep 9050`)
- Ensure proper Tor installation (`tor --version`)
- Verify internet connectivity
- Check firewall settings

### Example Usage

<p align="center">https://i.postimg.cc/Jz0w4Kry/Screenshot-2024-10-27-at-1-25-32-AM.png</p>

## Setup Google Sheets Authentication:

1. Go to the Google Cloud Console (https://console.cloud.google.com/).
Expand Down
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,6 @@ google_auth_oauthlib
google-auth-httplib2
google-api-python-client
google-generativeai
langchain-google-genai
langchain-google-genai
PySocks>=1.7.1
requests[socks]>=2.28.1
23 changes: 23 additions & 0 deletions src/scrapers/tor/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
class TorException(Exception):
"""Base exception for Tor-related errors"""
pass

class TorConnectionError(TorException):
"""Raised when there's an error connecting to the Tor network"""
pass

class TorInitializationError(TorException):
"""Raised when Tor service fails to initialize"""
pass

class TorCircuitError(TorException):
"""Raised when there's an error creating or managing Tor circuits"""
pass

class OnionServiceError(TorException):
"""Raised when there's an error accessing an onion service"""
pass

class TorProxyError(TorException):
"""Raised when there's an error with the Tor SOCKS proxy"""
pass
23 changes: 23 additions & 0 deletions src/scrapers/tor/tor_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from dataclasses import dataclass
from typing import List

@dataclass
class TorConfig:
"""Configuration for Tor connection and scraping"""
socks_port: int = 9050
control_port: int = 9051
debug: bool = False
max_retries: int = 3
timeout: int = 30
circuit_timeout: int = 10
auto_renew_circuit: bool = True
verify_connection: bool = True
user_agents: List[str] = None

def __post_init__(self):
if self.user_agents is None:
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0',
'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0',
'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0',
]
113 changes: 113 additions & 0 deletions src/scrapers/tor/tor_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import requests
import random
import logging
import socket
import socks
from typing import Dict, Optional
from urllib.parse import urlparse
from .tor_config import TorConfig
from .exceptions import (
TorConnectionError,
TorInitializationError,
OnionServiceError,
TorProxyError
)

class TorManager:
"""Manages Tor connection and session handling"""

def __init__(self, config: TorConfig = TorConfig()):
self.logger = logging.getLogger(__name__)
self.logger.setLevel(logging.DEBUG if config.debug else logging.INFO)
self.config = config
self._setup_logging()
self._setup_proxy()

def _setup_logging(self):
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
self.logger.addHandler(handler)

def _setup_proxy(self):
"""Configure SOCKS proxy for Tor"""
try:
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", self.config.socks_port)
socket.socket = socks.socksocket
self.proxies = {
'http': f'socks5h://127.0.0.1:{self.config.socks_port}',
'https': f'socks5h://127.0.0.1:{self.config.socks_port}'
}
except Exception as e:
raise TorProxyError(f"Failed to setup Tor proxy: {str(e)}")

def get_headers(self) -> Dict[str, str]:
"""Get randomized Tor Browser-like headers"""
return {
'User-Agent': random.choice(self.config.user_agents),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1'
}

async def verify_tor_connection(self) -> bool:
"""Verify Tor connection is working"""
try:
session = self.get_tor_session()
response = session.get('https://check.torproject.org/api/ip',
timeout=self.config.timeout)
is_tor = response.json().get('IsTor', False)

if is_tor:
self.logger.info("Successfully connected to Tor network")
return True
else:
raise TorConnectionError("Connection is not using Tor network")

except Exception as e:
raise TorConnectionError(f"Failed to verify Tor connection: {str(e)}")

def get_tor_session(self) -> requests.Session:
"""Create a requests session that routes through Tor"""
session = requests.Session()
session.proxies = self.proxies
session.headers = self.get_headers()
return session

@staticmethod
def is_onion_url(url: str) -> bool:
"""Check if the given URL is an onion service"""
try:
parsed = urlparse(url)
return parsed.hostname.endswith('.onion') if parsed.hostname else False
except Exception:
return False

async def fetch_content(self, url: str) -> str:
"""Fetch content from an onion site"""
if not self.is_onion_url(url):
raise OnionServiceError("URL is not a valid onion service")

try:
session = self.get_tor_session()

if self.config.verify_connection:
await self.verify_tor_connection()

response = session.get(url, timeout=self.config.timeout)
response.raise_for_status()

self.logger.info(f"Successfully fetched content from {url}")
return response.text

except requests.RequestException as e:
raise OnionServiceError(f"Failed to fetch onion content: {str(e)}")
except Exception as e:
raise TorException(f"Unexpected error fetching onion content: {str(e)}")
Loading

0 comments on commit e19fad5

Please sign in to comment.