Skip to content

Parithikrishnan/Social-Data-Scraping-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ Concurrent Social Data Scraping Engine 🚀

Overview ⚙️

This high-performance, multi-threaded Python application concurrently retrieves public data from multiple social media platforms using a single target username. It is optimized for speed via parallel execution, intentionally leveraging high system resources (CPU and memory) as a trade-off for maximum performance. Advanced anti-detection measures and configurable auto-termination ensure controlled and stealthy operation.

Key Features ✨

  • 🔍 Multi-Platform Concurrency: Simultaneously collects data from YouTube, Instagram, X, Threads, Quora, and Reddit.
  • 👤 Human-Like Behavior: Randomized, time-delayed actions simulate human activity to reduce the risk of bot detection and IP bans.
  • 💾 Unified Logging: Aggregates all collected data into a structured text file (storing.txt).
  • 🛑 Auto-Termination: Automatically terminates the process once a pre-defined item count is reached.
  • 🔑 Credential Support: Supports dummy/burner accounts for authenticated scraping sessions on specific platforms.

Technical Architecture 🛡️

1. Concurrency Model (Python Threads)

  • Execution: Uses ThreadPoolExecutor or equivalent to run platform-specific scrapers in parallel.
  • Resource Trade-Off: High parallelism increases CPU and memory usage to drastically reduce scraping time.

2. Thread Safety and Synchronization

  • Shared Resources: Output file (storing.txt) and global collected item counter.
  • Synchronization: Threading locks ensure safe updates to the counter, file writes, and limit checks.

Anti-Detection Protocol 🕵️

1.Human Delay

  • Introduces random time.sleep() delays before each request , mimicking human behavior.

2. Account Management

  • Allows dummy/burner account credentials for authenticated sessions, increasing rate limits, data visibility, and IP safety.

Execution Control and Data Logging 🛑

1. Automated Self-Termination

  • Limit Variable: MAX FETCH COUNT sets the operational lifespan.
  • Termination Logic: The main thread monitors the thread-safe counter and executes sys.exit(0) once the limit is reached.

2. Process Verification ⚠️

  • Post-execution, verify no orphaned threads or helper processes remain to prevent unnecessary resource usage.

3. Output Data Format

  • Data is appended to storing.txt in a pipe-delimited format:

About

Social media data scraper using username

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages