This README file provides step-by-step instructions on how to set up, configure, and run the Keyword Finder script. This script processes HTML content, searches for keywords, and extracts relevant sentences and additional context.
Ensure you have the following installed on your system:
- Python (Version 3.8 or higher)
- pip (Python's package manager)
- Git (for cloning the repository)
To get started, clone the repository to your local machine:
git clone https://github.com/Madaocv/KristiyanY.git
cd keyword-finder-script
It is recommended to use a virtual environment to avoid conflicts with other Python packages:
python3 -m venv venv
source venv/bin/activate
python -m venv venv
venv\Scripts\activate
Install all necessary Python packages using pip
:
pip install -r requirements.txt
The script accepts the following arguments:
--input_path
: The path to the input CSV file containing URLs.--input_keywords
: A list of keywords to search for, passed as a stringified Python list.--output_file
: The path where the results file will be saved. Ensure this includes a valid file name and extension (e.g.,.xlsx
).
python main.py --input_path="/Downloads/Template 2 - Sheet6.csv" --input_keywords='["art portfolio" , "website ideas" , "idea for a website" , "website design" , "mobile-friendly design" , "restaurant website" , "website for a restaurant" , "online store" , "website builder"]' --output_file="data/Initial_test_05_12_v6.xlsx"
-
Keyword Detection:
- Extracts sentences containing specified keywords.
- Identifies whether a keyword is wrapped in
<a>
tags and adds this information to the output.
-
Contextual Sentence Extraction:
- Extracts one sentence before and after each match (when available).
-
Output File:
- Results are saved in an Excel file with columns for the keyword, matching sentences, context, and additional metadata.
-
Automatic Directory Creation:
- If the specified output directory does not exist, it will be created automatically.
-
Usage trics:
- User can use different combinations of (output filename & set keywords) - in order to get files with different results
The output Excel file contains the following columns:
1 Keyword inside URL
: True/False.1.1 Keyword in URL
: keyword name.Response Status Code
: The HTTP status code for each URL.Keyword in text
: Keywords found in the page content.Link inside sentence
: Boolean indicating if the keyword was wrapped in an<a>
tag.Sentence
: Sentences containing the keywords.Sentence -1
: The sentence preceding the match.Sentence +1
: The sentence following the match.
-
File Not Found:
- Ensure the input file path is correct and accessible.
- Use absolute paths if running the script from a different directory.
-
Output File Errors:
- Make sure the
--output_file
argument includes a valid file name (e.g.,results.xlsx
). - Ensure you have write permissions for the specified directory.
- Make sure the
-
Dependencies Issues:
- Run
pip install -r requirements.txt
to ensure all dependencies are installed.
- Run
- If using relative paths, ensure the script is run from the directory containing the
main.py
file. - Use Python 3.8 or higher for compatibility.
Contributions are welcome! Feel free to submit issues or pull requests on the repository.
This project is licensed under the MIT License.
python main.py \
--input_path="Template2-Sheet72.csv" \
--output_file="data/5_2025_exclude_true.xlsx" \
--input_keywords='["art portfolio" , "website ideas" , "idea for a website" , "website design" , "mobile-friendly design" , "restaurant website" , "website for a restaurant" , "online store" , "website builder"]' \
--title_keywords='["art portfolio" , "website ideas" , "idea for a website" , "website design" , "mobile-friendly design" , "restaurant website" , "website for a restaurant" , "online store" , "website builder"]' \
--exclude_h_and_true=True