This is a Python application that demonstrates web scraping techniques using AWS Lambda and the AWS Toolkit for VSCode. It provides a flexible framework for scraping web pages, parsing data, and leveraging AWS serverless functions for scalable web scraping.
-
Clone the repository:
git clone https://github.com/workshop-msano/python-webscrayping-app.git
-
Navigate to the project directory:
cd python-webscrayping-app -
Install the required dependencies using pip:
Creating a virtual environment is strongly recommended.pip install -r requirements.txt
Before proceeding, ensure that you have the following tools installed and configured:
It is strongly recommended to set up a virtual environment to isolate the project dependencies. Follow the appropriate instructions for your preferred virtual environment tool (e.g., virtualenv or conda) to create and activate a virtual environment for the project.
-
Create a
.envfile in the project directory. -
In the
.envfile, set values with your API token:like...
BOT_USER_OAUTH_TOKEN=<your_slack_api_token>
- Slack: Sending messages
- LINE: Building a bot
Open the scraper.py file and modify the code to define the specific web scraping rules based on your requirements.
Follow the AWS documentation to create an AWS Lambda function and configure the necessary permissions and triggers.
Install the AWS Toolkit for VSCode and set up your AWS credentials using the AWS Command Line Interface (CLI) or VSCode's integrated AWS credentials management.
To build the application locally and test it:
sam build
sam local invokeTo deploy the application to AWS Lambda:
sam deploy --guidedThis will also create a samconfig.toml file that contains the deployment configurations. For subsequent deployments, simply run sam deploy to deploy the app.
Check the AWS CloudWatch logs and the output generated by the Lambda function to view the scraped data.
Contributions are welcome! If you have any ideas, suggestions, or bug reports, please open an issue or submit a pull request. Your input is highly appreciated.
To contribute to the project, follow these steps:
-
Fork the repository.
-
Create a new branch:
git checkout -b my-feature-branch
-
Make your changes and commit them:
git commit -m "Add new feature" -
Push your changes to the forked repository:
git push origin my-feature-branch
-
Open a pull request with a detailed description of your changes.
-
Wait for the project maintainers to review and merge your pull request.
It's possible to trigger a Lambda function by a specific time using EventBridge. By utilizing EventBridge rules, you can schedule the execution of your Lambda function at a predetermined time. This allows you to automate the retrieval of data without manual intervention.
You can get details in Amazon EventBridge
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to explore, use, and enhance this web scraping application. If you have any questions or need assistance, please don't hesitate to reach out. Happy web scraping!