GitHub - AkashHiremath856/Data-Extraction-and-NLP: The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables.

Task Assigned by Black Coffer

Data Extraction and NLP

This project aims to extract textual data articles from given URLs and perform text analysis to compute variables. The code is written in Jupyter Notebook for its flexibility compared to scripting.

Prerequisites

Before running the code, make sure to install the required packages by running the following command:

python3 install -r requirements.txt

Approach

The project follows the following steps:

input data: Read input data from input.xlsx using pandas.
Web Scraping: Use the requests module to get the raw HTML content from the URLs specified in the input file. Utilize the BeautifulSoup module to perform web scraping.
Find the title of the article using the <h1> tag and obtain the content using the <div> tag and class name selectors.
Since different articles may have different class selectors, try-except blocks are used to handle any exceptions.
Text Analysis: Perform text analysis using the NLTK library.
Use the Punkt tokenizer to divide the text into sentences using the sent_tokenize function.
Remove stop words (common words like 'the', 'and', 'I', etc.) using the NLTK stopwords module.
Tokenize the text into words using the word_tokenize function.
Estimate the number of syllables using a simple syllable estimator.
Compute Variables: Compute the required variables by looping through multiple URLs and performing various calculations.
Store Results: Store the variables in lists and insert them into a dictionary with the specified column names.
Export Results: Export the results to Output Data Structure.xlsx.

Output

The output file Output Data Structure.xlsx will contain the computed variables for each article.

Note: During the analysis, a few web pages were not found and couldn't be scraped. Please refer to the code file in the Jupyter Notebook format for a detailed implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MasterDictionary		MasterDictionary
StopWords		StopWords
Input.xlsx		Input.xlsx
NLP.ipynb		NLP.ipynb
Objective.docx		Objective.docx
Output Data Structure.xlsx		Output Data Structure.xlsx
Readme.md		Readme.md
Text Analysis.docx		Text Analysis.docx
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Task Assigned by Black Coffer

Data Extraction and NLP

Prerequisites

Approach

Output

About

Releases

Packages

Languages

AkashHiremath856/Data-Extraction-and-NLP

Folders and files

Latest commit

History

Repository files navigation

Task Assigned by Black Coffer

Data Extraction and NLP

Prerequisites

Approach

Output

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages