This project is a web application built with Streamlit for performing semantic analysis on website content. It allows users to load website sitemaps, analyze content, and visualize relationships between pages based on semantic similarity. The app includes features for topic modeling, cosine similarity calculations, and network visualizations.
Latent Dirichlet Allocation (LDA) is a statistical method for identifying abstract topics in a collection of documents. It assumes that each document is a mixture of a few topics, and each topic is a mixture of words. LDA helps identify key topics across a website's content.
- Content Themes: Discover the main topics discussed on the website and check if they align with user intent.
- SEO Improvement: Identify content gaps and optimize topics for better search engine rankings.
- Content Organization: Group similar content into silos for better SEO and user navigation.
Cosine Similarity measures how similar two documents are by calculating the cosine of the angle between two vectors in multi-dimensional space. These vectors are created using TF-IDF (Term Frequency-Inverse Document Frequency) scores. A similarity score of 1 means identical content, while 0 means no similarity.
- Duplicate Content Detection: Identify similar pages to avoid search engine penalties for duplicate content.
- Content Gap Analysis: Discover gaps between related topics and create new content accordingly.
- Internal Linking: Link pages with high similarity to enhance SEO and improve user navigation.
- Topic Optimization: Align your content with important topics to boost user engagement and SEO.
- Content Strategy Refinement: Identify underrepresented topics and combine high-similarity pages to prevent content cannibalization.
- SEO and User Flow Improvement: Use semantic relationships to build better internal linking and enhance the site's structure.
- Semantic Network: Interactive graph displaying relationships between pages based on their semantic similarity.
- Topics Discovered: Displays the topics identified across website content.
The app provides several downloadable files:
- Cosine Similarity Matrix: A CSV file containing similarity scores between pages.
- Semantic Network Edges: A CSV file representing relationships between pages.
- Topic Modeling Results: A CSV file showing topics assigned to each page.
- Nodes with Depth: A CSV of nodes and their depth in the semantic network.
- Semantic Network Visualization: An HTML file of the interactive graph.
- Load and process sitemap data from a website.
- Perform LDA topic modeling to discover key topics.
- Calculate cosine similarity between website pages.
- Build an interactive semantic network using Pyvis.
- Download analysis results, including cosine similarity matrices, topic models, and semantic network edges.
To run this project, you'll need the following libraries:
- Streamlit: Web app framework.
- Nest Asyncio: Allows running asyncio event loops in notebooks.
- NLTK: For text preprocessing, including stopwords and lemmatization.
- BeautifulSoup: For HTML parsing.
- Gensim: For LDA topic modeling.
- Scikit-learn: For cosine similarity calculations.
- NetworkX: For graph construction.
- Pyvis: For graph visualization.
- Matplotlib & Seaborn: For data visualization.
- Pandas: For data manipulation.
-
Clone the repository:
git clone https://github.com/MHA535/semantic-analysis-app.git cd semantic-analysis-app -
Install the dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run app.py
- Sitemap URL: Provide a sitemap URL (e.g.,
https://example.com/sitemap.xml). - Cosine Similarity Threshold: Set the threshold to filter page similarity connections (default: 0.5).
- Number of Topics: Specify the number of topics for LDA topic modeling (default: 10). For websites, this should match the number of articles on the site.
After setting the inputs, click the Run Analysis button to begin.
- Sitemap Loading: Load website URLs from the sitemap and extract headings.
- Text Preprocessing: Clean and process content by removing stopwords, URLs, and performing lemmatization.
- Topic Modeling: Use LDA to identify topics across the website.
- Cosine Similarity: Measure similarity between pages based on their TF-IDF vectors.
- Semantic Network: Create a graph where nodes represent pages, and edges represent high cosine similarity.
This project is licensed under the MIT License.