diff --git a/book/en/_toc.yml b/book/en/_toc.yml index 843a66e..19ed899 100644 --- a/book/en/_toc.yml +++ b/book/en/_toc.yml @@ -42,6 +42,9 @@ parts: - file: labs/nlp4ss-lab-1 - file: labs/nlp4ss-lab-2 - file: labs/nlp4ss-lab-3 + - caption: Projects + chapters: + - file: labs/nlp4ss-project-1 - caption: About chapters: - file: syllabus/index diff --git a/book/en/projects/nlp4ss-project-1.ipynb b/book/en/projects/nlp4ss-project-1.ipynb new file mode 100644 index 0000000..57ddd50 --- /dev/null +++ b/book/en/projects/nlp4ss-project-1.ipynb @@ -0,0 +1,540 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# NLP Analysis of Sierra Club Press Releases\n", + "\n", + "This notebook demonstrates an advanced Natural Language Processing (NLP) analysis of Sierra Club press releases. We'll cover various techniques including text preprocessing, word frequency analysis, named entity recognition, sentiment analysis, topic modeling, and more.\n", + "\n", + "## Installation\n", + "\n", + "Before we begin, let's install the necessary packages for this lab. Run the following cell to install the required libraries:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install nlp4ss\n", + "!python -m spacy download en_core_web_sm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup and Data Loading\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from hyfi import HyFI\n", + "\n", + "if HyFI.is_colab():\n", + " HyFI.mount_google_drive()\n", + " project_root = \"/content/drive/MyDrive/courses/nlp4ss\"\n", + "else:\n", + " project_root = \"$HOME/workspace/courses/nlp4ss\"\n", + "\n", + "h = HyFI.initialize(\n", + " project_name=\"nlp4ss\",\n", + " project_root=project_root,\n", + " logging_level=\"INFO\",\n", + " verbose=True,\n", + ")\n", + "\n", + "print(\"project_dir:\", h.project.root_dir)\n", + "print(\"project_workspace_dir:\", h.project.workspace_dir)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSB1MMAH2CZD26PAAS2", + "metadata": {}, + "outputs": [], + "source": [ + "# Importing necessary libraries\n", + "import re\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import nltk\n", + "import numpy as np\n", + "import pandas as pd\n", + "import pyLDAvis.gensim_models\n", + "import seaborn as sns\n", + "import spacy\n", + "from gensim import corpora\n", + "from gensim.models import CoherenceModel, LdaModel\n", + "from nltk import pos_tag\n", + "from nltk.corpus import stopwords\n", + "from nltk.stem import WordNetLemmatizer\n", + "from nltk.tokenize import word_tokenize\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "from textblob import TextBlob\n", + "from wordcloud import WordCloud\n", + "\n", + "# Download necessary NLTK data\n", + "nltk.download(\"punkt\")\n", + "nltk.download(\"averaged_perceptron_tagger\")\n", + "nltk.download(\"wordnet\")\n", + "nltk.download(\"stopwords\")\n", + "\n", + "# Load spaCy model\n", + "nlp = spacy.load(\"en_core_web_sm\")\n", + "\n", + "# Load the data\n", + "raw_data_file = h.project.workspace_dir / \"data/raw/articles.jsonl\"\n", + "rdata = h.load_dataset(\"json\", data_files=raw_data_file.as_posix())\n", + "df = rdata[\"train\"].to_pandas()\n", + "\n", + "print(\"Data loaded. Shape:\", df.shape)\n", + "print(\"\\nSample of the data:\")\n", + "print(df.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Text Preprocessing\n", + "\n", + "We'll now preprocess the text data. This includes lowercasing, removing special characters, tokenization, removing stopwords, and lemmatization. We'll also filter for only nouns and adjectives, which are typically the most informative for topic modeling.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSCW7146QJX8YF5SAXQ", + "metadata": {}, + "outputs": [], + "source": [ + "# Custom stopwords list\n", + "custom_stopwords = set(stopwords.words(\"english\"))\n", + "custom_stopwords.update(\n", + " [\"sierra\", \"club\", \"press\", \"release\"]\n", + ") # Add domain-specific stopwords\n", + "\n", + "\n", + "def preprocess_text(text):\n", + " # Lowercase and remove special characters\n", + " text = re.sub(r\"[^a-zA-Z\\s]\", \"\", text.lower())\n", + "\n", + " # Tokenize\n", + " tokens = word_tokenize(text)\n", + "\n", + " # Remove stopwords and keep only nouns and adjectives\n", + " lemmatizer = WordNetLemmatizer()\n", + " tokens = [\n", + " lemmatizer.lemmatize(token)\n", + " for token, pos in pos_tag(tokens)\n", + " if pos.startswith(\"NN\")\n", + " or pos.startswith(\"JJ\")\n", + " and token not in custom_stopwords\n", + " ]\n", + "\n", + " return tokens\n", + "\n", + "\n", + "# Apply preprocessing to all documents\n", + "df[\"processed_text\"] = df[\"content\"].apply(preprocess_text)\n", + "df[\"processed_string\"] = df[\"processed_text\"].apply(\" \".join)\n", + "\n", + "print(\"Preprocessing completed.\")\n", + "print(\"\\nSample of processed text:\")\n", + "print(df[\"processed_text\"].head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Word Frequency Analysis\n", + "\n", + "Let's analyze the most frequent words in our corpus. This can give us a quick overview of the main themes in the press releases.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSD99PBPTKRXBNGQHJA", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_word_frequency(text, title, n=20):\n", + " word_freq = pd.Series(\" \".join(text).split()).value_counts()[:n]\n", + " plt.figure(figsize=(12, 6))\n", + " word_freq.plot(kind=\"bar\")\n", + " plt.title(title)\n", + " plt.xlabel(\"Words\")\n", + " plt.ylabel(\"Frequency\")\n", + " plt.xticks(rotation=45, ha=\"right\")\n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + "\n", + "plot_word_frequency(\n", + " df[\"processed_string\"], \"Top 20 Most Frequent Words in Sierra Club Press Releases\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Named Entity Recognition\n", + "\n", + "Named Entity Recognition (NER) can help us identify key entities (like people, organizations, or locations) mentioned in the press releases.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSDJVHN4ZZV3FG1S4TK", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_entities(text):\n", + " doc = nlp(text)\n", + " return [(ent.text, ent.label_) for ent in doc.ents]\n", + "\n", + "\n", + "# Extract entities from a sample of articles\n", + "sample_size = min(100, len(df))\n", + "sample_entities = (\n", + " df[\"content\"].sample(n=sample_size, random_state=42).apply(extract_entities)\n", + ")\n", + "entity_df = pd.DataFrame(\n", + " [entity for entities in sample_entities for entity in entities],\n", + " columns=[\"Entity\", \"Label\"],\n", + ")\n", + "\n", + "print(\"Top 10 most common named entities:\")\n", + "print(entity_df[\"Entity\"].value_counts().head(10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sentiment Analysis\n", + "\n", + "Sentiment analysis can help us understand the overall tone of the press releases.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSD1HGHBVRRFKDJSSVH", + "metadata": {}, + "outputs": [], + "source": [ + "def get_sentiment(text):\n", + " return TextBlob(text).sentiment.polarity\n", + "\n", + "\n", + "df[\"sentiment\"] = df[\"content\"].apply(get_sentiment)\n", + "\n", + "plt.figure(figsize=(10, 6))\n", + "plt.hist(df[\"sentiment\"], bins=20)\n", + "plt.title(\"Sentiment Distribution of Sierra Club Press Releases\")\n", + "plt.xlabel(\"Sentiment Score\")\n", + "plt.ylabel(\"Frequency\")\n", + "plt.show()\n", + "\n", + "print(f\"Average sentiment: {df['sentiment'].mean():.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TF-IDF Analysis\n", + "\n", + "TF-IDF (Term Frequency-Inverse Document Frequency) can help us identify important terms in each document.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSD20NDZTX1FS8HV10Q", + "metadata": {}, + "outputs": [], + "source": [ + "tfidf_vectorizer = TfidfVectorizer(max_features=1000)\n", + "tfidf_matrix = tfidf_vectorizer.fit_transform(df[\"processed_string\"])\n", + "\n", + "feature_names = tfidf_vectorizer.get_feature_names_out()\n", + "first_doc_vector = tfidf_matrix[0]\n", + "top_terms = sorted(\n", + " zip(feature_names, first_doc_vector.toarray()[0]), key=lambda x: x[1], reverse=True\n", + ")[:10]\n", + "\n", + "print(\"Top 10 terms in the first document (TF-IDF):\")\n", + "for term, score in top_terms:\n", + " print(f\"{term}: {score:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Topic Modeling\n", + "\n", + "We'll use Latent Dirichlet Allocation (LDA) for topic modeling. First, we'll determine the optimal number of topics.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSD4G9AR45F4WGKGBVT", + "metadata": {}, + "outputs": [], + "source": [ + "id2word = corpora.Dictionary(df[\"processed_text\"])\n", + "corpus = [id2word.doc2bow(text) for text in df[\"processed_text\"]]\n", + "\n", + "\n", + "def compute_coherence_values(dictionary, corpus, texts, start, limit, step):\n", + " coherence_values = []\n", + " model_list = []\n", + " for num_topics in range(start, limit, step):\n", + " model = LdaModel(\n", + " corpus=corpus,\n", + " id2word=dictionary,\n", + " num_topics=num_topics,\n", + " random_state=100,\n", + " update_every=1,\n", + " chunksize=100,\n", + " passes=10,\n", + " alpha=\"auto\",\n", + " per_word_topics=True,\n", + " )\n", + " model_list.append(model)\n", + " coherencemodel = CoherenceModel(\n", + " model=model, texts=texts, dictionary=dictionary, coherence=\"c_v\"\n", + " )\n", + " coherence_values.append(coherencemodel.get_coherence())\n", + " return model_list, coherence_values\n", + "\n", + "\n", + "model_list, coherence_values = compute_coherence_values(\n", + " dictionary=id2word,\n", + " corpus=corpus,\n", + " texts=df[\"processed_text\"],\n", + " start=2,\n", + " limit=40,\n", + " step=2,\n", + ")\n", + "\n", + "# Plot coherence scores\n", + "plt.plot(range(5, 40, 5), coherence_values)\n", + "plt.xlabel(\"Number of Topics\")\n", + "plt.ylabel(\"Coherence Score\")\n", + "plt.title(\"Topic Coherence Scores\")\n", + "plt.show()\n", + "\n", + "# Find the optimal number of topics\n", + "optimal_num_topics = coherence_values.index(max(coherence_values)) * 5 + 5\n", + "print(f\"Optimal number of topics: {optimal_num_topics}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have the optimal number of topics, let's train our LDA model and examine the results.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSDRMH4Z51G57MT3BR7", + "metadata": {}, + "outputs": [], + "source": [ + "# Train LDA model with optimal number of topics\n", + "lda_model = LdaModel(\n", + " corpus=corpus,\n", + " id2word=id2word,\n", + " num_topics=optimal_num_topics,\n", + " random_state=100,\n", + " update_every=1,\n", + " chunksize=100,\n", + " passes=10,\n", + " alpha=\"auto\",\n", + " per_word_topics=True,\n", + ")\n", + "\n", + "# Print topics\n", + "print(\"Top words for each topic:\")\n", + "topics = lda_model.print_topics()\n", + "for idx, topic in topics:\n", + " print(f\"Topic {idx}: {topic}\")\n", + "\n", + "# Visualize topics\n", + "vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)\n", + "pyLDAvis.save_html(vis, \"lda_visualization.html\")\n", + "print(\"\\nLDA visualization saved as 'lda_visualization.html'\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Topic-Sentiment Analysis\n", + "\n", + "Let's analyze the sentiment associated with each topic.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSE0YBM0FD5CDTW3HSC", + "metadata": {}, + "outputs": [], + "source": [ + "def get_dominant_topic(ldamodel, corpus):\n", + " dominant_topics = []\n", + " for bow in corpus:\n", + " topic_probs = ldamodel.get_document_topics(bow, minimum_probability=0)\n", + " dominant_topic = max(topic_probs, key=lambda x: x[1])[0]\n", + " dominant_topics.append(dominant_topic)\n", + " return dominant_topics\n", + "\n", + "\n", + "df[\"dominant_topic\"] = get_dominant_topic(lda_model, corpus)\n", + "\n", + "topic_sentiments = df.groupby(\"dominant_topic\")[\"sentiment\"].mean()\n", + "\n", + "plt.figure(figsize=(12, 6))\n", + "topic_sentiments.plot(kind=\"bar\")\n", + "plt.title(\"Average Sentiment by Topic\")\n", + "plt.xlabel(\"Topic\")\n", + "plt.ylabel(\"Average Sentiment Score\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Word Clouds\n", + "\n", + "Let's create word clouds for each topic to visually represent the most significant words.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSEGXN26D6T0M0BEEP3", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_wordcloud(text, title):\n", + " wordcloud = WordCloud(width=800, height=400, background_color=\"white\").generate(\n", + " text\n", + " )\n", + " plt.figure(figsize=(10, 5))\n", + " plt.imshow(wordcloud, interpolation=\"bilinear\")\n", + " plt.axis(\"off\")\n", + " plt.title(title)\n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + "\n", + "for topic_id in range(optimal_num_topics):\n", + " topic_words = dict(lda_model.show_topic(topic_id, topn=30))\n", + " generate_wordcloud(\" \".join(topic_words.keys()), f\"Word Cloud for Topic {topic_id}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving Results\n", + "\n", + "Let's save our analysis results for future reference or further analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01J4JDJHSEX3C7VJXNGW4C1X40", + "metadata": {}, + "outputs": [], + "source": [ + "output_file = h.project.workspace_dir / \"data/processed/sierra_club_analysis.csv\"\n", + "df[[\"content\", \"processed_string\", \"sentiment\", \"dominant_topic\"]].to_csv(\n", + " output_file, index=False\n", + ")\n", + "print(f\"Analysis results saved to {output_file}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we've conducted a comprehensive NLP analysis of Sierra Club press releases. Here's a summary of what we've accomplished:\n", + "\n", + "1. **Text Preprocessing**: We cleaned and normalized the text data, focusing on nouns and adjectives which are most informative for our analysis.\n", + "\n", + "2. **Word Frequency Analysis**: We identified the most common words in the press releases, giving us an initial insight into the main themes.\n", + "\n", + "3. **Named Entity Recognition**: We extracted key entities mentioned in the press releases, which can be useful for understanding who and what the Sierra Club frequently discusses.\n", + "\n", + "4. **Sentiment Analysis**: We analyzed the overall sentiment of the press releases, which can indicate the general tone of Sierra Club's communications.\n", + "\n", + "5. **TF-IDF Analysis**: We identified important terms in individual documents, which can highlight specific focus areas in different press releases.\n", + "\n", + "6. **Topic Modeling**: Using LDA, we discovered the main topics discussed in the press releases. We also determined the optimal number of topics for our dataset.\n", + "\n", + "7. **Topic-Sentiment Analysis**: We examined the average sentiment associated with each topic, which can reveal how different subjects are framed.\n", + "\n", + "8. **Word Clouds**: We created visual representations of the most significant words in each topic.\n", + "\n", + "This analysis provides valuable insights into the content and framing of Sierra Club's press releases. It can be used to understand their communication strategies, main areas of focus, and how they discuss different environmental issues.\n", + "\n", + "For further analysis, you might consider:\n", + "\n", + "- Examining how topics and sentiment change over time\n", + "- Comparing these results with press releases from other environmental organizations\n", + "- Diving deeper into specific topics or entities of interest\n", + "\n", + "Remember that while these computational methods provide powerful insights, they should be combined with close reading and domain expertise for the most comprehensive understanding of the text data.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}