Skip to content

Pull request bh #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3e75fad
init
yanglianglu Oct 16, 2023
e0caab7
init
yanglianglu Oct 16, 2023
2b52992
Add an Elasticsearch docker-compose file and util function
yanglianglu Oct 17, 2023
51c084c
Add comment to readme
yanglianglu Oct 17, 2023
cfba8a4
update utils
yanglianglu Oct 17, 2023
7adced0
add project proposal
yanglianglu Oct 22, 2023
e935f64
Merge pull request #1 from yanglianglu/create_database_utils
yanglianglu Oct 22, 2023
1006a48
rebase
yanglianglu Oct 23, 2023
4ade8d2
Update README.md
yanglianglu Oct 23, 2023
012251c
Update README.md
yanglianglu Oct 23, 2023
2ec3380
Update README.md
yanglianglu Oct 23, 2023
2eb718d
add a skeleton search bar and search page
dxmtb Oct 23, 2023
d5f6fcb
Merge branch 'main' of github.com:yanglianglu/Auto_Dash into main
dxmtb Oct 23, 2023
df3900f
Crawler
weikunwu Oct 23, 2023
7d729fc
Add urls inside document
weikunwu Oct 24, 2023
aa3f121
Merge pull request #2 from yanglianglu/crawler
yanglianglu Oct 29, 2023
993b551
Merge pull request #1 from dxmtb/main
yanglianglu Oct 29, 2023
84a1567
update text preprocessing
persme1111 Oct 30, 2023
3cf652f
update
persme1111 Oct 30, 2023
e73d851
Description of the changes you made
persme1111 Oct 30, 2023
f469744
more modern search bar and search results
dxmtb Nov 2, 2023
3472ca9
fix button in results page
dxmtb Nov 2, 2023
c5fb265
fix button in results page
dxmtb Nov 2, 2023
45ddc60
add placeholder for text summary and sentiment
dxmtb Nov 2, 2023
e3b0152
Optimize speed and add method to crawl larger data
weikunwu Nov 4, 2023
8dd3f6c
Add insertion to elastic search
weikunwu Nov 12, 2023
9c66676
Merge pull request #5 from yanglianglu/crawler
yanglianglu Nov 13, 2023
661382e
Merge branch 'main' of https://github.com/dxmtb/Auto_Dash into dxmtb-…
yanglianglu Nov 13, 2023
3835259
Merge pull request #4 from dxmtb/main
yanglianglu Nov 14, 2023
46cf0b1
add report
yanglianglu Nov 17, 2023
c5dc707
Merge pull request #6 from yanglianglu/progress_report
yanglianglu Nov 17, 2023
a2bf0c4
add summarization model
yanglianglu Nov 26, 2023
a7de10d
Add sentiment classification model
weikunwu Dec 2, 2023
5b0a291
update and call scrape documents
dxmtb Dec 3, 2023
e03d8ae
Merge pull request #7 from yanglianglu/summarization
dxmtb Dec 3, 2023
5ba69ba
Merge pull request #3 from yanglianglu/pull_request_bh
dxmtb Dec 3, 2023
8cec10b
Merge pull request #8 from yanglianglu/sentiment
dxmtb Dec 3, 2023
dc94818
Merge branch 'main' of github.com:yanglianglu/Auto_Dash into main
dxmtb Dec 3, 2023
5d3344f
add topic cloud
dxmtb Dec 3, 2023
40ddd0d
Merge pull request #9 from dxmtb/main
dxmtb Dec 3, 2023
d89e81e
Updated sentiment model to save model
weikunwu Dec 4, 2023
941db3d
Fix merge conflict with main
weikunwu Dec 4, 2023
7101375
Merge pull request #10 from yanglianglu/sentiment
dxmtb Dec 4, 2023
e95b5c7
update text preprocessing and topic model
persme1111 Dec 5, 2023
3912ca2
Merge pull request #11 from yanglianglu/pull_request_bh
persme1111 Dec 6, 2023
180f411
implement topic model
persme1111 Dec 9, 2023
24d375f
Merge branch 'main' into pull_request_bh
yanglianglu Dec 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add sentiment classification model
  • Loading branch information
weikunwu committed Dec 2, 2023
commit a7de10de422e6282f5d648335325b0ace78ae97c
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@ beautifulsoup4==4.12.2
bs4==0.0.1
elasticsearch==8.10.1
html5lib==1.1
pandas==2.1.3
scikit-learn==1.3.2
selenium==4.14.0
webdriver-manager==4.0.1
1 change: 1 addition & 0 deletions src/models/sentiment-training-data.csv

Large diffs are not rendered by default.

84 changes: 84 additions & 0 deletions src/models/sentiment_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import sys
import os

# getting the name of the directory
# where the this file is present.
current = os.path.dirname(os.path.realpath(__file__))
parent = os.path.dirname(current)
sys.path.append(parent)

import utils.database_utils as db

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV


class LogisticRegressionModel:
def __init__(self):

df = pd.read_csv("./sentiment-training-data.csv", delimiter=",", encoding="latin-1")
df = df.rename(
columns={
"neutral": "Sentiment",
"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .": "Sentence",
}
)

train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
X_train = train_df["Sentence"]
X_test = test_df["Sentence"]
y_train = train_df["Sentiment"]
y_test = test_df["Sentiment"]

pipeline = Pipeline(
[
("tfidf_vect", TfidfVectorizer(stop_words="english")),
("lr_clf", LogisticRegression(solver="liblinear")),
]
)

params = {
"tfidf_vect__ngram_range": [(1, 1), (1, 2), (1, 3)],
"tfidf_vect__max_df": [0.5, 0.75, 1.0],
"lr_clf__C": [1, 5, 10],
}

grid_cv_pipe = GridSearchCV(
pipeline, param_grid=params, cv=3, scoring="accuracy", verbose=1
)
grid_cv_pipe.fit(X_train, y_train)
print("Optimized Hyperparameters: ", grid_cv_pipe.best_params_)

self.model = grid_cv_pipe

# # Accuracy
# pred = grid_cv_pipe.predict(X_test)
# print("Optimized Accuracy Score: {0: .3f}".format(accuracy_score(y_test, pred)))

def predict(self, x):
return self.model.predict(x)

# Example Usage
if __name__ == "__main__":

# Get documents from elasticsearch
client = db.create_client()
res = db.search_documents(client, "documents", {"match_all": {}})
docs = res["hits"]["hits"]

# Map documents into headlines
titles = [doc["_source"]["title"] for doc in docs]

# Format test data
x = pd.DataFrame(titles, columns=["Text"])

# Perform prediction
model = LogisticRegressionModel()
print(x["Text"])
print(model.predict(x["Text"]))