Progetto_gestione

Questo è un progetto sviluppato per l'esame di gestione dell'informazione realizzato da Filippo Reggiani e Janath Uthayakumar

This is a project made for the course of Gestione di informazione of university of Computer science of Modena made by Filippo Reggiani e Janath Uthayakumar

Libraries

beautifulsoup4 4.12.2
scrapy 2.9.0
whoosh 2.7.4
tqdm 4.65.0
transformers 4.29.1
textblob 0.17.1
vadersentiment 3.3.2
tensorflow 2.12.0
matplotlib 3.7.1

Setup

Pull this project in a folder using the git link
Use a virtual environment as you like ( we tested this project with PIPENV)
Install dependencies from the Pipfile
run Start.py ( it will take 10 minutes )
when everything its ready it will ask you to insert a queries , don't need to initialize but just run Start.py to make queries \

How to querries ?

after following the steps described in the Setup section

Run Start.py
insert the query to search : for example amazing visual effects
the application will prompt you to choose the search engine between BM25F or TF_IDF (default is TF_IDF)
the application will prompt you to choose the sentiment analyzer between Distil-roberta , Textblob and Vader
based on the analyzer chosen, you need to specify the sentiment type for example positive or negative or neutral
the result will be displayed

output example

the following output is an example by searching

amazing visual effects
TD_IDF
Vader
Negative

Ranking_finale: 1
Categoria: Cd e Vynil, Nome del prodotto : Pink Floyd - Pulse VHS
Autore recensione: Erico Macedo
Testo recensione: After waiting years to see Pulse on DVD, I have to say I am disappointed with the DVD. The concert is amazing, but as already posted in other reviews, it was shot in video instead of film. The result is clearly visible and the image quality is poor - too bad, as the visual effects would look amazing on a large screen. Anyway, it is still a must-have for Pink Floyd fans. You might be a little disappointed with the image quality, but you will definitely not regret your purchase.
Ranking pertinenza: 2
Ranking sentimento: 9
Sentiment analyzer: Vader

Ranking_finale = final score obtained by combining Ranking pertinenza and Ranking sentimento scroes
Categoria = the category name of the product
Autore recensione = the name of the author of the review
Testo recensione = the body of the review
Ranking pertinenza = the ranking of the review based on the pertinence of the querries and the content of the review
Ranking sentimento = the ranking based on the sentiment analyzer and sentiment chosen
Sentiment analyzer = the sentiment analyzer used for the analyzes

output.txt is an example of output of Start.py benchmark.txt is an example of output of Benchmark.py

Benchmark

for the benchmark we have chosen 100 reviews randomly using Benchmark/Benchmark_create.py and choose 10 queries to mimic what an user would have asked

the 10 query are inside the Benchmark/Benchmark.py

"r1": "love AND stories",
"r2": "DVD OR VHS",
"r3": "bad AND packing",
"r4": "worth buying",
"r5": "background music",
"r6": "amazing visual effects",
"r7": "cast",
"r8": "faithful to the book",
"r9": "Oscar OR awards ",
"r10": "highest rated",

then we proceed to manually score each review with each query for a score between [0,3]

we used the

DCG Formula

$$ DCG_{p} = rel_{1}+\sum_{i=2}^{p}{rel_{i} \over log_{2}i} $$

and plotted the following graphs

then we proceeded to calculate NDCG

$$ NDCG_{p} = { DCG_{p} \over IDCG_{p} } $$

IDCG= ideal DCG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Progetto_gestione

Libraries

Setup

How to querries ?

output example

Benchmark

Files

README.md

Latest commit

History

README.md

File metadata and controls

Progetto_gestione

Libraries

Setup

How to querries ?

output example

Benchmark