Skip to content

flitz99/Progetto_gestione

Repository files navigation

Progetto_gestione

Questo è un progetto sviluppato per l'esame di gestione dell'informazione realizzato da Filippo Reggiani e Janath Uthayakumar

This is a project made for the course of Gestione di informazione of university of Computer science of Modena made by Filippo Reggiani e Janath Uthayakumar

Libraries

  • beautifulsoup4 4.12.2
  • scrapy 2.9.0
  • whoosh 2.7.4
  • tqdm 4.65.0
  • transformers 4.29.1
  • textblob 0.17.1
  • vadersentiment 3.3.2
  • tensorflow 2.12.0
  • matplotlib 3.7.1

Setup

  • Pull this project in a folder using the git link

  • Use a virtual environment as you like ( we tested this project with PIPENV)

  • Install dependencies from the Pipfile

  • run Start.py ( it will take 10 minutes )

  • when everything its ready it will ask you to insert a queries , don't need to initialize but just run Start.py to make queries \

How to querries ?

after following the steps described in the Setup section

  1. Run Start.py
  2. insert the query to search : for example amazing visual effects
  3. the application will prompt you to choose the search engine between BM25F or TF_IDF (default is TF_IDF)
  4. the application will prompt you to choose the sentiment analyzer between Distil-roberta , Textblob and Vader
  5. based on the analyzer chosen, you need to specify the sentiment type for example positive or negative or neutral
  6. the result will be displayed

output example

the following output is an example by searching

  • amazing visual effects
  • TD_IDF
  • Vader
  • Negative

Ranking_finale: 1
Categoria: Cd e Vynil, Nome del prodotto : Pink Floyd - Pulse VHS
Autore recensione: Erico Macedo
Testo recensione: After waiting years to see Pulse on DVD, I have to say I am disappointed with the DVD. The concert is amazing, but as already posted in other reviews, it was shot in video instead of film. The result is clearly visible and the image quality is poor - too bad, as the visual effects would look amazing on a large screen. Anyway, it is still a must-have for Pink Floyd fans. You might be a little disappointed with the image quality, but you will definitely not regret your purchase.
Ranking pertinenza: 2
Ranking sentimento: 9
Sentiment analyzer: Vader

Ranking_finale = final score obtained by combining Ranking pertinenza and Ranking sentimento scroes
Categoria = the category name of the product
Autore recensione = the name of the author of the review
Testo recensione = the body of the review
Ranking pertinenza = the ranking of the review based on the pertinence of the querries and the content of the review
Ranking sentimento = the ranking based on the sentiment analyzer and sentiment chosen
Sentiment analyzer = the sentiment analyzer used for the analyzes

output.txt is an example of output of Start.py benchmark.txt is an example of output of Benchmark.py

Benchmark

for the benchmark we have chosen 100 reviews randomly using Benchmark/Benchmark_create.py and choose 10 queries to mimic what an user would have asked

the 10 query are inside the Benchmark/Benchmark.py

"r1": "love AND stories",
"r2": "DVD OR VHS",
"r3": "bad AND packing",
"r4": "worth buying",
"r5": "background music",
"r6": "amazing visual effects",
"r7": "cast",
"r8": "faithful to the book",
"r9": "Oscar OR awards ",
"r10": "highest rated",

then we proceed to manually score each review with each query for a score between [0,3]

we used the

DCG Formula

$$ DCG_{p} = rel_{1}+\sum_{i=2}^{p}{rel_{i} \over log_{2}i} $$

and plotted the following graphs

DCG using Tf_IDF

DCG using BM25

then we proceeded to calculate NDCG

$$ NDCG_{p} = { DCG_{p} \over IDCG_{p} } $$

IDCG= ideal DCG

NDCG using Tf_IDF

NDCG using BM25