Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified code according to project handout #2

Merged
merged 2 commits into from
Mar 13, 2022
Merged

modified code according to project handout #2

merged 2 commits into from
Mar 13, 2022

Conversation

bobcchen
Copy link
Owner

@bobcchen bobcchen commented Feb 27, 2022

1. Do you understand the steps involved in creating and deploying an LTR model? Name them and describe what each step does in your own words.
a. Initialize LTR store on OpenSearch, for storage of feature sets and models.
b. Feature Engineering.
Determine which features to use for training LTR model from dataset. Upload to LTR store as a feature set.
c. Data Processing.
Cleaning of dataset. Splitting of dataset into training and test sets. Both grades and features are required. Grades can be derived from explicit human judgments or implicit judgments. In this project the grades were derived from implicit judgment, followed by a heuristic (step()) to transform grades into the range of [0.0, 1.0]. Features can be calculated using a query (create_feature_log_query()).
d. Training and testing model.
Use training set which consists of grades and features to train an LTR model (in this project xgboost). Test model after training, using metrics such as MRR and P@k.
e. Upload LTR model to OpenSearch.
f. Use LTR model to search (typically done using rescore API).

2. What is a feature and featureset?
A feature is a query which depends on certain field(s) of documents. The score returned from the query will be used for training. Essentially a feature is a property of documents, to be used to train the LTR model. A featureset is a set of these features.

3. What is the difference between precision and recall?
Precision is defined as: no. of relevant documents retrieved / total no. of documents retrieved
Recall is defined as: no. of relevant documents retrieved / total no. of relevant documents
There is often a trade-off in precision-recall optimization, an increase in precision might result in a decrease in recall, and vice versa. E commerce use cases prioritize precision, while discovery use cases prioritize recall.

4. What are some of the traps associated with using click data in your model?
Clicks do not necessarily translate to relevance. Presentation bias - items with low click counts will remain low.

5. What are some of the ways we are faking our data and how would you prevent that in your application?
Data is faked by generating the impressions. To prevent that, log information of all the documents returned by the query, including documents which were not clicked on. Alternatively, pay for explicit human judgment data.

6. What is target leakage and why is it a bad thing?
Target leakage is the usage of information which might not be available during prediction (in production) to train a model. It will cause an overestimate of the model's predictive effectiveness.

7. When can using prior history cause problems in search and LTR?
When new versions of existing products are added, and when new products with no prior history are added.

8. Submit your project along with your best MRR scores
image
The scores were obtained mostly using the same configuration as described in the project handout, with the exception of 2 new features customerReviewAverage and customerReviewCount. Adding the new features did not result in a better improvement of MRR using LTR. However, the new features seem to be more significant as compared to salesRank*Terms and the prices.
ltr_model_importance
ltr_model_tree

@bobcchen
Copy link
Owner Author

bobcchen commented Mar 12, 2022

1. For query classification:
a. How many unique categories did you see in your rolled up training data when you set the minimum number of queries per category to 100? To 1000?

Getting the unique categories and their counts using df['category'].value_counts(), at first I got 1486 unique categories.
image
When the minimum number of queries per category is set to 100, I got 866 unique categories.
image
When the minimum number of queries per category is set to 1000, I got 374 unique categories.
image

b. What values did you achieve for P@1, R@3, and R@5? You should have tried at least a few different models, varying the minimum number of queries per category as well as trying different fastText parameters or query normalization. Report at least 3 of your runs.

  Model 1 Model 2 Model 3 Model 4
min queries 100 1000 100 1000
preprocessed text FALSE FALSE TRUE TRUE
learning rate 0.5 0.5 0.5 0.5
epochs 25 25 25 25
word ngrams 2 2 2 2
P@1 0.493 0.504 0.514 0.522
R@3 0.674 0.686 0.695 0.706
R@5 0.739 0.749 0.757 0.77

After the data is prepared by create_labeled_queries.py, I shuffled the data and took the first 50000 values as training data, and the last 50000 as testing data, according to the reading. Only the minimum number of queries per category and preprocessing is varied during the different runs. Training parameters were held constant (non default values are set according to the table above), and the default loss function softmax is used. Both performing preprocessing and increasing min queries (at the cost of losing resolution in predicted categories) are deemed to increase the performance of the model, and I will be using model 4 moving forward.

2.For integrating query classification with search:
I have implemented both filtering and boosting. 5 predictions are obtained with no threshold from the model. The filtering and boosting is then implemented as follows:
image

a. Give 2 or 3 examples of queries where you saw a dramatic positive change in the results because of filtering. Make sure to include the classifier output for those queries.
Query: "ipad 2"
Classifier output: Your query categorization model predicted: [('pcmcat209000050007', 0.760281503200531), ('pcmcat218000050000', 0.08647284656763077), ('pcmcat209000050008', 0.030620116740465164), ('pcmcat218000050003', 0.026567159220576286), ('pcmcat217900050000', 0.023749062791466713)]
According to the code above, results with category pcmcat209000050007 (or iPad) are boosted and hence the top results were all variations of iPad 2, followed by variations of other iPads, and then finally accessories related to the iPads, which is desirable.

Query: "lcd tv"
Classifier output: Your query categorization model predicted: [('abcat0101001', 0.9857031106948853), ('pcmcat200900050015', 0.006267087999731302), ('abcat0106004', 0.0047890073619782925), ('pcmcat233200050010', 0.0013648761669173837), ('abcat0101005', 0.0004192907072138041)]
Results with category abcat0101001 (or All Flat-Panel TVs) are boosted. The first result not of the All Flat-Panel TVs category is of the category abcat0101005 (or TV/DVD Combos) and is ranked at 34th, which is desirable.

b. Given 2 or 3 examples of queries where filtering hurt the results, either because the classifier was wrong or for some other reason. Again, include the classifier output for those queries.
Query: "ps3"
Classifier output: Your query categorization model predicted: [('abcat0703001', 0.5304310917854309), ('abcat0703002', 0.16260522603988647), ('pcmcat232900050029', 0.06001700833439827), ('abcat0715007', 0.05300392210483551), ('pcmcat144700050004', 0.044538553804159164)]
The score of abcat0703001 (or PS3 Consoles) is slightly below the threshold of 0.6. Hence no particular category is boosted, and the actual PS3 console results are not boosted to the top. Instead, the top results are of categories PS3 Accessories, PS3 Games, PS3 Controllers etc. This can be rectified by fine tuning the threshold of 0.6.

Query: "transformers"
Classifier output: Your query categorization model predicted: [('cat02015', 0.7603320479393005), ('abcat0707002', 0.12136812508106232), ('pcmcat209000050008', 0.019910454750061035), ('cat02685', 0.01568685658276081), ('abcat0703002', 0.008503180928528309)]
Firstly, the predicted category with the highest score cat02015 does not exist for any result (I performed a match query as such:
{ "match": { "categoryPathIds": "cat02015" } }). Could be due to information loss during rollup or an error in the data processing. Secondly, the top results are mostly of category with the second highest score abcat0707002 (or Nintendo DS Games), including games which do not contain the word "transformers" at all. Hence, the outcome is undesirable as the user might not be searching for Nintendo DS games.

@bobcchen bobcchen merged commit 679aa44 into main Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant