modified code according to project handout #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1. Do you understand the steps involved in creating and deploying an LTR model? Name them and describe what each step does in your own words.
a. Initialize LTR store on OpenSearch, for storage of feature sets and models.
b. Feature Engineering.
Determine which features to use for training LTR model from dataset. Upload to LTR store as a feature set.
c. Data Processing.
Cleaning of dataset. Splitting of dataset into training and test sets. Both grades and features are required. Grades can be derived from explicit human judgments or implicit judgments. In this project the grades were derived from implicit judgment, followed by a heuristic (
step()
) to transform grades into the range of [0.0, 1.0]. Features can be calculated using a query (create_feature_log_query()
).d. Training and testing model.
Use training set which consists of grades and features to train an LTR model (in this project
xgboost
). Test model after training, using metrics such as MRR and P@k.e. Upload LTR model to OpenSearch.
f. Use LTR model to search (typically done using
rescore
API).2. What is a feature and featureset?
A feature is a query which depends on certain field(s) of documents. The score returned from the query will be used for training. Essentially a feature is a property of documents, to be used to train the LTR model. A featureset is a set of these features.
3. What is the difference between precision and recall?
Precision is defined as:
no. of relevant documents retrieved / total no. of documents retrieved
Recall is defined as:
no. of relevant documents retrieved / total no. of relevant documents
There is often a trade-off in precision-recall optimization, an increase in precision might result in a decrease in recall, and vice versa. E commerce use cases prioritize precision, while discovery use cases prioritize recall.
4. What are some of the traps associated with using click data in your model?
Clicks do not necessarily translate to relevance. Presentation bias - items with low click counts will remain low.
5. What are some of the ways we are faking our data and how would you prevent that in your application?
Data is faked by generating the impressions. To prevent that, log information of all the documents returned by the query, including documents which were not clicked on. Alternatively, pay for explicit human judgment data.
6. What is target leakage and why is it a bad thing?
Target leakage is the usage of information which might not be available during prediction (in production) to train a model. It will cause an overestimate of the model's predictive effectiveness.
7. When can using prior history cause problems in search and LTR?
When new versions of existing products are added, and when new products with no prior history are added.
8. Submit your project along with your best MRR scores
The scores were obtained mostly using the same configuration as described in the project handout, with the exception of 2 new features
customerReviewAverage
andcustomerReviewCount
. Adding the new features did not result in a better improvement of MRR using LTR. However, the new features seem to be more significant as compared tosalesRank*Term
s and the prices.