-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix empty urlClassification #122
Comments
Main problem: there is no urlClassification data collected during the crawl because Firefox's Enhanced Tracking Protection did not detect any sites specifically during the crawl. (When visiting a site manually, opening the protection dashboard showed that Enhanced Tracking Protection detected and blocked first party and third party sites that fall under the relevant categories). |
Ruled out the Firefox Nightly recent update: I downloaded the older nightly version (the one that I used for June crawl) with both the current version of the code and the June crawl version of the code and they both still gave an empty output for the urlClassification. |
Ruled out the possibility of a problem from local device: I tried doing the crawl on my own computer and I also got the same empty output for urlClassification. |
As discussed, I have checked several things the past week,
For July, June, and even April crawl, the codebase is the same regarding the selenium-webcrawler used (see attached). I further confirmed this by checking the "resolved" field in the package-lock.json in our codebase; it is indeed the downloaded 4.7.1 version. So, npm install will install this exact version. The only way for us to update the selenium version when running the crawl is either manually editing the package.json file or running npm update selenium-webdriver. I can be sure that I used the version 4.7.1 for June crawl and presumably that is also the case for the previous crawls. So, my last experiment is manually editing the package-lock.json file to update the selenium version to the latest one (4.23.0) in issue-122. However, it still gave me no urlClassification. Nevertheless, I don't think we can rule out selenium versions just yet, given that the jump from v4.7.1 to v4.23.0 is pretty big. |
Thanks, @franciscawijaya! You compared the April/June dependency versions against the July dependency versions that you have installed locally on your computer, right? (as opposed to the July dependency versions that are committed here in the remote repo) |
Yes, correct. |
@franciscawijaya and @Mattm27, there are three places with dependencies:
When you go back in time to check if an updated dependency is causing the issue, consider that any of these may be relevant. |
Like @franciscawijaya, I received empty urlClassification entries when performing the crawl with the same dependencies from the April crawl. In addition, after using the identical codebase from the April crawl I still received an empty urlClassification. However, I performed this analysis before reading through @SebastianZimmeck's comment above, so I will double-check that the dependencies are identical in those three places and see if the results change at all. I will continue to dive deeper into this issue in the coming days to see if I come across any new information/findings that may be relevant! |
Sounds good! (Also, as additional clarification, what matters are the local dependencies on your computer and not on the remote repo.) |
I have gone through and run the crawl with identical dependencies from the June crawl at the top level, REST API, and crawler, and I still received empty URL classification results. The API call is working properly within the extension, indicating that it is most likely not an issue with the API. After some more thought, it occurred to me as I was reviewing analysis.js that it could possibly be a timing issue with the urlClassification data being accessed before it is fully populated. I will look more into this, but I would imagine this issue would have been accounted for in previous crawls if it was causing problems with the crawl data. As @SebastianZimmeck mentioned in last week's meeting, it will be beneficial to create a high-level crawler that only pulls the urlClassification object from a site, which @franciscawijaya and I will begin to work on! |
Similar to what @Mattm27 encountered, the issue persisted on my end even after making sure that all the dependencies from the three package.json are identical. This past week, I have also written my own Python script using selenium to exclusively extract and collect the urlClassification from Nightly's Enhanced Tracking Protection(ETP) and finally finished trying it out. However, the result is still empty even with this isolation. To double check that it's not a matter of the data being there but not collected properly, I also checked the crawl manually while it was still ongoing and a similar circumstance to our gpc crawler's issue also existed: when the protection banner is clicked during the ongoing crawl, there is no trackers known to nightly detected. This seems to suggest that the problem comes from the usage of selenium webdriver for collecting ETP data.
This is a good point for exploration! Following this suggestion, I reran the script after extending the timeout for the implicit wait of selenium to 30 seconds and 60 seconds. Unfortunately, using the python script below, the timeout seemed to not make any difference. Here is the script that I wrote and used. I will discuss with @Mattm27 to see if there's anything that I miss in this script as this script is far from perfect at the moment. |
@wesley-tan, given your expertise, do you know why our calls to the This behavior seems to have to do with Selenium. The call only fails when we use it as part of our crawl infrastructure with Selenium. @franciscawijaya created a minimal working example (which also requires Selenium to run). Do you have any thoughts? |
As mentioned during the call, the API that I used in the minimal working example is unfortunately an internal API, rendering the script unable to access the ETP data and thus making it invalid. Right now, I'm working on a new script that includes web API and extensions API (eg. methods like chrome.webRequest.onBeforeRequest.addListener that are used in our actual crawler) so that I would be able to access the ETP data. As of now, the new minimal working example script is still not working (ie. it's not printing the data yet) but when I ran the minimum selenium crawl and checked it manually when it is visiting a site, the protection banner suggests that there are sites blocked by firefox as according to ETP -- which is a good sign given that our first challenge with our gpc crawler is that the protection banner showed 'None detected' for sites blocked by ETP. I will continue working on this and start discussing with @Mattm27 about possible ways to get the now present sites blocked by ETP with the new minimum working example script. |
@wesley-tan is saying:
|
I began to play around with the preferences highlighted above by @wesley-tan in |
I did a lot of investigating, but I only made progress once I tried to figure out how Firefox blocked a tracker for a website to begin with. That train of thought leads you to the If you inspect element, you can find the actual code that's run when you search: let input = "https://ad-delivery.net"
let uri;
try {
uri = Services.io.newURI(input);
if (!uri) {
Search.reportError("url-classifier-search-error-invalid-url");
}
} catch (ex) {
Search.reportError("url-classifier-search-error-invalid-url");
}
console.log(uri);
let classifier = Cc["@mozilla.org/url-classifier/dbservice;1"].getService(
Ci.nsIURIClassifier
);
let featureNames = classifier.getFeatureNames();
let features = [];
featureNames.forEach(featureName => {
if (document.getElementById("feature_" + featureName).checked) {
let feature = classifier.getFeatureByName(featureName);
if (feature) {
features.push(feature);
}
}
});
if (!features.length) {
Search.reportError("url-classifier-search-error-no-features");
}
let listType =
document.getElementById("search-listtype").value == 0
? Ci.nsIUrlClassifierFeature.blocklist
: Ci.nsIUrlClassifierFeature.entitylist;
classifier.asyncClassifyLocalWithFeatures( uri, features, listType, list =>
Search.showResults(list)
); I thought the It's used to get the table states from the database. If you run the following line in the browser console: var tables = () => Cc["@mozilla.org/url-classifier/dbservice;1"].getService(
Ci.nsIURIClassifier
).getTables(table => console.log(table)); Running this in the normal nightly browser prints a very full list, while the crawler browser prints nothing. This gave me the idea that the tracking list information simply wasn't in the database. I am not sure about the exact mechanics, but I read enough docs to realize that maybe I realized what setting to fiddle with by clicking The relevant code reads: async fetchLatestChanges(serverUrl, options = {}) {
const { expectedTimestamp, lastEtag = "", filters = {} } = options;
let url = serverUrl + Utils.CHANGES_PATH;
const params = {
...filters,
_expected: expectedTimestamp ?? 0,
};
if (lastEtag != "") {
params._since = lastEtag;
}
if (params) {
url +=
"?" +
Object.entries(params)
.map(([k, v]) => `${k}=${encodeURIComponent(v)}`)
.join("&");
}
const response = await Utils.fetch(url);
if (response.status >= 500) {
throw new Error(`Server error ${response.status} ${response.statusText}`);
}
const is404FromCustomServer =
response.status == 404 &&
Services.prefs.prefHasUserValue("services.settings.server");
const ct = response.headers.get("Content-Type");
if (!is404FromCustomServer && (!ct || !ct.includes("application/json"))) {
throw new Error(`Unexpected content-type "${ct}"`);
} So, the RemoteSettingsClient seems to be fetching data from the server provided, but we seem to get caught because of a mismatched Content-Type header. If we look at the server which we query our settings from (which can be found in about:config under services.settings.server), we see it's default value has been set as |
Thank you! The privacy project is amazing, I am happy I got a chance to contribute to it. For the sake of completeness as well:
// Ensure remote settings do not hit the network
["services.settings.server", "data:,#remote-settings-dummy/v1"] This was the only trace on the internet I found of the default value of services.setting.server preference, hopefully someone else will be able to provide a much more sound answer. I hope this helped! |
Thank you so much, @eakubilo! This is super! |
As mentioned in the previous meeting, when I did the first batch of the crawl, all the data are collected except for the Firefox's urlClassification. When I open the site manually without a crawl, I was able to check the sites flagged by Firefox Tracking Protection
Some fundamental things that I checked:
Since then, I have tried out different things and ruled out some of the possible causes:
The text was updated successfully, but these errors were encountered: