Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended List of Malicious Websites #4667

Closed
3 tasks done
tanmarpn opened this issue Dec 22, 2024 · 20 comments
Closed
3 tasks done

Recommended List of Malicious Websites #4667

tanmarpn opened this issue Dec 22, 2024 · 20 comments
Assignees
Labels
deny Deny domain(s)

Comments

@tanmarpn
Copy link

Which domain(s) should be blocked?

www.kcczai.com
www.bitbitquark.com
bvox11.cc
www.fxnovus.co
uslasry.com
www.hkeusy.com
www.maicoinmn.com

Why should these domain(s) be blocked?

This is an anti-fraud website operated by the national government of Taiwan (Taipei). The site regularly publishes links to websites identified as fraudulent or malicious by national monitoring systems.

I have browsed through multiple entries and noticed that some of these websites are not blocked by TIF.

Hope you can make good use of the information here to ensure safer browsing for our regional users.

Uri:
https://165.npa.gov.tw/#/articles/subclass/3

I confirm ...

  • that I have checked if there is no other issue for the domain(s) whereby they were unblocked or the blocking was declined.
  • that I have checked that the domain(s) are not already blocked.

Privacy

  • I confirm that the report does not contain any private information.
@tanmarpn tanmarpn added the deny Deny domain(s) label Dec 22, 2024
@jarelllama
Copy link
Contributor

@hagezi I'll check in the morning if it's possible to use the website as a source for my blocklist

@hagezi
Copy link
Owner

hagezi commented Dec 22, 2024

Thanks @tanmarpn and @jarelllama

@hagezi hagezi added the in progress A solution is being worked on label Dec 22, 2024
@xRuffKez
Copy link
Contributor

@jarelllama i think you need to scrape it

@jarelllama
Copy link
Contributor

Unable to scrape due to Google bot verification

@tanmarpn
Copy link
Author

@jarelllama Thank you for checking.
This is the original API source. Please check if it can be used. Thank you.

Uri:
https://data.gov.tw/dataset/160055

@jarelllama
Copy link
Contributor

@tanmarpn it seems I am unable to access the dataset download: https://data.moi.gov.tw/MoiOD/System/DownloadFile.aspx?DATA=3BB8E3CE-8223-43AF-B1AB-5824FA889883. Even if I could, I am not sure the URL stays the same throughout each update, otherwise I would not be able to scrape the data automatically.

@jarelllama
Copy link
Contributor

jarelllama commented Dec 23, 2024

what I can access is the CSV: https://quality.data.gov.tw/dq_download_csv.php?nid=160055&md5_url=45ab3c35d9f3f23d0166ba8f5ab9fd6d (last updated December 3rd 2024). I am not sure if this is the entire dataset of just a part of it. I will try to scrape the domains but I will have to monitor if the URL changes after each update. If that happens I have tested that I can scrape https://data.gov.tw/dataset/160055 directly to get the CSV URL.

@jarelllama
Copy link
Contributor

The CSV has 21922 domains dating back 2022. I will only add those from 2024 onwards.

@hagezi
Copy link
Owner

hagezi commented Dec 23, 2024

Thank you very much @jarelllama, let me know when it is integrated and your list is updated online.

@jarelllama
Copy link
Contributor

Testing build now

@jarelllama
Copy link
Contributor

Source: 165 Anti-fraud
Raw:11219  Final:10790  Whitelisted:   0  Excluded:   3  Toplist:   4
Processing time: 2 second(s)

All good 👍 . Thanks @tanmarpn

@tanmarpn
Copy link
Author

@jarelllama Thank you very much. I have gathered some information that I hope will be helpful to you.

  1. "od.moi.gov.tw" provides the fastest data updates (every 7 days) and has a fixed URL. However, it blocks all IP addresses that are not from Taiwan. If you can provide a Taiwan proxy for your crawler, I believe this would be the best choice.
    Complete API URL: "https://od.moi.gov.tw/api/v1/rest/datastore/A01010000C-002150-013?format=json&limit=0"

  2. Data from "165.npa.gov.tw" is updated almost simultaneously with "od.moi.gov.tw", and I have tested it with IP addresses from multiple countries, all of which can access it normally. I did not encounter the "Google bot verification" issue you mentioned, though I am not sure about the design of your crawler system. Since this API is not ideally designed, it includes all the information from the web pages, making it bloated and requiring some logical analysis to extract the necessary links. You might consider adding a longer timeout for it.
    Complete API URL: "https://165.npa.gov.tw/api/article/subclass/3"

  3. Using "quality.data.gov.tw", as you mentioned, might be the best solution when the above two methods cannot be used. However, based on my tests, it has a slower data update. The latest data contains 32,218 records (as of 2024/12/17), while it only has 31,538 records (as of 2024/12/03).

@jarelllama
Copy link
Contributor

jarelllama commented Dec 23, 2024

Option number 2 seems to work fine. I will update the code in a bit. @hagezi I'll let you know when I am done so you can close this issue.

@jarelllama
Copy link
Contributor

Where can I check the total number of entries? I'm currently pulling about 35847.

@jarelllama
Copy link
Contributor

Anyway its no matter. I will only pull domains from 2024 onwards.

@tanmarpn
Copy link
Author

@jarelllama Currently, it seems that the total number of entries can only be calculated by visiting the API in method 1, as they do not provide an additional column or annotation for the total count.
However, the newer data should all be included in 'https://165.npa.gov.tw/api/article/subclass/3'.

jarelllama added a commit to jarelllama/Scam-Blocklist that referenced this issue Dec 23, 2024
@jarelllama
Copy link
Contributor

@hagezi @tanmarpn updated the source:

Source: 165 Anti-fraud
Raw:11904  Final: 664  Whitelisted:   0  Excluded:   3  Toplist:   6
Processing time: 15 second(s)
----------------------------------------------------------------------

Thanks for the help @tanmarpn

@hagezi hagezi added fixed-pending-release Will be fixed in the next release and removed in progress A solution is being worked on labels Dec 23, 2024
Copy link

Thank you for your support. The issue is scheduled to be fixed in the next release. You will be notified when the issue is finally fixed.

@hagezi
Copy link
Owner

hagezi commented Dec 23, 2024

Thanks @jarelllama @tanmarpn

Copy link

This issue has been fixed in release 2024.358.60841

@github-actions github-actions bot removed the fixed-pending-release Will be fixed in the next release label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deny Deny domain(s)
Projects
None yet
Development

No branches or pull requests

4 participants