Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually create a profile.tar.gz #731

Closed
djhmateer opened this issue Dec 9, 2024 · 3 comments
Closed

Manually create a profile.tar.gz #731

djhmateer opened this issue Dec 9, 2024 · 3 comments

Comments

@djhmateer
Copy link

djhmateer commented Dec 9, 2024

Is it possible to manually create a profile.tar.gz as in

docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://facebook.com/"

I started looking in here:

C:\Users\djhma\AppData\Local\Google\Chrome\User Data - I tried tar.gz'ing this directory but it didn't seem to work.

image

I've posted here too https://forum.webrecorder.net/t/manually-create-and-use-a-profile-tar-gz/702

Facebook is not happy with the docker profile.tar.gz creation process.

@tw4l
Copy link
Member

tw4l commented Dec 9, 2024

Hi @djhmateer, part of the issue may be that Browsertrix Crawler uses Brave Browser, which has a similar browser profile data structure to Chrome in that they are both Chromium-based but I would guess diverge at some points. I'm also not sure if the user profiles differ by operating system - the current browsertrix-browser-base Dockerfile is based on Ubuntu 24.

My guess would be that a manually saved and tarred/gzipped user data directory from a Brave browser installation would work but I haven't tested this myself, not sure if it'd have to be from the same OS as well.

@djhmateer
Copy link
Author

Hi @tw4l - thank you so much for the reply. Will test and report back.

@djhmateer
Copy link
Author

This strategy has worked well thank you @tw4l

Essentially I ran Release Channel Brave on my WSL2 (Ubuntu 22) instance using instructions from https://brave.com/linux/

Then did something like:

brave-browser
# now login to whatever site eg https://www.osr4rightstools.org
cd ~/.config/BraveSoftware/Brave-Browser
tar -czvf profile.tar.gz *

mv profile.tar.gz ~/auto-archiver/tmp/.

cd ~/auto-archiver/tmp
chmod 777 profile.tar.gz

# test
docker run --rm -v /home/dave/auto-archiver/tmp:/crawls/ webrecorder/browsertrix-crawler crawl --url https://www.osr4rightstools.org --scopeType page --generateWACZ --text --screenshot fullPage --collection 2 --id 2 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 200 --timeout 200 --profile /crawls/profile.tar.gz

# un tar and gz the wacz
# look for archive/screenshot .warc

# use replayweb.page to see if the screenshot is correct (easy to see if the site is logged in)

@github-project-automation github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

2 participants