Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bookmark import process #287

Open
sinkaszab opened this issue Jun 13, 2021 · 14 comments
Open

Optimize bookmark import process #287

sinkaszab opened this issue Jun 13, 2021 · 14 comments
Labels
Enhancement Any requests for improvements or new features

Comments

@sinkaszab
Copy link

I have several thousands of bookmarks (783KB of HTML). Importing them will result in an error:

[2021-06-13 07:10:11] production.ERROR: Maximum execution time of 30 seconds exceeded {"userId":1,"exception":"[object] (Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0): Maximum execution time of 30 seconds exceeded at /home/pi/${PATH_TO_APP}/vendor/nesbot/carbon/src/Carbon/Traits/Creator.php:563) [stacktrace] #0 {main} "}

To avoid maximum execution time error, please delegate processing of bookmarks to another (long running) process that can process in batches and/or restart on maximum execution time error and whose progress can be queried.

If I want to directly import from pinboard.in exported bookmarks, I have no other acceptable alternatives. I wouldn't want to increase maximum execution time setting.

@sinkaszab sinkaszab added the Enhancement Any requests for improvements or new features label Jun 13, 2021
@Kovah
Copy link
Owner

Kovah commented Jun 13, 2021

Did you try the import via command line? It was created exactly for this case.
https://www.linkace.org/docs/v1/cli/#import-links-from-a-html-bookmarks-file

@sinkaszab
Copy link
Author

No, I didn't, although I would expect this to be handled from the UI side as well; I mean as a user I am not really interested in how bookmarks will be processed, until the upload fits into the maximum execution time. But if not, a warning to try with CLI would help: "Over X bookmarks, please use the CLI to import. For usage please visit the docs". (No harm, just sharing my user experience.)

I started with the setup docs and rushed to import when site was ready.

The CLI seems to have problem with resolving absolute paths. It needs to be relative to storage. I think that should be fixed. Please see output below:

$ sudo php artisan links:import ~/pinboard_export.2021.06.11_09.25.html                                                               
You will be asked to select a user who will be the owner of the imported bookmarks now.                                                     

 Please enter the user email address:
 > ${MY_EMAIL}

Reading file "/home/pi/pinboard_export.2021.06.11_09.25.html"...

In Filesystem.php line 57:
                                                                                                                                            
  File does not exist at path /home/pi/${PATH_TO_APP}/storage//home/pi/pinboard_export.202  
  1.06.11_09.25.html.

@sinkaszab
Copy link
Author

I started from clean slate and tried to import my bookmarks via the CLI.

$ sudo php artisan links:import --skip-check pinboard_export.2021.06.11_09.25.html 
You will be asked to select a user who will be the owner of the imported bookmarks now.

 Please enter the user email address:
 > ${MY_EMAIL}

Reading file "pinboard_export.2021.06.11_09.25.html"...
Skipping link check.
2577 links imported successfully, 1000 skipped.

What are those 1000 skipped bookmarks? Why were they skipped? I would like to import all of them. I can see 404 warnings in the logs, although I explicitly told the program to skip checking.

@Kovah
Copy link
Owner

Kovah commented Jun 14, 2021

Bookmarks are only skipped if the url already exists in the database. Were there links existing in the database from previous imports?

If you disable the check all links will be imported without being checked if they are online or not.

@sinkaszab
Copy link
Author

sinkaszab commented Jun 14, 2021

I dropped all tables from the database, re-installed the application and imported the bookmark export.

If you say there are 1000 duplicates then:
A.) pinboard.in saves a new version of bookmarks on update, meaning I updated exactly a 1000 bookmarks over the course of years.
B.) LinkAce considers http://foo.com/a and http://foo.com/b equal and I exactly added 1000 bookmarks that share the same domain over the years.

(It's also weird there are exactly a 1000 duplicates. I don't say there can't be, but such nice numbers ring a bell, I can't stand it.)

Unfortunately LinkAce doesn't log skipped links. (It would be nice though. Btw. I have debug mode set to true.)

$ cat storage/logs/laravel-2021-06-14.log | grep -c skip
0
$ cat storage/logs/laravel-2021-06-14.log | grep -c 404
108

@Kovah
Copy link
Owner

Kovah commented Jun 14, 2021

As I wrote, LinkAce skips all urls that are already in the database. That check runs for all links while doing the import. I do not know how the Poinboard export looks like, but there must be duplicates in there.
foo.com/a should not be equal to foo.com/b, but the check may be case sensitive, meaning foo.com/a is equal to foo.com/A. I think it can be possible that pinboard exports are not case sensitive.

I will see if I can add more details to the import via logs.

@sinkaszab
Copy link
Author

sinkaszab commented Jun 14, 2021

I loaded the export into the browser and checked for duplicates with code below:

var links = [...document.querySelectorAll('dt a')].map(e => e.href);
console.log(`No. links: ${links.length}`);
var unique = new Set([...links]);
console.log(`No. unique links: ${unique.size}`);
var counter = {};

links.forEach(link => {
  if (counter[link]) {
    counter[link] += 1;
  } else {
    counter[link] = 1;
  }
});

console.table(counter);

var nonUniques = Object.entries(counter).filter(([_, count]) => count > 1).map(([link]) => link);

console.log(nonUniques);

The numbers match up to my biggest surprise. LinkAce's behavior is OK.

One thing that popped into my mind, how does it get handled when there is a duplicate with different tags or description? Maybe there should be a temporary storage, from where you can resolve merge one by one or applying some kind of rules (eg. choose by creation date; merge fields; by number of tags, etc. ...)

@Kovah Kovah reopened this Jun 14, 2021
@Kovah
Copy link
Owner

Kovah commented Jun 14, 2021

Awesome!
Good to know everything worked. I'll keep this issue open for improvements.

Feel free to share any further feedback. It's very valuable to me. 👍

@sinkaszab
Copy link
Author

Great, keep up the good work!

@Kovah Kovah changed the title Delegate processing bookmarks to a long running process on import. Optimize bookmark import process Jun 14, 2021
@Kovah Kovah added this to the v1.10.0 milestone Mar 25, 2022
@Kovah Kovah removed this from the v1.10.0 milestone Apr 4, 2022
@gingerbeardman
Copy link

gingerbeardman commented Apr 4, 2022

I started an import using the web UI, 874K (4,185 bookmarks)

  • can I close the page or not?
  • (it seems I can, this should be stated on the page)

perhaps

  • if the import file is large, recommend using the CLI (as in this thread)

a better import experience could be:

  1. import all the links quickly (all 4,185 entries)
  2. declare import stage 1 complete! (user is impressed and can use their bookmarks immediately)
  3. show notification that additional metadata will be collected in the background
  4. show a progress indicator of some sort in the UI

What do you think?

@Kovah
Copy link
Owner

Kovah commented Apr 4, 2022

I have not worked on the import process because I decided to give it a complete overhaul in the next big version. I know the current process is not good.

If you are using the web UI, the page should not be closed. But there is a CLI command to use instead.

I noted the idea of something like a queue that works of all links after they were stored. I am not just sure how this will look like.

@gingerbeardman
Copy link

Hmm. I closed the page and the import seems to be continuing in the background?

I have reopened LinkAce and it's currently at ~1,380 out of 4,185

@gingerbeardman
Copy link

gingerbeardman commented Apr 5, 2022

My web import stopped at ~1,980 of 4,185 seems odd it would go that long after me closing the page. Maybe there was some disruption to my server.

I started the CLI import a short while ago and it's already at ~3,060 of 4,185. Almost there!

@piegamesde
Copy link

When importing bookmarks, most of them already have description and title and tags (the assumption is that you mainly use import for migrating from a different bookmark manager or the browser).

https://github.com/Kovah/LinkAce/blob/main/app/Actions/ImportHtmlBookmarks.php#L49-L56

A low effort high return improvement would be to only generate metadata if any of the values is actually being used. This should make the import process a lot faster for the common case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Any requests for improvements or new features
Projects
Status: Backlog
Development

No branches or pull requests

4 participants