Albert

Trying to see if by reading texts and images from Albert offer catalog PDF, keeping a track of their coordinates and clustering them by proximity, I can scrape this PDF and extract the item price and photo associations.

Running

node extract to generate data directory with extracted image and text bounds and python3 -m http.server and http://localhost:8000 to view the extracted data.

To-Do

Scrape page viewport size and set the `canvas` size to the scraped size

Right now the page canvas size is hard-coded to 480x840 which looks correct.

Figure out how to get rid of mask-type faux-images not actually in the PDF

Are all of those images identified by objId masks? If so, remove them.

Rank image and text pairs by closeness to bounds not the center

Provide multiple candidates for each image so that things that sit in the middle appear for both and hopefully some of the images are then excluded as masks, graphics etc.

Also maybe consider whether just simple overlap is enough to associate texts to an image in all/most cases. Text pattern matching could do the rest of the job after.

Display a locator glow around an image / a text on hover in the tool UI

Help locate the canvas elements by hovering them when the user hovers their mouse cursor over their corresponding UI toggles.

Add a checkbox for only displaying images not covered by other images

This should help us isolate only foreground images, but it will need a threshold setting so that decorative junk doesn't remove product shot.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
extract.js		extract.js
icon.png		icon.png
index.css		index.css
index.html		index.html
index.js		index.js
package.json		package.json
readme.md		readme.md
url.js		url.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Albert

Running

To-Do

Scrape page viewport size and set the `canvas` size to the scraped size

Figure out how to get rid of mask-type faux-images not actually in the PDF

Rank image and text pairs by closeness to bounds not the center

Display a locator glow around an image / a text on hover in the tool UI

Add a checkbox for only displaying images not covered by other images

About

Uh oh!

Languages

TomasHubelbauer/albert

Folders and files

Latest commit

History

Repository files navigation

Albert

Running

To-Do

Scrape page viewport size and set the canvas size to the scraped size

Figure out how to get rid of mask-type faux-images not actually in the PDF

Rank image and text pairs by closeness to bounds not the center

Display a locator glow around an image / a text on hover in the tool UI

Add a checkbox for only displaying images not covered by other images

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Scrape page viewport size and set the `canvas` size to the scraped size