diff --git a/MuPDF b/MuPDF index c81feac8..54dea4cc 160000 --- a/MuPDF +++ b/MuPDF @@ -1 +1 @@ -Subproject commit c81feac83090e9efc14801fdf255f37f68ad1916 +Subproject commit 54dea4cca713eaf51acea692ab46ba8d191c9a88 diff --git a/docs-src/Notes/Aside/C++ std.unique_ptr for array type auto-delete.md b/docs-src/Notes/Aside/C++ std.unique_ptr for array type auto-delete.md index b6a79708..02fcab1c 100644 --- a/docs-src/Notes/Aside/C++ std.unique_ptr for array type auto-delete.md +++ b/docs-src/Notes/Aside/C++ std.unique_ptr for array type auto-delete.md @@ -158,4 +158,15 @@ int edit_distance_dp_basic(const string& str1, const string& str2) return rv; } -``` \ No newline at end of file +``` + +--- + +## Post Scriptum: minor bonus note: `std::unique_ptr(new Obj)` vs. `std::make_unique()` + +As seen on a C++ blog (link not immediately at hand while writing this from human memory, sorry) there's one minor performance improvement: `std::unique_ptr p1(new X2)` is not cache-optimal as it takes two(2) heap allocations: one for the `X2` object and one for the `std::unique_ptr` internals necessary for `unique_ptr` to deliver its gorgeous magic to us, which include a counter (for `weak_ptr` family relations) and a to-be-dereferenced *pointer* to the `X2` instance now located elsewhere on the heap. +All this while the `std::make_unique` idiom only takes a *single* allocation, combining the `X2` instance and the `unique_ptr` internals into a single heap chunk, thus improving CPU access performance of the `X2` instance as the pointer dereference will very probably address (depending on the size of `X2` and the CPU cache line width) `X2` memory residing in the very same cache line, which is 👍 for performance. + +The key there was that the `std::unique_ptr p1(new X2)` is only potentially beneficial when your usage of the `std::unique_ptr` is one where you use `.reset()`, `.swap()` or `operator=` to have it release its hold on your `X2` instance *early* and `free/delete` that one from the heap long before the `std::unique_ptr` instance itself goes out of scope, where its internals are released back into the heap: as `std::make_unique` treats both as a single heap chunk, your `X2` destructor may be called as early as before, but the actual releasing-back-allocated-memory-to-the-heap activity will have to wait until the `std::unique_ptr` instance goes out of scope as well, potentially increasing peak heap memory usage to undesirable levels then as the allocate-release timeline for the heap will be quite different. Otherwise, they argued (and I don't see any flaw in their logic), it's advisable to use `std::make_unique` instead of the `new/assignment` idiom to set up your `std::unique_ptr` instances. + +Granted, my brain keeps thinking in terms of the outmoded constructor+`new` rather than `std::make_unique` best practices, so this sub-par idiom may show up in a few places still, after today. Alas. I blame my advancing age. *(koff koff koff)*😇 diff --git a/docs-src/Notes/Aside/Chrome Browser - colouring a selected chunk of text in an arbitrary web page.md b/docs-src/Notes/Aside/Chrome Browser - colouring a selected chunk of text in an arbitrary web page.md new file mode 100644 index 00000000..f30c35cf --- /dev/null +++ b/docs-src/Notes/Aside/Chrome Browser - colouring a selected chunk of text in an arbitrary web page.md @@ -0,0 +1,55 @@ +# \[useful feature:] Chrome Browser :: colouring a selected chunk of text in an arbitrary web page + +*(Handy when you want to have searched phrases in a larger text highlighted.)* + +I noticed this behaviour for a while in Chrome when I followed links from explanatory paragraphs in Google Search: it's a specially crafted `#` hash part of the URL, e.g. + +`https://en.wikipedia.org/wiki/Attention_(machine_learning)#:~:text=which%20are%20computed` + +Note the `#:~:text=` in that URL: that suffices to have Chrome show the web page with any matched text highlighted in purple for your convenience, e.g. try & click this link: https://en.wikipedia.org/wiki/Attention_(machine_learning)#:~:text=which%20are%20computed + +Once I had realized it's a *browser*-specific thing, here's more info on this browser feature: + +- https://support.google.com/chrome/answer/10256233?hl=en&co=GENIE.Platform%3DDesktop +- https://stackoverflow.com/questions/62161819/what-exactly-is-the-text-location-hash-in-an-url +- https://wicg.github.io/ScrollToTextFragment/ + https://chromestatus.com/feature/4733392803332096 +- **[CanIUse](https://caniuse.com/) feature check**: https://caniuse.com/url-scroll-to-text-fragment --> everybody's got it, in 2024AD, which is a good thing for us, as I can use this in Qiqqa Ⅱ 🥳 + + + +------ + +Quoting https://stackoverflow.com/posts/62162093/timeline: + +## Scroll To Text Fragment + +This is a feature called **[Scroll To Text Fragment](https://chromestatus.com/feature/4733392803332096)**. It is [enabled by default since Chrome 80](https://www.chromestatus.com/features/4733392803332096), but apparently not yet implemented in other browsers. + +There are quite nice examples in the ["W3C Community Group Draft Report"](https://wicg.github.io/ScrollToTextFragment/). More good examples can be found on [Wikipedia](https://en.wikipedia.org/wiki/Fragment_identifier#Examples). + +### Highlighting the first appearance of a certain text + +Just append `#:~:text=` to the URL. The text search is not case-sensitive. + +**Example:** [https://example.com#:~:text=domain](https://example.com/#:%7E:text=domain) [![The word "domain" is highlighted on example.com](https://i.sstatic.net/mHPz1.png)](https://i.sstatic.net/mHPz1.png) + +### Highlighting a whole section of text + +You can use `#:~:text=,` to highlight a whole section of text. + +**Example:** [https://stackoverflow.com/questions/62161819/what-exactly-is-the-text-location-hash-in-an-url/62162093#:~:text=Apparently,Wikipedia](https://stackoverflow.com/questions/62161819/what-exactly-is-the-text-location-hash-in-an-url/62162093#:%7E:text=Apparently,Wikipedia) [![part of this very answer is highlighted](https://i.sstatic.net/fIqVh.jpg)](https://i.sstatic.net/fIqVh.jpg) + +### More advanced techniques + +- Prefixing and suffixing like the [example suggested in the repository for the suggestion](https://github.com/WICG/ScrollToTextFragment/#identifying-a-text-snippet) [https://en.wikipedia.org/wiki/Cat#:~:text=Claws-,Like%20almost,the%20Felidae%2C,-cats](https://en.wikipedia.org/wiki/Cat#:%7E:text=Claws-,Like%20almost,the%20Felidae%2C,-cats) texts as proposed don't seem to work for me (yet? I use Chrome 83). +- You can [style the look of the highlighted text](https://github.com/WICG/ScrollToTextFragment/#target) with the CSS `:target` and you can [opt your website out](https://github.com/WICG/ScrollToTextFragment/#opting-out) so this feature does not work with it anymore. + + + + +-------------- + +Allegedly already available since Chrome 80/83, which is quite a while back. Ah well, browser feature trend watching isn't my forte. 😅 + + + diff --git a/docs-src/Notes/Aside/File last change + last access timestamps' woes.md b/docs-src/Notes/Aside/File last change + last access timestamps' woes.md new file mode 100644 index 00000000..28f5eb06 --- /dev/null +++ b/docs-src/Notes/Aside/File last change + last access timestamps' woes.md @@ -0,0 +1,66 @@ +# x + +See https://github.com/git-for-windows/git/issues/1000#issuecomment-301611003: + +Upon closer inspection using stat test.xls in Git Bash, it would appear that the change time is modified by Excel along with the bytes on disk, but not the modified time. I fear that the described problem is related to the fact that Git for Windows has to take a couple of shortcuts when trying to emulate Linux' semantics. In particular, when the so-called "stat" data (essentially, all the metadata for a give file) is emulated, we use the FindFirstFile()/FindNextFile() API which gives us the time of the last access, the time of the last modification and the creation time. Sadly, that differs slightly from the POSIX semantics that Git wants to see, where the first two times are identical, but the ctime does not refer to the creation time but the change time. But we do not have a fast way to get at the change time, only the access time, modified time and creation time. We could get the change time, via the ChangeTime field in the FILE_BASIC_INFO data structure initialized by GetFileAttributesByHandleEx() function, but that requires a HANDLE, which we can only obtain using CreateFile() (which is orders of magnitude slower than FindFirstFile()/FindNextFile(). So what Git for Windows does is rely on applications to update the modified time when changing any file contents. But that is not the case with Excel. I fear there is not really anything we can do here, not unless we want to slow down Git for Windows dramatically (in most cases, for no good reason)... + +Just FWIW. +It seem that it depend on the FS used (and the underlying OS drivers for that FS). +My findings are that: + +``` +#-------------------------------------- +# [a/c/m]time +#-------------------------------------- +# On Windows (via Cygwin & Python3): +# The creation time is: aTime .CreationTime === .LastAccessTime in Poweshell, but known as "access" time in Linux) +# The modification time is: mTime == cTime .LastWriteTime in Poweshell +# +# On Linux: +# The creation time is: cTime +# The modification time is: mTime +# The access time is: aTime (normally not used) +# +# ==> For seeing last modification time, use "cTime" on Windows FS's, and "mTime" on *linux FS's +#-------------------------------------- +``` + +IDK why an _Excel_ file would behave different from any other "Windows" generated file, in this respect. + +... plus: + +https://community.hpe.com/t5/operating-system-hp-ux/what-s-the-difference-of-ctime-and-mtime-in-find-command/td-p/3341256?nobounce + +### Re: what's the difference of -ctime and -mtime in find command? + +mtime refers to the modification time of the file, while ctime refers to a change in the status information of the file. For example, you could use the touch command to alter the date of the file (the status information), without actually changing the file itself. + +At least that's the way I interpret it. + +Pete + +------ + +Pete's right. Even if you can change both change and modification with touch (-c or -m). + +You have 3 dates for a file : +. ctime : change time. It gives you the last time a modification was done on the inode. For example chmod. You can see it with ls -lu file. +. mtime : modification time. It gives the last time the file content was modified. For example with vi. It is the one normally displayed by ls -l. +. atime : access time. It gives you the last time the file was accessed. Even cat modifies this date. + + +### https://nicolasbouliane.com/blog/knowing-difference-mtime-ctime-atime :: Knowing the difference between mtime, ctime and atime + +If you are dealing with files, you might wonder what the difference is between `mtime`, `ctime` and `atime`. + +`mtime`, or modification time, is when the file was last modified. When you change the _contents_ of a file, its mtime changes. + +`ctime`, or change time, is when the file’s property changes. It will always be changed when the mtime changes, but also when you change the file’s permissions, name or location. + +`atime`, or access time, is updated when the file’s contents are read by an application or a command such as `grep` or `cat`. + +The easiest way to remember which is which is to read their alphabetical order: + +- `Atime` can be updated alone +- `Ctime` will update `atime` +- `Mtime` will update both `atime` and `ctime`. diff --git a/docs-src/Notes/Aside/Unicode alternatives for square brackets in MarkDown text.md b/docs-src/Notes/Aside/Unicode alternatives for square brackets in MarkDown text.md new file mode 100644 index 00000000..10a0ee5d --- /dev/null +++ b/docs-src/Notes/Aside/Unicode alternatives for square brackets in MarkDown text.md @@ -0,0 +1,31 @@ +# Unicode ~~replacement~~ ~~alternative~~ *homoglyphs* for \[...\] square brackets in MarkDown + +Because I'm kinda lazy and slightly irritated at having to ugly-`\\`-backslash-escape the square brackets when I want them to appear as-is. + +First, the general I-am-looking-for-??? Unicode glyph/codepoint site that works well (best) for me is: https://symbl.cc/ + +But finding homoglyphs when you need/want them is still a bit of a hassle as the added/follow-up problem is: does the codepoint I selected as a homoglyph-du-jour actually *exist* in my display/screen font? Sometimes it doesn't, so the process becomes iterative. Alas. + +Some homoglyph lists: + +- http://xahlee.info/comp/unicode_look_alike_math_symbols.html +- https://gist.github.com/StevenACoffman/a5f6f682d94e38ed804182dc2693ed4b +- https://github.com/codebox/homoglyph +- https://github.com/life4/homoglyphs +- and the inverse: getting back to the base/ASCII form: https://github.com/nodeca/unhomoglyph, which happens to reference the very useful **official set of Unicode Confusables**: [Recommended confusable mapping for IDN](http://www.unicode.org/Public/security/latest/confusables.txt) + +Anyway, let's see what we got for the `[` bracket glyph.... + +- `[`: "\[...]" +- `⟦`: "⟦..." -- https://symbl.cc/en/27E6/ "Mathematical Left White Square Bracket" +- (note the ugly extra left-side whitespace occupied by the codepoint) `〚`: "〚....]" -- https://symbl.cc/en/301A/ "Left White Square Bracket" +- (uglier due to minimal underline-like cruft...) `⦋`: "⦋...]" -- https://symbl.cc/en/298B/ "Left Square Bracket with Underbar" +- - `⦍`: "⦍...]" -- https://symbl.cc/en/298D/ "Left Square Bracket with Tick In Top Corner" +- - `⦏`: "⦏...]" -- https://symbl.cc/en/298F/ "Left Square Bracket with Tick In Bottom Corner" + +and the other one of the 'matched set': + +- `]` +- `⟧`: https://symbl.cc/en/27E7/ "Mathematical Right White Square Bracket" + +(... TODO ....) diff --git a/docs-src/Notes/Aside/Unicode homoglyphs - adversarial Unicode characters.md b/docs-src/Notes/Aside/Unicode homoglyphs - adversarial Unicode characters.md new file mode 100644 index 00000000..476c86a1 --- /dev/null +++ b/docs-src/Notes/Aside/Unicode homoglyphs - adversarial Unicode characters.md @@ -0,0 +1,90 @@ +# Unicode homoglyphs :: adversarial Unicode characters + +The Unicode Consortium has its own place for these: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=-:;/\?!*|%3C%3E{}%27%22abcdefghijklmnopqrstuvwxyz0123456789&r=None & https://www.unicode.org/Public/security/16.0.0/confusables.txt + the other files in https://www.unicode.org/Public/security/16.0.0/ . + +Also do note the recommendations in https://www.unicode.org/reports/tr36/ (*Unicode Security Considerations*), e.g. *3.6 Secure Encoding Conversion* and *3.7 Enabling Lossless Conversion*. + + +## Unicode homoglyphs for Win32/NTFS & UNIX illegal filename characters + + +PHP code based on examples and libraries from phlyLabs Berlin; part of [phlyMail](http://phlymail.com/) +Also thanks to [http://homoglyphs.net](http://homoglyphs.net/) for helping me find more glyphs. + + +| **Char** | **Homoglyphs** | +| ---- | ---- | +| ! | ! ǃ ! | +| " | " ״ ″ " | +| $ | $ $ | +| % | % % | +| & | & & | +| ' | ' ' | +| ( | ( ﹝ ( | +| ) | ) ﹞ ) | +| * | * ⁎ * | +| + | + + | +| , | , ‚ , | +| - | - ‐ - | +| . | . ٠ ۔ ܁ ܂ ․ ‧ 。 . 。  ․  | +| / | / ̸ ⁄ ∕ ╱ ⫻ ⫽ / ノ | +| 0 | 0 O o Ο ο О о Օ O o | +| 1 | 1 1 | +| 2 | 2 2 | +| 3 | 3 3 | +| 4 | 4 4 | +| 5 | 5 5 | +| 6 | 6 6 | +| 7 | 7 7 | +| 8 | 8 8 | +| 9 | 9 9 | +| | | +| : | : ։ ܃ ܄ ∶ ꞉ : ∶  | +| ; | ; ; ; ; | +| < | < ‹ < | +| = | = = | +| > | > › > | +| ? | ? ? | +| @ | @ @ | +| [ | [ [ | +| ] | ] ] | +| ^ | ^ ^ | +| _ | _ _ | +| ` | ` ` | +| a | A a À Á Â Ã Ä Å à á â ã ä å ɑ Α α а Ꭺ A a | +| b | B b ß ʙ Β β В Ь Ᏼ ᛒ B b ḅ | +| c | C c ϲ Ϲ С с Ꮯ Ⅽ ⅽ C c | +| d | D d Ď ď Đ đ ԁ ժ Ꭰ ḍ Ⅾ ⅾ D d | +| e | E e È É Ê Ë é ê ë Ē ē Ĕ ĕ Ė ė Ę Ě ě Ε Е е Ꭼ E e | +| f | F f Ϝ F f | +| g | G g ɡ ɢ Ԍ ն Ꮐ G g | +| h | H h ʜ Η Н һ Ꮋ H h | +| i | I i ɩ Ι І і ا Ꭵ ᛁ Ⅰ ⅰ I i | +| j | J j ϳ Ј ј յ Ꭻ J j | +| k | K k Κ κ К Ꮶ ᛕ K K k | +| l | L l ʟ ι ا Ꮮ Ⅼ ⅼ L l | +| m | M m Μ Ϻ М Ꮇ ᛖ Ⅿ ⅿ M m | +| n | N n ɴ Ν N n | +| 0 | 0 O o Ο ο О о Օ O o | +| p | P p Ρ ρ Р р Ꮲ P p | +| q | Q q Ⴍ Ⴓ Q q | +| r | R r ʀ Ի Ꮢ ᚱ R r | +| s | S s Ѕ ѕ Տ Ⴝ Ꮪ S s | +| t | T t Τ τ Т Ꭲ T t | +| u | U u μ υ Ա Ս ⋃ U u | +| v | V v ν Ѵ ѵ Ꮩ Ⅴ ⅴ V v | +| w | W w ѡ Ꮃ W w | +| x | X x Χ χ Х х Ⅹ ⅹ X x | +| y | Y y ʏ Υ γ у Ү Y y | +| z | Z z Ζ Ꮓ Z z | +| { | { { | +| \| | \| ǀ ا | | +| } | } } | +| ~ | ~ ⁓ ~ | +| ß | ß | +| ä | Ä Ӓ | +| ö | ӧ Ö Ӧ | +| | | + + + diff --git a/docs-src/Notes/Aside/tesseract preprocessing - discovering an optimal preprocess.md b/docs-src/Notes/Aside/tesseract preprocessing - discovering an optimal preprocess.md new file mode 100644 index 00000000..71605f85 --- /dev/null +++ b/docs-src/Notes/Aside/tesseract preprocessing - discovering an optimal preprocess.md @@ -0,0 +1,34 @@ +# tesseract preprocessing :: discovering an optimal preprocess + +Preprocessing page images for (tesseract) OCR is a process that's unsolved and will stay unsolved, for there's always yet another image that doesn't react well to any treatment X that you throw at it. + +## The idea... + +I had the idea that using a scripting language for this part would be useful: that way users can mess around and fiddle with the various available preprocessing steps and their parameters and thus proceed towards a preprocess that's more suitable to their particular input. + +Using a DSL (Domain Specific Language) for this is a good idea and has been very useful in the past. While I picked QuickJS as the preferred engine for this (thus JavaScript as a base for the DSL), my inner geek was bikeshedding about JS, TCL, LISP. Which led me to JANET as an alternative, because on closer inspection, I didn't like the way TCL does variables: the `$` seems unnecessary and I'd rather write `awk` instead, *if ya dig*. + +Anyway, the bikeshedding / procrastination there is also due to the fact that deep down I had immediately realized that offering a scripting solution wouldn't actually *solve* the issue of producing improved page images. Nor would any regular photoshop-like GUI, which is why I looked at scantailor and friends and didn't get any arousal out of it: same problem, different approach (clickity-click instead of typiddi-type), same trouble, not moving forward any. + +## The next idea... + +While I don't want the burden of going Deep AI with this - primarily it would guzzle my time and financial means like crazy: training a model is not cheap, au contraire, and since I am not extremely well versed in that particular niche either, I must reckon with the learning curve. More cost, less prospect of success. + +After having looked at non-related material and stumbling across the now-sorely-out-of-hipness Kohonen maps (SOMs), which I happen to like a lot but always struggle a bit with the 2D output mapping to make them work well as a visual aid, I got tripped up by the root of the GA (Genetic Algorithms) tree: another part of CS that's had its day and is now severely uncool if the amount of work happening there is any indication. +But this whole thing stuck to me like [catchweed](https://en.wikipedia.org/wiki/Galium_aparine) to cloth (Dutch: *kleefkruid*): what if we could use GA or other "unsupervised" means to find us an improved preprocess for image X? But then what would the 'gene' represent? It's all nice and dandy to brag in literature about running a GA with 10K parameters, but I'm looking for a limited number of features/parameters/phenotypes/whatever-you-wanna-call-it as this should be a relatively fast and hopefully *human reviewable* process... + +So the idea is this: the gene represents the sequence of image operations to be executed during preprocessing: scaling, contrast mapping, greyscale conversion, binarization, noise reduction, sharpening, etc. where the gene can have variable length and each element in the gene codes for a specific function/algorithm to use. +If this works, we can then enhance the concept by including parameterization of these algorithms/functions as part of the 'gene' and using some sort of Monte Carlo process to test several settings. This is not exactly *pure* GA, but it's using some of those ideas, while now the problem shifts to the question: +Since we want this to be an "unsupervised" process, hence no human in the loop, how do we teach the machine to evaluate the test results and rank them, i.e. what do we use for a output quality metric? + +We can't know the "ground truth" as it is intended as an *unsupervised* approach, so we need to come up with a metric that approximates ground truth. We cannot just use the OCR engine's reported confidence and turn that into a KPI as the OCR engine can be quite confident while utterly mistaken some times, but more importantly: when the quality is *meh*, the OCR engine spits out all sorts of confidence values and I fear it won't be much different from *noise*, so a big no-no as a decision-making metric. +As we won't always be OCRing text books, applying a dictionary search + match score is also a bit naive, so then the thought becomes: we did Markov chain analysis once for language detection (using 2-grams); how about we do that again to rank text output as sensible/crazy? That way we can feed the Markov-chain based n-gram ranker *previous* ground truth to have it develop its own kind of n-gram based ground truth table and then we can apply stuff like TF-IDF to score actual option against such a *prior* to obtain a (hopefully) useful KPI to drive the GA/MCMC towards a potentially optimal preprocessing solution. + +Suppose this works, then the final 'gene' can be stored with the document page for later reproduction; the 'gene' encodes the preprocessing 'script' that must be executed to produce the discovered 'best result'. +When processing a book, this costly search can be done for a few sample pages and then that same 'gene' can be applied to all pages alike. If the user reviews the output and finds some pages lacking in quality, *hir* should be able to rerun the GA search for these particular pages, store the discovered optimal genes for each page and if the user wants to try their own hand at it, they can take the gene and manually tweak it to change the preprocessing pipeline to their taste. +Which we could call *gene programming*. + +Anyway, that's the second idea and currently if feels like I might be able to pull this off... + + + diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Notebook = Dumping ground for ideas/tesseract - using Normalized Discounted Cumulative Gain as a metric during training (and use).md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Notebook = Dumping ground for ideas/tesseract - using Normalized Discounted Cumulative Gain as a metric during training (and use).md new file mode 100644 index 00000000..2f6a22f9 --- /dev/null +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Notebook = Dumping ground for ideas/tesseract - using Normalized Discounted Cumulative Gain as a metric during training (and use).md @@ -0,0 +1,86 @@ +# tesseract / LSTM+CTC/Neural Networks :: using Normalized Discounted Cumulative Gain as a metric during training (and use)? + +What I understand from a quick scan through NDCG literature, it is a metric that translates the neural net's output vector, which carries some kind of **ranking**, i.e. is **not** a probability but rather a **rank** value for each of the categories = items in the vector. They (a.k.a. The Internet Interpreted Through My Reading) write that: + +1. **classifiers** (e.g. LSTM) do **not** output probabilities but rather produce **rank** values, which do not translate easily to *probabilities* 1:1, or in other words: one SHOULD NOT treat these numbers as probabilities, not even after "normalization" (which is, AFAIU, to sum them and scale them so the sum in 1 and '*thus*' \[sic] these scaled values now can be processed as probabilities, is *wrong* (over-simplified and incorrect use/transformation of rankings to probability *perunages*): *enter NDCG* and then, *ta.daah!*, we have a 'proper' transform of rankings to probabilities.[^1] +2. NDCG serves as a 'proper', i.e. useful, training metric for such classifiers to help determine whether these are performing well. Hmmmmm. Training tesseract has been a serious bother from day one, so *maybe* this may be able to assist a little: I was not feeling very confident yet that the current training mechanism was up to snuff. Another reason to push this one up my priority tree of *Stuff To Implement & Field-Test*! + + +Notes/Thoughts du Jour: my current general perception of NNs is that (almost?) all of them should be interpreted as single/multi-level *classifiers*, **ranking** the output choices, rather than spitting out probabilities for those. ('single' as in A-or-B, e.g. SVM classifiers, 'multi' as in 'output vectors' such as you get from LSTM, whether interpreted through CTC or not, etc.: each element in the vector is one element in your "*output alphabet of choices*", hence I can/do look at the codepoints that make up an LSTM's output vector as a set of categories where each codepoint ("character") is its own category. We humans read those as a time series a.k.a. "written text". + +Then again, suppose I am wrong about this, would it *hurt* to use NDCG? Without having grokked it yet, gut feeling says: heck, even if the output vector *is* a bunch of probabilities, each of those is also valid as a category rank, so we are okay to feed it as a bunch of rankings into NDCG and be a-okay! Ideally we would have arrived at a 1-to-1 direct transformation then, that's all! *shrugs*... + + +> ### What is Normalized Discounted Cumulative Gain? +> +>Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list. NDCG at K is determined by dividing the Discounted Cumulative Gain (DCG) by the ideal DCG representing a perfect ranking. +> \[...] +> Normalized Discounted Cumulative Gain (NDCG) is a metric used in information retrieval to **measure the effectiveness** of search engines, recommendation systems, and other ranking algorithms. It serves as a measure of ranking quality in information retrieval and is normalized so that it is comparable across queries. +> + + +### On Second Thought... + +Having read a little more about it, including [Wikipedia](https://en.wikipedia.org/wiki/Discounted_cumulative_gain), it looks like this is a nice buzz term *I didn't know yet*, but is already happening in some way in tesseract. What *is* interesting, though, is whether tesseract employs those same/similar *logarithmic denominators and position-in-ranking-order-equals-gain/importance type of approach*. + +Consequently this is one more for my personal RTFC must-grok queue... + +---- + +> ### Cons of NDCG +> +> * **Complex calculation**: The computation-intensive nature of the NDCG score calculation, particularly its normalization step, demands additional resources for processing. This complexity may also impede the evaluation process speed in large datasets or real-time systems, a factor that renders it less than ideal for applications demanding swift feedback on ranking performance. +> * **Sensitivity to rank depth**: It significantly prioritizes the top-ranked results and while this is usually desirable, at times relevant items lower down in the list may be undervalued. In scenarios where there’s a more uniform distribution of these relevant items across rankings or when identifying all relevancies trumps their positions it can lead to skewed evaluations due to this particular characteristic. +> * **Dependence on relevance judgments**: The metric’s effectiveness, NDCG in this case, hinges profoundly on the quality and granularity of relevance judgments. These can often be subjective and not to mention challenging to obtain. Such a heavy reliance implies that the accuracy of NDCG scores could bear significant influence from its initial relevance assessment process: it demands careful consideration and potentially extensive manual review to ensure an accurate reflection of user expectations or needs via relevant scores. +> + +----- + +### Alternative metrics, better ones for us? + +To be investigated. + +There's: +- MAP (Mean Average Precision) +- MMR (Mean Reciprocal Rank) +- ... + +Quoting/interpreting from https://www.shaped.ai/blog/evaluating-recommendation-systems-map-mmr-ndcg: + +> MAP and NDCG seem like they have everything \[...] — they both take all relevant items into account, and the order in which they are ranked. **However, where MRR beats them across the board is interpretability.** MRR gives an indication of the average rank of your algorithm, which can be a helpful way of communicating when a user is most likely to see their first relevant item using your algorithm. +> Meanwhile, MAP has some interpretability characteristics in that it represents the area under the Precision-Recall curve. **NDCG is hard to interpret because of the seemingly arbitrary log discount within the DCG equation.** The advantage of NDCG is that it can be extended to use-cases when numerical *relevancy* is provided for each item, but it may be hard or even impossible to obtain this 'ground-truth' relevance score in practice, so you don't see it used as much in practice. + +... and there goes my initial euphoria tonight! Still, all this leaves me with the strengthened impression that those CTC output vectors' (the codepoint-in-OCR-model-alphabet) rakings' processing is an area to be researched and tweaking those numbers, as I did a while back, is par for the course. +*sigh* +Yep, same ol', some ol': throw any sufficiently large load of more-or-less 'founded in solid theory'[^2] statistics at the problem and ultimately you end up with... the numeric equivalent of The Wet Finger wind-tasting approach, which one only can **hope** is better and more discriminating (as in: identifying the actual character written there) that your favorite [ouijaboard](https://en.wikipedia.org/wiki/Ouija). + + +------- + + +> **The NDCG is a ranking metric.** Imagine that you predicted a sorted list of 1000 documents and there are 100 relevant documents, the NDCG equals 1 is reached when the 100 relevant docs have the 100 highest ranks in the list. +> +> So 0.8 NDCG is 80% of the best ranking. +> +> This is an intuitive explanation of why the math includes some logarithms. +> + + +See also: [A Theoretical Analysis of NDCG Type Ranking Measures (2013), Yining Wang, Li-Wei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, Wei Chen](https://arxiv.org/abs/1304.6480): + +> A central problem in ranking is to design a ranking measure for evaluation of ranking functions. In this paper we study, from a theoretical perspective, the widely used Normalized Discounted Cumulative Gain (NDCG)-type ranking measures. +> +> Although there are extensive empirical studies of NDCG, little is known about its theoretical properties. We first show that, whatever the ranking function is, the standard NDCG, which adopts a logarithmic discount, converges to 1 as the number of items to rank goes to infinity. This result is surprising at first sight: it seems to imply that NDCG cannot differentiate good and bad ranking functions, contradicting to the empirical success of NDCG in many applications. +> In order to have a deeper understanding of ranking measures in general, we propose a notion referred to as *consistent distinguishability*. This notion captures the intuition that any ranking measure should have such a property: for every pair of substantially different ranking functions, the ranking measure can decide which one is better in a consistent manner on almost all datasets. We show that NDCG with logarithmic discount has *consistent distinguishability* although it converges to the same limit for all ranking functions. +> +> We next characterize the set of all feasible discount functions for NDCG according to the concept of consistent distinguishability. Specifically *we show that whether NDCG has consistent distinguishability depends on how fast the discount decays*, and 1/r is a critical point. We then turn to the cut-off version of NDCG, i.e. NDCG@k. We analyze the distinguishability of NDCG@k for various choices of *k* and the discount functions. Experimental results on real Web search datasets agree well with the theory. +> + + + + + + +[^1]: which, incidentally, made it sound like the current tesseract OCR output vector treatment (the LSTM+CTC output) is indeed **not** to be treated as a yet-to-scaled collection of probabilities; I already felt that way, but this might just be me being biased and reading confirmation into that pre-existing bias of mine. Anyhow, I haven't seed any mention of NDCG in the tesseract code yet, so "*to be investigated further, most assuredly!*" + +[^2]: of course, using those statistics' algorithms does serve extremely well the purpose of 'splaining why your system is better than any other. I'm getting cynical at my advancing age. ... Oh, *wait*! Damn! I was born this way! \ No newline at end of file diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Front-End-GUI - why we choose to use Neutralino.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Front-End-GUI - why we choose to use Neutralino.md new file mode 100644 index 00000000..5b4335d3 --- /dev/null +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Front-End-GUI - why we choose to use Neutralino.md @@ -0,0 +1,59 @@ +# Front-End/GUI : why we choose to use Neutralino + +From the [`gluon` github](https://github.com/gluon-framework/gluon) README: + + +**Gluon is a new framework for creating desktop apps from websites**, using **system installed browsers** *(not webviews)* and NodeJS, differing a lot from other existing active projects - opening up innovation and allowing some major advantages. Instead of other similar frameworks bundling a browser like Chromium or using webviews (like Edge Webview2 on Windows), **Gluon just uses system installed browsers** like Chrome, Edge, Firefox, etc. Gluon supports Chromium ***and Firefox*** based browsers as the frontend, while Gluon's backend uses NodeJS to be versatile and easy to develop (also allowing easy learning from other popular frameworks like Electron by using the same-ish stack). + +## Features + +\[...\] +- **Cross-platform** - Gluon works on Windows, Linux, and macOS (WIP) +\[...\] + +## Comparisons + +### Internals + +| Part | Gluon | Electron | Tauri | Neutralinojs | +| ---- | ----- | -------- | ------------ | ----- | +| Frontend | System installed Chromium *or Firefox* | Self-contained Chromium | System installed webview | System installed webview | +| Backend | System installed *or bundled* Node.JS | Self-contained Node.JS | Native (Rust) | Native (Any) | +| IPC | Window object | Preload | Window object | Window object | +| Status | Early in development | Production ready | Usable | Usable | +| Ecosystem | Integrated | Distributed | Integrated | Integrated | + + +### Benchmark / Stats + +Basic (plain HTML) Hello World demo, measured on up to date Windows 10, on my machine (your experience will probably differ). Used latest stable versions of all frameworks as of 9th Dec 2022. (You shouldn't actually use random stats in benchmarks to compare frameworks, this is more so you know what Gluon is like compared to other similar projects.) + +| Stat | Gluon | Electron | Tauri | Neutralinojs | +| ---- | ----- | -------- | ------------ | ----- | +| Build Size | <1MB[^system][^gluon][^1] | ~220MB | ~1.8MB[^system] | ~2.6MB[^system] | +| Memory Usage | ~80MB[^gluon] | ~100MB | ~90MB | ~90MB | +| Backend[^2] Memory Usage | ~13MB[^gluon] (Node) | ~22MB (Node) | ~3MB (Native) | ~3MB (Native) | +| Build Time | ~0.7s[^3] | ~20s[^4] | ~120s[^5] | ~2s[^3][^6] | + +*Extra info: All HTML/CSS/JS is unminified (including Gluon). Built in release configuration. All binaries were left as compiled with common size optimizations enabled for that language, no stripping/packing done.* + +[^system]: Does not include system installed components. +[^gluon]: Using Chrome as system browser. Early/WIP data, may change in future. + +[^1]: *How is Gluon so small?* Since NodeJS is expected as a system installed component, it is "just" bundled and minified Node code. +[^2]: Backend like non-Web (not Chromium/WebView2/etc). +[^3]: Includes Node.JS spinup time. +[^4]: Built for win32 zip (not Squirrel) as a fairer comparison. +[^5]: Cold build (includes deps compiling) in release mode. +[^6]: Using `neu build -r`. + +-------------- + +**And that's exactly why we choose to use Neutralino: "Backend: Native (Any)" plus the same main advantages as `tauri` or `gluon` (or a whole slew of others, e.g. Chromely, DeskCap, etc., all of whom target the market of Electron-without-the-extra-load). + +The only alternative we still ponder instead is using the system-installed bowser directly, as we will need to use that one anyway for Qiqqa Sniffer style PDF grabbing off websites as many employ advanced fingerprinting technology to detect non-standard-and-fully-up-to-date system browser applications anno 2024, which turns using a webview-based application of any kind like Electron/Neutralino not that much of an advantage already, except for more available screen real estate and possibly a cleaner UI/UX on desktops and mobiles when we're not actively hunting PDFs on the internet? + +I don't know... we need to scan/monitor the download directory anyway for the user can download/drop a new PDF at *any time* and dumping PDFs via drag&drop into the browser/application window has to be supported as well, which would mean a local system 'upload' transfer that's rather quick to run: data is copied a few more times on the way into the database / storage directory tree, but that's overhead we can easily handle on modern-ish hardware. Writing the UI code once for web viewing/use helps to move toward 'remote access', i.e. using Qiqqa as a web server alike backend where folks can access their Qiqqa database from different places as long as they have internet connectivity... (which adds another challenge: securing the data access, but anyhoo... it's either straight to browser or PWA/NeutralinoJS for our future.) + + + diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/LLM training video transcript.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/LLM training video transcript.md new file mode 100644 index 00000000..ced858d1 --- /dev/null +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/LLM training video transcript.md @@ -0,0 +1,74 @@ +# LLM training video transcript + notes (DeepLearning.AI) + +Large Language Models with Semantic Search + + +- [Introduction](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction) +- [Keyword Search](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/2/keyword-search)[ +- [Embeddings](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/3/embeddings)[ +- [Dense Retrieval](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/4/dense-retrieval)[ +- [ReRank](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/5/rerank)[ +- [Generating Answers](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/6/generating-answers)[ +- [Conclusion](https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/7/conclusion)[ +- [Course Feedback](https://learn.deeplearning.ai/large-language-models-semantic-search/course-feedback)[ +- [Community](https://learn.deeplearning.ai/large-language-models-semantic-search/community) + + + + +## Introduction + +Welcome to this short course, Large Language Models with Semantic Search, built in partnership with Cohere.  + +In this course, you'll learn how to incorporate large language models, or LLMs, into information search in your own applications. For example, let's say you run a website with a lot of articles, picture Wikipedia for the sake of argument, or a website with a lot of e-commerce products.  + +Even before LLMs, it was common to have keyword search to let people maybe search your site. But with LLMs, you can now do much more. First, you can let users ask questions that your system then searches your site or database to answer. Second, the LLM is also making the retrieve results more relevant to the meaning or the semantics of what the user is asking about.  + +I'd like to introduce the instructors for this course, Jay Allamar and Luis Serrano. Both Jay and Luis are experienced machine learning engineers as well as educators.  +I've admired for a long time some highly referenced illustrations that Jay had created to explain transformer networks. He's also co-authoring a book, Hands-On Large Language Models. Luis is the author of the book, Grokking Machine Learning, and he also taught with DeepLearning.ai. Math for Machine Learning. At Cohere, Jay and Luis, together with Neil Amir, have also been working on a site called. LLMU, and have a lot of experience teaching developers to use LLMs. So I was thrilled when they agreed to teach semantic search with LLMs.  + +Thanks, Andrew. What an incredible honor it is to be teaching this course with you. Your machine learning course introduced me to machine learning eight years ago, and continues to be an inspiration to continue sharing what I learn. As you mentioned, Luis and I work at Cohere, so we get to advise others in the industry on how to use and deploy large language models for various use cases. We are thrilled to be doing this course to give developers the tools they need to build robust LLM powered apps. We're excited to share what we learned from our experience in the field. Thank you, Jay and Luis. Great to have you with us.  + +This course consists of the following topics.  +First, it shows you how to use basic keyword search, which is also called lexical search, which powered a lot of search systems before large language models. It consists of finding the documents that has the highest amount of matching words with the query.  +Then you learn how to enhance this type of keyword search with a method called re-rank. As the name suggests, this then ranks the responses by relevance with the query.  +After this, you learn a more advanced method of search, which has vastly improved the results of keyword search, as it tries to use the actual meaning or the actual semantic meaning of the text with which to carry out the search. This method is called dense retrieval.  +This uses a very powerful tool in natural language processing called embeddings, which is a way to associate a vector of numbers with every piece of text. Semantic search consists of finding the closest documents to the query in the space of embeddings.  +Similar to other models, search algorithms need to be properly evaluated. You also learn effective ways to do this.  +Finally, since LLMs can be used to generate answers, you also learn how to plug in the search results into an LLM and have it generate an answer based on them. Dense retrieval with embeddings vastly improves the question answering capabilities of an LLM as it first searches for and retrieves the relevant documents and it creates an answer from this retrieved information. Many people have contributed to this course. We're grateful for the hard work of Meor Amer, Patrick Lewis, Nils Reimer, and Sebastian Hofstätter from Cohere, as well as on the DeepLearning.ai side, Eddie Shyu and Diala Ezzedine. In the first lesson, you will see how search was done before large language models. From there, we will show you how to improve search using LLMs, including tools such as embeddings and re-rank. That sounds great. And with that, let's dive in and go on to the next video. + + + + + +## Keyword Search + + +Welcome to Lesson 1. In this lesson, you will learn how to use keyword search to answer questions using a database. Search is crucial to how we navigate the world. This includes search engines, but it also includes search within apps, so when you search Spotify or YouTube or Google Maps for example. Companies and organizations also need to use keyword search or various methods of search to search their internal documents. Keyword search is the method most commonly used to build search systems. Let's look at how we can use a keyword search system and then look at how language models can improve those systems. Now, in this code example, we'll connect to a database and do some keyword search on it. The first cell installs "weaviate=client". You don't need to run this if you're running this from inside the classroom.  But if you want to download this code and run it on your own environment, client, you would want to install "weaviate=client".  The first code cell we need to run loads the API keys we'll need later in the lesson. And then now we can import Weaviate. This will allow us to connect to an online database. We'll talk about this database. Let's talk about what Weaviate is. Weaviate is an open source database. It has keyword search capabilities, but also vector search capabilities that rely on language models. The API key we'll be using here is public, this is part of a public demo, so the actual key is not a secret and you can use it and we encourage you to use it to access this demo database. Now that we've set the configurations for authentication, let's look at this code that connects the client to the actual database. Now this database is a public database and it contains 10 million records. These are records from Wikipedia. Each cell, each record, a row in this database, is a passage, is a paragraph from Wikipedia. These 10 million records are from 10 different languages. So one million of them are in English and the other nine million are in different languages. And we can choose and filter which language we want to query, and we'll see that later in this lab. After we run this line of code, we make sure that the client is ready and connected. And if we get true, then that means that our local Weaviate client is able to connect to the remote Weaviate database. And then now we're able to do a keyword search on this data set. Let's talk a little bit about keyword search before we look at the code. So let's say you have the query, what color is the grass? And you're searching a very small archive that has these five texts, these five sentences. One says tomorrow is Saturday, one says the grass is green, the capital of Canada is Ottawa, the sky is blue, and a whale is a mammal. So this is a simple example of search. A high-level look at how keyword search works is to compare how many words are in common between the query and the documents. So if we compare how many words are in common between the query and the first sentence, they only share the word is. And so that's one word they have in common. And we can see the counts of every sentence in this archive. We can see that the second sentence has the most words in common with the query, and so keyword search might retrieve that as the answer. So now that we're connected to the database, let's build out the function that will query it. Let's call it "keyword_search" and we'll be building this and going back and forth. So the simplest thing we'll need to do here is to say "response = (" and then "client.query.get". Now everything we're doing here, this is Weaviate. So this is all defined by the Weaviate API. And it tells us the kind of data, I think the collection we need to add here is called articles. So that's defined in that database. And since we want to do keyword search, let's say before we go into keyword search, let's copy what these properties are like. So these will be, let's say, a list defined with this data set, like this. Every article in this database has a number of properties. What we're saying here is that the results for this search, we want you to return to us the title, the URL, and the text for each result. There are other properties in here, but we don't want the database to return them to us now. Now to do the keyword search part, Weaviate has us type ".with_bm25", and bm25 is this keyword search or lexical search algorithm commonly used, and it scores the documents in the archive versus the query based on a specific formula that looks at the count of the shared words between the query and each document and the way we need to do this is to say "query=query" we will pass to you and the query we need to add to this function so it is a parameter here as well. A couple more lines we need to pass to alleviate our ".with_where", so we can have a where clause here that is formatted in a specific way. So what we want to do here is limit this to only English results. And results slang will be something we also add to this definition. So let's say "en". By default, we'll filter by the English language and only look at the English language articles. So that's why we're adding it here as a default, but it's something we can change whenever we call the keyword search method. Now one more line we need to add is to say ".with_limit". So how many results do we want the search engine to retrieve to us? So, we say "num_results" and then we define that here as well, so "num_results". And let's set that by default to say 3. And with that, our query is complete and we just say do and then we close that request. And once we've completed this, we can now get the response and return the result. With this, that is our keyword search function. Now let's use this keyword search function and pass it one query. Say we say, what is the most viewed televised event? We pass the query to the function and then we print it. So let's see what happens when we run this. It goes and comes back and these are the search results that we have. It's a lot of text, we'll go through it, but we can see that it's a list of dictionaries. So let's define a function that prints it in maybe a better way. And this function can look like this "print_result". And with this, we can say, okay, now print it and let me see exactly what the results were. So the first result that we got back is this is the text. This is the passage or paragraph of the text. This is the title, and remember, we're trying to look for what is the most televised event. This does not really look like the correct result very much, but it contains a lot of the keywords. Now, we have another article here about the Super Bowl. This is a better result, so the Super Bowl could probably be a highly televised event. and then there's another result here that kind of mentions the World Cup but it's not exactly the World Cup result.  With each of these you see the URL of that article we can click on it and it will lead us to a Wikipedia page. You can try to edit this query so you can see what else is in this data set but this is a high-level example of the query connecting to the database and then seeing the results. A few things you can try here as well is you need to look at the properties. This is the list of properties that this data set was built using and so these are all columns that are stored within the database. So you can say you're gonna look at how many views a Wikipedia page received. You can use that to filter or sort. This is an estimated figure but then this is the language column that we use to filter, and you can use other values for language. The codes for the other languages look like this. So we have English, German, French, Spanish, Italian, Japanese, Arabic, Chinese, Korean, and Hindi, I believe. So just input one of these and pass it to the keyword search, and it will give you results in that language. Let's see how we can query the database with a different language. So let's say we copy this code. Notice that now we're printing the result here. Let's specify the language to a different language here. I'm going to be using, let's say, German. And we did German here because some words might be shared and we can see here some results. So this result for the most televised event is for Busta Rhymes, the musician. But you can see why it brought this as a result, right? Because the word event is here. And then the name of the album mentioned here is event. So the text here and the query that we have shared, they don't have to share all of the keywords but at least some of them. BM25 only needs one word to be shared for it to score that as somewhat relevant. And the more words the query and the document share, the more it's repeated in the document, the higher the score is. But we can see in general, while these results are returned, this is maybe not the best, most relevant answer to this question or document that is most relevant to this query. We'll see how language models help with this. So at the end of the first lesson, let's look back at search at a high level. The major components are the query, the search system, the search system has access to a document archive that it processed beforehand, and then in response to the query the system gives us a list of results ordered by the most relevant to the query. If we look a little bit more closely, we can think of search systems as having multiple stages. The first stage is often a retrieval or a search stage, and there's another stage after it called re-ranking.  Re-ranking is often needed because we want to involve or include additional signals rather than just text relevance. The first stage, the retrieval, commonly uses the BM25 algorithm to score the documents in the archive versus the query. The implementation of the first stage retrieval often contains this idea of an inverted index. Notice that this table is a little bit different than the table we showed you before of the documents. The inverted index is this kind of table that has kind of these two columns. One is the keyword, and then next to the keyword is the documents that this keyword is present in. This is done to optimize the speed of the search. When you enter a query into a search engine, you expect the results in a few milliseconds. This is how it's done. In practice, in addition to the document ID, the frequency of how many times this keyword appears is also added to this call. With this, you now have a good high-level overview of keyword search. Now, notice for this query, what color is the sky, when we look at the inverted index, the word color has the document 804, and the word sky also has the document 804. So 804 will be highly rated from the results that are retrieved in the first stage. From our understanding of keyword search, we can see some of the limitations. So, let's say we have this query, strong pain in the side of the head. If we search a document archive that has this other document that answers it exactly, but it uses different keywords, so it doesn't use them exactly, it says sharp temple headache, keyword search is not going to be able to retrieve this document. This is an area where language models can help, because they're not comparing keywords simply. They can look at the general meaning and they're able to retrieve a document like this for a query like this. Language models can improve both search stages and in the next lessons, we'll look at how to do that. We'll look at how language models can improve the retrieval or first stage using embeddings, which are the topic of the next lesson. And then we'll look at how re-ranking works and how it can improve the second stage. And at the end of this course, we'll look at how large language models can generate responses as informed by a search step that happens beforehand. So let's go to the next lesson and learn about embeddings. + + + + + +## Embeddings + + +Welcome to Lesson 2. In this lesson you will learn embeddings. Embeddings are numerical representations of text that computers can more easily process. This makes them one of the most important components of large language models. So let's start with the embeddings lab. This code over here is going to help us load all the API keys we need. Now in the classroom this is all done for you, but if you'd like to do this yourself you would have to pip install some packages, for example, the Cohere one. Other packages you would have to install for the visualizations are umap-learn, Altair, and datasets for the Wikipedia dataset. I'm gonna comment this line because I don't need to run it in this classroom. So next, you'll import the Cohere library. The Cohere library is an extensive library of functions that use large language models and they can be called via API calls. In this lesson we're going to use the embed function but there are other functions like the generate function which you'll use later in the course. The next step is to create a Cohere client using the API key. So first let me tell you what an embedding is. Over here we have a grid with a horizontal and a vertical axis and coordinates, and we have a bunch of words located in this grid as you can see. Given the locations of these words, where would you put the word apple? As you can see in this embedding, similar words are grouped together. So in the top left you have sports, in the bottom left you have houses and buildings and castles, in the bottom right you have vehicles like bikes and cars, and in the top right you have fruits. So the apple would go among the fruits. Then the coordinates for Apple here are 5'5 because we are associating each word in the table in the right to two numbers, the horizontal and the vertical coordinate. This is an embedding. Now this embedding sends each word to two numbers like this. In general, embeddings would send words to a lot more numbers and we would have all the possible words. Embeddings that we use in practice could send a word to hundreds of different numbers or even thousands. Now let's import a package called pandas and we're going to call it "pd". Pandas is very good for dealing with tables of data. And the first table of data that we're going to use is a very small one. It has three words. The word joy, the word happiness, and the word potato which you can see over here. The next step is to create embeddings for these three words. We're going to call them three words emb and to create the embeddings we're going to call the cohere function embed. The embed function takes some inputs. The first one is the data set that we want to embed which is called three words for this table and we also have to specify the column which is called text. Next we specify which of the cohere models we want to use and finally we extract the embeddings from there. So now we have our three words embeddings. Now let's take a look at the vector associated with each one of the words. The one associated with word joy, we're going to call it "word_1". And the way we get it is by looking at "three_words_emb" and taking the first row. Now we're going to do the same thing with "word_2" and "word_3". Those are the vectors corresponding to the words happiness and potato. Just out of curiosity, let's look at the first 10 entries of the vector associated with the word joy. That's going to be "word_1" all the way up to 10. Now embeddings not only have to work for words, they can also work for longer pieces of text. Actually, it can be really long pieces of text. In this example here, we have embeddings for sentences. Now the sentences get sent to a vector or a list of numbers. And notice that that the first sentence is, hello, how are you? The last one is, hi, how's it going? And they don't have the same words, but they are very similar. And because they're very similar, the embedding sends them to numbers that are really close to each other. Now, let me show you an example of embeddings. First, we'll have to import Pandas as "pd". Pandas is a library that works for handling tables of data. And next, we're going to take a look at a small data set of sentences. This one has eight sentences, as you can see. They are in pairs. Each one is the answer to the previous one, for example, what color is the sky? The sky is blue. What is an apple? An apple is a fruit. Now we are going to plot this embedding and see which sentences are close or far from each other. In order to turn all these sentences into embeddings we are going to use the embed function from Cohere. So we're going to call this table m and we're going to call the endpoint co.embed. This function is going to give us all the embeddings and it takes some inputs. The first input is the table of sentences that we want to embed. So the table is called sentences and we have to specify the column, which is called a text. The next input is the name of the model we're going to use. Finally, we extract the embeddings from the output of this function. This function is going to give us a long list of numbers for each one of the sentences. Let's take a look at the first 10 entries of the embeddings of each of the first three sentences. And they are over here. Now how many numbers are associated to each one of the sentences? In this particular case it's 4096, but different embeddings have different lengths. Now we're going to visualize the embedding. For this we're going to call a function from utils called umapplot. Umapplot uses the packages umap and altair and and it produces this plot over here. Notice that this plot gives us eight points in pairs of two. And let's look what the pairs are. This one over here is the bear lives in the woods and the closest sentence is where does the bear live? Which makes sense because they are sentences that are quite similar. Let's look at these two over here. Here we have what is an apple and an apple is a fruit. Over here we have where is the World Cup? The World Cup is in Qatar. And over here we have what color is the sky and the sky is blue. So as you can see, the embedding put similar sentences in points that are close by and different sentences in points that are far away from each other. Notice something very particular. The closest sentence to any question is its particular answer. So we could technically use this to find the answer to a question by searching for the closest sentence. This is actually the basis for dense retrieval, which is what Jay is going to teach you in the very next video. Now feel free to add some more sentences or change these sentences completely, and then plot the embedding and see how it looks like. Now that you know how to embed a small data set of eight sentences, let's do it for a large data set. We're gonna work with a big data set of Wikipedia articles. Let's load the following data set. It has a bunch of articles with title, the text of the first paragraph, and the embedding of that first paragraph. And it has 2,000 articles. We're gonna import "NumPy" and a function that will help us visualize this plot very similar to the previous one. We're going to bring it down to two dimensions so that it's visible for us. The embedding is over here and notice that similar articles are in similar places. For example, over here you can find a lot of languages. In here, countries. Over here you're going to find a lot of kings and queens. And here you're going to find a lot of soccer players, and over here you're going to find artists. Feel free to explore this embedding and try to find where the topics are located. And that's it for embeddings. Now in the next lesson with Jay, you will be able to use embeddings to do dense retrieval. That is, to be able to search for the answer to a query in a big database. + +### My Notes + +Looks like (k-means, etc. etc.) clustering to me over vector space. All examples use a pre-trained language model, but my issue is how to train a LOCAL language model like that: design system requirement for me is: no external companies, no remote access APIs used, information does not leave the room ("anonymization" is, IMO, flawed (as shown by various complaints about copyright infringement by ChatGPT et al where people are able to produce content they wrote/created themselves and is chopped up but still retrievable from the foreign database)). +Anyway, looks to me like "embeddings" is teaching (learning) a neural network what words and phrases are related (hence same cluster; same neighbourhood); with this I can see how a search based on the "embeddings" i.e. the cluster identifiers ("coordinates" if you will) can produce more useful answers as it's not a FTS based on direct word/phrase ngrams any more: IFF you train the embeddings model to closely associate, say, "coprolites" and "fossilized dung", then a search for "coprolites" would also rank high any documents that do NOT use that word but talk about "fossilized excrement" instead. (Bonus points if your training included "excrement" + "dung" clustering as well, for then the "embeddings" vector would also hit the "excrement" word and rank would jump to the top accordingly.) + +What I was pondering for some time now is how to get such a model trained "as you go" *locally*: explicitly creating a training set for the embeddings database is beyond the means of most, unless^📕, perhaps, one can go something like this: start with classical FTS (based on ngrams/skipgrams as I already intended), get your hits (documents), get the highest ranked keywords for that document and find matching documents for those as being "similar documents"; use that as a training set to help *cluster* similar documents. This would be unsupervised training for this to scale beyond a very low number of documents; the auto-trained clustering info can then be used to "find" additional documents with high rank (hopefully; this would mean we got our training + ranking right for this new approach) to be added to our original query results. + + +📕: this was the idea, but I didn't / don't know yet how to make this happen on average local hardware, as this stuff has all the smells of exploding storage and CPU costs; something to test and experiment with further. + + +Couple of things: +- a query will produce search results, but for response latency reasons we don't limit that to classic REST query-response cycles, but would rather need a query process that's running kinda "in the background", spitting out results as/when they are discovered. The second key element in this process is that they will NOT necessarily show up in ranking order, but each answer will come with a score and the client side will have to merge the new search result and re-sort the list on display. This also means the UI behaviour must be tolerant of search list updates updating the result list will users go through / select/mark / do other things with the list: "live updates while you work". +- as the unsupervised training would not result in very well defined clusters from the start, on its own, the clustering (model output / training result), should only gradually boost the ranking of related "embedding-based" matches. The "learn as you go" idea that sits at the back of my head is becoming clearer now, but still an unsolved problem: what I need to do is find a way to gradually improve the training by taking user feedback about the query responses: if I can get people to VERY QUICKLY (low user effort/cost!) mark/tag search results for each (or many/several) of their queries, I can "construct" a query-response training set from that to improve the embeddings being discovered: this becomes semi-supervised training in a round-about way as it becomes a "human sometimes in / influenced the loop". +- The lesson mentions using a 4K wide embeddings vector for the demo database; for a generic language model (such as the English one they're using), this size makes sense as it's a near-minimum size for learning meanings of words ("semantics"): average human active vocabulary sits at around 2K+ (higher ed) / 1-1.5K for "everyone", depending on which investigation/report you pick, so that's 2K wide for allowing a machine to learn the meaning of words, thesaurus/dictionary style. That leaves 2K for idioms, sayings and other (partial-)sentence like mappings/relationships. However, training such an animal would come at considerable cost when done on local John Doe hardware; I don't have the equipment for that either, so I'm looking at much reduced language models. Hm, "good enough" should be redefined as "lightly better" in my mind, perhaps... :-) + + diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md index 97511945..25ac1959 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Tools To Be Developed/bezoar - OCR and related document page prep.md @@ -8,7 +8,7 @@ 3. *hopefully* add the ability to manipulate these behaviours through user-provided *scripting* for customized behaviours for individual input files. This SHOULD include arbitrary *page image* preprocessing flows, inspired on the algorithms included in `unpaper` and `libprecog` (a.k.a. `PRLib`). - The key to this (scriptable) flow is that image (pre)processing is just not a single forward movement, but should allow for a preprocessing *graph* to be set up so we can create masks, etc., which are then to be applied to later stages in hat preprocessing graph + The key to this (scriptable) flow is that image (pre)processing is just not a single forward movement, but should allow for a preprocessing *graph* to be set up so we can create masks, etc., which are then to be applied to later stages in that preprocessing graph. 1. Also add the feature to tesseract to provide a separate image for the *segmentation phase* in that codebase and/or override that phase entirely by allowing external processes (or the preprocessor) to deliver a list of segments (*bboxes*) to be OCR-ed. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 018-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 018-of-N.md index cf4957b1..feb9ef77 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 018-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 018-of-N.md @@ -2,96 +2,6 @@ - -## Unicode homoglyphs for Win32/NTFS & UNIX illegal filename characters - - -PHP code based on examples and libraries from phlyLabs Berlin; part of [phlyMail](http://phlymail.com/) -Also thanks to [http://homoglyphs.net](http://homoglyphs.net/) for helping me find more glyphs. - - -| | | -| ---- | ---- | -| **Char** | **Homoglyphs** | -| | ᅟ ᅠ                     ㅤ | -| ! | ! ǃ ! | -| " | " ״ ″ " | -| $ | $ $ | -| % | % % | -| & | & & | -| ' | ' ' | -| ( | ( ﹝ ( | -| ) | ) ﹞ ) | -| * | * ⁎ * | -| + | + + | -| , | , ‚ , | -| - | - ‐ - | -| . | . ٠ ۔ ܁ ܂ ․ ‧ 。 . 。  ․  | -| / | / ̸ ⁄ ∕ ╱ ⫻ ⫽ / ノ | -| 0 | 0 O o Ο ο О о Օ O o | -| 1 | 1 1 | -| 2 | 2 2 | -| 3 | 3 3 | -| 4 | 4 4 | -| 5 | 5 5 | -| 6 | 6 6 | -| 7 | 7 7 | -| 8 | 8 8 | -| 9 | 9 9 | -| | | -| : | : ։ ܃ ܄ ∶ ꞉ : ∶  | -| ; | ; ; ; ; | -| < | < ‹ < | -| = | = = | -| > | > › > | -| ? | ? ? | -| @ | @ @ | -| [ | [ [ | -| \| \| | -| ] | ] ] | -| ^ | ^ ^ | -| _ | _ _ | -| ` | ` ` | -| a | A a À Á Â Ã Ä Å à á â ã ä å ɑ Α α а Ꭺ A a | -| b | B b ß ʙ Β β В Ь Ᏼ ᛒ B b ḅ | -| c | C c ϲ Ϲ С с Ꮯ Ⅽ ⅽ C c | -| d | D d Ď ď Đ đ ԁ ժ Ꭰ ḍ Ⅾ ⅾ D d | -| e | E e È É Ê Ë é ê ë Ē ē Ĕ ĕ Ė ė Ę Ě ě Ε Е е Ꭼ E e | -| f | F f Ϝ F f | -| g | G g ɡ ɢ Ԍ ն Ꮐ G g | -| h | H h ʜ Η Н һ Ꮋ H h | -| i | I i ɩ Ι І і ا Ꭵ ᛁ Ⅰ ⅰ I i | -| j | J j ϳ Ј ј յ Ꭻ J j | -| k | K k Κ κ К Ꮶ ᛕ K K k | -| l | L l ʟ ι ا Ꮮ Ⅼ ⅼ L l | -| m | M m Μ Ϻ М Ꮇ ᛖ Ⅿ ⅿ M m | -| n | N n ɴ Ν N n | -| 0 | 0 O o Ο ο О о Օ O o | -| p | P p Ρ ρ Р р Ꮲ P p | -| q | Q q Ⴍ Ⴓ Q q | -| r | R r ʀ Ի Ꮢ ᚱ R r | -| s | S s Ѕ ѕ Տ Ⴝ Ꮪ S s | -| t | T t Τ τ Т Ꭲ T t | -| u | U u μ υ Ա Ս ⋃ U u | -| v | V v ν Ѵ ѵ Ꮩ Ⅴ ⅴ V v | -| w | W w ѡ Ꮃ W w | -| x | X x Χ χ Х х Ⅹ ⅹ X x | -| y | Y y ʏ Υ γ у Ү Y y | -| z | Z z Ζ Ꮓ Z z | -| { | { { | -| \| | \| ǀ ا | | -| } | } } | -| ~ | ~ ⁓ ~ | -| ß | ß | -| ä | Ä Ӓ | -| ö | ӧ Ö Ӧ | -| | | -| | | - - - - - ## PCA, PPA, SVD, LCA, auto-encoder, etc: dimension reductions for search, clustering, topic analysis, ... Paper: Empirical comparison between autoencoders and traditional dimensionality reduction methods diff --git a/thirdparty/DirScanner b/thirdparty/DirScanner index 8c72b4c0..f9086236 160000 --- a/thirdparty/DirScanner +++ b/thirdparty/DirScanner @@ -1 +1 @@ -Subproject commit 8c72b4c0baf9702b573b9c002757ebfd97fb358b +Subproject commit f9086236db0b75be56c8cf2d76273c29355e371b