Skip to content

Commit

Permalink
regenerated doc/site again with experimental deGaulle code.
Browse files Browse the repository at this point in the history
  • Loading branch information
GerHobbelt committed Feb 6, 2021
1 parent a52e64d commit 3e0f248
Show file tree
Hide file tree
Showing 260 changed files with 19,861 additions and 106,348 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,7 @@
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<p>If you want to do more with PDFs than merely <em>read</em> them on screen<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>, there’s trouble ahead.</p>
<h2>Searchable text anyone?</h2>
<p>Here are a few articles which discuss the banana peels on your pedestrian path when you need to <em>extract data from tthe PDF</em>, whether it’s for search or other purposes: things are <em>not</em> rosy when you don’t need OCR to get anything potentially legible, as the next article will show you: here’s a bunch of journalists who wrestle with this file format on a daily basis:</p>
<p>Heart of Nerd Darkness: Why Updating Dollars for Docs Was So Difficult</p>
<p>–Quote–</p>
<p>&quot;Compiling the data for it has been an enormous project right from the beginning. After we published the first version, the original developer on the project, Dan Nguyen, compiled all of the things he had to learn into a guide to scraping data. This year’s update took more than eight months of full-time work by me, working with other news-app developers, and at times with our CAR team, a researcher, two editors and two health-care reporters. It was a massive effort and presented huge technical and journalistic challenges.</p>
<p>After we launched, my editor pulled me aside and asked what was so hard about Dollars for Docs. What follows is my answer.</p>
<p>PDFs Considered Harmful</p>
<p>–Quote End–</p>
<p><a href="https://www.propublica.org/nerds/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult">https://www.propublica.org/nerds/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult</a></p>
<p>(source: <a href="https://softwarerecs.stackexchange.com/questions/18728/pdf-content-extraction-software">https://softwarerecs.stackexchange.com/questions/18728/pdf-content-extraction-software</a>)</p>
<p>Bulletfroofing Your Data</p>
<p><a href="https://github.com/propublica/guides/blob/master/data-bulletproofing.md">https://github.com/propublica/guides/blob/master/data-bulletproofing.md</a></p>
<p>How to attach raw data to a PDF:
<a href="https://www.youtube.com/watch?v=CKDWr1h8Y9c">https://www.youtube.com/watch?v=CKDWr1h8Y9c</a></p>
<p>This almost never happens, also in research/university circles, so you’re bound to need tools. And then it’s a statistics game: how good are your tools, how much effort are you willing to spend and what is the target accuracy/veracity of your extracted data?</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li tabindex="-1" id="fn1" class="footnote-item"><p>for which there’s Adobe Acrobat and if you don’t like it, there’s a plethora of PDF <em>Viewers</em> out there, e.g. SumatraPDF. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>


</head>
<body>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,7 @@
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<h1>Contributing to Qiqqa :: What Can You Do To Help?</h1>
<p><a href="./toc.html">toc</a></p>
<h2>Preface</h2>
<p>We need more than just software engineers / programmers!</p>
<p>There’s plenty to do in other fields while working on improving Qiqqa:</p>
<ul>
<li>documentation for users</li>
<li>documentation for testers and developers</li>
<li>testing</li>
<li>helping out with user questions</li>
<li>…</li>
</ul>
<h2>TBD</h2>
<hr>
<p><strong>ripped raw from email: TODO: copy-edit this material</strong></p>
<hr>
<p>thanks for the offer! Not everyone has to be a programmer to help out. Rather not, I’d say!</p>
<p>First of all, any user feedback is appreciated: positive feedback helps to keep me motivated 😃</p>
<p>It’s also a great help if you test Qiqqa (experimental) releases and report any bugs/usability/weirdness observations to the github issue tracker. (You’ll need to create a github account for that, but github is not reserved for IT folk 😉 )</p>
<p>In the near future there’s going to be work done on:</p>
<ol>
<li>Qiqqa documentation (served at the new domain <a href="http://qiqqa.org">qiqqa.org</a> (<a href="https://qiqqa.og/">https://qiqqa.og/</a>) but the source material is managed at github); again there’s lots of help I could use: review, writing, copy editing, …</li>
<li>creating small Technology Tests, which are (relatively) small applications which are meant to test a specific bit of technology, before I can incorporate that into Qiqqa. Testing those executables on any hardware that will be/should be running Qiqqa is also great help as I have found that the really worrisome bugs only surface when you run your stuff “someplace else”, hence I cannot emphasize the importance of bug reports in the issue tracker at github. Any data you can provide with a bug report, including screenshots and/or (zipped) copies of log files will help me find out what’s going on both quicker and more reliably.</li>
</ol>
<p>For your information, so you get a bit of a feel for the ‘speed of development’ here:</p>
<p>“In the near future” means I plan on getting some first results on these subjects in Q2 2020 (so april/may/june); when we’re lucky there’s some quick progress but expect periods of slow going as my attention has to turn elsewhere, so, on average, progress should be reckoned in weeks or months, rather than days. You will see bursts of activity and some bouts of silence - RL can take over hard and fast sometimes.</p>
<p>I am happy with any time and effort you’re willing to spend on Qiqqa, so have a look around at the github project site. Re the documentation effort: if you’re interested in helping out there, say in a few weeks, note that you don’t even need to install <code>git</code> for that: you can fork and then use the github website to edit the text files and save your work on-line. Of course there’s more that can be done when you can work with a local copy of the project, but I just wanted to mention that the entry level is really low if you’re copywriting.</p>
<p>If you want to help out with the documentation, first order of the day would be setting up a github account for yourself (easy and free of charge), then either ‘fork’ the repository (so you have a personal copy of the source data to work with) or maybe create a small test project of your own to get used to the github website and the text editing facilities offered on-line. That would be entry level and there’s plenty you can accomplish at this “entry level”.</p>
<p>If you want to help out with testing Qiqqa and/or the technology test applications, you’ll be more dependent on me &amp; my rate of output as I’ll have to provide the binaries and installers on-line as releases for you to download them, but then all you need is a github account for providing feedback and maybe a zip application (e.g. 7-zip) when you want to upload log files of the test runs. I can send you a notification when new tests/versions are on-line via the google groups mailing list I just set up today.</p>
<p>Let me know what you think and what might interest you, then we can work something out re process.</p>
<hr>
<h2></h2>


</head>
<body>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,46 +4,7 @@
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<h1>How to locate your Qiqqa Base Directory</h1>
<p>This is the place where Qiqqa stores all your libraries and works on them. Intranet and Web Libraries are also stored in here (as “local work copies”).</p>
<p>TBD: copyedit this copy/pasta’d article.</p>
<hr>
<p>First things first: what I need is a copy of your <em>local machine’s qiqqa library/libraries</em>: that is the library your Qiqqa Windows software shows you and operates on when you search the library, add to it, etc.</p>
<p>To find out where your Qiqqa software keeps the library files on your PC, here’s the sequence of screenshots to follow and find the directory - in screenshots that is D:\Qiqqa\base\You will need that path to find out where all the qiqqa data is so you can package it and send it to me (and later get stuff from me and add that to your Qiqqa library set).</p>
<p>In Qiqqa</p>
<p><img src="./images/htlyqbd1.png" alt="[screenshot]"></p>
<p>click on tools button as shown below:</p>
<p><img src="./images/htlyqbd2.png" alt="[screenshot]"></p>
<p>to get a dropdown menu, where you click on the Configuration item:</p>
<p><img src="./images/htlyqbd3.png" alt="[screenshot]"></p>
<p>to see the Configuration Tab in Qiqqa as below. Scroll down if necessary to find the System item line and ‘unfold’ it by clicking on the ‘+’ icon to the right:</p>
<p><img src="./images/htlyqbd4.png" alt="[screenshot]"></p>
<p>This will show you the system details, including the base path to ALL Qiqqa libraries on your PC: in my case that’s the indicated path on the D: drive of my PC:</p>
<p><img src="./images/htlyqbd5.png" alt="[screenshot]"></p>
<p>You can copy that path and then paste it into Windows Explorer: you can open a new Windows Explorer window in various ways, e.g. RIGHT-clicking on the icon (1) in the Windows  Task Bar and then in the popup menu clicking on the ‘File Explorer’ line (number 2 in the screenshot below),</p>
<p><img src="./images/htlyqbd6.png" alt="[screenshot]"></p>
<p>after which you get a new Windows Explorer window in which you can either paste that base directory in the bar at the top of that window or browse to that base directory in the usual way, until you end up in that Qiqqa base directory and see something like this:</p>
<p><img src="./images/htlyqbd7.png" alt="[screenshot]"></p>
<p>Note the blue-selected directory path at the top there; that’s where I pasted that path copied from the Qiqqa Configuration panel.</p>
<p>The slightly more technical bit now is to find out which library is the local copy of that XXXXX library you invited me to on Qiqqa:
with a bit of luck one of your directories in there is also given the cryptic name</p>
<pre><code>C9F0D079-EE4C-4A15-8547-72164A7A356D
</code></pre>
<p>but to make sure (or in case that directory name is NOT available on your PC: I am guessing here) you need to look inside those directories and look for a file named</p>
<pre><code>Qiqqa.known_web_libraries
</code></pre>
<p>In my case, that file is inside the 3A614D8C-0882-4CAB-8FBD-A9E494093283 directory but in your case it’ll certainly sit in another directory.</p>
<p>(Note: no need to look deep inside those folders; when you look in them, you’ll notice almost all of those cryptic directories have a file called
<code>Qiqqa.library</code>
in them and we’re only interested in those directories. Most of them will have subdirectories called ‘documents’ (which is where Qiqqa stores all your collected PDFs) and ‘index’ (which is where the search index database for Qiqqa is stored), but right now we don’t care as we’re only looking for that
<code>Qiqqa.known_web_libraries</code>
file.</p>
<p>When found, you can open it in a text editor (Notepad) and it will look like some chunks of text, interspersed with unintelligible characters. No worries, just have a look and NEVER select to ‘save’ if Notepad asks: Qiqqa should be the ONLY application writing to that file! But we can have a look: you’ll see something like this:</p>
<p><img src="./images/htlyqbd8.png" alt="[screenshot]"></p>
<p>We find the XXXXX library name and description and then pick the corresponding ‘hash’ which is used by Qiqqa for the directory name: selected blue in the snap below:</p>
<p><img src="./images/htlyqbd9.png" alt="[screenshot]"></p>
<p>No need to be super-precise here: all we need this for is a hint which of those cryptic directory names is the one storing the XXXXXX library we are looking for: it’s enough to recognize that one of those directory names STARTS with the same “C9” characters as the “XXXXX” line in that “known_web_libraries” file: recognizing the first two characters is enough as none of the other directories in that base directory starts with “C9”.</p>
<p>Now that we have the location where the entire library is stored on your local PC, you can create a copy or backup of that one and send it/backup it anywhere.</p>


</head>
<body>
Expand Down
Loading

0 comments on commit 3e0f248

Please sign in to comment.