-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation reorg #2
Conversation
Also, see https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/index.md I've broken down each type of analysis into it's own page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes, but overall I like direction the structure is going in.
...it'll be easier to sort out what needs to be done on archivesunleashed/aut#223 with this new structure. 🤝 @lintool |
@ruebot I think those issues were in the previous version, but I fixed anyway. Note that I haven't org'ed text, link, and image analysis in the "How do I..." format. |
At some point in time: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md#to-take-or-to-save "To Take or To Save" might get it's own page, as it's applicable to every script... and we can link to it. Same treatment with "Filters". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor things, which can be taken or left!
@@ -0,0 +1,48 @@ | |||
## Image Analysis | |||
|
|||
AUT supports image analysis, a growing area of interest within web archives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to move us away from acronyms all over the place, so maybe just The Toolkit
|
||
AUT supports image analysis, a growing area of interest within web archives. | ||
|
||
### Most frequent image URLs in a collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is so dated - and now that we can just write the images out directly rather than having to wget them, maybe stick with just that?
|
||
### Extraction of Simple Site Link Structure | ||
|
||
If your web archive does not have a temporal component, the following Spark script will generate the site-level link structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark
-> scala
(just to keep things clear)?
.saveAsTextFile("plain-text/") | ||
``` | ||
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of explanation might be superfluous now? In any case, we should change src/test/resources/arc/example.arc.gz
to just example.arc.gz
to reflect the script above (this was probably in the original!).
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. | ||
|
||
Note that this will create a new directory to store the output, which cannot already exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these generic examples around manipulating code should just go in one place at the beginning of the docs?
|
||
### Plain text by domain | ||
|
||
The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a filter string. In the example case, it will go through the collection and find all of the URLs within the "archive.org" domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all these, Spark
-> scala
?
hey @ianmilligan1 I've left Let's focus on |
Take a look at this example: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md
Major changes:
Let me know what you think...