Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation reorg #2

Merged
merged 21 commits into from
Oct 20, 2019
Merged

Documentation reorg #2

merged 21 commits into from
Oct 20, 2019

Conversation

lintool
Copy link
Member

@lintool lintool commented Oct 19, 2019

Take a look at this example: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md

Major changes:

  • Grouping analyses by type, on its own page. Doing this so the heading level doesn't get too deep to be manageable.
  • Rephrasing headers into tasks, in the form of "How do I..."
  • Every task has Scala RDD, Scala DF, and Python DF subsections - with TODO stubs for the latter two.

Let me know what you think...

@lintool
Copy link
Member Author

lintool commented Oct 19, 2019

Also, see https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/index.md

I've broken down each type of analysis into it's own page.

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes, but overall I like direction the structure is going in.

current/link-analysis.md Outdated Show resolved Hide resolved
current/link-analysis.md Outdated Show resolved Hide resolved
current/link-analysis.md Outdated Show resolved Hide resolved
@ruebot
Copy link
Member

ruebot commented Oct 19, 2019

...it'll be easier to sort out what needs to be done on archivesunleashed/aut#223 with this new structure.

🤝 @lintool

@lintool
Copy link
Member Author

lintool commented Oct 19, 2019

@ruebot I think those issues were in the previous version, but I fixed anyway.

Note that I haven't org'ed text, link, and image analysis in the "How do I..." format.

@lintool
Copy link
Member Author

lintool commented Oct 19, 2019

At some point in time: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md#to-take-or-to-save

"To Take or To Save" might get it's own page, as it's applicable to every script... and we can link to it. Same treatment with "Filters".

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor things, which can be taken or left!

current/collection-analysis.md Outdated Show resolved Hide resolved
current/collection-analysis.md Outdated Show resolved Hide resolved
@@ -0,0 +1,48 @@
## Image Analysis

AUT supports image analysis, a growing area of interest within web archives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to move us away from acronyms all over the place, so maybe just The Toolkit


AUT supports image analysis, a growing area of interest within web archives.

### Most frequent image URLs in a collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is so dated - and now that we can just write the images out directly rather than having to wget them, maybe stick with just that?


### Extraction of Simple Site Link Structure

If your web archive does not have a temporal component, the following Spark script will generate the site-level link structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark -> scala (just to keep things clear)?

.saveAsTextFile("plain-text/")
```

If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of explanation might be superfluous now? In any case, we should change src/test/resources/arc/example.arc.gz to just example.arc.gz to reflect the script above (this was probably in the original!).


If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.

Note that this will create a new directory to store the output, which cannot already exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these generic examples around manipulating code should just go in one place at the beginning of the docs?


### Plain text by domain

The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a filter string. In the example case, it will go through the collection and find all of the URLs within the "archive.org" domain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these, Spark -> scala?

@lintool
Copy link
Member Author

lintool commented Oct 19, 2019

hey @ianmilligan1

I've left image-analysis.md and text-analysis.md alone for now... since they'll need to be rewritten later anyway.

Let's focus on collection-analysis.md and see if we're happy with it?

@ruebot ruebot merged commit 25d3310 into master Oct 20, 2019
@lintool lintool deleted the doc-reorg branch October 21, 2019 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants