-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update package name * Move warcbase-core to root * Remove Wayback * Combine parent pom and warcbase-core pom * Update pom * Remove warcbase-hbase * Remove vis * Update README * Add LICENSE * Update TravisCI config * Update gitignore * Update CONTRIBUTING.md
- Loading branch information
Showing
189 changed files
with
415 additions
and
25,747 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,8 @@ target/ | |
*.iml | ||
*~ | ||
src/main/solr/lib/ | ||
.gradle | ||
.settings | ||
.*.swp | ||
workbench.xmi | ||
build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,4 +9,4 @@ before_install: | |
- "export JAVA_OPTS=-Xmx512m" | ||
|
||
script: | ||
- mvn clean package | ||
- mvn clean install |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,53 @@ | ||
# Welcome! | ||
|
||
If you are reading this document then you are interested in contributing to the warcbase or warcbase workshop project. All contributions are welcome: use-cases, documentation, code, patches, bug reports, feature requests, etc. You do not need to be a programmer to speak up! | ||
If you are reading this document then you are interested in contributing The Archives Unleashed Project. All contributions are welcome: use-cases, documentation, code, ptatches, bug reports, feature requests, etc. You do not need to be a programmer to speak up! | ||
|
||
### Use cases | ||
|
||
If you would like to submit a use case for the warcbase project, please submit and issue [here](https://github.com/lintool/warcbase/issues/new), assigning the "use case" label to the issue. | ||
If you would like to submit a use case for The Archives Unleashed Toolkit, please submit and issue [here](https://github.com/archivesunleashed/aut/issues/new), and begin the issue title with "Use Case:". | ||
|
||
### Documentation | ||
|
||
You can contribute documentation in two different ways. One way is to create an issue [here](https://github.com/lintool/warcbase/issues/new) assign the "documentation" label to the issue. | ||
|
||
We also do have a [warcbase-docs](https://github.com/lintool/warcbase-docs) repository. You can fork and do a Pull Request. All documentation resides in [`docs`](https://github.com/lintool/warcbase-docs/tree/master/docs). | ||
You can contribute documentation in two different ways. One way is to create an issue [here](https://github.com/archivesunleashed/aut/issues/new) and begin the issue title with "Documentation:". | ||
|
||
### Request a new feature | ||
|
||
To request a new feature you should [open an issue](https://github.com/lintool/warcbase/issues/new) or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Select the label "enhancement" if creating an issue on the project repo. | ||
To request a new feature you should [open an issue](https://github.com/archivesunleashed/aut/issues/new) or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Begin the issue title with "Enhancement:". | ||
|
||
### Report a bug | ||
|
||
To report a bug you should [open an issue](https://github.com/lintool/warcbase/issues/new) that summarizes the bug. Set the label to "bug". | ||
To report a bug you should [open an issue](https://github.com/archivesunleashed/aut/issues/new) that summarizes the bug. Set the label to "bug". | ||
|
||
In order to help us understand and fix the bug it would be great if you could provide us with: | ||
|
||
1. The steps to reproduce the bug. This includes information about e.g. the warcbase version you were using, whether on a single node or cluster, etc. | ||
1. The steps to reproduce the bug. This includes information about e.g. The Archives Unleashed Toolkit version you were using, whether on a single node or cluster, etc. | ||
2. The expected behavior. | ||
3. The actual, incorrect behavior. | ||
|
||
Feel free to search the issue queue for existing issues (aka tickets) that already describe the problem; if there is such a ticket please add your information as a comment. | ||
|
||
### Contribute code | ||
|
||
_If you are interested in contributing code to Warcbase but do not know where to begin:_ | ||
_If you are interested in contributing code to The Archives Unleashed Toolkit but do not know where to begin:_ | ||
|
||
In this case you should [browse open issues](https://github.com/lintool/warcbase/issues), and or [use cases](https://github.com/lintool/warcbase/labels/use%20case). | ||
In this case you should [browse open issues](https://github.com/archivesunleashed/aut/issues). | ||
|
||
Contributions to the Warcbase codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub. | ||
Contributions to The Archives Unleased Toolkit codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub. | ||
|
||
* For _small patches_, feel free to submit pull requests directly for those patches. | ||
* For _larger code contributions_, please use the following process. The idea behind this process is to prevent any wasted work and catch design issues early on. | ||
|
||
1. [Open an issue](https://github.com/lintool/warcbase/issues) and assign it the label of "enhancement", if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue. | ||
1. [Open an issue](https://github.com/archivesunleashed/aut/issues), if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue. | ||
2. Comment on the issue with your plan for implementing the issue. Explain what pieces of the codebase you are going to touch and how everything is going to fit together. | ||
3. Warcbase committers will work with you on the design to make sure you are on the right track. | ||
3. The Archives Unleashed Toolkit committers will work with you on the design to make sure you are on the right track. | ||
4. Implement your issue, create a pull request (see below), and iterate from there. | ||
|
||
### Create a pull request | ||
|
||
Take a look at [Creating a pull request](https://help.github.com/articles/creating-a-pull-request). In a nutshell you need to: | ||
|
||
1. [Fork](https://help.github.com/articles/fork-a-repo) the warcbase GitHub repository at [https://github.com/lintool/warcbase](https://github.com/lintool/warcbase) to your personal GitHub account. | ||
1. [Fork](https://help.github.com/articles/fork-a-repo) The Archives Unleashed Toolkit GitHub repository at [https://github.com/archivesunleashed/aut](https://github.com/archivesleashed/aut) to your personal GitHub account. | ||
2. Commit any changes to your fork. | ||
3. Send a [pull request](https://help.github.com/articles/creating-a-pull-request) to the warcbase GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue](https://github.com/lintool/warcbase/issues) earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from. | ||
3. Send a [pull request](https://help.github.com/articles/creating-a-pull-request) to The Archives Unleashed Toolkit GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue](https://github.com/archivesunleashed/aut/issues) earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from. | ||
|
||
You may want to read [Syncing a fork](https://help.github.com/articles/syncing-a-fork) for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `warcbase` repository. | ||
You may want to read [Syncing a fork](https://help.github.com/articles/syncing-a-fork) for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `aut` repository. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,113 +1,38 @@ | ||
Warcbase [![Build Status](https://travis-ci.org/lintool/warcbase.svg?branch=master)](https://travis-ci.org/lintool/warcbase) | ||
======== | ||
# The Archives Unleashed Toolkit [![Build Status](https://travis-ci.org/archivesunleashed/aut.svg?branch=master)](https://travis-ci.org/archivesunleashed/aut) | ||
|
||
Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. | ||
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives. Tight integration with Hadoop provides powerful tools for analytics and data processing via Apache Spark. | ||
|
||
There are two main ways of using Warcbase: | ||
|
||
+ The first and most common is to analyze web archives using [Spark](http://spark.apache.org/): these functionalities are contained in the `warcbase-core` module. | ||
+ The second is to take advantage of HBase to provide random access as well as analytics capabilities. Random access allows Warcbase to provide temporal browsing of archived content (i.e., "wayback" functionality): these functionalities are contained in the `warcbase-hbase` module. | ||
|
||
You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase. | ||
|
||
Other helpful links: | ||
|
||
+ Detailed documentation is available [here](http://lintool.github.io/warcbase-docs/). | ||
+ Supporting files can be found in the [warcbase-resources repository](https://github.com/lintool/warcbase-resources). | ||
|
||
Getting Started | ||
--------------- | ||
## Getting Started | ||
|
||
Clone the repo: | ||
|
||
``` | ||
$ git clone http://github.com/lintool/warcbase.git | ||
$ git clone http://github.com/archivesunleashed/aut.git | ||
``` | ||
|
||
You can then build Warcbase. If you are just interested in the analytics function, you can run the following: | ||
You can then build The Archives Unleased Toolkit. | ||
|
||
``` | ||
$ mvn clean package -pl warcbase-core | ||
$ mvn clean install | ||
``` | ||
|
||
For the impatient, to skip tests: | ||
|
||
``` | ||
$ mvn clean package -pl warcbase-core -DskipTests | ||
``` | ||
|
||
If you are interested in the HBase functionality as well, you can build everything using: | ||
|
||
``` | ||
$ mvn clean package | ||
$ mvn clean install -DskipTests | ||
``` | ||
|
||
Warcbase is built against CDH 5.7.1: | ||
The Archives Unleashed Toolkit is built against CDH 5.7.1: | ||
|
||
+ Hadoop version: 2.6.0-cdh5.7.1 | ||
+ Spark version: 1.6.0-cdh5.7.1 | ||
+ HBase version: 1.2.0-cdh5.7.1 | ||
|
||
The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions. | ||
|
||
Spark Quickstart | ||
---------------- | ||
|
||
For the impatient, let's do a simple analysis with Spark. Within the repo there's already a sample ARC file stored at `warcbase-core/src/test/resources/arc/example.arc.gz`. Our supporting resources repository also has [larger ARC and WARC files as real-world examples](https://github.com/lintool/warcbase-resources/tree/master/Sample-Data). | ||
|
||
If you need to install Spark, [we have a walkthrough here](http://lintool.github.io/warcbase-docs/Getting-Started/). This page also has instructions on how to install and run Spark Notebook, an interactive web-based editor. | ||
|
||
Once you've got Spark installed, go ahead and fire up the Spark shell: | ||
|
||
``` | ||
$ spark-shell --jars warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar | ||
``` | ||
|
||
Here's a simple script that extracts and counts the top-level domains (i.e., number of pages for each top-level domain) in the sample ARC data: | ||
|
||
```scala | ||
import org.warcbase.spark.matchbox._ | ||
import org.warcbase.spark.rdd.RecordRDD._ | ||
|
||
val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc) | ||
.keepValidPages() | ||
.map(r => ExtractDomain(r.getUrl)) | ||
.countItems() | ||
.take(10) | ||
``` | ||
|
||
**Tip:** By default, commands in the Spark shell must be one line. To run multi-line commands, type `:paste` in the Spark shell: you can then copy-paste the script above directly into Spark shell. Use Ctrl-D to finish the command. | ||
|
||
What to learn more? Check out our [detailed documentation](http://lintool.github.io/warcbase-docs/). | ||
|
||
|
||
Visualizations | ||
-------------- | ||
|
||
The result of analyses of using Warcbase can serve as input to visualizations that help scholars interactively explore the data. Examples include: | ||
|
||
+ [Basic crawl statistics](http://lintool.github.io/warcbase/vis/crawl-sites/index.html) from the Canadian Political Parties and Political Interest Groups collection. | ||
+ [Interactive graph visualization](http://lintool.github.io/warcbase-docs/Gephi-Converting-Site-Link-Structure-into-Dynamic-Visualization/) using Gephi. | ||
+ [Named entity visualization](http://lintool.github.io/warcbase-docs/Spark-NER-Visualization/) for exploring relative frequencies of people, places, and locations. | ||
+ [Shine interface](http://webarchives.ca/) for faceted full-text search. | ||
|
||
|
||
Next Steps | ||
---------- | ||
|
||
+ [Ingesting content into HBase](http://lintool.github.io/warcbase-docs/Ingesting-Content-into-HBase/): loading ARC and WARC data into HBase | ||
+ [Warcbase/Wayback integration](http://lintool.github.io/warcbase-docs/Warcbase-Wayback-Integration/): guide to provide temporal browsing capabilities | ||
+ [Warcbase Java tools](http://lintool.github.io/warcbase-docs/Warcbase-Java-Tools/): building the URL mapping, extracting the webgraph | ||
|
||
|
||
License | ||
------- | ||
# License | ||
|
||
Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). | ||
|
||
# Acknowledgments | ||
|
||
Acknowledgments | ||
--------------- | ||
|
||
This work is supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors. | ||
|
||
This work is supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Andrew W. Mellon Foundation (via Columbia University, University of Waterlook, and York University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors. |
Oops, something went wrong.