Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize relationships between any number of extracted features #100

Open
ericleasemorgan opened this issue Jun 8, 2020 · 1 comment
Open
Assignees

Comments

@ericleasemorgan
Copy link
Owner

The Reader excels at: 1) feature extraction, and 2) listing those features. Features include named entities, parts-of-speech, email addresses, URLs, latent themes/topics, keywords, ngrams, etc. The listing of features is usually interesting and easy to interpret, but many times the student, researcher, or scholar wants to discover the relationships between features and answer questions such as, "What features are central to a corpus?" or "What features are highly connected to other features?" The way to address these sort of questions is through (interactive) network diagrams.

Some of this work has been previously been done by Team JAMS -- a few computer science students who took part in a PEARC hack-a-thon. (Team JAMS won first prize for their good work.) I took their efforts as a starting point, abstracted it, and made it a part of a different repository called "reader-workbook". See:

The first script (carrel2diagram.sh) is merely a front-end to everything else. The second script (carrel2json.py) does the hard work. It creates a stream of JSON and saves the result to the file system. The third script (template2html-diagram.sh) merely reads the template (template-diagram.htm), does a substitution, and sends the result to STDOUT as a stream of HTML. Finally, when the resulting HTML is loaded, a cool Javascript library (D3.js) reads the JSON and outputs a network diagram. The process works pretty well, and the resulting diagrams are very interesting, but the process is not scalable and it only functions against a tiny handful of our extracted features (namely, different types of nouns).

Your mission, if you choose to accept it, it to incorporate this into our repository, and increase scalability by parallelizing this whole process, probably by editing carrel2json.py. Remember, you will have at least 24 cores at your disposal. In the end, the output will include at least one network diagram of nouns saved in a study carrel at ./htm/network-diagram.htm.

For extra credit, create four different network diagrams, one for each different type of noun found in carrel2json.py.

Once we get this far, we will explore the creation of even more network diagrams illustrating the relationships between any number of things such as: 1) authors and keywords, 2) types of entities and DOIs, or 3) dates and places.

In my mind, this is the most difficult hack to write, but the results will be one of the most well-respected features of a Distant Reader study carrel.

@ericleasemorgan
Copy link
Owner Author

How goes the work on this task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants