entitydebs is a social science tool to programmatically analyze entities in
non-fictional texts. In particular, it's well-suited to extract the sentiment
for an entity using dependency parsing. Tokenization is highly customizable and
supports the Google Cloud Natural Language API out-of-the-box. It can help
answer questions like:
- How do politicians describe their country in governmental speeches?
- Which current topics correlate with celebrities?
- What are the most common root verbs used in different music genres?
Visit the live demo or read through the source code here. To learn more about dependency trees consult the Google Natural Language API guide.
- Dependency parsing: Build and traverse dependency trees for syntactic and sentiment analysis
- AI tokenizer: Out-of-the-box support for the Google Cloud Natural Language API for robust tokenization, with a built-in retrier
- Bullet-proof trees: Dependency trees are constructed using gonum
- Efficient traversal: Native iterators for traversing analysis results
- Text normalization: Built-in normalizers (lowercasing, NFKC, lemmatization) to reduce redundancy and improve data integrity
- High test coverage: Over 80 % test coverage and millions of tokens
go get github.com/ndabAP/entitydebsLet's understand how to use the tool with an example. We would like to know the dependencies in Congress speeches to the United States. Our entity would be United States and our texts Congressional speeches.
First, we need to create a source to create the data frames from. Our entity
with its aliases is:
entity := []string{"America", "USA", "United States", "US"}As text source we can use "Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts". We skip the part how to parse the texts and use a subset for this example:
texts := []string{
"Pay tribute for the past 237 years of sacrifice to our great United States Army.",
"So having a Special Order this evening is an opportunity for us all to come together and celebrate the commitment of the United States Congress to communities around the world as they experience.",
"Thank you very much for your sacrifice and your commitment for a free Cuba and a strong United States.",
}
src := entitydebs.NewSource(entity, texts)Now, we need a tokenizer that implements tokenize.Tokenizer. The built-in
package nlp uses Google Natural Language API and support dependency parsing.
For this to work, we need a service account file
(learn more here)
with the respective permissions.
creds := os.Getenv("GCLOUD_SERVICE_ACCOUNT_KEY")
nlp := nlp.New(creds, language.EN)source.Frames uses the provided tokenizer to generate the data frames. This
may take a while depending on the input and how the tokenizer works.
frames, err := src.Frames(ctx, nlp, tokenize.FeatureSyntax)
if err != nil {
panic(err.Error())
}From this point, Frames can construct for us a forest of dependency trees.
Trees that don't contain the entity are not created. We print all relationships
using Gos native text/tabwriter:
w := tabwriter.NewWriter(os.Stdout, 0, 0, 1, ' ', 0)
_, _ = fmt.Fprintln(w, "Head\tRelationship\tDependent")
frames.Forest().Dependencies(func(
head,
dependent *tokenize.Token,
rel tokenize.DependencyEdgeLabel,
tree dependency.Tree,
) bool {
_, _ = fmt.Fprintf(
w,
"%s\t%s\t%s\n",
head.Text.Content,
languagepb.DependencyEdge_Label_name[int32(rel)],
dependent.Text.Content,
)
return true
})
_ = w.Flush()
// Output:
// Head Relationship Dependent
// tribute P .
// tribute PREP to
// tribute PREP for
// tribute NN Pay
// for POBJ years
// years PREP of
// years NUM 237
// years AMOD past
// years DET the
// of POBJ sacrifice
// to POBJ Army
// States NN United
// Army NN States
// Army AMOD great
// Army POSS our
// having TMOD evening
// having DOBJ Order
// having ADVMOD So
// Order NN Special
// Order DET a
// evening DET this
// is P .
// is ATTR opportunity
// is CSUBJ having
// opportunity CCOMP come
// opportunity DET an
// us DET all
// come CONJ celebrate
// come CC and
// come ADVMOD together
// come AUX to
// come NSUBJ us
// come MARK for
// celebrate ADVCL experience
// celebrate PREP to
// celebrate DOBJ commitment
// commitment PREP of
// commitment DET the
// of POBJ Congress
// States NN United
// Congress NN States
// Congress DET the
// to POBJ communities
// communities PREP around
// around POBJ world
// world DET the
// experience NSUBJ they
// experience MARK as
// Thank P .
// Thank PREP for
// Thank ADVMOD much
// Thank DOBJ you
// much ADVMOD very
// for POBJ sacrifice
// sacrifice CONJ commitment
// sacrifice CC and
// sacrifice POSS your
// commitment PREP for
// commitment POSS your
// for POBJ Cuba
// Cuba CONJ States
// Cuba CC and
// Cuba AMOD free
// Cuba DET a
// States NN United
// States AMOD strong
// States DET aYou can find more examples in the examples folder.
Julian Claus and contributors.
MIT