Skip to content

Datastore entry method #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 25, 2022
Merged

Datastore entry method #771

merged 22 commits into from
May 25, 2022

Conversation

hepengfe
Copy link
Collaborator

@hepengfe hepengfe commented May 4, 2022

This PR fixes #767.

Description of changes

  • implemented method all_entries(entry_type_name)
  • implemented method num_entries(entry_type_name)

Possible influences of this PR.

Describe what are the possible side-effects of the code change.

Test Conducted

  • test all_entries(entry_type_name) returns an iterator of raw entry data in the List format of entry_type_name
  • test num_entries(entry_type_name) returns a correct number of entries of entry_type_name

@hepengfe hepengfe self-assigned this May 4, 2022
@hepengfe hepengfe added topic: data Issue about data loader modules and data processing related data_efficiency labels May 4, 2022
@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #771 (71f77fb) into master (e2ce0a4) will decrease coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #771      +/-   ##
==========================================
- Coverage   81.13%   80.97%   -0.16%     
==========================================
  Files         249      249              
  Lines       18824    18693     -131     
==========================================
- Hits        15272    15136     -136     
- Misses       3552     3557       +5     
Impacted Files Coverage Δ
forte/data/data_store.py 90.67% <100.00%> (+0.39%) ⬆️
tests/forte/data/data_store_test.py 94.49% <100.00%> (+0.23%) ⬆️
...e/processors/pretrained_encoder_processors_test.py 39.39% <0.00%> (-3.47%) ⬇️
tests/forte/data/data_pack_test.py 96.25% <0.00%> (-2.50%) ⬇️
forte/processors/writers.py 60.52% <0.00%> (-1.98%) ⬇️
forte/data/readers/misc_readers.py 55.55% <0.00%> (-1.90%) ⬇️
...s/forte/data/readers/classification_reader_test.py 96.42% <0.00%> (-1.79%) ⬇️
forte/data/readers/deserialize_reader.py 63.95% <0.00%> (-1.61%) ⬇️
forte/data/caster.py 90.24% <0.00%> (-1.60%) ⬇️
forte/data/extractors/subword_extractor.py 38.46% <0.00%> (-1.54%) ⬇️
... and 77 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2ce0a4...71f77fb. Read the comment docs.

@hepengfe hepengfe requested a review from mylibrar May 5, 2022 00:00
@hepengfe hepengfe marked this pull request as ready for review May 5, 2022 00:00
Copy link
Collaborator

@mylibrar mylibrar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not preserve the previous behavior in DataPack since it does not retrieve the subclasses. For example, DataPack.all_annotations is expected to return an iterator of all entries that satisfy issubclass(Annotation) (e.g., Token, Document, Sentence, etc.). Hence DataStore.all_entries("forte.data.ontology.top.Annotation") should also be able to get all subtypes of Annotation in its outputs.

@mylibrar mylibrar requested review from hunterhector May 5, 2022 22:46
for entry in list(self.data_store._DataStore__entry_dict.values())
if entry[3] == "ft.onto.base_ontology.Document"
]
ann_list = list(self.data_store._DataStore__entry_dict.values())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be correct if there are Links or Groups in DataStore. For example, the format of test entry with tid=10123 looks like a Group but will also be included in ann_list

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to check Link, I found Link type entry in self.__elements is similar to a regular entry.

Copy link
Collaborator

@mylibrar mylibrar May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docstring in DataStore.__init__, a Link entry has the following format: [<parent_tid>, <child_tid>, <tid>, <type_name>, <attr_1>, <attr_2>, ..., <attr_n>, index_id]. So I guess you can still use <type_name> to check if it's a link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rest of the PR is fine. What's the current status of this comment?

@mylibrar feel free to resolve the comments if you are fine with them, and you can approve it after your review too.

@hepengfe hepengfe marked this pull request as draft May 7, 2022 04:51
@mylibrar mylibrar added this to the 0.3 alpha milestone May 20, 2022
@hepengfe hepengfe marked this pull request as ready for review May 25, 2022 16:24
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but will leave it up to @mylibrar to resolve all comments and approve it.

@mylibrar mylibrar merged commit 8af7f5a into asyml:master May 25, 2022
@hepengfe hepengfe deleted the datastore_entry_method branch May 25, 2022 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data_efficiency topic: data Issue about data loader modules and data processing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement num_entries, all_entries in DataStore
3 participants