Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a content manifest file with exportcontent #9460

Merged

Conversation

dylanmccall
Copy link
Contributor

@dylanmccall dylanmccall commented May 24, 2022

Summary

This change will update Kolibri's exportcontent command to generate a manifest file, describing an intentional content selection associated with the exported data files. This is distinct from the user's provided node_ids and exclude_node_ids, because it needs to be reproducible on another Kolibri instance with a different set of available content. To achieve this, I split get_import_export_data into a few different functions, and added another function (get_content_nodes_selectors) which finds the optimal list of include nodes to describe the list of content being exported.

That list of nodes is what is written to a manifest.json file alongside the exported content. For simplicity, it is possible for the same channel ID to be repeated multiple times, with different sets of include nodes and export nodes, in the same manifest.json file. This will only happen if the user runs exportcontent multiple times for the same channel.

This pull request does not include the importcontent counterpart to this work, which is in #9467.

References

Reviewer guidance

Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch from dd4fcb8 to 00b383c Compare May 24, 2022 23:51
@github-actions
Copy link
Contributor

github-actions bot commented May 24, 2022

@dylanmccall
Copy link
Contributor Author

dylanmccall commented May 26, 2022

Getting this to play nice with exportcontent is a bit trickier than it sounds at first because we need to deal with the tool being run multiple times, which means we need to generate a sane set of include_nodes / exclude_nodes, and then update that set as needed. After some poking and prodding today, the solution I have for now is to append a new {channel id, include nodes, exclude nodes} to the manifest file for every run. When I'm working on importcontent I think I'll be able to add a mechanism that extends the existing selection for a channel as long as it is the same version, but I think it's reasonable to expect the channel to appear multiple times in some cases.

@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 3 times, most recently from d459fc5 to 35e01c1 Compare May 28, 2022 00:15
@dylanmccall dylanmccall marked this pull request as ready for review May 28, 2022 00:15
@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from 30f4533 to e695c66 Compare May 28, 2022 00:25
@dylanmccall dylanmccall changed the title WIP: Generate a content manifest file with exportcontent Generate a content manifest file with exportcontent May 28, 2022
@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 4 times, most recently from 37d5bf6 to db8c569 Compare May 30, 2022 22:10
danigm added a commit to endlessm/kolibri-explore-plugin that referenced this pull request May 31, 2022
This is a first step to import the content from USB if present instead
of from network, without importing the content, just the metadata.

The list of channels is getted from the network right now, waiting for
the metadata.json definition that could be present in the
KOLIBRI_DATA/content folder:
learningequality/kolibri#9460

https://phabricator.endlessm.com/T33526
danigm added a commit to endlessm/kolibri-explore-plugin that referenced this pull request May 31, 2022
This is a first step to import the content from USB if present instead
of from network, without importing the content, just the metadata.

The list of channels is getted from the network right now, waiting for
the metadata.json definition that could be present in the
KOLIBRI_DATA/content folder:
learningequality/kolibri#9460

https://phabricator.endlessm.com/T33526
danigm added a commit to endlessm/kolibri-explore-plugin that referenced this pull request May 31, 2022
This is a first step to import the content from USB if present instead
of from network, without importing the content, just the metadata.

The list of channels is getted from the network right now, waiting for
the metadata.json definition that could be present in the
KOLIBRI_DATA/content folder:
learningequality/kolibri#9460

https://phabricator.endlessm.com/T33526
files_to_download = list(queried_file_objects.values())

total_bytes_to_transfer = sum(map(lambda x: x["file_size"] or 0, files_to_download))
return number_of_resources, files_to_download, total_bytes_to_transfer


def get_content_nodes_selectors(channel_id, nodes_queries_list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's a more efficient way to do this with a hierarchical aggregation, similar to how we do our annotation logic. I will try to sketch out what that might look like and get back to you here.

Copy link
Contributor Author

@dylanmccall dylanmccall Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right that there's probably a better way. I opted for the simple to read, procedural approach since it's part of a one-off task that users already understand will take a length of time measured in minutes (so a few seconds generating a manifest isn't a big deal as long as we aren't pegging their CPU to do it). I think it's more important here that the code is clearly understandable and somebody can optimize it once it has been proved out.

But fetching all the node IDs into memory is certainly ugly. For Khan Academy, for example, that's a really big query and about 800 kB of memory for our set of available_node_ids. I'm sure we could do something clever involving the lft / rft of leaf nodes that are excluded, bubbling up from them to select the correct topics. There's a danger that it could turn into a lot of queries for a problem where we're really just shuffling primary keys around. So while it would involve less duplicate data in memory, I don't think we're guaranteed that it would be more efficient, particularly for the sort of deployment where someone would be using exportcontent. Worth exploring in a branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do something very similar in our importability annotation https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/importability_annotation.py#L36 - the main difference here would be instead of doing annotation based on some, we would annotate on all children being available. Otherwise the code would be nearly identical. We can then descend down the tree in a read query until we have a set of ids that are marked as 'all' available.

The main trick here is that the write and read all happen within a transaction, and then we roll it back at the end.

Comment on lines 37 to 29
try:
FileNotFoundError
except NameError:
FileNotFoundError = IOError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I guess this is to be compliant with older python.

@dylanmccall
Copy link
Contributor Author

dylanmccall commented Jun 22, 2022

Okay, I added a test suite here, which should give us a nice way to measure performance, although the current test fixture isn't quite doing that. (It's a very small tree of content). In general, when I say this is slow I really just mean "it isn't trying very hard to be fast," but (anecdotally) it is acceptable with the channels I've tried it against. In the worst case I've had it spend ~1.5 seconds generating a manifest file for Khan Academy, which is a particularly large channel. It could be faster with some added cleverness, but I don't think we need to block on it here - and the nice thing is the test suite should make it easier to add such cleverness in the future without breaking existing behaviour :)

Here's a handy little script to play with this in isolation:

from kolibri.utils.main import initialize
initialize(skip_update=True)

import json

# Constants to know:
# Khan Academy (c9d7f950ab6b5a1199e3d6c10d7f0103)

from kolibri.core.content.utils.import_export_content import get_import_export_nodes
from kolibri.core.content.utils.import_export_content import get_content_nodes_selectors

KHAN_ACADEMY="c9d7f950ab6b5a1199e3d6c10d7f0103"

nodes_segments = get_import_export_nodes(KHAN_ACADEMY, available=True)
content_selection = get_content_nodes_selectors(KHAN_ACADEMY, nodes_segments)

print("Content selection:", json.dumps(content_selection, indent=4))

@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from ddc6709 to 1cb17ea Compare June 23, 2022 21:03
@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from 704c71a to 933d5eb Compare July 5, 2022 20:57
@dylanmccall
Copy link
Contributor Author

dylanmccall commented Jul 5, 2022

I did a couple tweaks here, moving some stuff over from #9467:

  • In get_import_export_nodes, differentiate between node_ids=None (all node IDs) and node_ids=[] (no node IDs). This required changes to a bunch of tests which were using an empty array to mean "all the nodes from this channel", but I think it makes sense. Kolibri's annotation code appears to be making the same distinction: https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/annotation.py#L207-L228.
  • When exporting all nodes in a channel, include_node_ids in the manifest file will be set to [channel_id]. In a previous version, I had it output an empty list instead.
  • With these changes together, we have the ability to clearly express both a channel with no content and a channel with all of the content in our manifest files. In theory, that means we can have exportchannel generate a manifest file as well.

@dylanmccall dylanmccall requested a review from rtibbles July 5, 2022 21:25
@rtibbles
Copy link
Member

rtibbles commented Jul 5, 2022

In get_import_export_nodes, differentiate between node_ids=None (all node IDs) and node_ids=[] (no node IDs). This required changes to a bunch of tests which were using an empty array to mean "all the nodes from this channel", but I think it makes sense. Kolibri's annotation code appears to be making the same distinction: https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/annotation.py#L207-L228.

I think that's generally helpful, and may be the cause of some behavioural errors I've witnessed elsewhere in Kolibri develop.

When exporting all nodes in a channel, include_node_ids in the manifest file will be set to [channel_id]. In a previous version, I had it output an empty list instead.

I think for precision this should be [root_node_id] - they are frequently identical, but there are some cases where they are not.

With these changes together, we have the ability to clearly express both a channel with no content and a channel with all of the content in our manifest files. In theory, that means we can have exportchannel generate a manifest file as well.

This would also seem fine.

@dylanmccall
Copy link
Contributor Author

dylanmccall commented Jul 5, 2022

I think for precision this should be [root_node_id] - they are frequently identical, but there are some cases where they are not.

Ah, good point. That is in fact what it's doing - I didn't realize they could be different, so that's useful to know! In a previous version, it was special-casing this by saying "if the list we end up with is [channel_id] then make it []", which was silly, so now it just isn't doing that special case :)


channels = manifest_data.setdefault("channels", [])

# TODO: If the channel is already listed, it would be nice to merge the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the case the channel is already in the manifest what happens, do we have two entries for the same channel in the manifest?

Copy link
Contributor Author

@dylanmccall dylanmccall Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. So it could end up with a manifest file like this:

{
    "channels": [
        {
            "id": "e409b964366a59219c148f2aaa741f43",
            "version": 10,
            "include_node_ids": [
                "6277aa0c44235435acdc8a9ed98f466b"
            ],
            "exclude_node_ids": []
        },
        {
            "id": "e409b964366a59219c148f2aaa741f43",
            "version": 10,
            "include_node_ids": [
                "e409b964366a59219c148f2aaa741f43"
            ],
            "exclude_node_ids": []
        }
    ],
    "channel_list_hash": "a2c59ee9bab4586a78e56ca8cf13a912"
}

Where the first time around, I exported just "6277aa0c44235435acdc8a9ed98f466b", and the second time I exported the entire channel.

Note that identical entries are skipped, but if we have a different selection for the same channel, they each get a new entry. It would be better if it merges them more exhaustively. For instance the second entry is a superset of the first entry. So, I added a TODO about that. Could do that by adding the existing entries to our list of selected nodes.

Come to think of it, that doesn't sound too difficult to implement, so I think I'll give it a shot right now over in #9467 :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I was wondering if we could just do the TODO now, and not leave it for later!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, indeed. Ran out of time today, but I'll let you know how it goes tomorrow!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I added a commit with a couple things:

  • Removed exclude_node_ids from manifest.json. It isn't being used right now, and it makes our life easier if it isn't there. If we are only including nodes, we can trivially extend that set of nodes when a new export happens. Otherwise (if there is the possibility of nodes being excluded), we would have to expand the selection into a complete set of nodes and then simplify it again.
  • Added a content_manifest module to deal with this in one place, moving some file access and json parsing code out of import_export_content.py. It is responsible for reading and writing manifest files, as well as simplifying the provided set of node IDs.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seems reasonable on a read through, and the tests and generation time for KA gives assurance that this is a reasonable approach.

One question about the merge case.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions - nothing that is completely blocking, but I think it would be good to handle malformed JSON in the manifest reading, and I wonder if optional filtering by available might be useful.

Beyond that - some doc strings on the methods of the ContentManifest class would be helpful - especially to distinguish the role of different but somewhat similarly named methods.

json_str = fp.read()
if not json_str:
return
# Raises JSONDecodeError if the file is invalid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note in Python 2 this raises a ValueError, of which JSONDecodeError is a subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah. I made this clearer along with my fix for the below comment, defining JSONDecodeError = ValueError as a fallback.

if not json_str:
return
# Raises JSONDecodeError if the file is invalid
manifest_data = json.loads(json_str)
Copy link
Member

@rtibbles rtibbles Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to just do a try...except here, catch any errors and just do json.load directly from the fp object?

This would also have the benefit of handling non-empty, but invalid JSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's funny it ended up looking like this :b Doesn't seem to be any reason, so I switched it to use json.load(fp).

@@ -97,15 +97,56 @@ def filter_by_file_availability(nodes_to_include, channel_id, drive_id, peer_id)

def get_import_export_data( # noqa: C901
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing we can probably remove the noqa from here now?

Copy link
Contributor Author

@dylanmccall dylanmccall Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Indeed we can!

if len(missing_leaf_nodes) == 0:
include_node_ids.add(node.id)
elif len(matching_leaf_nodes) > 0:
available_nodes_queue.extend(node.children.all())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be optionally filtering the node children by an available flag here (similarly to how we do in the import export data function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I think that's right. Assuming the availability of nodes in the tree is correct, this should be doing node.children.filter(available=True). Unavailable nodes aren't going to resolve to anything useful here so it's just wasting memory. I should make sure the tests actually cover this properly first (might need to adopt that channel builder instead of the questionable database fixture :b), so, I'll do that on Monday.

I see what you mean about passing an available= flag to the function to be consistent with the import export data function, although I should tweak some names in that case, and probably wrap my head about the use case there. It does sound like it would be convenient for something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.

The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?

So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?

Would be good to make a test case to cover this scenario, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I added some test cases which explore that. Specifically:

  • If we add some new content nodes to a ContentManifest, include_node_ids is extended to include those nodes. Any nodes that were listed previously are still there.
  • If content nodes are added for one channel version, and then for a different version of the same channel, the two versions are separated; the lists of include_node_ids are not merged. (In practice, importcontent merges them in Read options from a manifest file in importcontent #9467, albeit with a warning message).

This behaviour means a bit of duplication in some cases. That can be solved, but it adds some complexity - especially if we start thinking about channel versions - so I think just something to be aware of for the moment.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is nearly ready to go, I just have one more question that I couldn't be sure of by looking at the code and test cases.

if len(missing_leaf_nodes) == 0:
include_node_ids.add(node.id)
elif len(matching_leaf_nodes) > 0:
available_nodes_queue.extend(node.children.all())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.

The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?

So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?

Would be good to make a test case to cover this scenario, I think.

@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch 2 times, most recently from cfa0be5 to 0787542 Compare August 4, 2022 15:13
Using the same list of nodes selected for export, we can generate a set
of node IDs which work with importcontent.
This case indicates that all of the content from the channel is
included. It is distinct from include_node_ids being an empty list,
which suggests that none of the content is included.
When node_ids is set to an empty list, get_import_export_nodes will
understand that to mean no nodes are selected, which is different from
when node_ids is unset, in which case all nodes are selected.
Previously, node_ids defaulted to empty list. With the previous
change, that resulted in no nodes being exported by default.
This is designed similarly to ConfigParser, with a ContentManifest
class that can read from or write to a JSON file. It is responsible
for simplifying the list of nodes to include and merging additions to
that list for the same channel ID and version.
These explore situations where we extend an existing list of node IDs.
We need to ensure that data is not lost in the process.
In get_content_nodes_selectors, only step through child nodes that are
marked as available.
@dylanmccall dylanmccall force-pushed the exportcontent-manifest-file branch from 0787542 to 2da94c3 Compare August 4, 2022 15:35
@dylanmccall dylanmccall requested a review from rtibbles August 8, 2022 20:08
@dylanmccall
Copy link
Contributor Author

Funky build failure from the CI system, but the tests passed :)

@rtibbles
Copy link
Member

Oh yes, the buildkite error was due to a sysops issue with the builder.

@rtibbles rtibbles merged commit b4539a6 into learningequality:develop Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants