-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a content manifest file with exportcontent #9460
Generate a content manifest file with exportcontent #9460
Conversation
dd4fcb8
to
00b383c
Compare
Build Artifacts
|
Getting this to play nice with |
d459fc5
to
35e01c1
Compare
30f4533
to
e695c66
Compare
37d5bf6
to
db8c569
Compare
This is a first step to import the content from USB if present instead of from network, without importing the content, just the metadata. The list of channels is getted from the network right now, waiting for the metadata.json definition that could be present in the KOLIBRI_DATA/content folder: learningequality/kolibri#9460 https://phabricator.endlessm.com/T33526
This is a first step to import the content from USB if present instead of from network, without importing the content, just the metadata. The list of channels is getted from the network right now, waiting for the metadata.json definition that could be present in the KOLIBRI_DATA/content folder: learningequality/kolibri#9460 https://phabricator.endlessm.com/T33526
This is a first step to import the content from USB if present instead of from network, without importing the content, just the metadata. The list of channels is getted from the network right now, waiting for the metadata.json definition that could be present in the KOLIBRI_DATA/content folder: learningequality/kolibri#9460 https://phabricator.endlessm.com/T33526
files_to_download = list(queried_file_objects.values()) | ||
|
||
total_bytes_to_transfer = sum(map(lambda x: x["file_size"] or 0, files_to_download)) | ||
return number_of_resources, files_to_download, total_bytes_to_transfer | ||
|
||
|
||
def get_content_nodes_selectors(channel_id, nodes_queries_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there's a more efficient way to do this with a hierarchical aggregation, similar to how we do our annotation logic. I will try to sketch out what that might look like and get back to you here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right that there's probably a better way. I opted for the simple to read, procedural approach since it's part of a one-off task that users already understand will take a length of time measured in minutes (so a few seconds generating a manifest isn't a big deal as long as we aren't pegging their CPU to do it). I think it's more important here that the code is clearly understandable and somebody can optimize it once it has been proved out.
But fetching all the node IDs into memory is certainly ugly. For Khan Academy, for example, that's a really big query and about 800 kB of memory for our set of available_node_ids
. I'm sure we could do something clever involving the lft
/ rft
of leaf nodes that are excluded, bubbling up from them to select the correct topics. There's a danger that it could turn into a lot of queries for a problem where we're really just shuffling primary keys around. So while it would involve less duplicate data in memory, I don't think we're guaranteed that it would be more efficient, particularly for the sort of deployment where someone would be using exportcontent
. Worth exploring in a branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do something very similar in our importability annotation https://github.com/learningequality/kolibri/blob/develop/kolibri/core/content/utils/importability_annotation.py#L36 - the main difference here would be instead of doing annotation based on some
, we would annotate on all
children being available. Otherwise the code would be nearly identical. We can then descend down the tree in a read query until we have a set of ids that are marked as 'all' available.
The main trick here is that the write and read all happen within a transaction, and then we roll it back at the end.
try: | ||
FileNotFoundError | ||
except NameError: | ||
FileNotFoundError = IOError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I guess this is to be compliant with older python.
Okay, I added a test suite here, which should give us a nice way to measure performance, although the current test fixture isn't quite doing that. (It's a very small tree of content). In general, when I say this is slow I really just mean "it isn't trying very hard to be fast," but (anecdotally) it is acceptable with the channels I've tried it against. In the worst case I've had it spend ~1.5 seconds generating a manifest file for Khan Academy, which is a particularly large channel. It could be faster with some added cleverness, but I don't think we need to block on it here - and the nice thing is the test suite should make it easier to add such cleverness in the future without breaking existing behaviour :) Here's a handy little script to play with this in isolation: from kolibri.utils.main import initialize
initialize(skip_update=True)
import json
# Constants to know:
# Khan Academy (c9d7f950ab6b5a1199e3d6c10d7f0103)
from kolibri.core.content.utils.import_export_content import get_import_export_nodes
from kolibri.core.content.utils.import_export_content import get_content_nodes_selectors
KHAN_ACADEMY="c9d7f950ab6b5a1199e3d6c10d7f0103"
nodes_segments = get_import_export_nodes(KHAN_ACADEMY, available=True)
content_selection = get_content_nodes_selectors(KHAN_ACADEMY, nodes_segments)
print("Content selection:", json.dumps(content_selection, indent=4)) |
ddc6709
to
1cb17ea
Compare
704c71a
to
933d5eb
Compare
I did a couple tweaks here, moving some stuff over from #9467:
|
I think that's generally helpful, and may be the cause of some behavioural errors I've witnessed elsewhere in Kolibri develop.
I think for precision this should be
This would also seem fine. |
Ah, good point. That is in fact what it's doing - I didn't realize they could be different, so that's useful to know! In a previous version, it was special-casing this by saying "if the list we end up with is |
|
||
channels = manifest_data.setdefault("channels", []) | ||
|
||
# TODO: If the channel is already listed, it would be nice to merge the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the case the channel is already in the manifest what happens, do we have two entries for the same channel in the manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. So it could end up with a manifest file like this:
{
"channels": [
{
"id": "e409b964366a59219c148f2aaa741f43",
"version": 10,
"include_node_ids": [
"6277aa0c44235435acdc8a9ed98f466b"
],
"exclude_node_ids": []
},
{
"id": "e409b964366a59219c148f2aaa741f43",
"version": 10,
"include_node_ids": [
"e409b964366a59219c148f2aaa741f43"
],
"exclude_node_ids": []
}
],
"channel_list_hash": "a2c59ee9bab4586a78e56ca8cf13a912"
}
Where the first time around, I exported just "6277aa0c44235435acdc8a9ed98f466b", and the second time I exported the entire channel.
Note that identical entries are skipped, but if we have a different selection for the same channel, they each get a new entry. It would be better if it merges them more exhaustively. For instance the second entry is a superset of the first entry. So, I added a TODO about that. Could do that by adding the existing entries to our list of selected nodes.
Come to think of it, that doesn't sound too difficult to implement, so I think I'll give it a shot right now over in #9467 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I was wondering if we could just do the TODO now, and not leave it for later!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, indeed. Ran out of time today, but I'll let you know how it goes tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I added a commit with a couple things:
- Removed
exclude_node_ids
from manifest.json. It isn't being used right now, and it makes our life easier if it isn't there. If we are only including nodes, we can trivially extend that set of nodes when a new export happens. Otherwise (if there is the possibility of nodes being excluded), we would have to expand the selection into a complete set of nodes and then simplify it again. - Added a
content_manifest
module to deal with this in one place, moving some file access and json parsing code out of import_export_content.py. It is responsible for reading and writing manifest files, as well as simplifying the provided set of node IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All seems reasonable on a read through, and the tests and generation time for KA gives assurance that this is a reasonable approach.
One question about the merge case.
084d100
to
a359246
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions - nothing that is completely blocking, but I think it would be good to handle malformed JSON in the manifest reading, and I wonder if optional filtering by available might be useful.
Beyond that - some doc strings on the methods of the ContentManifest class would be helpful - especially to distinguish the role of different but somewhat similarly named methods.
json_str = fp.read() | ||
if not json_str: | ||
return | ||
# Raises JSONDecodeError if the file is invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note in Python 2 this raises a ValueError
, of which JSONDecodeError
is a subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah. I made this clearer along with my fix for the below comment, defining JSONDecodeError = ValueError
as a fallback.
if not json_str: | ||
return | ||
# Raises JSONDecodeError if the file is invalid | ||
manifest_data = json.loads(json_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to just do a try...except here, catch any errors and just do json.load
directly from the fp
object?
This would also have the benefit of handling non-empty, but invalid JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's funny it ended up looking like this :b Doesn't seem to be any reason, so I switched it to use json.load(fp)
.
@@ -97,15 +97,56 @@ def filter_by_file_availability(nodes_to_include, channel_id, drive_id, peer_id) | |||
|
|||
def get_import_export_data( # noqa: C901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing we can probably remove the noqa
from here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Indeed we can!
if len(missing_leaf_nodes) == 0: | ||
include_node_ids.add(node.id) | ||
elif len(matching_leaf_nodes) > 0: | ||
available_nodes_queue.extend(node.children.all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be optionally filtering the node children by an available
flag here (similarly to how we do in the import export data function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I think that's right. Assuming the availability of nodes in the tree is correct, this should be doing node.children.filter(available=True)
. Unavailable nodes aren't going to resolve to anything useful here so it's just wasting memory. I should make sure the tests actually cover this properly first (might need to adopt that channel builder instead of the questionable database fixture :b), so, I'll do that on Monday.
I see what you mean about passing an available=
flag to the function to be consistent with the import export data function, although I should tweak some names in that case, and probably wrap my head about the use case there. It does sound like it would be convenient for something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.
The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?
So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?
Would be good to make a test case to cover this scenario, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I added some test cases which explore that. Specifically:
- If we add some new content nodes to a ContentManifest,
include_node_ids
is extended to include those nodes. Any nodes that were listed previously are still there. - If content nodes are added for one channel version, and then for a different version of the same channel, the two versions are separated; the lists of
include_node_ids
are not merged. (In practice, importcontent merges them in Read options from a manifest file in importcontent #9467, albeit with a warning message).
This behaviour means a bit of duplication in some cases. That can be solved, but it adds some complexity - especially if we start thinking about channel versions - so I think just something to be aware of for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nearly ready to go, I just have one more question that I couldn't be sure of by looking at the code and test cases.
if len(missing_leaf_nodes) == 0: | ||
include_node_ids.add(node.id) | ||
elif len(matching_leaf_nodes) > 0: | ||
available_nodes_queue.extend(node.children.all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking at how the function is used, I'm not sure there is a need to have optional available filtering. It's really just that it can be filtered by available=True to limit the search space.
The only question in my mind is.... does the manifest handle the case where a ContentNode has the files it needs in the exported directory, but is not available on the machine doing the exporting?
So, for example, if I exported the basic arithmetic topic from Khan Academy from my machine, and it generated a content manifest - I then pass that USB key to you, and you export more of Khan Academy from your machine, but you don't have basic arithmetic on yours - is that going to now be excluded from the updated manifest?
Would be good to make a test case to cover this scenario, I think.
cfa0be5
to
0787542
Compare
Using the same list of nodes selected for export, we can generate a set of node IDs which work with importcontent.
This case indicates that all of the content from the channel is included. It is distinct from include_node_ids being an empty list, which suggests that none of the content is included.
When node_ids is set to an empty list, get_import_export_nodes will understand that to mean no nodes are selected, which is different from when node_ids is unset, in which case all nodes are selected.
Previously, node_ids defaulted to empty list. With the previous change, that resulted in no nodes being exported by default.
This is designed similarly to ConfigParser, with a ContentManifest class that can read from or write to a JSON file. It is responsible for simplifying the list of nodes to include and merging additions to that list for the same channel ID and version.
These explore situations where we extend an existing list of node IDs. We need to ensure that data is not lost in the process.
In get_content_nodes_selectors, only step through child nodes that are marked as available.
0787542
to
2da94c3
Compare
Funky build failure from the CI system, but the tests passed :) |
Oh yes, the buildkite error was due to a sysops issue with the builder. |
Summary
This change will update Kolibri's
exportcontent
command to generate a manifest file, describing an intentional content selection associated with the exported data files. This is distinct from the user's providednode_ids
andexclude_node_ids
, because it needs to be reproducible on another Kolibri instance with a different set of available content. To achieve this, I splitget_import_export_data
into a few different functions, and added another function (get_content_nodes_selectors
) which finds the optimal list of include nodes to describe the list of content being exported.That list of nodes is what is written to a
manifest.json
file alongside the exported content. For simplicity, it is possible for the same channel ID to be repeated multiple times, with different sets of include nodes and export nodes, in the samemanifest.json
file. This will only happen if the user runsexportcontent
multiple times for the same channel.This pull request does not include the
importcontent
counterpart to this work, which is in #9467.References
Reviewer guidance
Testing checklist
PR process
Reviewer checklist
yarn
andpip
)