Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor content fetching #237

Merged
merged 43 commits into from
Mar 15, 2023
Merged

Conversation

jan-cerny
Copy link
Member

Description:

This PR contains a series of small refactoring of the code responsible for content fetching, with focus on the org_fedora_oscap.content_discovery.ContentBringer class.

The main change is that the code performing analysis of the content has been moved away from ContentBringer to a new class ContentAnalyzer.

Moreover, code manipulating the members of the PolicyData class has been moved away from ContentBringer to the PolicyData class.

The outcome is that now the ContentBringer contains only code performing the download of the content and ContentAnalyzer contains only code performing the analysis of the already fetched content. These 2 classes are independent.

For more details, please read commit messages of all commits.

Rationale:

This change prepares the code for further refactoring with the goal to consolidate content fetching functionality. The separation of content fetching and content analysis functionality will allow us to focus on each of these 2 stages separately, with lower mutual collisions, in future PRs.

Review Hints:

cd testing_files/
python3 -m http.server

./create_update_image.sh install_addon_from_repo

./reanaconda.py updates ~/work/git/oscap-anaconda-addon/update.img

Separates code for determining target file name from code
that fetches the file from URI.
We will not split URI into 2 parts and then pass it as arguments
to methods. Instead we will keep the whole URI as a property
of ContentBringer instances. This change will simplify the code.
The argument of various methods in this class that holds the destination
directory name is always set to self.CONTENT_DOWNLOAD_LOCATION, which
means that these methods don't need to have this argument but they can
use self.CONTENT_DOWNLOAD_LOCATION directly.
The static method ContentBringer.__get_dest_file_name is called
only in a single place with arguments self.content_uri and
self.CONTENT_DOWNLOAD_LOCATION which are both members of the object.
So it can become also a property of the object.
The destination file name depends on URI and changes only if
the URI is set. It doesn't make sense to compute the destination
file name each time it's requested. Instead, we can compute it
only once, when the URI is changed.
The parameter report_callback isn't used anywhere.
Split _finish_actual_fetch into 2 methods:
1. _finish_actual_fetch
2. _analyze_fetched_content
The reason is to separate fetching from content analysis.
We know the path where the file is downloaded, we have the
path stored in self.dest_file_name, that means we don't
need to pass the path as an argument
The self.fetched_content isn't used anywhere.
Pass the target URL as a parameter to fetch_content instead of
digging it from the PolicyData Class.
This change will simplify us to remove PolicyData class
from ContentBringer in future.
This function only manipulates PolicyData members, so it should
be a method of the PolicyData class.
The class memebers rule_data and dry_run aren't set and aren't
used anywhere.
Move method use_system_content from ContentBringer class
to PolicyData class. The reason is that this method
only manipulates the PolicyData instance referenced from
the ContentBringer so it should be a responsibility of the
PolicyData instance.
The outcome of this move is that self._addon_data won't be
directly read in the get_preferred_tailoring() method,
and that will allow further refactoring in future.
The preferred tailoring file is a property of the
ObtainedContent class, so the get_preffered_tailoring
method should belong to the ObtainedContent
The outcome of this move is that self._addon_data won't be
directly read in the get_preferred_content() method,
and that will allow further refactoring in future.
Move method get_preffered_content to the ObtainedContent class
because the preffered content path is a property of the
ObtainedContent class.
Move the use_downloaded_content method from the ContentBringer
to the PolicyData class. This method mostly manipulates the
PolicyData class members, so the PolicyData is a more suitable
place for the method.
This change will cause that allow_one_expected_tailoring_or_no_tailoring
will not directl acess self._addon_data, therefore will not
depend on self._addon_data and can be usable without self._addon_data.
Prevent shadowing, the name "expected_path" is already used above
with different value.
Don't access self._addon_data directly in filter_discovered_content
but instead pass them as parameter. This will reduce the dependency
on the PolicyData class (self._addon_data).
Stop accessing self._addon_data in the _analyze_fetched_content
to reduce dependency on PolicyData
Don't access self._addon_data directly in finish_content_fetch
but instead pass them as parameter. This will reduce the dependency
on the PolicyData class (self._addon_data).
After recent changes, the PolicyData instance isn't used in the ContentBringer
at all.
This method doesn't access self and isn't oused outside
of the class.
This method doesn't use self so it can be a static method
Change allow_one_expected_tailoring_or_no_tailoring method
to private static method because it doesn't access self and
it isn't used outside of this class.
Change filter_discovered_content to private static method.
This method doesn't read self and is used only inside this class.
Change the _analyze_fetched_content method to private static method
because this method doesn't use self (apart form accessing the
constant) and is called only inside this class.
Move all static methods from ContentBringer out to a new class
ContentAnalyzer. This faciliates the responsiibilities split.
ContentBringer contains code related to fetching content from URI
resources. ContentAnalyzer contains code analyzing the fetched
content.
This commit will move content analysis away from the
ContentBringer.finish_content_fetch() method. Therefore, the
ContentBringer class will be responsible only for download of
the content and not for the analysis of the downloaded content.
Instead, there will be a new method static public method
content_discovery.ContentAnalyzer.analyze() and we will add calling
this method everywhere where ContentBringer.finish_content_fetch()
is called.
These parameters of finish_content_fetch aren't used in it.
From now on, we will call finish_content_fetch only if
fetch_content has been called. These 2 calls will therefore
always form a kind of "pair" of calls.
The function update_progress_label isn't used anywhere.
Move the decision whether to verify fingerprint outside from the
`finish_content_fetch` method. The fingerprint will be verified if the
`fingerprint` parameter of the method is set (is not None). This method
is called from 2 places. When calling from _init_data_fetch, the
dest_filename is always not None. And, when calling from
`PrepareValidContent.run()` we can set fingerprint to a non-None value
if the content_dest is set. The reason for this change is that the
`finish_content_fetch` method will not depend on the destination content
file name so it will reduce amount of needed parameters.
We will pass the failure callback parameter in the constructor
instead of the individual methods, because it's always the same
function, it doesn't differ in different methods and depends
only on the caller class.
@matejak matejak self-assigned this Mar 8, 2023
Copy link
Contributor

@matejak matejak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, it does many changes totally right, while some I don't like, so let's talk about them.

Great:

  • Streamlining of the code.
  • Factoring the analyzer out of the bringer.

Controversial:

  • PolicyData getting new abilities. I agree with the ability to "clear" itself, but the rest should be responsibility of a different class. PolicyData should remain a data structure, an interface that other classes understand and use. This data structure shouldn't know that a thing s.a. "system content" or "downloadable content" exist.


@staticmethod
def __get_dest_file_name(url, destdir):
path = url.split("://")[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash in some cases of invalid URLs s.a. http://. We had a BZ about this recently, and the code that extracts path in a robust way is already in the project, that could be extracted as well.
The code also assumes that destdir is a pathlib path, so adding an appropriate type hint may be a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but in later commits in this PR, the __get_dest_file_name is further removed and the code using the URI is consolidated at a single place: the content_uri setter method, in which the sanity of the URI is checked and an exception is raised.

@@ -84,7 +83,7 @@ def get_content_type(self, url):

@property
def content_uri(self):
return self.content_uri_scheme + "://" + self.content_uri_path
return self._content_uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming to _valid_content_uri, so the code that uses it knows that it is safe to make certain assumptions.

Comment on lines 97 to 106
path = scheme_and_maybe_path[1]
if "/" not in path:
msg = f"Missing the path component of the '{uri}' URL"
raise KickstartValueError(msg)
basename = path.rsplit("/", 1)[1]
if not basename:
msg = f"Unable to deduce basename from the '{uri}' URL"
raise KickstartValueError(msg)
self._content_uri = uri
self.dest_file_name = self.CONTENT_DOWNLOAD_LOCATION / basename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to keep the code in that private method, and call it only once in the constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the basename is derived from URI of the content and the URI will change each time a new content will be requested to fetch.

@@ -159,7 +159,7 @@ def _start_actual_fetch(self, ca_certs_path):
)
return fetching_thread_name

def finish_content_fetch(self, fetching_thread_name, fingerprint, report_callback, dest_filename,
def finish_content_fetch(self, fetching_thread_name, fingerprint, dest_filename,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky, the report callback is supposed to be a function that I can use to get information out of the depths of data processing to the user in a good form. It may not be used because of simple omission, not because it doesn't make sense to use it. IOW, the fix for this unused argument could be that it is put to a use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea what information it should return?

Comment on lines 210 to 212
# internal values
self.rule_data = rule_handling.RuleData()
self.dry_run = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, almost suspicious. Note to self - needs further investigation.

@@ -1215,6 +1215,6 @@ def on_change_content_clicked(self, *args):
self._refresh_ui()

def on_use_ssg_clicked(self, *args):
self.content_bringer.use_system_content()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here - the content spec data class shouldn't have the knowledge what "system content" even is, and even how to get it. The "bringer" class knows this, and, of course, hammers the information into the content spec aspect of the "policy data" dataclass. It is OK for the bringer to know the data format of the content spec, and the content spec just has to store the data, it doesn't have to fully understand them to the finest extent.

@@ -209,3 +209,22 @@ def use_system_content(self):
self.clear_all()
self.content_type = "scap-security-guide"
self.content_path = common.get_ssg_path()

def use_downloaded_content(self, content):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, the logic of obtaining content is independent from the description of the intent. The PolicyData should store data that have objective meaning, and a single line of its definition shouldn't need to be touched if we decide to use different algorithm for content download, or other implementation detail.

@@ -436,14 +436,14 @@ def update_progress_label(msg):

if actually_fetched_content:
content_path = common.get_raw_preinst_content_path(self._policy_data)
self.content_bringer.finish_content_fetch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for self: Probably the right thing to do now, but we need to change how it works.
Code should read like a prose, not like a cipher game, as it reads right now.

self.content_bringer.finish_content_fetch(
fetching_thread_name, self._policy_data.fingerprint,
content_dest, _handle_error)
if fetching_thread_name is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Probably the right thing to do now, but we need to change how it works.
Code should read like a prose, not like a cipher game, as it reads right now. It is very well visible here that a variable storing a name of a thread is seriously abused.

Even if this breaks othe object-oriented principles, the PolicyData
should remain a data structure, an interface that other classes
understand and use. This data structure shouldn't know that a thing s.a.
"system content" or "downloadable content" exist.
@scrutinizer-notifier
Copy link

A new inspection was created.

@jan-cerny
Copy link
Member Author

I have moved the methods out from the PolicyData class and then I have renamed _content_uri.

@jan-cerny
Copy link
Member Author

@matejak ^

@matejak
Copy link
Contributor

matejak commented Mar 15, 2023

LGTM, thanks!

@matejak matejak merged commit 2a87992 into OpenSCAP:rhel9-branch Mar 15, 2023
@matejak matejak added port-RHEL8 Port this PR to the rhel8-branch port-rawhide Port this PR to the master labels Apr 19, 2023
@jan-cerny jan-cerny deleted the fetching2 branch April 20, 2023 13:17
This was referenced Apr 20, 2023
@matejak matejak removed the port-RHEL8 Port this PR to the rhel8-branch label Apr 28, 2023
matejak added a commit that referenced this pull request Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
port-rawhide Port this PR to the master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants