Refactor content fetching #237

jan-cerny · 2023-03-06T16:03:09Z

Description:

This PR contains a series of small refactoring of the code responsible for content fetching, with focus on the org_fedora_oscap.content_discovery.ContentBringer class.

The main change is that the code performing analysis of the content has been moved away from ContentBringer to a new class ContentAnalyzer.

Moreover, code manipulating the members of the PolicyData class has been moved away from ContentBringer to the PolicyData class.

The outcome is that now the ContentBringer contains only code performing the download of the content and ContentAnalyzer contains only code performing the analysis of the already fetched content. These 2 classes are independent.

For more details, please read commit messages of all commits.

Rationale:

This change prepares the code for further refactoring with the goal to consolidate content fetching functionality. The separation of content fetching and content analysis functionality will allow us to focus on each of these 2 stages separately, with lower mutual collisions, in future PRs.

Review Hints:

cd testing_files/
python3 -m http.server

./create_update_image.sh install_addon_from_repo

./reanaconda.py updates ~/work/git/oscap-anaconda-addon/update.img

Separates code for determining target file name from code that fetches the file from URI.

We will not split URI into 2 parts and then pass it as arguments to methods. Instead we will keep the whole URI as a property of ContentBringer instances. This change will simplify the code.

The argument of various methods in this class that holds the destination directory name is always set to self.CONTENT_DOWNLOAD_LOCATION, which means that these methods don't need to have this argument but they can use self.CONTENT_DOWNLOAD_LOCATION directly.

The static method ContentBringer.__get_dest_file_name is called only in a single place with arguments self.content_uri and self.CONTENT_DOWNLOAD_LOCATION which are both members of the object. So it can become also a property of the object.

The destination file name depends on URI and changes only if the URI is set. It doesn't make sense to compute the destination file name each time it's requested. Instead, we can compute it only once, when the URI is changed.

The parameter report_callback isn't used anywhere.

Split _finish_actual_fetch into 2 methods: 1. _finish_actual_fetch 2. _analyze_fetched_content The reason is to separate fetching from content analysis.

We know the path where the file is downloaded, we have the path stored in self.dest_file_name, that means we don't need to pass the path as an argument

The self.fetched_content isn't used anywhere.

Pass the target URL as a parameter to fetch_content instead of digging it from the PolicyData Class. This change will simplify us to remove PolicyData class from ContentBringer in future.

This function only manipulates PolicyData members, so it should be a method of the PolicyData class.

The class memebers rule_data and dry_run aren't set and aren't used anywhere.

Move method use_system_content from ContentBringer class to PolicyData class. The reason is that this method only manipulates the PolicyData instance referenced from the ContentBringer so it should be a responsibility of the PolicyData instance.

The outcome of this move is that self._addon_data won't be directly read in the get_preferred_tailoring() method, and that will allow further refactoring in future.

The preferred tailoring file is a property of the ObtainedContent class, so the get_preffered_tailoring method should belong to the ObtainedContent

The outcome of this move is that self._addon_data won't be directly read in the get_preferred_content() method, and that will allow further refactoring in future.

Move method get_preffered_content to the ObtainedContent class because the preffered content path is a property of the ObtainedContent class.

Move the use_downloaded_content method from the ContentBringer to the PolicyData class. This method mostly manipulates the PolicyData class members, so the PolicyData is a more suitable place for the method.

This change will cause that allow_one_expected_tailoring_or_no_tailoring will not directl acess self._addon_data, therefore will not depend on self._addon_data and can be usable without self._addon_data.

Prevent shadowing, the name "expected_path" is already used above with different value.

Don't access self._addon_data directly in filter_discovered_content but instead pass them as parameter. This will reduce the dependency on the PolicyData class (self._addon_data).

Stop accessing self._addon_data in the _analyze_fetched_content to reduce dependency on PolicyData

Don't access self._addon_data directly in finish_content_fetch but instead pass them as parameter. This will reduce the dependency on the PolicyData class (self._addon_data).

After recent changes, the PolicyData instance isn't used in the ContentBringer at all.

This method doesn't access self and isn't oused outside of the class.

This method doesn't use self so it can be a static method

Change allow_one_expected_tailoring_or_no_tailoring method to private static method because it doesn't access self and it isn't used outside of this class.

Change filter_discovered_content to private static method. This method doesn't read self and is used only inside this class.

Change the _analyze_fetched_content method to private static method because this method doesn't use self (apart form accessing the constant) and is called only inside this class.

Move all static methods from ContentBringer out to a new class ContentAnalyzer. This faciliates the responsiibilities split. ContentBringer contains code related to fetching content from URI resources. ContentAnalyzer contains code analyzing the fetched content.

This commit will move content analysis away from the ContentBringer.finish_content_fetch() method. Therefore, the ContentBringer class will be responsible only for download of the content and not for the analysis of the downloaded content. Instead, there will be a new method static public method content_discovery.ContentAnalyzer.analyze() and we will add calling this method everywhere where ContentBringer.finish_content_fetch() is called.

These parameters of finish_content_fetch aren't used in it.

From now on, we will call finish_content_fetch only if fetch_content has been called. These 2 calls will therefore always form a kind of "pair" of calls.

The function update_progress_label isn't used anywhere.

Move the decision whether to verify fingerprint outside from the `finish_content_fetch` method. The fingerprint will be verified if the `fingerprint` parameter of the method is set (is not None). This method is called from 2 places. When calling from _init_data_fetch, the dest_filename is always not None. And, when calling from `PrepareValidContent.run()` we can set fingerprint to a non-None value if the content_dest is set. The reason for this change is that the `finish_content_fetch` method will not depend on the destination content file name so it will reduce amount of needed parameters.

We will pass the failure callback parameter in the constructor instead of the individual methods, because it's always the same function, it doesn't differ in different methods and depends only on the caller class.

matejak

Thank you for the PR, it does many changes totally right, while some I don't like, so let's talk about them.

Great:

Streamlining of the code.
Factoring the analyzer out of the bringer.

Controversial:

PolicyData getting new abilities. I agree with the ability to "clear" itself, but the rest should be responsibility of a different class. PolicyData should remain a data structure, an interface that other classes understand and use. This data structure shouldn't know that a thing s.a. "system content" or "downloadable content" exist.

matejak · 2023-03-08T11:46:57Z

org_fedora_oscap/content_discovery.py

-
+    @staticmethod
+    def __get_dest_file_name(url, destdir):
+        path = url.split("://")[1]


This will crash in some cases of invalid URLs s.a. http://. We had a BZ about this recently, and the code that extracts path in a robust way is already in the project, that could be extracted as well.
The code also assumes that destdir is a pathlib path, so adding an appropriate type hint may be a good idea.

Good point, but in later commits in this PR, the __get_dest_file_name is further removed and the code using the URI is consolidated at a single place: the content_uri setter method, in which the sanity of the URI is checked and an exception is raised.

matejak · 2023-03-08T11:49:35Z

org_fedora_oscap/content_discovery.py

@@ -84,7 +83,7 @@ def get_content_type(self, url):

    @property
    def content_uri(self):
-        return self.content_uri_scheme + "://" + self.content_uri_path
+        return self._content_uri


I suggest renaming to _valid_content_uri, so the code that uses it knows that it is safe to make certain assumptions.

matejak · 2023-03-08T11:57:09Z

org_fedora_oscap/content_discovery.py

+        path = scheme_and_maybe_path[1]
+        if "/" not in path:
+            msg = f"Missing the path component of the '{uri}' URL"
+            raise KickstartValueError(msg)
+        basename = path.rsplit("/", 1)[1]
+        if not basename:
+            msg = f"Unable to deduce basename from the '{uri}' URL"
+            raise KickstartValueError(msg)
        self._content_uri = uri
+        self.dest_file_name = self.CONTENT_DOWNLOAD_LOCATION / basename


Why not to keep the code in that private method, and call it only once in the constructor?

Because the basename is derived from URI of the content and the URI will change each time a new content will be requested to fetch.

matejak · 2023-03-08T12:00:33Z

org_fedora_oscap/content_discovery.py

@@ -159,7 +159,7 @@ def _start_actual_fetch(self, ca_certs_path):
            )
        return fetching_thread_name

-    def finish_content_fetch(self, fetching_thread_name, fingerprint, report_callback, dest_filename,
+    def finish_content_fetch(self, fetching_thread_name, fingerprint, dest_filename,


This is tricky, the report callback is supposed to be a function that I can use to get information out of the depths of data processing to the user in a good form. It may not be used because of simple omission, not because it doesn't make sense to use it. IOW, the fix for this unused argument could be that it is put to a use.

Do you have an idea what information it should return?

matejak · 2023-03-08T12:09:08Z

org_fedora_oscap/structures.py

-        # internal values
-        self.rule_data = rule_handling.RuleData()
-        self.dry_run = False


This is interesting, almost suspicious. Note to self - needs further investigation.

matejak · 2023-03-08T12:10:35Z

org_fedora_oscap/gui/spokes/oscap.py

@@ -1215,6 +1215,6 @@ def on_change_content_clicked(self, *args):
        self._refresh_ui()

    def on_use_ssg_clicked(self, *args):
-        self.content_bringer.use_system_content()


I disagree here - the content spec data class shouldn't have the knowledge what "system content" even is, and even how to get it. The "bringer" class knows this, and, of course, hammers the information into the content spec aspect of the "policy data" dataclass. It is OK for the bringer to know the data format of the content spec, and the content spec just has to store the data, it doesn't have to fully understand them to the finest extent.

matejak · 2023-03-09T12:53:03Z

org_fedora_oscap/structures.py

@@ -209,3 +209,22 @@ def use_system_content(self):
        self.clear_all()
        self.content_type = "scap-security-guide"
        self.content_path = common.get_ssg_path()
+
+    def use_downloaded_content(self, content):


Ditto, the logic of obtaining content is independent from the description of the intent. The PolicyData should store data that have objective meaning, and a single line of its definition shouldn't need to be touched if we decide to use different algorithm for content download, or other implementation detail.

matejak · 2023-03-09T12:59:24Z

org_fedora_oscap/gui/spokes/oscap.py

@@ -436,14 +436,14 @@ def update_progress_label(msg):

        if actually_fetched_content:
            content_path = common.get_raw_preinst_content_path(self._policy_data)
+            self.content_bringer.finish_content_fetch(


Note for self: Probably the right thing to do now, but we need to change how it works.
Code should read like a prose, not like a cipher game, as it reads right now.

matejak · 2023-03-09T13:01:24Z

org_fedora_oscap/service/installation.py

-        self.content_bringer.finish_content_fetch(
-            fetching_thread_name, self._policy_data.fingerprint,
-            content_dest, _handle_error)
+        if fetching_thread_name is not None:


Note to self: Probably the right thing to do now, but we need to change how it works.
Code should read like a prose, not like a cipher game, as it reads right now. It is very well visible here that a variable storing a name of a thread is seriously abused.

Even if this breaks othe object-oriented principles, the PolicyData should remain a data structure, an interface that other classes understand and use. This data structure shouldn't know that a thing s.a. "system content" or "downloadable content" exist.

scrutinizer-notifier · 2023-03-10T08:55:59Z

A new inspection was created.

jan-cerny · 2023-03-10T08:56:48Z

I have moved the methods out from the PolicyData class and then I have renamed _content_uri.

jan-cerny · 2023-03-15T08:30:07Z

@matejak ^

matejak · 2023-03-15T09:35:43Z

LGTM, thanks!

Port #237 to rhel8 branch

jan-cerny added 30 commits March 2, 2023 10:30

Extract method __get_dest_file_name

52ac17b

Separates code for determining target file name from code that fetches the file from URI.

Stop splitting URI

200eab4

We will not split URI into 2 parts and then pass it as arguments to methods. Instead we will keep the whole URI as a property of ContentBringer instances. This change will simplify the code.

Change __get_dest_file_name to a property

7b0ae59

The static method ContentBringer.__get_dest_file_name is called only in a single place with arguments self.content_uri and self.CONTENT_DOWNLOAD_LOCATION which are both members of the object. So it can become also a property of the object.

Set destination file name when URI is set

f13a3b4

The destination file name depends on URI and changes only if the URI is set. It doesn't make sense to compute the destination file name each time it's requested. Instead, we can compute it only once, when the URI is changed.

Remove unused parameter

3a0ef3c

The parameter report_callback isn't used anywhere.

Split method into 2 methods

af9c56e

Split _finish_actual_fetch into 2 methods: 1. _finish_actual_fetch 2. _analyze_fetched_content The reason is to separate fetching from content analysis.

Move fingerprint verification out from content analysis

2a030c9

Verify fingerprint of the downloaded file

c6038d4

We know the path where the file is downloaded, we have the path stored in self.dest_file_name, that means we don't need to pass the path as an argument

Remove unused class member

b49dfff

The self.fetched_content isn't used anywhere.

Pass the target URL as a parameter to fetch_content

e30695b

Pass the target URL as a parameter to fetch_content instead of digging it from the PolicyData Class. This change will simplify us to remove PolicyData class from ContentBringer in future.

Move method closer to the caller method

89457dd

Move clear_all to PolicyData

b55b617

This function only manipulates PolicyData members, so it should be a method of the PolicyData class.

Remove unused code

e92b724

The class memebers rule_data and dry_run aren't set and aren't used anywhere.

Move method use_system_content

ab6be82

Move method use_system_content from ContentBringer class to PolicyData class. The reason is that this method only manipulates the PolicyData instance referenced from the ContentBringer so it should be a responsibility of the PolicyData instance.

Move variable access

7e78b23

The outcome of this move is that self._addon_data won't be directly read in the get_preferred_tailoring() method, and that will allow further refactoring in future.

Move get_preffered_tailoring to ObtainedContent

5d043d1

The preferred tailoring file is a property of the ObtainedContent class, so the get_preffered_tailoring method should belong to the ObtainedContent

Move variable access

a22296b

The outcome of this move is that self._addon_data won't be directly read in the get_preferred_content() method, and that will allow further refactoring in future.

Move method get_preffered_content

0012b07

Move method get_preffered_content to the ObtainedContent class because the preffered content path is a property of the ObtainedContent class.

Move use_downloaded_content to the PolicyData class

a04f4e8

Move the use_downloaded_content method from the ContentBringer to the PolicyData class. This method mostly manipulates the PolicyData class members, so the PolicyData is a more suitable place for the method.

Move variable access

01b0237

This change will cause that allow_one_expected_tailoring_or_no_tailoring will not directl acess self._addon_data, therefore will not depend on self._addon_data and can be usable without self._addon_data.

Rename variable

990fe9a

Prevent shadowing, the name "expected_path" is already used above with different value.

Don't access self._addon_data

46d028f

Don't access self._addon_data directly in filter_discovered_content but instead pass them as parameter. This will reduce the dependency on the PolicyData class (self._addon_data).

Stop accessing self._addon_data

8dec02d

Stop accessing self._addon_data in the _analyze_fetched_content to reduce dependency on PolicyData

Stop accessing self._addon_data

3ac39f8

Don't access self._addon_data directly in finish_content_fetch but instead pass them as parameter. This will reduce the dependency on the PolicyData class (self._addon_data).

Remove PolicyData from ContentBringer

9cd6cf5

After recent changes, the PolicyData instance isn't used in the ContentBringer at all.

Change get_content_type to static private method

762be9b

This method doesn't access self and isn't oused outside of the class.

Change reduce_files to static method

f73871e

This method doesn't use self so it can be a static method

Change method to static

a77dbf2

Change allow_one_expected_tailoring_or_no_tailoring method to private static method because it doesn't access self and it isn't used outside of this class.

Change method to private static method

0fdaa65

Change filter_discovered_content to private static method. This method doesn't read self and is used only inside this class.

jan-cerny added 10 commits March 3, 2023 16:07

Change method to private static method

680bc0c

Change the _analyze_fetched_content method to private static method because this method doesn't use self (apart form accessing the constant) and is called only inside this class.

Create ContentAnalyzer

617f466

Move all static methods from ContentBringer out to a new class ContentAnalyzer. This faciliates the responsiibilities split. ContentBringer contains code related to fetching content from URI resources. ContentAnalyzer contains code analyzing the fetched content.

Remove unused parameters

5073ae4

These parameters of finish_content_fetch aren't used in it.

Finish content fetching only if started

554ae38

From now on, we will call finish_content_fetch only if fetch_content has been called. These 2 calls will therefore always form a kind of "pair" of calls.

Remove unused code

c086edd

The function update_progress_label isn't used anywhere.

Move the failure callback parameter

212ed3a

We will pass the failure callback parameter in the constructor instead of the individual methods, because it's always the same function, it doesn't differ in different methods and depends only on the caller class.

Add missing doctext

94c97b4

Fix code style

b6c317e

jan-cerny requested review from matejak and evgenyz March 8, 2023 08:00

matejak self-assigned this Mar 8, 2023

matejak requested changes Mar 9, 2023

View reviewed changes

jan-cerny added 2 commits March 10, 2023 08:50

Rename _content_uri to _valid_content_uri

df9fd5c

matejak approved these changes Mar 15, 2023

View reviewed changes

matejak merged commit 2a87992 into OpenSCAP:rhel9-branch Mar 15, 2023

matejak added port-RHEL8 Port this PR to the rhel8-branch port-rawhide Port this PR to the master labels Apr 19, 2023

jan-cerny deleted the fetching2 branch April 20, 2023 13:17

This was referenced Apr 20, 2023

Port #237 to rawhide branch #242

Closed

Port #237 to rhel8 branch #243

Merged

matejak removed the port-RHEL8 Port this PR to the rhel8-branch label Apr 28, 2023

matejak added a commit that referenced this pull request Apr 28, 2023

Merge pull request #243 from jan-cerny/fetching2_rhel8

e3e7804

Port #237 to rhel8 branch

alexhaydock mentioned this pull request Jun 17, 2023

Regression between Rocky 9.1 and 9.2 with using of tailoring-path #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor content fetching #237

Refactor content fetching #237

jan-cerny commented Mar 6, 2023

matejak left a comment

matejak Mar 8, 2023

jan-cerny Mar 10, 2023

matejak Mar 8, 2023

matejak Mar 8, 2023

jan-cerny Mar 10, 2023

matejak Mar 8, 2023

jan-cerny Mar 10, 2023

matejak Mar 8, 2023

matejak Mar 8, 2023

matejak Mar 9, 2023

matejak Mar 9, 2023

matejak Mar 9, 2023

scrutinizer-notifier commented Mar 10, 2023

jan-cerny commented Mar 10, 2023

jan-cerny commented Mar 15, 2023

matejak commented Mar 15, 2023

Refactor content fetching #237

Refactor content fetching #237

Conversation

jan-cerny commented Mar 6, 2023

Description:

Rationale:

Review Hints:

matejak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scrutinizer-notifier commented Mar 10, 2023

jan-cerny commented Mar 10, 2023

jan-cerny commented Mar 15, 2023

matejak commented Mar 15, 2023