Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Implement FileContent class #1680

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ZeyadTarekk
Copy link

Summary

Added FileContent class that detects the type of the file

Test Plan

added some unit tests to test the class

@ZeyadTarekk
Copy link
Author

@Dcallies Could you check that
#1680

@Dcallies Dcallies changed the title Implement filecontent class [pytx] Implement FileContent class Nov 6, 2024
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dcallies Could you check that
#1680

I am having trouble parsing your comment, did you post this accidentally before completing your thought?

Toplevel comments:

  • There is some unexpected functionality in here that I think we should talk about more - see questions inline
  • All of the other ContentTypes today are powered by static/class methods, since we haven't defined what instantiation means yet. I propose we stick with static/class methods to start with, unless you have a specific proposal that depends on making these methods more objective.

@@ -21,8 +21,8 @@ def get_name(cls) -> str:

@classmethod
def extract_additional_content(
cls, content_arg: str
) -> t.List[t.Tuple[t.Type["ContentType"], str]]:
cls, content_in_file: Path, available_content: t.Sequence[t.Type["ContentType"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking question: Walk me through your thoughts about these changes. Why are they on this PR?


class FileContent(ContentType):
"""
Content representing a file. Determines if a file is a photo or video based on file extension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested:

Suggested change
Content representing a file. Determines if a file is a photo or video based on file extension.
ContentType representing a generic file.
Determines if a file is a photo or video based on file extension.

Comment on lines 17 to 19
@classmethod
def get_name(cls) -> str:
return "File"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this the same as a default implementation, and so can be removed

Comment on lines 14 to 15
def __init__(self, file_name: str):
self.file_name = file_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: Currently, I don't any ContentType has an implementation for instantiation (we couldn't agree on what it should mean at the time, and so just saved it for the future). What would be your arguments to make file the first one?

Content representing a file. Determines if a file is a photo or video based on file extension.
"""

VALID_PHOTO_EXTENSIONS = {".jpg", ".jpeg", ".png", ".gif"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For photos, we can align with the image types that Pillow supports. Check out this code:

from PIL import Image

exts = Image.registered_extensions()

Comment on lines 21 to 30
def get_content_type_from_filename(self) -> t.Type[ContentType]:
"""
Determines content type based on file extension.
"""
if any(self.file_name.endswith(ext) for ext in self.VALID_PHOTO_EXTENSIONS):
return PhotoContent
elif any(self.file_name.endswith(ext) for ext in self.VALID_VIDEO_EXTENSIONS):
return VideoContent
else:
raise ValueError(f"Unknown content type for file: {self.file_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Since the caller can't easily know whether the type is supported before calling, suggest making this return None instead of throwing an exception.

from threatexchange.content_type.video import VideoContent
from threatexchange.content_type.file_content import FileContent

class TestFileContentType(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: In this library, we use py.test and not unittest.TestCase. Take look at some of the other tests in this file to see how to write those.

The logic overall looks good though!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any tests that using py.test
All are using unittest.TestCase
@Dcallies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very definitive statement! Here's one: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/cli/tests/cli_smoke_test.py

I will open issues for you to convert all of the old ones to py.test :)

def get_name(cls) -> str:
return "File"

def get_content_type_from_filename(self) -> t.Type[ContentType]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested signature

Suggested change
def get_content_type_from_filename(self) -> t.Type[ContentType]:
@classmethod
def get_content_type_from_filename(cls, file_name: str) -> t.Type[ContentType]:

@ZeyadTarekk
Copy link
Author

I have added all the requested changes
Could you review them @Dcallies

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please:

  1. Convert the test to py.test as previously requested
  2. Make sure lint and testing is passing (you can run it locally before submitting)

@@ -0,0 +1,33 @@
import unittest
Copy link
Contributor

@Dcallies Dcallies Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Please convert this to py.test as previously requested.

Additionally, we want the tests for modules to live in the same directory as the code. Please move this file to

threatexchange/content_type/tests/test_file_content_type.py

Determines if a file is a photo or video based on file extension.
"""

VALID_PHOTO_EXTENSIONS = {ext.lower() for ext in Image.registered_extensions()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing work during module import time, which is generally considered an anti-pattern, but I don't think it's worth changing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more details?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me more details?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first result for me from google for "doing work during module import time"
https://www.benkuhn.net/importtime/

As mentioned, I don't think it's worth changing.

@ZeyadTarekk
Copy link
Author

Please:

  1. Convert the test to py.test as previously requested
  2. Make sure lint and testing is passing (you can run it locally before submitting)

How to run the linting and testing locally?

@Dcallies
Copy link
Contributor

Dcallies commented Nov 8, 2024

Please:

  1. Convert the test to py.test as previously requested
  2. Make sure lint and testing is passing (you can run it locally before submitting)

How to run the linting and testing locally?

It's still in the same place in https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/CONTRIBUTING.md :)

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Please make sure all lint and CI are passing. You can test them locally before submitting.

from threatexchange.content_type.video import VideoContent
from threatexchange.content_type.file_content import FileContent

@pytest.mark.parametrize("file_name,expected_content_type", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Determines if a file is a photo or video based on file extension.
"""

VALID_PHOTO_EXTENSIONS = {ext.lower() for ext in Image.registered_extensions()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first result for me from google for "doing work during module import time"
https://www.benkuhn.net/importtime/

As mentioned, I don't think it's worth changing.

as either PhotoContent or VideoContent based on file extension.
"""
content_type = FileContent.get_content_type_from_filename(file_name)
assert content_type == expected_content_type, f"Failed for {file_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need to add a message - parameterize will do that for you.

content_type = FileContent.get_content_type_from_filename(file_name)
assert content_type == expected_content_type, f"Failed for {file_name}"

def test_unknown_file_type():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can combine this into your previous test by just doing

("file.txt", None)

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to run linting locally before submitting! I cannot review this PR anymore today - you can save time by running tests and typing yourself!

@ZeyadTarekk
Copy link
Author

Remember to run linting locally before submitting! I cannot review this PR anymore today - you can save time by running tests and typing yourself!

@Dcallies
I am getting an error and I can't run the tests locally
Screenshot 2024-11-08 at 5 22 54 PM
Screenshot 2024-11-08 at 6 54 08 PM

@Dcallies
Copy link
Contributor

Dcallies commented Nov 8, 2024

Hey @ZeyadTarekk it looks like you are showing the the result of a pip run, which are not the linting tools (black, mypy, pytest). Read more about how pip works to figure out how to avoid installing vpdq to unblock yourself: https://pip.pypa.io/en/stable/cli/pip_install/

Here are my suggestions:

  1. Can you open an issue for the error you are seeing? Include the paste that includes the stacktrace. @ianwal might also respond to a ping on the dangling ; here https://github.com/facebook/ThreatExchange/blame/main/vpdq/cpp/vpdq/cpp/hashing/ffmpegwrapper.cpp#L134
  2. You can ignore vPDQ or install an older version, and omit the extensions directory when testing, then use the relevant python tools to run the CI.

@ZeyadTarekk
Copy link
Author

Hey @ZeyadTarekk it looks like you are showing the the result of a pip run, which are not the linting tools (black, mypy, pytest). Read more about how pip works to figure out how to avoid installing vpdq to unblock yourself: https://pip.pypa.io/en/stable/cli/pip_install/

Here are my suggestions:

  1. Can you open an issue for the error you are seeing? Include the paste that includes the stacktrace. @ianwal might also respond to a ping on the dangling ; here https://github.com/facebook/ThreatExchange/blame/main/vpdq/cpp/vpdq/cpp/hashing/ffmpegwrapper.cpp#L134
  2. You can ignore vPDQ or install an older version, and omit the extensions directory when testing, then use the relevant python tools to run the CI.

I have already run the linter and it gives me that there is no problem
Screenshot 2024-11-09 at 12 10 48 AM

And when I try to run the tests it is giving me error in the imports because the packages are not installed using pip install --editable .[dev]

Screenshot 2024-11-09 at 12 13 58 AM

@Dcallies
Copy link
Contributor

Dcallies commented Nov 9, 2024

Hey @ZeyadTarekk , @haianhng31 can teach you about a workaround using test selection with pytest, or you can read the command line --help. See the public documentation at https://docs.pytest.org/en/stable/how-to/usage.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants