Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Implement FileContent class #1680

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from enum import Enum, auto
import typing as t

from threatexchange import common
import common


class ContentType:
Expand All @@ -21,8 +21,8 @@ def get_name(cls) -> str:

@classmethod
def extract_additional_content(
cls, content_arg: str
) -> t.List[t.Tuple[t.Type["ContentType"], str]]:
cls, content_in_file: Path, available_content: t.Sequence[t.Type["ContentType"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking question: Walk me through your thoughts about these changes. Why are they on this PR?

) -> t.Dict[t.Type["ContentType"], t.List[Path]]:
"""
Post-process/download content to find additional components

Expand All @@ -32,7 +32,7 @@ def extract_additional_content(
* Photo => run OCR and extract text
* Video => break out photo thumbnail, close caption text, audio
"""
return []
return {}


class RotationType(Enum):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import typing as t
from content_base import ContentType
from photo import PhotoContent
from video import VideoContent

class FileContent(ContentType):
"""
Content representing a file. Determines if a file is a photo or video based on file extension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested:

Suggested change
Content representing a file. Determines if a file is a photo or video based on file extension.
ContentType representing a generic file.
Determines if a file is a photo or video based on file extension.

"""

VALID_PHOTO_EXTENSIONS = {".jpg", ".jpeg", ".png", ".gif"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For photos, we can align with the image types that Pillow supports. Check out this code:

from PIL import Image

exts = Image.registered_extensions()

VALID_VIDEO_EXTENSIONS = {".mp4", ".avi", ".mov", ".mkv"}

def __init__(self, file_name: str):
self.file_name = file_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: Currently, I don't any ContentType has an implementation for instantiation (we couldn't agree on what it should mean at the time, and so just saved it for the future). What would be your arguments to make file the first one?


@classmethod
def get_name(cls) -> str:
return "File"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this the same as a default implementation, and so can be removed


def get_content_type_from_filename(self) -> t.Type[ContentType]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested signature

Suggested change
def get_content_type_from_filename(self) -> t.Type[ContentType]:
@classmethod
def get_content_type_from_filename(cls, file_name: str) -> t.Type[ContentType]:

"""
Determines content type based on file extension.
"""
if any(self.file_name.endswith(ext) for ext in self.VALID_PHOTO_EXTENSIONS):
return PhotoContent
elif any(self.file_name.endswith(ext) for ext in self.VALID_VIDEO_EXTENSIONS):
return VideoContent
else:
raise ValueError(f"Unknown content type for file: {self.file_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Since the caller can't easily know whether the type is supported before calling, suggest making this return None instead of throwing an exception.

Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import unittest
Copy link
Contributor

@Dcallies Dcallies Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Please convert this to py.test as previously requested.

Additionally, we want the tests for modules to live in the same directory as the code. Please move this file to

threatexchange/content_type/tests/test_file_content_type.py

from threatexchange.content_type.photo import PhotoContent
from threatexchange.content_type.video import VideoContent
from threatexchange.content_type.file_content import FileContent

class TestFileContentType(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: In this library, we use py.test and not unittest.TestCase. Take look at some of the other tests in this file to see how to write those.

The logic overall looks good though!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any tests that using py.test
All are using unittest.TestCase
@Dcallies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very definitive statement! Here's one: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/cli/tests/cli_smoke_test.py

I will open issues for you to convert all of the old ones to py.test :)

def test_photo_detection(self):
file_content = FileContent("file.jpg")
content_type = file_content.get_content_type_from_filename()
self.assertEqual(content_type, PhotoContent)

def test_video_detection(self):
file_content = FileContent("file.mp4")
content_type = file_content.get_content_type_from_filename()
self.assertEqual(content_type, VideoContent)

def test_unknown_file_type(self):
file_content = FileContent("file.txt")
with self.assertRaises(ValueError):
file_content.get_content_type_from_filename()
Loading