Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/bb 2 way diff fix #1172

Merged
merged 6 commits into from
Sep 3, 2024
Merged

Conversation

MaxHoecker
Copy link

@MaxHoecker MaxHoecker commented Aug 23, 2024

User description

Fixes #1131

Gets the 2-way file diff between the merging branch head and the best common ancestor


PR Type

Bug fix, Enhancement


Description

  • Fixed issue Bitbucket server uses 2-way diff but Codium AI uses 3-way diff #1131: Implemented correct 2-way file diff calculation for Bitbucket Server
  • Enhanced BitbucketServerProvider class to find the best common ancestor (merge base) between branches
  • Updated get_diff_files method to use the correct base commit for diff calculation
  • Added comprehensive unit tests to verify the correct behavior of the 2-way diff functionality
  • Improved code flexibility by allowing injection of a custom bitbucket_client in the constructor

Changes walkthrough 📝

Relevant files
Enhancement
bitbucket_server_provider.py
Improve BitbucketServerProvider diff calculation                 

pr_agent/git_providers/bitbucket_server_provider.py

  • Added optional bitbucket_client parameter to the constructor
  • Implemented get_best_common_ancestor method to find the merge base
  • Updated get_diff_files to use the best common ancestor for diff
    calculation
  • +23/-4   
    Tests
    test_bitbucket_provider.py
    Add comprehensive tests for BitbucketServerProvider           

    tests/unittest/test_bitbucket_provider.py

  • Added new test class TestBitbucketServerProvider
  • Implemented multiple test cases for get_diff_files method
  • Added mock implementations for Bitbucket API calls
  • +200/-1 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    @codiumai-pr-agent-pro codiumai-pr-agent-pro bot added enhancement New feature or request Bug fix labels Aug 23, 2024
    Copy link
    Contributor

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🏅 Score: 85
    🧪 PR contains tests
    🔒 No security concerns identified
    🔀 Multiple PR themes

    Sub-PR theme: Implement 2-way diff calculation for Bitbucket Server

    Relevant files:

    • pr_agent/git_providers/bitbucket_server_provider.py

    Sub-PR theme: Add unit tests for Bitbucket Server 2-way diff functionality

    Relevant files:

    • tests/unittest/test_bitbucket_provider.py

    ⚡ Key issues to review

    Code Complexity
    The get_best_common_ancestor method might be difficult to understand and maintain. Consider adding more comments or breaking it down into smaller, more descriptive functions.

    Copy link
    Contributor

    codiumai-pr-agent-pro bot commented Aug 23, 2024

    PR Code Suggestions ✨

    Latest suggestions up to cf14e45

    CategorySuggestion                                                                                                                                    Score
    Maintainability
    Use parameterized testing to reduce code duplication in API version tests

    Consider using a parameterized test for the different API versions to reduce code
    duplication and improve test maintainability.

    tests/unittest/test_bitbucket_provider.py [233-297]

    -def test_get_diff_files_multi_merge_diverge_60(self):
    -    bitbucket_client = self.get_multi_merge_diverge_mock_client(60)
    +import pytest
    +
    +@pytest.mark.parametrize("api_version, expected_file_content", [
    +    (60, 'file\nwith\nmultiple\nlines\nto\nemulate\na\nreal\nfile'),
    +    (70, 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile'),
    +    (816, 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile')
    +])
    +def test_get_diff_files_multi_merge_diverge(self, api_version, expected_file_content):
    +    bitbucket_client = self.get_multi_merge_diverge_mock_client(api_version)
     
         provider = BitbucketServerProvider(
             "https://git.onpreminstance.com/projects/AAA/repos/my-repo/pull-requests/1",
             bitbucket_client=bitbucket_client
         )
     
         expected = [
             FilePatchInfo(
    -            'file\nwith\nmultiple\nlines\nto\nemulate\na\nreal\nfile',
    -            'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile',
    -            '--- \n+++ \n@@ -1,9 +1,9 @@\n-file\n-with\n-multiple\n+readme\n+without\n+some\n lines\n to\n-emulate\n+simulate\n a\n real\n file',
    -            'Readme.md',
    -            edit_type=EDIT_TYPE.MODIFIED,
    -        )
    -    ]
    -
    -    actual = provider.get_diff_files()
    -
    -    assert actual == expected
    -
    -def test_get_diff_files_multi_merge_diverge_70(self):
    -    bitbucket_client = self.get_multi_merge_diverge_mock_client(70)
    -
    -    provider = BitbucketServerProvider(
    -        "https://git.onpreminstance.com/projects/AAA/repos/my-repo/pull-requests/1",
    -        bitbucket_client=bitbucket_client
    -    )
    -
    -    expected = [
    -        FilePatchInfo(
    -            'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile',
    +            expected_file_content,
                 'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile',
                 '--- \n+++ \n@@ -1,9 +1,9 @@\n-file\n-with\n+readme\n+without\n some\n lines\n to\n-emulate\n+simulate\n a\n real\n file',
                 'Readme.md',
                 edit_type=EDIT_TYPE.MODIFIED,
             )
         ]
     
         actual = provider.get_diff_files()
     
         assert actual == expected
     
    -def test_get_diff_files_multi_merge_diverge_816(self):
    -    bitbucket_client = self.get_multi_merge_diverge_mock_client(816)
    -
    -    provider = BitbucketServerProvider(
    -        "https://git.onpreminstance.com/projects/AAA/repos/my-repo/pull-requests/1",
    -        bitbucket_client=bitbucket_client
    -    )
    -
    -    expected = [
    -        FilePatchInfo(
    -            'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile',
    -            'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile',
    -            '--- \n+++ \n@@ -1,9 +1,9 @@\n-file\n-with\n+readme\n+without\n some\n lines\n to\n-emulate\n+simulate\n a\n real\n file',
    -            'Readme.md',
    -            edit_type=EDIT_TYPE.MODIFIED,
    -        )
    -    ]
    -
    -    actual = provider.get_diff_files()
    -
    -    assert actual == expected
    -
    Suggestion importance[1-10]: 8

    Why: Parameterized testing significantly improves test maintainability and reduces code duplication, which is a substantial improvement in the testing framework.

    8
    Extract version comparison logic into separate methods for improved readability

    Consider extracting the version comparison logic into a separate method for better
    readability and maintainability.

    pr_agent/git_providers/bitbucket_server_provider.py [162-186]

    -if self.bitbucket_api_version is not None and self.bitbucket_api_version >= LooseVersion("8.16"):
    +def _get_base_sha(self):
    +    if self._is_version_gte("8.16"):
    +        return self._get_base_sha_for_version_816_and_above()
    +    else:
    +        return self._get_base_sha_for_version_below_816()
    +
    +def _is_version_gte(self, version):
    +    return self.bitbucket_api_version is not None and self.bitbucket_api_version >= LooseVersion(version)
    +
    +def _get_base_sha_for_version_816_and_above(self):
         try:
    -        base_sha = self.bitbucket_client.get(self._get_merge_base())['id']
    -    except Exception as e:
    -        get_logger().error(f"Failed to get the best common ancestor for PR: {self.pr_url}, \nerror: {e}")
    -        raise e
    -else:
    +        return self.bitbucket_client.get(self._get_merge_base())['id']
    +    except Exception as error:
    +        get_logger().error(f"Failed to get the best common ancestor for PR: {self.pr_url}, \nerror: {error}")
    +        raise error
    +
    +def _get_base_sha_for_version_below_816(self):
         source_commits_list = list(self.bitbucket_client.get_pull_requests_commits(
             self.workspace_slug,
             self.repo_slug,
             self.pr_num
         ))
    -    # if Bitbucket api version is None or < 7.0 then do a simple diff with a guaranteed common ancestor
         base_sha = source_commits_list[-1]['parents'][0]['id']
    -    # if Bitbucket api version is 7.0-8.15 then use 2-way diff functionality for the base_sha
    -    if self.bitbucket_api_version is not None and self.bitbucket_api_version >= LooseVersion("7.0"):
    +    if self._is_version_gte("7.0"):
             try:
                 destination_commits = list(
                     self.bitbucket_client.get_commits(self.workspace_slug, self.repo_slug, base_sha,
                                                       self.pr.toRef['latestCommit']))
                 base_sha = self.get_best_common_ancestor(source_commits_list, destination_commits, base_sha)
    -        except Exception as e:
    +        except Exception as error:
                 get_logger().error(
    -                f"Failed to get the commit list for calculating best common ancestor for PR: {self.pr_url}, \nerror: {e}")
    -            raise e
    +                f"Failed to get the commit list for calculating best common ancestor for PR: {self.pr_url}, \nerror: {error}")
    +            raise error
    +    return base_sha
     
    +base_sha = self._get_base_sha()
    +
    Suggestion importance[1-10]: 7

    Why: Extracting complex logic into separate methods improves code maintainability and readability, which is a valuable improvement but not critical.

    7
    Best practice
    Use more descriptive variable names in exception handling

    Consider using a more descriptive variable name instead of 'e' in the exception
    handling blocks. This will make the code more readable and maintainable.

    pr_agent/git_providers/bitbucket_server_provider.py [165-167]

    -except Exception as e:
    -    get_logger().error(f"Failed to get the best common ancestor for PR: {self.pr_url}, \nerror: {e}")
    -    raise e
    +except Exception as error:
    +    get_logger().error(f"Failed to get the best common ancestor for PR: {self.pr_url}, \nerror: {error}")
    +    raise error
     
    Suggestion importance[1-10]: 6

    Why: Using more descriptive variable names improves code readability, but it's a minor improvement that doesn't address major issues or bugs.

    6

    Previous suggestions

    ✅ Suggestions up to commit 2a9e3ee
    CategorySuggestion                                                                                                                                    Score
    Best practice
    ✅ Refactor the mock method to use a dictionary for improved efficiency and maintainability
    Suggestion Impact:The commit implemented the suggestion by replacing the series of if-elif statements with a dictionary (content_map) and using the .get() method for efficient lookup

    code diff:

    -        if at == '9c1cffdd9f276074bfb6fb3b70fbee62d298b058':
    -            return 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile\n'
    -        elif at == '2a1165446bdf991caf114d01f7c88d84ae7399cf':
    -            return 'file\nwith\nmultiple \nlines\nto\nemulate\na\nfake\nfile\n'
    -        elif at == 'f617708826cdd0b40abb5245eda71630192a17e3':
    -            return 'file\nwith\nmultiple \nlines\nto\nemulate\na\nreal\nfile\n'
    -        elif at == 'cb68a3027d6dda065a7692ebf2c90bed1bcdec28':
    -            return 'file\nwith\nsome\nchanges\nto\nemulate\na\nreal\nfile\n'
    -        elif at == '1905dcf16c0aac6ac24f7ab617ad09c73dc1d23b':
    -            return 'file\nwith\nsome\nlines\nto\nemulate\na\nfake\ntest\n'
    -        elif at == 'ae4eca7f222c96d396927d48ab7538e2ee13ca63':
    -            return 'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile'
    -        elif at == '548f8ba15abc30875a082156314426806c3f4d97':
    -            return 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile'
    -        return ''
    +        content_map = {
    +            '9c1cffdd9f276074bfb6fb3b70fbee62d298b058': 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile\n',
    +            '2a1165446bdf991caf114d01f7c88d84ae7399cf': 'file\nwith\nmultiple \nlines\nto\nemulate\na\nfake\nfile\n',
    +            'f617708826cdd0b40abb5245eda71630192a17e3': 'file\nwith\nmultiple \nlines\nto\nemulate\na\nreal\nfile\n',
    +            'cb68a3027d6dda065a7692ebf2c90bed1bcdec28': 'file\nwith\nsome\nchanges\nto\nemulate\na\nreal\nfile\n',
    +            '1905dcf16c0aac6ac24f7ab617ad09c73dc1d23b': 'file\nwith\nsome\nlines\nto\nemulate\na\nfake\ntest\n',
    +            'ae4eca7f222c96d396927d48ab7538e2ee13ca63': 'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile',
    +            '548f8ba15abc30875a082156314426806c3f4d97': 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile'
    +        }
    +
    +        return content_map.get(at, '')

    Consider using a more efficient data structure, such as a dictionary, for the
    mock_get_content_of_file method. This would improve the readability and
    maintainability of the code, especially as the number of test cases grows.

    tests/unittest/test_bitbucket_provider.py [25-40]

     def mock_get_content_of_file(self, project_key, repository_slug, filename, at=None, markup=None):
    -    if at == '9c1cffdd9f276074bfb6fb3b70fbee62d298b058':
    -        return 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile\n'
    -    elif at == '2a1165446bdf991caf114d01f7c88d84ae7399cf':
    -        return 'file\nwith\nmultiple \nlines\nto\nemulate\na\nfake\nfile\n'
    -    elif at == 'f617708826cdd0b40abb5245eda71630192a17e3':
    -        return 'file\nwith\nmultiple \nlines\nto\nemulate\na\nreal\nfile\n'
    -    elif at == 'cb68a3027d6dda065a7692ebf2c90bed1bcdec28':
    -        return 'file\nwith\nsome\nchanges\nto\nemulate\na\nreal\nfile\n'
    -    elif at == '1905dcf16c0aac6ac24f7ab617ad09c73dc1d23b':
    -        return 'file\nwith\nsome\nlines\nto\nemulate\na\nfake\ntest\n'
    -    elif at == 'ae4eca7f222c96d396927d48ab7538e2ee13ca63':
    -        return 'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile'
    -    elif at == '548f8ba15abc30875a082156314426806c3f4d97':
    -        return 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile'
    -    return ''
    +    content_map = {
    +        '9c1cffdd9f276074bfb6fb3b70fbee62d298b058': 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile\n',
    +        '2a1165446bdf991caf114d01f7c88d84ae7399cf': 'file\nwith\nmultiple \nlines\nto\nemulate\na\nfake\nfile\n',
    +        'f617708826cdd0b40abb5245eda71630192a17e3': 'file\nwith\nmultiple \nlines\nto\nemulate\na\nreal\nfile\n',
    +        'cb68a3027d6dda065a7692ebf2c90bed1bcdec28': 'file\nwith\nsome\nchanges\nto\nemulate\na\nreal\nfile\n',
    +        '1905dcf16c0aac6ac24f7ab617ad09c73dc1d23b': 'file\nwith\nsome\nlines\nto\nemulate\na\nfake\ntest\n',
    +        'ae4eca7f222c96d396927d48ab7538e2ee13ca63': 'readme\nwithout\nsome\nlines\nto\nsimulate\na\nreal\nfile',
    +        '548f8ba15abc30875a082156314426806c3f4d97': 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile'
    +    }
    +    return content_map.get(at, '')
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: This suggestion greatly improves code maintainability and readability, especially as the number of test cases grows, making it a valuable refactoring.

    9
    Maintainability
    Improve variable naming for better code readability and understanding

    Consider using a more descriptive variable name for guaranteed_common_ancestor. A
    name like oldest_source_commit might better describe what this variable represents,
    improving code readability.

    pr_agent/git_providers/bitbucket_server_provider.py [156-160]

     source_commits_list = list(self.bitbucket_client.get_pull_requests_commits(self.workspace_slug, self.repo_slug, self.pr_num))
    -guaranteed_common_ancestor = source_commits_list[-1]['parents'][0]['id']
    -destination_commits = list(self.bitbucket_client.get_commits(self.workspace_slug, self.repo_slug, guaranteed_common_ancestor, self.pr.toRef['latestCommit']))
    +oldest_source_commit = source_commits_list[-1]['parents'][0]['id']
    +destination_commits = list(self.bitbucket_client.get_commits(self.workspace_slug, self.repo_slug, oldest_source_commit, self.pr.toRef['latestCommit']))
     
    -base_sha = self.get_best_common_ancestor(source_commits_list, destination_commits, guaranteed_common_ancestor)
    +base_sha = self.get_best_common_ancestor(source_commits_list, destination_commits, oldest_source_commit)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggested variable name change significantly improves code readability and understanding, which is crucial for maintainability.

    8
    Performance
    ✅ Optimize the creation of the destination_commit_hashes set using a set comprehension and union operation
    Suggestion Impact:The suggestion was directly implemented in the commit. The code was changed to use a set comprehension with a union operation to create destination_commit_hashes, exactly as suggested.

    code diff:

    -        destination_commit_hashes = {commit['id'] for commit in destination_commits_list}
    -        destination_commit_hashes.add(guaranteed_common_ancestor)
    +        destination_commit_hashes = {commit['id'] for commit in destination_commits_list} | {guaranteed_common_ancestor}

    The get_best_common_ancestor method could be optimized by using a set comprehension
    instead of a loop for creating destination_commit_hashes. This would make the code
    more concise and potentially more efficient.

    pr_agent/git_providers/bitbucket_server_provider.py [140-143]

     @staticmethod
     def get_best_common_ancestor(source_commits_list, destination_commits_list, guaranteed_common_ancestor) -> str:
    -    destination_commit_hashes = {commit['id'] for commit in destination_commits_list}
    -    destination_commit_hashes.add(guaranteed_common_ancestor)
    +    destination_commit_hashes = {commit['id'] for commit in destination_commits_list} | {guaranteed_common_ancestor}
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: This suggestion offers a minor performance improvement and enhances code readability by using more concise Python syntax.

    7
    Enhancement
    Simplify the initialization of the Bitbucket client by using a default value in the method signature

    Consider using a default value for bitbucket_client instead of None. This would
    allow you to create the Bitbucket instance directly in the parameter list, reducing
    the need for an additional conditional statement in the method body.

    pr_agent/git_providers/bitbucket_server_provider.py [17-19]

     def __init__(
             self, pr_url: Optional[str] = None, incremental: Optional[bool] = False,
    -        bitbucket_client: Optional[Bitbucket] = None,
    +        bitbucket_client: Optional[Bitbucket] = Bitbucket(url=self._parse_bitbucket_server(pr_url),
    +                                                          token=get_settings().get("BITBUCKET_SERVER.BEARER_TOKEN", None)),
     ):
     
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: The suggestion improves code readability but may not be the best practice as it could lead to unexpected behavior if the URL is not provided.

    5

    @mrT23
    Copy link
    Collaborator

    mrT23 commented Aug 25, 2024

    @MaxHoecker
    Does this PR support the different calculation methods that were used in Bitbucket Server 7.0 before and after ?

    In other words, is it backward (and forward) compatible ?

    @MaxHoecker
    Copy link
    Author

    @mrT23
    It does not support the 3-way. Bitbucket Server 7.0 came out over 4 years ago, are you suggesting an additional config value to toggle between them?

    @mrT23
    Copy link
    Collaborator

    mrT23 commented Aug 26, 2024

    @MaxHoecker I dont know.
    I am not a Bitbucket server user.
    Just trying to make sure this is an improvement.

    @MarkRx are you interested in reviewing this PR ?

    @MaxHoecker
    Copy link
    Author

    Added a config to allow the legacy 3-way diff. Although I think it's unlikely anyone using this and bitbucket server is using Bitbucket Server 6.0 or older, so I defaulted to the new diff calculation method.

    @MarkRx
    Copy link
    Contributor

    MarkRx commented Aug 27, 2024

    Looks like there is an API endpoint you can call to get the base line commit for a 2-way diff. It may be better to use that to match what BB is doing. If the API client doesn't expose the endpoint you can call it manually like I did here.

    As for supporting a 3-way diff unfortunately BB forces the 2-way diff from version 7+. Given that Enterprises should be keeping up to date in theory support for 6 with a 3-way diff is not necessary.

    @MaxHoecker
    Copy link
    Author

    Unfortunately that endpoint is not available in my Bitbucket API version. I'll add a check for the api version and use a different strategy depending on api version though

    …egy depending on API version, and expanded unit test cases
    @MaxHoecker
    Copy link
    Author

    Changes pushed, @MarkRx please review when you get the chance 😄

    @@ -407,3 +451,6 @@ def get_pr_labels(self, update=False):

    def _get_pr_comments_path(self):
    return f"rest/api/latest/projects/{self.workspace_slug}/repos/{self.repo_slug}/pull-requests/{self.pr_num}/comments"

    def _get_best_common_ancestor(self):
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This method name is confusing

    Copy link
    Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    completely agree, will fix

    base_sha, head_sha = source_commits_list[-1]['parents'][0]['id'], source_commits_list[0]['id']

    # if Bitbucket api version is greater than or equal to 7.0 then use 2-way diff functionality for the base_sha
    if self.bitbucket_api_version is not None and self.bitbucket_api_version >= LooseVersion("7.0"):
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    it may be more clear to split this into 3 cases:
    <7
    <8.16
    8.16+

    Copy link
    Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Unfortunately I don't think this block can be cleanly split into 3 cases like that based on the shared logic between certain cases. I do think I can cut out a call to get the source_commits_list for the >8.16 condition though so I'll at least add that

    @mrT23
    Copy link
    Collaborator

    mrT23 commented Sep 3, 2024

    ok, thanks for the contribution

    @mrT23 mrT23 merged commit d4d9a7f into Codium-ai:main Sep 3, 2024
    2 checks passed
    @hussam789
    Copy link
    Collaborator

    @CodiumAI-Agent /improve
    --pr_code_suggestions.num_code_suggestions_per_chunk="5"

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    Bitbucket server uses 2-way diff but Codium AI uses 3-way diff
    4 participants