Confluence scanning: for large sites not all spaces/pages are scanned without error

I use a fork of your n0s1 code to scan our (large) confluence cloud instance. Thanks for that, it is very useful.

However, I found out that not all spaces are being scanned, but I didn't get an error message or timeout. I just noticed that a test space I added was not in the report. The total scan took about 5 hours. I figured it was caused by somehow the connection being closed and the client object to become empty.  I saw that you recently added error handling and did some refactoring. But the strange thing is, we didn't get errors. But I will adopt the error handling in any case.
For now, I solved the issue with missing spaces by adding a self.connect() in the method 'get_data' for every batch of spaces to be collected. There might be a better way though, but for now this works.

```
    def set_config(self, config):
        from atlassian import Confluence
        SERVER = config.get("server", "")
        EMAIL = config.get("email", "")
        TOKEN = config.get("token", "")
        LABEL_FALSE_POSITIVE = config.get("label_false_positive", "cict-no-secrets-confirmed")
        self._url = SERVER
        self._user = EMAIL
        self._password = TOKEN
        self.label_false_positive = LABEL_FALSE_POSITIVE
        self._connect()
        return self.is_connected()
        
    def _connect(self):
        from atlassian import Confluence
        if self._user and len(self._user) > 0:
            self._client = Confluence(url=self._url, username=self._user, password=self._password)
        else:
            self._client = Confluence(url=SERVER, token=TOKEN)
```
and in get_data:
```
    def get_data(self, include_comments=False, test=""):
        if not self._client:
            return None, None, None, None, None, None
        start = 0
        limit = 50

        finished = False
        while not finished:
            logging.info(f"Spaces batch: {start} - {start+limit}")
            # reconnect for every batch
            self._connect()
            if not test:
                res = self._client.get_all_spaces(
                    start=start, limit=limit, expand="history"
                )
                start += limit
                spaces = res.get("results", [])
            else:
                key = test
                res = self._client.get_space(key, expand="history")
                finished = True
                spaces = [res]
```
I also added a possibility to only test with one space as the total scan takes such a long time via the parameter test.

For your interest, another improvement I made for our use case, is a change to the config.yaml: `id: generic-api-key` as we got tons of false positives due to this regex finding the confluence user macro and link macro in combination with 'key'.
```
  - id: generic-api-key
    description: Generic API Key
    regex: >-
      (?i)(?<!ri:user|CDATA\[\<add )(?:key|api|token|secret|client|passwd|password|auth|access)(?:[0-9a-z\-_\t
      .]{0,20})(?:[\s|']|[\s|"]){0,3}(?:=|>|:{1,3}=|\|\|:|<=|=>|:|\?=)(?:'|\"|\s|=|\x60){0,5}([0-9a-z\-_.=]{10,150})(?:['|\"|\n|\r|\s|\x60|;]|$)
```
And we added a method to skip a page if a label was set to indicate the page is a false positive, because the found secret is just meant as an example. In that case, the user can add a specific label to indicate that it is a false positive.
```
   def is_false_positive(self, page_id):
        labels_json = self._client.get_page_labels(page_id)
        labels = labels_json.get("results", [])
        for label in labels:
            if label["name"] == self.label_false_positive:
                logging.info(f"INFO: page {page_id} is false positive due to label {label}")
                return True
        return False
```
And in the method `get_data`:
```
                        for p in pages:
                            comments = []
                            title = p.get("title", "")
                            page_id = p.get("id", "")
                            if self.is_false_positive(page_id):
                                continue
```
In any case, thanks for your code. Hope my comments are useful.
Kind regards,
Mariska

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions