Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional options for APIDataSet (e.g. proxies) #711

Closed
lucasjamar opened this issue Mar 4, 2021 · 13 comments
Closed

Additional options for APIDataSet (e.g. proxies) #711

lucasjamar opened this issue Mar 4, 2021 · 13 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature pinned Issue shouldn't be closed by stale bot

Comments

@lucasjamar
Copy link
Contributor

Description

Add proxies parameter to APIDataSet. Perhaps other options could also interesting?

Context

When comparing the list of options available from requests with those available in the APIDataSet, there are few missing that might be interesting to add.
Personally, the most interesting to have would be the proxies option as discussed here. Before I used to set my proxies in my global env variables but I stumble upon cases where my pipeline has two APIs requiring two different proxies. Having the proxies options in the data catalog would have the added benefit that the proxies could be loaded using the TemplatedConfigLoader, thereby keeping the proxies confidential in the credentials.yml.

Other than proxies, I was wondering if there are other options that users might find to be of interest? Should we consider adding all options available in requests that are not yet available in the APIDataSet so that users have full flexibility?

Possible Implementation

I would like to try solving this one if thats ok. Hopefully it requires just adding these options in extras/datasets/api/APIDataSet.py. I am not certain if these options would require extra tests. Any inputs?

@lucasjamar lucasjamar added the Issue: Feature Request New feature or improvement to existing feature label Mar 4, 2021
@antonymilne
Copy link
Contributor

antonymilne commented Mar 15, 2021

Thanks for opening the issue, and for your enthusiasm in trying to solve it! This sounds like an eminently reasonable idea to me, but would definitely appreciate the input from people who know APIDataSet better or who were involved with its initial addition to kedro - maybe @limdauto?

Just to be clear, here's where we currently stand. Below are all the options available in requests:

Available in kedro:

- url
- params
- data
- json (added in #424 after original implementation)
- headers
- auth
- timeout

Not available in kedro:

- cookies
- files
- allow_redirects
- proxies
- verify
- stream
- cert

Personally I don't know which of these options are or aren't important, but I don't see why we shouldn't support all of them. However, I do wonder whether our current implementation would be the correct way to do this (having many different options in the __init__ method). Should we instead treat these as load_args? This way we could support all the options without needing to explicitly list and document them.

pull bot pushed a commit to vishalbelsare/kedro that referenced this issue Apr 4, 2021
@stale
Copy link

stale bot commented May 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 14, 2021
@lorenabalan lorenabalan added pinned Issue shouldn't be closed by stale bot and removed stale labels May 19, 2021
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht
Copy link
Member

Hi @lucasjamar, is this still a feature you're interested in? If so would you be able to make a contribution PR?

@lucasjamar
Copy link
Contributor Author

Hi,

Definitely useful. As @AntonyMilneQB mentionned, should we just wrap up the other options inside load_args or add them like the options?

@merelcht
Copy link
Member

Definitely useful. As @AntonyMilneQB mentionned, should we just wrap up the other options inside load_args or add them like the options?

I'd be in favour of wrapping them in load_args instead of adding each option in the __init__

@daBlesr
Copy link
Contributor

daBlesr commented Jun 17, 2022

Hi, I can pick this one up. The API specification of the requests library is the following:

def request(method, url, **kwargs): ...

Hence, my proposal is to keep the interface (almost) same in the data catalog: method, url, load_args.

Since authentication while using the requests library requires BaseAuth instances and this is not possible from a catalog, I'd like to add the parameter auth.
This auth parameter allows to configure complex authentication instances.

if auth is not None:
    if 'type' not in auth:
        raise DataSetError("'type' is missing from Authenticator configuration")
    auth_type = auth.pop('type')
    self._auth = BaseAuthFactory.create(auth_type, **auth, **credentials)

There is a PR open #1445 for this implementation, but since my implementation is quite deviant from the current implementation, I'd like to open a new one.

Also fyi @afuetterer as you are working on the subclass of the general API class.

@deepyaman
Copy link
Member

@daBlesr Sounds fine at a high level. Not entirely sure how you're planning to handle auth? If it's not an object, the default handling seems to be user/password as in https://github.com/psf/requests/blob/2a6f290bc09324406708a4d404a88a45d848ddf9/requests/models.py#L589-L609, and the current APIDataSet implementation also looks to be geared towards creating an auth object from tuple, if possible.

@daBlesr
Copy link
Contributor

daBlesr commented Jun 18, 2022

Hi @deepyaman, I think since most commonly credentials are provided in dictionary type, it would be better to stay consistent with that across the project. Hence, in my approach I would not support tuples.

For instance, if you want to support access tokens, you can currently write something like this:

url: example.com
method: GET
auth_type: tests.extras.datasets.api.test_api_auth.AccessTokenAuth 
credentials: access_token_cred
load_args:
    headers: 
        some_header: some_value

-- credentials.yaml
access_token_cred:
    token: abc

Custom Access Token class:

class AccessTokenAuth(AuthBase):
    """Attaches Access Token Authentication to the given Request object."""

    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        r.headers['Authorization'] = f'access_token {self.token}'
        return r

In my tests I have verified that indeed the following is the case:

requests_mocker.last_request.headers['Authorization'] == 'access_token abc' 

The default for auth_type is requests.auth.HTTPBasicAuth, so username and password are generally expected, if credentials are actually provided. Let me know what you think.

@afuetterer
Copy link

Thank you for continuing this PR, @daBlesr. I am looking forward for your update on the "base" APIDataSet. I will gladly adapt my subclass to your approach here.

A question: The APIDataSet s main use case is to make a single request in a pipeline run, right? It would not make sense to instantiate a requests.Session` in this class, right? It would be more consistent with the subclass, that is used for multiple API calls and therefore needs a Session object. If you adapt the "base" class here, I thought I might ask.

@daBlesr
Copy link
Contributor

daBlesr commented Jun 19, 2022

In the general case, I think it is not very beneficial to create a session for the user (and expose it), as sessions should be closed at some point. However, it would not hurt to create a (non-exposed) session in this class and autoclose it afterwards (which you would be able to use), since sessions are created internally in the requests library itself for one-offs as well.

@afuetterer
Copy link

Thanks for your input.

I tried exactly that already in the paginated api dataset.
The session is not exposed. It is instantiated in the dataset's init method as _session. Headers and auth are set on the session object. I close it with finally in the execute_request method and at the end of the load method.

Is that what you meant?

@antonymilne
Copy link
Contributor

Auth

@deepyaman @daBlesr Just for context, the current behaviour of converting to tuple was added fairly recently: #1101 (PR #1105). Actually I raised exactly the same point @daBlesr makes there but decided it was ok: "I just have one small hesitation: credentials is usually a kwargs dictionary rather than a list.".

The case of an API token in headers, which is not currently supported by APIDataSet, certainly sounds very valid, and actually has also come up before: #1523. As you can see this is currently possible through a custom MyAPIDataSet. @daBlesr's solution is very much more elegant than this, but I don't think we should implement it straight away. The whole topic of credentials in kedro is something that needs a re-think and I believe we need a more holistic solution here rather than something that's specific to APIDataSet. e.g. ideally it would also handle other credentials-related issues: #1280 #1621. I intend to make an issue on this shortly and will very much welcome your thoughts on it there! The pattern of "provide auth_type class in data catalog" certainly sounds like one possible solution here.

So for now let's leave auth out of it and try and brainstorm some more general solutions in a separate issue to be created soon.

Session

@afuetterer @daBlesr Agree with all that's said above that there's no need to use session in the base APIDataSet since this is exactly what requests.requests does underneath:

    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

... but equally I don't see any harm in using a session for this in APIDataSet.

Also agree that using a session for PaginatedAPIDataSet does make sense since it's not just a one-off request. I don't mind whether this lives in the base APIDataSet or in the Paginated one alone - whatever seems neatest overall. Happy to go with whatever you guys think is best here, and I plan to review the paginated PR very soon 🙂

@merelcht
Copy link
Member

Completed in #1633

@merelcht merelcht moved this to Done in Kedro Framework Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature pinned Issue shouldn't be closed by stale bot
Projects
Archived in project
Development

No branches or pull requests

7 participants