-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
We need the ability to get the TaskContext.task_id any place where a Custom Data Source is invoked. As it stands currently, the state: &SessionState is available in TableProvider.scan and task_ctx: Arc<TaskContext> is available in ExecutionPlan.execute, but not in the supports_filters_pushdown. This prohibits per-query customization or tracking of external state in this method. For example if there are 3 filters for a custom table, and 10 are possible, we need to be able to choose the best one at runtime.
Further, the task_id should always be available by passing the TaskContext or from SessionState to keep things consistent.
In trying to implement this it proved infeasible because supports_filters_pushdown is in 2 interfaces in 2 separate crates: TableProvider (in core) and TableSource (in expr). It is not possible to add state: &SessionState to the TableSource implementation as it cannot access the core crate, a cyclic dependency occurs the way it is now. This was intentional to make LogicalPlan separable, which makes sense, but preventing this type of enhancement.
Describe the solution you'd like
Add &SessionState or minimally TaskContext in every pertinent method for per-query specific processing in a custom data source.
A possible way to solve this is to make a new datafusion-traits crate, and to move SessionState and other common items to datafusion-common, such that these components are used by core and expr. It will make some components available in expr that are not strictly necessary, but I think that is a good trade-off. This work could be combined with other efforts to break core into more sub-crates, that would make DataFusion much more flexible overall.
Describe alternatives you've considered
No response
Additional context
Restructuring crates in a project of this size will be a lot of work, but I believe the benefit will be there. There are other issues that also would benefit. I would recommended a separate restructure ticket that can be reviewed before any implementation is attempted. In addition then this would need to be implemented by multiple contributors, it will inevitably cause a lot of temporary breakage and retesting will also be required.