Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add s3_url_compatibility_mode to create_secret_functions #82

Closed
wants to merge 76 commits into from

Conversation

carlopi
Copy link
Owner

@carlopi carlopi commented Oct 17, 2024

No description provided.

pdet and others added 30 commits September 30, 2024 14:11
…_install)

Idea is that if `.info` files ends up being corrupt, you will be stuck in a loop like:
duckdb -c "FORCE INSTALL x"
IO Error: Failed to read info file for 'x' extension:
Try reinstalling the extension using 'FORCE INSTALL x;'

And given in the force_install codepath (that is either in case of `FORCE INSTALL` or in
case of `UPDATE EXTENSIONS`) the info file will be rewritten in any case I think this is
fine.

Note that we could also just avoid the read (but only in the case of FORCE INSTALL), but I wanted
avoid yet another slightly different codepath.
…e to detect headers and types. (duckdb#14174)

When globbing multiple CSV files, if a file that would be ran with the
adaptive sniffer only has one row, the adaptive sniffer could
misidentify if the one row would be a header or a data row.

This PR extends the checks to identify this case, and adds tests for it.

Fix: duckdb#14166
Fixes duckdb#14294

I ran the test with 1 thread and saw it consistently yielded 5 files, so
this should be the proper lower bound.
This is a follow-up to duckdb#13712,
using a more fine-granular locking approach.

duckdb#13712 has fixed an issue where checkpoints together with concurrent
queries that referenced the same table multiple times could lead to
deadlocks. The locking approach in that PR used a transaction-level lock
to coordinate lock acquisition for the same table.

In case of a query having many table scans (for different tables), this
could potentially now block all executor threads if only a single table
lock could not be acquired (e.g. due to a concurrent checkpoint). One
thread would be blocked trying to acquire the checkpoint lock for the
given table, and many others on the active_locks_lock (even though they
would only be interested in acquiring the checkpoint lock for another
table that wasn't being checkpointed).
…b#14395)

This PR fixes duckdb#14386

Offsets were not used correctly, now they are
…kdb#14397)

File list materialization can be expensive and should be done only when
necessary
Clamp the range of shifted defaults/NULLs to the end of the partition,
not the end of the chunk.

fixes: duckdb#14398
fixes: duckdblabs/duckdb-internal#3300
Clamp the range of shifted defaults/NULLs to the end of the partition,
not the end of the chunk.

fixes: duckdb#14398
fixes: duckdblabs/duckdb-internal#3300
bjornasm and others added 28 commits October 22, 2024 16:42
str(summarized_phase.percentage * 100)[:6] will fail if the number is on the form of 1.23e-05 or similar. Changed to round.
The existing code to calculate the percentages shown in the html report
`str(summarized_phase.percentage * 100)[:6]` will fail if the number is
on the form of 8.09860748592347e-07 or similar. Changed to
`round(summarized_phase.percentage,4)` to give a rounded number that is
correct.


```python
summarized_phase.percentage = 8.09860748592347e-07
str(summarized_phase.percentage*100)[:6]
>'8.0986'
round(summarized_phase.percentage*100,4)
>0.0001
```
…_ROW_ID (duckdb#14480)

Updated two instances where LogicalType::BIGINT was mistakenly used
instead of ROW_TYPE when handling the COLUMN_IDENTIFIER_ROW_ID
pseudo-column. This change aligns the code with the rest of the project
for consistency and clarity.
… installing) (duckdb#14272)

Idea is that if `.info` files ends up being corrupt, you will be stuck
in a loop like:
```
duckdb -c "FORCE INSTALL x"
IO Error: Failed to read info file for 'x' extension: ...
....
Try reinstalling the extension using 'FORCE INSTALL x;'
```

And given in the force_install codepath (that is either in case of
`FORCE INSTALL` or in case of `UPDATE EXTENSIONS`) the info file will be
rewritten in any case I think this is OK, and there is no need to notify
user that this happened.
In the LOAD codepath this will still be visible, we only need a way to
get unblocked (yes, removing the file works, but it's not super cool).

Note that we could also just avoid the read (but only in the case of
proper `FORCE INSTALL`), but I wanted avoid yet another slightly
different codepath.
…ls but avoid testing them (duckdb#14510)

Discussed with @Mytherin, Python 3.7 is end of life already, we keep the
support just since it seems weird to remove what currently is working
(but keeps generating noise).
…ning empty lists (duckdb#14538)

This PR fixes duckdb#14497

I can't produce a test for this other than the reproduction from the
issue, which uses pyiceberg.
The problem was obvious though, there are no offsets when the list is
empty, so `offsets[0]` accesses uninitialized memory, it no longer does
that.
…b#14556)

Fix duckdb#14545

Yacc "helpfully" allows you to accidentally define an empty token, which
defaults to returning nothing, in the following way:

```
pivot_header:
	| d_expr	                 			{ $$ = list_make1($1); }
```

This PR fixes the issue and makes the query return a syntax error (as it
should).
…pushdown (duckdb#14553)

This PR fixes duckdb#14344 

I've also included a small change to TableFilter, using ToString is
impossible when debugging, as `ToString("c0")` would complain about `c0`
not being a `std::string`

DebugToString inserts a bogus default column name so the method can be
called.
Pre-allocating the hash-vector results in a large amount of memory being
consumed by duckdb::Vector type (70% of the program memory) and also results
in a 3-fold increase in resident memory usage.
…uckdb#14570)

The `DistinctStatistics` constructor pre-initializes a hash vector of
type `duckdb::Vector` with a size of `sizeof(uin64)*2048` = 16KB. This
pre-initialization ends up consuming a lot of memory as seen in this
massif output of a program that attaches 50 catalogs, each having 97
columns. We see that 78MB of the 116MB allocated is related to this
vector.

```
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 45 933,253,335,399      113,180,936      112,322,871       858,065            0
 46 933,265,400,713      114,960,272      114,086,884       873,388            0
 47 945,145,815,618      114,955,328      114,082,020       873,308            0
 48 948,725,400,279      116,662,272      115,778,767       883,505            0
99.24% (115,778,767B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->67.65% (78,921,728B) 0xEB0B05: duckdb::VectorBuffer::CreateStandardVector(duckdb::PhysicalType, unsigned long) (in /home/admin/a.out)
| ->67.65% (78,921,728B) 0xEB1F0D: duckdb::Vector::Initialize(bool, unsigned long) (in /home/admin/a.out)
|   ->67.65% (78,921,728B) 0xEB2065: duckdb::Vector::Vector(duckdb::LogicalType, unsigned long) (in /home/admin/a.out)
|     ->59.93% (69,910,528B) 0x2670C4F: duckdb::DistinctStatistics::DistinctStatistics(duckdb::unique_ptr<duckdb::HyperLogLog, std::__1::default_delete<duckdb::HyperLogLog>, true>, unsigned long, unsigned long) (in /home/admin/a.out)
|     | ->59.93% (69,910,528B) 0x26573C6: duckdb::DistinctStatistics::Deserialize(duckdb::Deserializer&) (in /home/admin/a.out)
|     |   ->59.93% (69,910,528B) 0x267340E: duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true> duckdb::Deserializer::ReadPropertyWithExplicitDefault<duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true> >(unsigned short, char const*, duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true>&&) (in /home/admin/a.out)
|     |     ->59.93% (69,910,528B) 0x2670A4E: duckdb::ColumnStatistics::Deserialize(duckdb::Deserializer&) (in /home/admin/a.out)
|     |       ->59.93% (69,910,528B) 0x26AF648: duckdb::TableStatistics::Deserialize(duckdb::Deserializer&, duckdb::ColumnList&) (in /home/admin/a.out)
|     |         ->59.93% (69,910,528B) 0x259D9FC: duckdb::TableDataReader::ReadTableData() (in /home/admin/a.out)
```

After this change, the same program only takes up 36MB and the dominant
memory consumption is from `duckdb::FileBuffer` at 26MB (not surprising
at all).

```
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 45 1,580,151,349,285       31,102,848       30,481,800       621,048            0
 46 1,580,163,374,582       31,273,928       30,640,637       633,291            0
 47 1,642,024,314,218       31,269,416       30,636,181       633,235            0
 48 1,645,707,202,738       36,805,360       36,075,097       730,263            0
98.02% (36,075,097B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->71.81% (26,429,440B) 0xD76F48: duckdb::Allocator::DefaultAllocate(duckdb::PrivateAllocatorData*, unsigned long) (in /home/admin/a.out)
| ->71.81% (26,429,440B) 0xD76FFF: duckdb::Allocator::AllocateData(unsigned long) (in /home/admin/a.out)
|   ->71.78% (26,419,200B) 0xD8ADC3: duckdb::FileBuffer::Resize(unsigned long) (in /home/admin/a.out)
|   | ->71.22% (26,214,400B) 0x255CBE3: duckdb::Block::Block(duckdb::Allocator&, long, unsigned long) (in /home/admin/a.out)
|   | | ->71.22% (26,214,400B) 0x257A571: duckdb::SingleFileBlockManager::CreateBlock(long, duckdb::FileBuffer*) (in /home/admin/a.out)
|   | |   ->71.22% (26,214,400B) 0x2596FBD: duckdb::BlockHandle::Load(duckdb::unique_ptr<duckdb::FileBuffer, std::__1::default_delete<duckdb::FileBuffer>, true>) (in /home/admin/a.out)
|   | |     ->71.22% (26,214,400B) 0x256F56C: duckdb::StandardBufferManager::Pin(duckdb::shared_ptr<duckdb::BlockHandle, true>&) (in /home/admin/a.out)
|   | |       ->71.22% (26,214,400B) 0x2625E40: duckdb::MetadataReader::ReadNextBlock() (in /home/admin/a.out)
```
duckdb#14290 changes the timing field from `operator_timing` to `latency` for
the root node. This follows that change to make the query_graph tool
work again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.