Add s3_url_compatibility_mode to create_secret_functions #82

carlopi · 2024-10-17T14:42:22Z

No description provided.

…_install) Idea is that if `.info` files ends up being corrupt, you will be stuck in a loop like: duckdb -c "FORCE INSTALL x" IO Error: Failed to read info file for 'x' extension: Try reinstalling the extension using 'FORCE INSTALL x;' And given in the force_install codepath (that is either in case of `FORCE INSTALL` or in case of `UPDATE EXTENSIONS`) the info file will be rewritten in any case I think this is fine. Note that we could also just avoid the read (but only in the case of FORCE INSTALL), but I wanted avoid yet another slightly different codepath.

…e to detect headers and types. (duckdb#14174) When globbing multiple CSV files, if a file that would be ran with the adaptive sniffer only has one row, the adaptive sniffer could misidentify if the one row would be a header or a data row. This PR extends the checks to identify this case, and adds tests for it. Fix: duckdb#14166

Fixes duckdb#14294 I ran the test with 1 thread and saw it consistently yielded 5 files, so this should be the proper lower bound.

This is a follow-up to duckdb#13712, using a more fine-granular locking approach. duckdb#13712 has fixed an issue where checkpoints together with concurrent queries that referenced the same table multiple times could lead to deadlocks. The locking approach in that PR used a transaction-level lock to coordinate lock acquisition for the same table. In case of a query having many table scans (for different tables), this could potentially now block all executor threads if only a single table lock could not be acquired (e.g. due to a concurrent checkpoint). One thread would be blocked trying to acquire the checkpoint lock for the given table, and many others on the active_locks_lock (even though they would only be interested in acquiring the checkpoint lock for another table that wasn't being checkpointed).

…b#14395) This PR fixes duckdb#14386 Offsets were not used correctly, now they are

… unset

…kdb#14397) File list materialization can be expensive and should be done only when necessary

Fix: duckdblabs/duckdb-internal#3287

Clamp the range of shifted defaults/NULLs to the end of the partition, not the end of the chunk. fixes: duckdb#14398 fixes: duckdblabs/duckdb-internal#3300

str(summarized_phase.percentage * 100)[:6] will fail if the number is on the form of 1.23e-05 or similar. Changed to round.

Signed, toeless.

The existing code to calculate the percentages shown in the html report `str(summarized_phase.percentage * 100)[:6]` will fail if the number is on the form of 8.09860748592347e-07 or similar. Changed to `round(summarized_phase.percentage,4)` to give a rounded number that is correct. ```python summarized_phase.percentage = 8.09860748592347e-07 str(summarized_phase.percentage*100)[:6] >'8.0986' round(summarized_phase.percentage*100,4) >0.0001 ```

…_ROW_ID (duckdb#14480) Updated two instances where LogicalType::BIGINT was mistakenly used instead of ROW_TYPE when handling the COLUMN_IDENTIFIER_ROW_ID pseudo-column. This change aligns the code with the rest of the project for consistency and clarity.

…rrow_c_stream__

This PR fixes duckdblabs/duckdb-internal#3325

…avoid testing

… installing) (duckdb#14272) Idea is that if `.info` files ends up being corrupt, you will be stuck in a loop like: ``` duckdb -c "FORCE INSTALL x" IO Error: Failed to read info file for 'x' extension: ... .... Try reinstalling the extension using 'FORCE INSTALL x;' ``` And given in the force_install codepath (that is either in case of `FORCE INSTALL` or in case of `UPDATE EXTENSIONS`) the info file will be rewritten in any case I think this is OK, and there is no need to notify user that this happened. In the LOAD codepath this will still be visible, we only need a way to get unblocked (yes, removing the file works, but it's not super cool). Note that we could also just avoid the read (but only in the case of proper `FORCE INSTALL`), but I wanted avoid yet another slightly different codepath.

@Mytherin

…ls but avoid testing them (duckdb#14510) Discussed with @Mytherin, Python 3.7 is end of life already, we keep the support just since it seems weird to remove what currently is working (but keeps generating noise).

…b#14466) Fix: duckdb#14428

…here

…ning empty lists (duckdb#14538) This PR fixes duckdb#14497 I can't produce a test for this other than the reproduction from the issue, which uses pyiceberg. The problem was obvious though, there are no offsets when the list is empty, so `offsets[0]` accesses uninitialized memory, it no longer does that.

…unicode (python enforces this)

…b#14556) Fix duckdb#14545 Yacc "helpfully" allows you to accidentally define an empty token, which defaults to returning nothing, in the following way: ``` pivot_header: | d_expr { $$ = list_make1($1); } ``` This PR fixes the issue and makes the query return a syntax error (as it should).

…pushdown (duckdb#14553) This PR fixes duckdb#14344 I've also included a small change to TableFilter, using ToString is impossible when debugging, as `ToString("c0")` would complain about `c0` not being a `std::string` DebugToString inserts a bogus default column name so the method can be called.

…uckdb#14533) This PR fixes duckdb#14523

Pre-allocating the hash-vector results in a large amount of memory being consumed by duckdb::Vector type (70% of the program memory) and also results in a 3-fold increase in resident memory usage.

…uckdb#14527) Fixes duckdblabs/duckdb-internal#3342

Fixes an internally found issue

…uckdb#14570) The `DistinctStatistics` constructor pre-initializes a hash vector of type `duckdb::Vector` with a size of `sizeof(uin64)*2048` = 16KB. This pre-initialization ends up consuming a lot of memory as seen in this massif output of a program that attaches 50 catalogs, each having 97 columns. We see that 78MB of the 116MB allocated is related to this vector. ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 45 933,253,335,399 113,180,936 112,322,871 858,065 0 46 933,265,400,713 114,960,272 114,086,884 873,388 0 47 945,145,815,618 114,955,328 114,082,020 873,308 0 48 948,725,400,279 116,662,272 115,778,767 883,505 0 99.24% (115,778,767B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->67.65% (78,921,728B) 0xEB0B05: duckdb::VectorBuffer::CreateStandardVector(duckdb::PhysicalType, unsigned long) (in /home/admin/a.out) | ->67.65% (78,921,728B) 0xEB1F0D: duckdb::Vector::Initialize(bool, unsigned long) (in /home/admin/a.out) | ->67.65% (78,921,728B) 0xEB2065: duckdb::Vector::Vector(duckdb::LogicalType, unsigned long) (in /home/admin/a.out) | ->59.93% (69,910,528B) 0x2670C4F: duckdb::DistinctStatistics::DistinctStatistics(duckdb::unique_ptr<duckdb::HyperLogLog, std::__1::default_delete<duckdb::HyperLogLog>, true>, unsigned long, unsigned long) (in /home/admin/a.out) | | ->59.93% (69,910,528B) 0x26573C6: duckdb::DistinctStatistics::Deserialize(duckdb::Deserializer&) (in /home/admin/a.out) | | ->59.93% (69,910,528B) 0x267340E: duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true> duckdb::Deserializer::ReadPropertyWithExplicitDefault<duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true> >(unsigned short, char const*, duckdb::unique_ptr<duckdb::DistinctStatistics, std::__1::default_delete<duckdb::DistinctStatistics>, true>&&) (in /home/admin/a.out) | | ->59.93% (69,910,528B) 0x2670A4E: duckdb::ColumnStatistics::Deserialize(duckdb::Deserializer&) (in /home/admin/a.out) | | ->59.93% (69,910,528B) 0x26AF648: duckdb::TableStatistics::Deserialize(duckdb::Deserializer&, duckdb::ColumnList&) (in /home/admin/a.out) | | ->59.93% (69,910,528B) 0x259D9FC: duckdb::TableDataReader::ReadTableData() (in /home/admin/a.out) ``` After this change, the same program only takes up 36MB and the dominant memory consumption is from `duckdb::FileBuffer` at 26MB (not surprising at all). ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 45 1,580,151,349,285 31,102,848 30,481,800 621,048 0 46 1,580,163,374,582 31,273,928 30,640,637 633,291 0 47 1,642,024,314,218 31,269,416 30,636,181 633,235 0 48 1,645,707,202,738 36,805,360 36,075,097 730,263 0 98.02% (36,075,097B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->71.81% (26,429,440B) 0xD76F48: duckdb::Allocator::DefaultAllocate(duckdb::PrivateAllocatorData*, unsigned long) (in /home/admin/a.out) | ->71.81% (26,429,440B) 0xD76FFF: duckdb::Allocator::AllocateData(unsigned long) (in /home/admin/a.out) | ->71.78% (26,419,200B) 0xD8ADC3: duckdb::FileBuffer::Resize(unsigned long) (in /home/admin/a.out) | | ->71.22% (26,214,400B) 0x255CBE3: duckdb::Block::Block(duckdb::Allocator&, long, unsigned long) (in /home/admin/a.out) | | | ->71.22% (26,214,400B) 0x257A571: duckdb::SingleFileBlockManager::CreateBlock(long, duckdb::FileBuffer*) (in /home/admin/a.out) | | | ->71.22% (26,214,400B) 0x2596FBD: duckdb::BlockHandle::Load(duckdb::unique_ptr<duckdb::FileBuffer, std::__1::default_delete<duckdb::FileBuffer>, true>) (in /home/admin/a.out) | | | ->71.22% (26,214,400B) 0x256F56C: duckdb::StandardBufferManager::Pin(duckdb::shared_ptr<duckdb::BlockHandle, true>&) (in /home/admin/a.out) | | | ->71.22% (26,214,400B) 0x2625E40: duckdb::MetadataReader::ReadNextBlock() (in /home/admin/a.out) ```

duckdb#14290 changes the timing field from `operator_timing` to `latency` for the root node. This follows that change to make the query_graph tool work again.

pdet and others added 30 commits September 30, 2024 14:11

wip on globbing

7c0fedc

More tests/fixes

7aef75b

fix tests

efc2b0f

Adding a test with a single type line

124d68b

avoid slicing

7894773

Format

c4f27cc

Merge branch 'main' into better_glob

bdaeabc

Remove raw ptr

53fec90

clang

e7759b9

Referenec

46a0f29

increase bounds

28333fa

Use table-level locking when acquiring shared locks

4263bcc

Increase bounds for test/sql/copy/file_size_bytes.test (duckdb#14367)

8d3f8f8

Fixes duckdb#14294 I ran the test with 1 thread and saw it consistently yielded 5 files, so this should be the proper lower bound.

use unique_ptr

43237a6

fix scanning booleans when offsets are involved

2e245d1

avoid unnecessary file list materialization when pruning readers

2f97ec4

commit test

a6be5c8

Return the right node

4c8da26

Make sniffer behave as ignore_errors when rejects options are set

ef77720

[Arrow] Fix scanning of BOOL columns when offsets are involved (duckd…

08969b4

…b#14395) This PR fixes duckdb#14386 Offsets were not used correctly, now they are

Also read correct file with ingore_errors the same way as it would be…

ea0b25c

… unset

avoid unnecessary file list materialization when pruning readers (duc…

0bbfe09

…kdb#14397) File list materialization can be expensive and should be done only when necessary

Fixing type pushdown on the CSV Scanner (duckdb#14399)

1bb332c

Fix: duckdblabs/duckdb-internal#3287

Issue duckdb#14398: Lead Shift Defaults

37a141d

Clamp the range of shifted defaults/NULLs to the end of the partition, not the end of the chunk. fixes: duckdb#14398 fixes: duckdblabs/duckdb-internal#3300

Issue duckdb#14398: Lead Shift Defaults (duckdb#14409)

98d4ad2

Clamp the range of shifted defaults/NULLs to the end of the partition, not the end of the chunk. fixes: duckdb#14398 fixes: duckdblabs/duckdb-internal#3300

More tests

c6756c9

Escape should default to quoted

f6fafbd

bjornasm and others added 28 commits October 22, 2024 16:42

Changed bug in percentage calculation

e6bae1c

str(summarized_phase.percentage * 100)[:6] will fail if the number is on the form of 1.23e-05 or similar. Changed to round.

Don't move lvalue when inserting in order preserving map (duckdb#14492)

0359726

Signed, toeless.

bump vss (duckdb#14493)

89a2f0a

create a RecordBatchReader so we can read it without resorting to __a…

d5885c9

…rrow_c_stream__

[Python][Dev] Fix up test to work with older version (duckdb#14505)

909a0cb

This PR fixes duckdblabs/duckdb-internal#3325

Python 3.7 tests are particularly brittle, we will keep building but …

2a273e1

…avoid testing

operator_timing -> latency to be compatible with duckdb#14290

abc2d39

Being more restrictive with the names option in the csv reader (duckd…

dfdd796

…b#14466) Fix: duckdb#14428

CI: Mirror 'reproduced' / 'under review' to the internal repository

8e350d1

include the alias for the returning column when present

266b8a3

since we are reading 'offsets[0]' we would read uninitialized memory …

7e9e6c5

…here

cast to py::bytes when dealing with BLOB, as py::str has to be valid …

15c4c7c

…unicode (python enforces this)

Fix duckdb#14545 - pivot header must be defined in the grammar

53a85c5

Aggregate states must be trivially destructible

736e9d5

[Dev] Include aliases for RETURNING list expressions in ToString (d…

bccd37a

…uckdb#14533) This PR fixes duckdb#14523

Don't pre-initialize hash vector in DistinctStatistics construction

e0edea3

Pre-allocating the hash-vector results in a large amount of memory being consumed by duckdb::Vector type (70% of the program memory) and also results in a 3-fold increase in resident memory usage.

CI: Mirror 'reproduced' / 'under review' to the internal repository (d…

e791508

…uckdb#14527) Fixes duckdblabs/duckdb-internal#3342

Approx-Top K: Make aggregate state trivially destructible (duckdb#14571)

662b0b3

Fixes an internally found issue

Fix query_graph tool for duckdb#14290 (duckdb#14521)

c73cfca

duckdb#14290 changes the timing field from `operator_timing` to `latency` for the root node. This follows that change to make the query_graph tool work again.

Add s3_url_compatibility_mode to create_secret_functions

c33b4b9

carlopi force-pushed the fix_secret_attempt2 branch from 6a90eb4 to c33b4b9 Compare October 29, 2024 14:38

carlopi closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add s3_url_compatibility_mode to create_secret_functions #82

Add s3_url_compatibility_mode to create_secret_functions #82

carlopi commented Oct 17, 2024

Add s3_url_compatibility_mode to create_secret_functions #82

Add s3_url_compatibility_mode to create_secret_functions #82

Conversation

carlopi commented Oct 17, 2024