Skip to content

Large columns #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
30ac707
Fix OrcUtil bug where convertStruct() returns an Array instead of an …
jsims-slower Nov 10, 2022
d091270
Fix OrcUtil to correctly handle Kafka Connect Decimals.
jsims-slower Nov 11, 2022
b7e5ce4
Fix bug in OrcUtil.convertStruct where the outer TypeInfo is passed t…
jsims-slower Nov 12, 2022
b85fe16
Initial JDBC large-column support. Query back into the DB for columns…
jsims-slower Nov 30, 2022
947b463
Correctly identify and load LOB columns from the DB, and write them t…
jsims-slower Nov 30, 2022
347f233
Add support for Connection reconnect-and-retry. Refactor to fit Confl…
jsims-slower Dec 1, 2022
ef10ed4
Cache LOB Hashes in memory (LRU). Restructure code to be less complex…
jsims-slower Dec 7, 2022
62b89bb
Make retries configurable. Move Retry logic into the retrySpec class.…
jsims-slower Dec 8, 2022
6958cf3
Make retries more intuitive. Make most SQLExceptions be retriable.
jsims-slower Dec 8, 2022
351fc49
Refactor some code to be less coupled. Clean up some logging and add …
jsims-slower Dec 8, 2022
fd5f059
Improve logging, and related cleanup.
jsims-slower Dec 9, 2022
f5d852f
Minor refactor so JdbcTableInfo is less coupled to Record structure. …
jsims-slower Dec 12, 2022
bcb848a
Use Hikari for DB Connection pooling and management. Remove custom DB…
jsims-slower Dec 13, 2022
2a55a16
Fix code-quality warnings.
jsims-slower Dec 14, 2022
3816732
Rename to JdbcColumnInfo, and add equals/hashCode methods.
jsims-slower Dec 14, 2022
7a1f8ca
Generify reading data from the original SourceRecord so testing will …
jsims-slower Dec 14, 2022
3c59661
Refactor how filtered columns are configured. Make hash-cache size co…
jsims-slower Dec 16, 2022
9011272
Make table+column configuration be in line with the source connector …
jsims-slower Dec 16, 2022
b6f2b56
Add support for float and double types as primary keys. Rename HashCa…
jsims-slower Dec 19, 2022
c1e8a19
Remove repetitive method call in loop.
jsims-slower Dec 20, 2022
0b0ff35
Make Hash algorithm dynamic. Add LONG+VARCHAR as potential large colu…
jsims-slower Dec 23, 2022
3e15048
Merge remote-tracking branch 'upstream/master' into logical-types
jsims-slower Dec 29, 2022
41e9996
Merge branch 'logical-types' into large-columns
jsims-slower Dec 29, 2022
1e3eece
Correctly handle null value from transformed record. Add VARCHAR type…
jsims-slower Dec 30, 2022
dbb1864
Make HashCache configurable.
jsims-slower Jan 3, 2023
972200b
Make HashCache configurable.
jsims-slower Jan 3, 2023
bf12a16
Make HashCache configurable.
jsims-slower Jan 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,11 @@
<artifactId>kafka-connect-storage-common-avatica-shaded</artifactId>
<version>${kafka.connect.storage.common.version}</version>
</dependency>
<dependency>
<groupId>com.ibm.db2</groupId>
<artifactId>jcc</artifactId>
<version>11.5.8.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
/*
* Copyright 2018 Confluent Inc.
*
* Licensed under the Confluent Community License (the "License"); you may not use
* this file except in compliance with the License. You may obtain a copy of the
* License at
*
* http://www.confluent.io/confluent-community-license
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/

package io.confluent.connect.hdfs.jdbc;

import org.apache.kafka.connect.data.Struct;

import java.sql.Blob;
import java.sql.Clob;
import java.sql.SQLException;
import java.sql.SQLXML;

public class FilteredColumnToStructVisitor extends FilteredJdbcColumnVisitor {
private final Struct struct;

public FilteredColumnToStructVisitor(HashCache hashCache,
JdbcTableInfo tableInfo,
String primaryKey,
Struct struct) {
super(hashCache, tableInfo, primaryKey);
this.struct = struct;
}

@Override
public void visit(String columnName, Blob value) throws SQLException {
// TODO: Would be so much better if we could stream this data
// TODO: Write to a disk-buffer first, and then digest()? RocksDB?
byte[] bytes = value != null
? value.getBytes(1L, (int) value.length())
: null;

updateCache(columnName, bytes);

struct.put(columnName, bytes);
}

@Override
public void visit(String columnName, Clob value) throws SQLException {
// TODO: Would be so much better if we could stream this data
// TODO: Write to a disk-buffer first, and then digest()? RocksDB?
String valueStr = value != null
? value.getSubString(1L, (int) value.length())
: null;

updateCache(columnName, valueStr);

struct.put(columnName, valueStr);
}

@Override
public void visit(String columnName, SQLXML value) throws SQLException {
// TODO: Would be so much better if we could stream this data
// TODO: Write to a disk-buffer first, and then digest()? RocksDB?
String valueStr = value != null
? value.getString()
: null;

updateCache(columnName, valueStr);

struct.put(columnName, valueStr);
}

@Override
public void visit(String columnName, String value) {
updateCache(columnName, value);

struct.put(columnName, value);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/*
* Copyright 2018 Confluent Inc.
*
* Licensed under the Confluent Community License (the "License"); you may not use
* this file except in compliance with the License. You may obtain a copy of the
* License at
*
* http://www.confluent.io/confluent-community-license
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/

package io.confluent.connect.hdfs.jdbc;

import java.util.Optional;

public abstract class FilteredJdbcColumnVisitor implements JdbcColumnVisitor {
private final HashCache hashCache;
protected final JdbcTableInfo tableInfo;
protected final String primaryKey;
private int columnsChanged = 0;

public FilteredJdbcColumnVisitor(HashCache hashCache,
JdbcTableInfo tableInfo,
String primaryKey) {
this.hashCache = hashCache;
this.tableInfo = tableInfo;
this.primaryKey = primaryKey;
}

public boolean hasChangedColumns() {
return columnsChanged > 0;
}

protected void updateCache(String columnName, byte[] value) {
boolean columnChanged = Optional
.ofNullable(hashCache)
.map(hashCache_ -> hashCache_.updateCache(tableInfo, primaryKey, columnName, value))
.orElse(true);

// If it has changed, indicate that we should write the new value to HDFS
if (columnChanged) {
columnsChanged++;
}
}

protected void updateCache(String columnName, String value) {
boolean columnChanged = Optional
.ofNullable(hashCache)
.map(hashCache_ -> hashCache_.updateCache(tableInfo, primaryKey, columnName, value))
.orElse(true);

// If it has changed, indicate that we should write the new value to HDFS
if (columnChanged) {
columnsChanged++;
}
}
}
109 changes: 109 additions & 0 deletions src/main/java/io/confluent/connect/hdfs/jdbc/HashCache.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
/*
* Copyright 2018 Confluent Inc.
*
* Licensed under the Confluent Community License (the "License"); you may not use
* this file except in compliance with the License. You may obtain a copy of the
* License at
*
* http://www.confluent.io/confluent-community-license
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/

package io.confluent.connect.hdfs.jdbc;

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.Arrays;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;

/**
* Cache of LOB hashes, for pruning unnecessary HDFS writes;
* NOTE: None of this is synchronized, as it expects to be called from non-parallel code.
* NOTE: MD5 Hashing takes about 1-2 seconds per gig, at least locally
*/
public class HashCache {
private static final byte[] EMPTY_HASH = {};

private final Map<JdbcTableInfo, LinkedHashMap<String, Map<String, byte[]>>>
tablePkColumnCache = new HashMap<>();

private final int maxHashSize;
private final MessageDigest messageDigest;

public HashCache(int maxHashSize, MessageDigest messageDigest) {
this.maxHashSize = maxHashSize;
this.messageDigest = messageDigest;
}

public boolean updateCache(JdbcTableInfo tableInfo,
String primaryKey,
String columnName,
String value
) {
byte[] bytes = Optional
.ofNullable(value)
// TODO: Should we hardcode UTF_8 here?
.map(value_ -> value_.getBytes(StandardCharsets.UTF_8))
.orElse(null);

return updateCache(tableInfo, primaryKey, columnName, bytes);
}

/**
* Update cache; and return true if hash has changed
*/
public boolean updateCache(JdbcTableInfo tableInfo,
String primaryKey,
String columnName,
byte[] value
) {
byte[] hash = Optional
.ofNullable(value)
.map(bytes -> {
messageDigest.reset();
return messageDigest.digest(bytes);
})
.orElse(EMPTY_HASH);

// Get the Cache for the given Table + PK
Map<String, Map<String, byte[]>> pkColumnHashCache = getPkColumnCache(tableInfo);
Map<String, byte[]> columnHashCache = pkColumnHashCache.computeIfAbsent(
primaryKey,
__ -> new HashMap<>()
);

// Check the hash against the cache
byte[] oldHash = columnHashCache.put(columnName, hash);
return !Arrays.equals(hash, oldHash);
}

private LinkedHashMap<String, Map<String, byte[]>> getPkColumnCache(
JdbcTableInfo tableInfo
) {
// The entire tablePkColumnCache map structure is accessed serially,
// so no need to make it threadsafe
return tablePkColumnCache.computeIfAbsent(
tableInfo,
__ -> new LinkedHashMap<String, Map<String, byte[]>>(
100,
.75f,
true
) {
// NOTE: Turns LinkedHashMap into an LRU Cache
@Override
protected boolean removeEldestEntry(
Map.Entry<String, Map<String, byte[]>> eldest
) {
return this.size() > maxHashSize;
}
}
);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/*
* Copyright 2018 Confluent Inc.
*
* Licensed under the Confluent Community License (the "License"); you may not use
* this file except in compliance with the License. You may obtain a copy of the
* License at
*
* http://www.confluent.io/confluent-community-license
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/

package io.confluent.connect.hdfs.jdbc;

import java.sql.JDBCType;
import java.util.Comparator;
import java.util.Objects;

public class JdbcColumnInfo {
private final String name;
private final JDBCType jdbcType;
private final int ordinal;
private final boolean nullable;

public static Comparator<JdbcColumnInfo> byOrdinal =
Comparator.comparingInt(JdbcColumnInfo::getOrdinal);

public JdbcColumnInfo(String name,
JDBCType jdbcType,
int ordinal,
boolean nullable) {
this.name = name;
this.jdbcType = jdbcType;
this.ordinal = ordinal;
this.nullable = nullable;
}

public String getName() {
return name;
}

public JDBCType getJdbcType() {
return jdbcType;
}

public int getOrdinal() {
return ordinal;
}

public boolean isNullable() {
return nullable;
}

@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (!(o instanceof JdbcColumnInfo)) {
return false;
}
JdbcColumnInfo column = (JdbcColumnInfo) o;
return Objects.equals(getName(), column.getName())
&& getJdbcType() == column.getJdbcType()
&& getOrdinal() == column.getOrdinal()
&& isNullable() == column.isNullable();
}

@Override
public int hashCode() {
return Objects.hash(getName(), getJdbcType(), getOrdinal(), isNullable());
}

@Override
public String toString() {
return getClass().getSimpleName()
+ "{"
+ "name='" + name + "'"
+ ", jdbcType=" + jdbcType
+ ", ordinal=" + ordinal
+ ", nullable=" + nullable
+ "}";
}
}
Loading