Skip to content

Commit 0927fdb

Browse files
iindyktf-transform-team
authored andcommitted
Adding tf.RaggedTensor support to tft.bucketize, tft.compute_and_apply_vocabulary and related mappers.
PiperOrigin-RevId: 403989982
1 parent 24b98be commit 0927fdb

File tree

10 files changed

+503
-324
lines changed

10 files changed

+503
-324
lines changed

RELEASE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
## Major Features and Improvements
66

7+
* Added `tf.RaggedTensor` support to `tft.bucketize`,
8+
`tft.compute_and_apply_vocabulary` and related analyzers and mappers.
9+
710
## Bug Fixes and Other Changes
811

912
* Fix re-loading a transform graph containing pyfuncs exported as a TF1

tensorflow_transform/analyzers.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1632,7 +1632,7 @@ def _register_vocab(sanitized_filename: str,
16321632
# https://github.com/tensorflow/community/blob/master/rfcs/20190116-embedding-partitioned-variable.md#goals
16331633
@common.log_api_use(common.ANALYZER_COLLECTION)
16341634
def vocabulary(
1635-
x: common_types.TensorType,
1635+
x: common_types.InputTensorType,
16361636
top_k: Optional[int] = None,
16371637
frequency_threshold: Optional[int] = None,
16381638
vocab_filename: Optional[str] = None,
@@ -1651,7 +1651,7 @@ def vocabulary(
16511651
r"""Computes the unique values of a `Tensor` over the whole dataset.
16521652
16531653
Computes The unique values taken by `x`, which can be a `Tensor` or
1654-
`SparseTensor` of any size. The unique values will be aggregated over all
1654+
`CompositeTensor` of any size. The unique values will be aggregated over all
16551655
dimensions of `x` and all instances.
16561656
16571657
In case one of the tokens contains the '\n' or '\r' characters or is empty it
@@ -1697,7 +1697,7 @@ def vocabulary(
16971697
within each vocabulary entry (b/117796748).
16981698
16991699
Args:
1700-
x: A categorical/discrete input `Tensor` or `SparseTensor` with dtype
1700+
x: A categorical/discrete input `Tensor` or `CompositeTensor` with dtype
17011701
tf.string or tf.int[8|16|32|64]. The inputs should generally be unique per
17021702
row (i.e. a bag of words/ngrams representation).
17031703
top_k: Limit the generated vocabulary to the first `top_k` elements. If set
@@ -1729,11 +1729,10 @@ def vocabulary(
17291729
dense tensor of the identical shape as x (i.e. element-wise labels).
17301730
Labels should be a discrete integerized tensor (If the label is numeric,
17311731
it should first be bucketized; If the label is a string, an integer
1732-
vocabulary should first be applied). Note: `SparseTensor` labels are not
1733-
yet supported (b/134931826). WARNING: When labels are provided, the
1734-
frequency_threshold argument functions as a mutual information
1735-
threshold,
1736-
which is a float. TODO(b/116308354): Fix confusing naming.
1732+
vocabulary should first be applied). Note: `CompositeTensor` labels are
1733+
not yet supported (b/134931826). WARNING: When labels are provided, the
1734+
frequency_threshold argument functions as a mutual information
1735+
threshold, which is a float. TODO(b/116308354): Fix confusing naming.
17371736
use_adjusted_mutual_info: If true, and labels are provided, calculate
17381737
vocabulary using adjusted rather than raw mutual information.
17391738
min_diff_from_avg: MI (or AMI) of a feature x label will be adjusted to zero
@@ -2174,7 +2173,13 @@ def quantiles(x: tf.Tensor,
21742173
return quantile_boundaries
21752174

21762175

2177-
def _quantiles_per_key(x, key, num_buckets, epsilon, name=None):
2176+
def _quantiles_per_key(
2177+
x: tf.Tensor,
2178+
key: tf.Tensor,
2179+
num_buckets: int,
2180+
epsilon: float,
2181+
name: Optional[str] = None
2182+
) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor, int]:
21782183
"""Like quantiles but per-key.
21792184
21802185
For private use in tf.Transform implementation only.

0 commit comments

Comments
 (0)