Skip to content

Avoid nested unwind_protect() and unsafe string usage #241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 29 additions & 17 deletions src/ngram.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,26 +11,44 @@ fill_one_ngram(const cpp11::strings& x,
const R_xlen_t x_size = x.size();
const R_xlen_t range = std::max(x_size - n + 1, static_cast<R_xlen_t>(0));

// Not strictly necessary to call `unwind_protect()` here because we also
// call it in `cpp11_ngram()`, but it seems like it would be good practice
// to call it here too because this is where we leave cpp11 and use the less
// safe but faster R API directly.
std::string out_elt;

// Using `unwind_protect()` manually because character vector extraction and
// insertion is currently quite slow in cpp11 due to how protection of each
// CHARSXP is handled. We are very careful to:
// - Not use any of the cpp11 API here
// - Avoid exceptions that could be thrown by `std::string` operations
// - Avoid `std::string` creation inside `unwind_protect()`, because its
// destructor would not run if an R error occurred.
cpp11::unwind_protect([&] {
for (R_xlen_t i = 0; i < range; ++i) {
// `x[i]` goes through `r_string`, and that is too expensive because
// generating each `r_string` involves protecting each CHARSXP.
std::string elt = CHAR(STRING_ELT(x, i));
const char* elt = CHAR(STRING_ELT(x, i));

try {
out_elt.clear();
out_elt.assign(elt);
} catch (...) {
Rf_errorcall(R_NilValue, "C++ error while assigning.");
}

for (R_xlen_t j = 1; j < n; ++j) {
const std::string piece = CHAR(STRING_ELT(x, i + j));
elt = elt + delim + piece;
const char* piece = CHAR(STRING_ELT(x, i + j));

try {
out_elt = out_elt + delim + piece;
} catch (...) {
Rf_errorcall(R_NilValue, "C++ error while concatenating.");
}
}

// `out[i] = elt` would approximately do the same thing, but it goes
// through `std::string elt` -> `r_string` before assigning, and that
// again involves protecting each `r_string`, which is expensive and not
// necessary.
SET_STRING_ELT(out, loc, Rf_mkCharLenCE(elt.data(), elt.size(), CE_UTF8));
// We don't expect `.data()` or `.size()` to ever throw exceptions.
SET_STRING_ELT(out, loc, Rf_mkCharLenCE(out_elt.data(), out_elt.size(), CE_UTF8));
++loc;
}
});
Expand Down Expand Up @@ -78,15 +96,9 @@ cpp11_ngram(cpp11::list_of<cpp11::strings> x,
const R_xlen_t x_size = x.size();
cpp11::writable::list_of<cpp11::writable::strings> out(x_size);

// Calling `unwind_protect()` here because each call to `ngram()` allocates
// a `cpp11::writable::strings` vector, and that allocation calls
// `unwind_protect()` too, so `unwind_protect()` would be run `x_size` times.
// Calling it here "turns off" the nested inner calls to it.
cpp11::unwind_protect([&] {
for (R_xlen_t i = 0; i < x_size; ++i) {
out[i] = ngram(x[i], n, n_min, delim);
}
});
for (R_xlen_t i = 0; i < x_size; ++i) {
out[i] = ngram(x[i], n, n_min, delim);
}

return(out);
}