Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Feb 13, 2019

I admit this feels gross, but it's less gross than what was there before. I can do some more clean up but wanted to get feedback before spending any more time on it

So, the problem partially lies with the gRPC C++ library. The obvious thing, and first thing I tried, was to specialize SerializationTraits<protocol::FlightData> and do casts between FlightData and protocol::FlightData (the proto) at the last possible moment. Unfortunately, this seems to not be possible because of this:

https://github.com/grpc/grpc/blob/master/include/grpcpp/impl/codegen/proto_utils.h#L100

So I had to override that Googly hack and go to some shenanigans (see protocol.h/protocol.cc) to make sure the same templates are always visible both in Flight.grpc.pb.cc as well as our client.cc/server.cc

…_cast on the client and server C++ types

Change-Id: I19b28f112c52cc658b4f37f9516607b85738726c
@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

cc @pitrou @lihalite

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

Perhaps it would be useful to ping the gRPC community to know if this is idiomatic.

if (!custom_writer->grpc::ClientWriter<IpcPayload>::Write(payload,
grpc::WriteOptions())) {

if (!writer_->Write(*reinterpret_cast<const pb::FlightData*>(&payload),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we are praying that Write doesn't do anything with the FlightData pointer except pass it to SerializationTraits<FlightData>, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#include "arrow/flight/protocol.h"

#include "arrow/flight/Flight.grpc.pb.cc" // NOLINT
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the trick of protocol.cc and protocol.h is to make sure you can only get the generated Protobuf code along with our template specialization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the linker (at least with gcc on Linux) was seeing the vtable generated in Flight.grpc.pb.cc and overriding the ones in server.cc/client.cc. This was the only way I could make sure that the same code is generated in all places

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd appreciate a comment about that just so that someone else looking at it isn't confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels icky, but less icky than before.

grpc/grpc#15764 is the upstream issue we need, I think (it'd be ideal if whatever API they come up with eventually just gives us a byte stream/sink).

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

@lihalite Interesting.
Apparently, https://github.com/tensorflow/tensorflow/blob/ced4fddc0400fedce8ba15fb9a3e6e765ddda757/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc#L171 shows how to do actual zero-copy writes, i.e. write directy from application data to socket - by creating a bunch of slices that point to application data and chaining them in the ByteBuffer (which can contain any number of slices).

@ghost
Copy link

ghost commented Feb 13, 2019

@pitrou That is awesome, maybe we should do that in a follow-up.

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

@pitrou I was going to propose that on the e-mail thread but you found it already =) Composing a slice from smaller zero-copy constituent slices is the way to avoid memcpy there. The only annoyance is that we'll have to add padding slices but c'est la vie

@ghost
Copy link

ghost commented Feb 13, 2019

FWIW, the Java side also adds padding slices (it was one of the incompatibilities between the two implementations)

// [ARROW-4213] These buffers must be aligned to an 8-byte boundary in order to be readable from C++.
if (b.readableBytes() % 8 != 0) {
int paddingBytes = 8 - (b.readableBytes() % 8);
assert paddingBytes > 0 && paddingBytes < 8;
size += paddingBytes;
allBufs.add(PADDING_BUFFERS.get(paddingBytes).retain());

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

Hmm... why do we need padding slices?

Change-Id: I2f8d64eee08c3eb68eb38da64c4729f68cd0c466
@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

I commented on the gRPC thread

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

@pitrou a buffer's size may not be a multiple of 8. But a buffer should be padded already so we should be able to round up the buffer size to the next multiple of 8 and avoid the padding slices

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/serialization-internal.h#L302

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

Right... but the "padding" won't be necessarily zero-filled (for example if taking a slice of a buffer). Is it a problem?

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

I looked at Tensorflow's use of gRPC and it seems they don't do zero-copy on the receiving side. For example with the SendTensor method in eager_service.proto, it seems the server will use regular deserialization for the tensor bytes data sent by the client.

Also there are weird hacks where protobuf-generated definitions are replaced/overriden with a hand-written ones, I'm not sure why that is:
https://github.com/tensorflow/tensorflow/blob/ced4fddc0400fedce8ba15fb9a3e6e765ddda757/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h#L43
https://github.com/tensorflow/tensorflow/blob/ced4fddc0400fedce8ba15fb9a3e6e765ddda757/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h#L98

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

I opened https://issues.apache.org/jira/browse/ARROW-4562 about the composite byte buffer optimization

@wesm wesm changed the title WIP ARROW-4558: [C++][Flight] Implement gRPC customizations without UB ARROW-4558: [C++][Flight] Implement gRPC customizations without UB Feb 13, 2019
@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

I'm sure you could ask TF about it and find the right person to talk to.

You might want to try benchmarking with two of the nodes in the UL Nashville network rather than localhost-to-localhost. They are connected with gigabit ethernet only (I'm upgrading the network switches but not sure I can get 10-gigabit working, too many factors out of my control) so it would be interesting to see to what extent we are saturating the network bandwidth

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

@pitrou regarding padding, I could imagine an application that hashes IPC payloads for some purpose (caching?) having an issue with non-zero padding. It seems like an esoteric concern; I'm not sure it is worth paying the price, or at least having the "sanitize padding" option as something that's disabled by default

@pitrou
Copy link
Member

pitrou commented Feb 13, 2019

Perhaps the price isn't that large either. Ideally the batches are large so adding a 8-bytes zero-initialized slice may be cheap...

Change-Id: I7a191986b6544310ab71ae074ef218bb776d06f9
@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

OK to merge this with a passing build? This could benefit from an IWYU pass but I might like to do that for all of Flight to clean up all the includes and forward-declarations

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

I should rename protocol.h to protocol-internal.h so the header is not installed.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment.

#include <memory>

#include "grpcpp/impl/codegen/config_protobuf.h"
#undef GRPC_OPEN_SOURCE_PROTO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on this? I think it's easy to miss it or misunderstand why it's here. (as in "what happens if I remove this")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

using google::protobuf::io::CodedInputStream;
using google::protobuf::io::CodedOutputStream;

bool ReadBytesZeroCopy(const std::shared_ptr<arrow::Buffer>& source_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and you needn't redeclare this (already defined above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

… feedback

Change-Id: I908aa0568ea0f75968fe37c20fa7629ce9a9af36
@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

OK I'll wait for the build to run again and then merge

@wesm
Copy link
Member Author

wesm commented Feb 13, 2019

Merging. A couple of builds experienced transient failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants