Skip to content

Adds tee and cat commands and a supporting DataStore service #1101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 27, 2016

Conversation

imbriaco
Copy link
Member

  • Creates a DataStore service based heavily on the Memory service, but which writes files to disk.
    • This service is not exposed via the web API and is only currently intended for use by Cog internals.
    • Data is written by default to a service_data directory under the top-level Cog data directory. This is configurable in config.exs.
    • File manipulation is handled by a new NestedFile module which creates a multi-level directory hierarchy in order to reduce number of files which will be written to a directory for performance reasons.
  • Adds a tee command to save pipeline output using the DataStore service for later use
  • Adds a cat command to retrieve saved pipeline output
    • Includes support for modifying the saved tee data with the current pipeline output in the following ways:
      • Append current output to saved data returning an array
      • Insert current output into saved map as a provided field name
      • Merge current output map into saved map

Note that this does include tests for the cat and tee commands which will pass but which are currently configured with @moduletag :skip to align with the other command tests.

def init(nil),
do: raise "Unable to start #{__MODULE__}: Data path not configured"
def init(base_path) do
state = %__MODULE__{base_path: base_path}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to verify this path exists, is a directory, can be written to, etc? Is that done somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NestedFile module will try to create it if it does not exist the first time it is attempted to be written to and return an error if it is unable to. That felt like the right thing in this case since not everyone is likely to use the commands that make use of it and I wouldn't want to keep Cog from working. It also defaults to be nested under the default data directory just like logs and audit_logs so it feels pretty safe to me.

Happy to hear arguments to the contrary if you think we should do something else.

opts["insert"] ->
handle_transform(:insert, req, data, state)
opts["append"] ->
{:reply, req.reply_to, List.wrap(data) ++ List.wrap(req.cog_env), state}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this actually "prepend"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's appending the current pipeline output to the saved pipeline output.

Example:

@ash seed '{ "foo": "value1" }' | tee foo
[
  {
    "foo": "value1"
  }
]
@ash seed '{ "foo": "value2" }' | cat -a foo
[
  {
    "foo": "value1"
  },
  {
    "foo": "value2"
  }
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, you're right... I got a tad mixed up.

end
end

defp transform_data(action, [prev], curr, opts),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to name this something like transform_map_data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

do: GenServer.call(__MODULE__, {:delete, namespace, key})

def init(nil),
do: raise "Unable to start #{__MODULE__}: Data path not configured"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention for this kind of thing is to return {:stop, reason}. That way start_link will return the correct {:error, reason} response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derp, yeah. Thanks.

# must be the same one that data was written into.
@data_namespace [ "commands", "tee" ]

@description "Save and pass through pipeline data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might have been a copy-pasta'd and not updated, unless cat actually "saves" stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

assert File.exists?(path)

assert :ok = NestedFile.delete(base_paths, key)
assert !File.exists?(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but refute is handy for this kind of thing.

response = send_message(user, "@bot: seed '{\"foo\":\"fooval6\"}' | cat -a test")
assert [%{foo: "fooval5"},%{foo: "fooval6"}] = response
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to add a test that is skipped. Really want we need to do is to test the commands in isolation but we don't have any test helpers setup to do that yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to pass and doesn't seem to have anything provider specific (since there are no templates, etc) so it's not clear to me why it needs to be skipped. I was just following the pattern for the rest of the command tests.

I think it's worth having even if it's skipped since it describes the expected behavior if nothing else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can run, lets run it.

Copy link
Member

@vanstee vanstee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; just left some minor comments.


MemoryClient.delete(root, token, memory_key)

case DataStore.replace(@data_namespace, key, data) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth calling out that existing keys are replaced in the desc.

Copy link
Collaborator

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the security implications here? If someone tee's sensitive information, anybody can cat it if they know the name it was saved under, even if they couldn't have run the commands that generated the data in the first place.

Related: how do you actually delete data once you're done with it?

@@ -12,8 +11,10 @@ defmodule Cog.Command.Service.Supervisor do
token_monitor_table = :ets.new(:token_monitor_table, [:public])
memory_table = :ets.new(:memory_table, [:public])
memory_monitor_table = :ets.new(:memory_monitor_table, [:public])
data_path = Application.get_env(:cog, Cog.Command.Service, [])[:data_path]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're defining this in config/config.exs, and that the thing won't even start if this is nil, how about using Application.fetch_env!/2 here instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to use:

Application.fetch_env!(:cog, Cog.Command.Service)
|> Keyword.fetch!(:data_path)

Which should raise if either Cog.Command.Service or the data_path key under it are missing from the Cog application configuration.

opts["insert"] ->
handle_transform(:insert, req, data, state)
opts["append"] ->
{:reply, req.reply_to, List.wrap(data) ++ List.wrap(req.cog_env), state}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, you're right... I got a tad mixed up.

"""
require Logger

def fetch(base_paths, key, ext \\ "data") do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to specify an extension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly there because I didn't know if we'd want to use NestedFile in other places in Cog since it's fairly generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of simultaneous writes to the same key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last one wins. They're serialized through the DataStore service's GenServer.


Note: The key is sanitized to remove dangerous characters, but the
base paths and extension are not. Make sure not to use user entered
values for these unsafe arguments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just sanitize everything here, then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really, since "everything" includes the base path which will obviously have things like slashes in it. We could sanitize the contents of the namespace list that is passed to DataStore but it felt unnecessary since this is an internal API that's not exposed outside of Cog core.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the key the only thing that's user-specified? The wording of the comment led me to believe otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the key is the only thing that is user specified. That note is there for future-me who wants to reuse this for something else so that I don't make a mistake an let user-specified data into the namespace for instance.

foo1 = %{"foo1" => "fooval"}
DataStore.replace(@namespace, "foo1", foo1)
assert {:ok, ^foo1} = DataStore.fetch(@namespace, "foo1")
DataStore.delete(@namespace, "foo1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this delete call (and others like it in these tests) really needed? If for some reason the assert failed, this line wouldn't be executed.

If it's important, an on_exit callback should be used. If it's not important, we can just chuck it 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly there to keep cruft from accumulating after tests since files get written to the filesystem and we don't have transactions to roll them back. I can use an on_exit instead.

response = send_message(user, "@bot: seed '{\"foo\":\"fooval6\"}' | cat -a test")
assert [%{foo: "fooval5"},%{foo: "fooval6"}] = response
end
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can run, lets run it.

@imbriaco
Copy link
Member Author

Yep, anything that anyone saves with tee can be retrieved using cat by anyone else, unless they write rules to prevent it. I updated the tee docs with some discussion of that.

Copy link
Member

@kevsmith kevsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo question about behavior w/concurrent writes and/or deletes.

"""
require Logger

def fetch(base_paths, key, ext \\ "data") do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of simultaneous writes to the same key?

@imbriaco
Copy link
Member Author

I commented the command test back out. They pass locally if I run it in dev mode but test mode behaves strangely, probably related to the same issue with the rest of the command tests. When we fix those up, at least we have this to start from.

add support for option descriptions to embedded bundle command options
@imbriaco imbriaco force-pushed the imbriaco/tee-command branch from e864e91 to b199ff7 Compare October 27, 2016 13:32
@imbriaco imbriaco merged commit c58b5bd into master Oct 27, 2016
@imbriaco imbriaco deleted the imbriaco/tee-command branch October 27, 2016 13:33
@imbriaco imbriaco added this to the Cog 0.16 milestone Nov 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants