Skip to content

Commit

Permalink
Elastic search advanced key storing
Browse files Browse the repository at this point in the history
  • Loading branch information
arnarthor committed Nov 12, 2014
1 parent 8c87822 commit 9dc2cf0
Showing 1 changed file with 91 additions and 13 deletions.
104 changes: 91 additions & 13 deletions Week10/elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ name Perun. If you don't specify the name of your node Elasticsearch will
provide one for you. This node was elected the master node for our cluster and
is in the state of accepting other nodes into the cluster for replication.
Finally we can see that the Elasticsearch API has the public address
localhost:9200.
localhost:9200.

Remember to give your cluster a name. If you use the default one, your node
might join a cluster that is on the same network, specially if your are doing
Expand All @@ -74,7 +74,7 @@ Now let's play around with Elasticsearch by adding, updating and deleting
documents.

To add documents to Elasticsearch we use the Elasticsearch API and we use the
well known HTTP methods to do so.
well known HTTP methods to do so.

To index a document we do a PUT command with a JSON object in the body.
The url that we PUT to is on the following format.
Expand Down Expand Up @@ -135,11 +135,11 @@ If we `PUT` the document again, using the same id, the document is updated in th

curl -XPUT http://localhost:9200/entries/entry/1 -d '{
"title": "Today I learned to search",
"data": "Lets change the blog content",
"data": "Lets change the blog content",
"created": "2014-12-24T14:24:23"}'
{"_index":"entries","_type":"entry","_id":"1","_version":2,"created":false}%
{"_index":"entries","_type":"entry","_id":"1","_version":2,"created":false}%

% curl http://localhost:9200/entries/entry/1
% curl http://localhost:9200/entries/entry/1
{"_index":"entries","_type":"entry","_id":"1","_version":2,"found":true,"_source":{
"title": "Today I learned to search",
"data": "Lets change the blog content",
Expand All @@ -152,8 +152,8 @@ You can remove a given document from the index using the `DELETE` HTTP method.

% curl -XDELETE http://localhost:9200/entries/entry/1
{"found":true,"_index":"entries","_type":"entry","_id":"1","_version":3}%
% curl http://localhost:9200/entries/entry/1

% curl http://localhost:9200/entries/entry/1
{"_index":"entries","_type":"entry","_id":"1","found":false}%

If we try to fetch the document after the delete, we can see that it has been
Expand All @@ -164,7 +164,7 @@ You can also delete all the documents under a given type as follows
curl -XDELETE http://localhost:9200/entries/entry

or, remove the whole index

curl -XDELETE http://localhost:9200/entries/

# Search
Expand Down Expand Up @@ -228,7 +228,7 @@ To perform a query we POST on the _search endpoint with our query in the body of
curl -XPOST http://localhost:9200/entries/_search -d '
{
"query":{
// query goes here!
// query goes here!
}
}'

Expand All @@ -244,7 +244,7 @@ Let's create a query and search for entries that contain the work "scaffolding"
"query_string": {
"query": "scaffolding",
"default_field" : "data"
}
}
}
}'

Expand All @@ -256,15 +256,15 @@ When execute this query,
quote> "query_string": {
quote> "query": "scaffolding",
quote> "default_field" : "data"
quote> }
quote> }
quote> }
quote> }'
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"entries","_type":"entry","_id":"2","_score":0.095891505,"_source":{
"title": "Ruby on rails is just a bubble in bathtub",
"data": "Yeah, Ruby on rails is awesome, but scaffolding is not!",
"created": "2013-04-02T12:34:02",
"tags": ["programming", "ruby", "rails"]
}}]}}%
}}]}}%

We get back one entry where we have the text scaffolding in and that is document with id 2.

Expand All @@ -289,7 +289,7 @@ We get back one entry where we have the text scaffolding in and that is document
'

## query by range on dates.

curl -XPOST http://localhost:9200/entries/_search -d '
{
"query" : {
Expand All @@ -301,3 +301,81 @@ We get back one entry where we have the text scaffolding in and that is document
}
}
}'

# Advanced key storing
When storing keys in Elastic search it automatically stores the keys cleverly, for example when storing a key that includes a dash it splits the key on dashes.

If we have a slug for example (some-pretty-long-slug), elastic would want to split this into smaller pieces to make the lookup faster. In some cases this is not good since we might want to look up by the slug (which should be unique). In order to prevent elastic search from changing the key we need to create our own mapping for elastic search.

Let's use Kodemon from PA3 as an example. There we had keys that had a combined name of function and file. These keys should be unique (although it's not 100% accurate since 2 files could have the same name in different directories) and we want to look up each key as a unique key in the index.

>If you want to run the following code make sure Elastic Search is running with the default settings, or change the code according to your own setup.
>Do note that this will not work if you have data in elastic search that do not want to loose. To be able to use this method for existing data you would need to create data migration which is not covered in this section.
We start of by creating the index. We can use the following script to do so
```bash
# Delete the kodemon index if you have one already
curl -XDELETE http://127.0.0.1:9200/kodemon
# Create the kodemon/execution index with one document
curl -XPOST http://127.0.0.1:9200/kodemon/execution -d '{ "key": "script-hehe", "token": "token" }'
# Delete all documents from the execution index
curl -XDELETE http://127.0.0.1:9200/kodemon/execution
```

Now we can be sure that we have the index we want to create a custom mapping for.
Let's assume the data is in the following format when it gets posted to elastic search.
```json
{
"execution_time": 0.0019073486328125,
"timestamp": "2014-11-05T02:12:39.000Z",
"token": "test-token",
"key": "second.py-cool_function",
"_id": "54598797538bb0f6084f0072"
}
```

The next thing we want to do is to make sure that our the propertie key will not be stored as ```second.py``` and ```cool_function``` but as a single piece ```second.py-cool_function```.

In order to do that we could execute the following script.
```bash
curl -X POST http://127.0.0.1:9200/kodemon/execution/_mapping?pretty=true -d '
{
"execution": {
"properties": {
"key": {
"type": "multi_field",
"fields": {
"original_key": {
"type": "string",
"index": "not_analyzed"
},
"key": {
"type": "string",
"index": "analyzed"
}
}
},
"token": {
"type": "multi_field",
"fields": {
"original_token": {
"type": "string",
"index": "not_analyzed"
},
"token": {
"type": "string",
"index": "analyzed"
}
}
}
}
}
}'
```

The following code takes all executions in and looks at the properties key and token specifically to see how to store them, the rest gets it's default mapping.

We take both the key and token properties and tell elastic search to store these single fields as two properties. First as the original string which we say is not analyzed, which means that if we now query the propertie key by key.original_key then we are asking elastic to filter out where each key is exactly as it was when it got saved (ex. ```second.py-cool_function```).

On the other hand if we still want to be able to look it up in a clever way elastic has also saved the properties like it wants to. So if we query by key as before we'll get the key split up on dashes like the default mapping behavior wants to. You can read more about mapping [here](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html)

0 comments on commit 9dc2cf0

Please sign in to comment.