ALIS-1265: create tag case insensitive #164

matsumatsu20 · 2018-09-01T15:20:52Z

概要

タグ作成時に大文字小文字を区別していたがそれを区別したくなかった
- 具体的にはタグ作成時に大文字小文字区別せずに検索し、存在したらその表記に合わせてタグの作成を行う
- 例: awsで登録してもAWSがすでに登録されている場合はAWSで登録され、AWSのカウントが上がる

環境変数(SSMパラメータ)

なし

影響範囲(ユーザ)

リリースに気をつけないとユーザ影響あり。後述する

影響範囲(システム)

影響を与えるシステムはどこか

サーバレス

技術的変更点概要

まず、「大文字小文字を区別する」がDynamoDBだと非常に難しい（方法は全スキャンしかない）
- よってタグ情報を保存するデータストアをESに変更した。
- これにより、タグマスタは不要になる。
タグ追加時に全てのタグをESに問い合わせて大文字小文字区別せずすでに存在済みかどうかをチェックし、存在してれば存在している形に直して返却させるメソッドを一回通すようにしている
カウントアップやなかったら作成するロジックも、全て一度上記のメソッドを通して大文字小文字区別しないようにしている
ElasticSearchは(というかLucene)にセグメントマージという仕組みがあり、一度インデックス情報をバッファメモリにためてある程度まとめてディスクに書き込む（ここで検索可能になる）
- ディスクI/Oを減らす仕組みだが、この間(デフォルトは1sごとにセグメントマージしようとする)は検索対象にならない = この間は文字の大きさ違いのタグが飛んで来た場合はないものとして登録してしまうので、create時には明示的にrefreshをしている
- ここは結構ハックっぽくあまりにもマージが大量に発生しまくると、逆にキューを捌けなくなってrefreshの期間が遅くなることも考えられるのでモニタリングしつつ探っていきたい

DBやDBへのクエリに対する変更

タグテーブルを使わないように修正
DBのスキーマに変更があるか
- がリソース削除をPRに混ぜるとこのPRが扱いづらくなるので別対応にしたい

注意点・その他

こいつは単独ではマージできない
- 今タグテーブルに入っている全てのタグ情報を一度ElasticSearchに移してからリリースする必要がある
- Index更新用のバッチを作ってあげて、それを流した後にリリースする
  - Index更新用のバッチ〜リリースまでに作られてしまったものは拾えない
    - 正確になるならサイト停止
  - しかしどちらにせよ、今日整合性がないDBなので整合性の担保はできない。特にインクリメンタルなカウントは苦手なはず
  - なので、結論としては
    - 十分にアクセスが少ない時間帯でインデックス更新+リリースを行う
    - カウントの整合性を高めるためには、記事を全件舐めてインデクシングするバッチを作ればよい

To save tags case-insensitive

matsumatsu20 · 2018-09-01T16:49:22Z

elasticsearch-setup.py

+        }
+    }
+}
+create_index_list.append({"name": "tags", "setting": tag_settings})


[質問] ESのmappingなどのリリースって現状どう行なっているのでしょうか？

あと、このスクリプトをローカルで動かそうとするとendpointの正規表現のところでコケてしまったんですが、既知の問題だったりしますか？
もし既知の問題であれば調べるのにコストかけるよりも聞いてしまいたいってのが背景なので、道であれば調査します。

ES に関わるリリースは、ブルー・グリーンデプロイを行っています。
また、 endpoint 箇所でコケる件ですが、もしかしたら実行方法に問題あるかもしれません。
下記のように形で、ローカルIPを引数として渡しているか一度ご確認いただければと思います。

python elasticsearch-setup.py $(curl https://checkip.amazonaws.com/)

matsumatsu20 · 2018-09-01T16:52:41Z

elasticsearch-setup.py

+            'properties': {
+                'name': {
+                    'type': 'keyword',
+                    'normalizer': 'lowercase_normalizer'


このように keyword で指定しておいて、normalizer で小文字フィルターをかけてあげると
Term level queries(言語解析されずに検索) で、かつ 大文字でも小文字でも検索対象になる ようにできる

上記精査ありがとうございます！

matsumatsu20 · 2018-09-01T16:56:07Z

src/common/tag_util.py

        )

+        # デフォルトのrefresh_intervalだと、1sほど作成されたタグが検索対象にならないのでセグメントマージを強制的に行う
+        elasticsearch.indices.refresh(index='tags')


ここは相当悩ましかったんですがPRの説明にも書いたように、明示的にRefreshしてます（これをすることによって、インメモリバッファからディスク書き込みが行われ、検索可能になる）

セグメントマージはあまりに高頻度だと逆にパフォーマンスも劣化するんですが、この処理自体それほど頻度が多いものではないので大丈夫だろうと判断しています。

matsumatsu20 · 2018-09-01T16:57:49Z

src/common/tag_util.py

        )

+        # デフォルトのrefresh_intervalだと、1sほど作成されたタグが検索対象にならないのでセグメントマージを強制的に行う
+        elasticsearch.indices.refresh(index='tags')


セグメントマージはあまりに高頻度だと逆にパフォーマンスも劣化するんですが、この処理自体それほど頻度が多いものではないので大丈夫だろうと判断しています。

matsumatsu20 · 2018-09-01T17:05:17Z

src/common/tag_util.py

+    """
+    与えられたタグ名をElasticSearchに問い合わせ(大文字小文字区別せず)
+    すでに存在する場合はElasticSearchに存在する文字列に完全一致する形に変換し、タグ名の配列を返却する
+    """


やってることは単純だが意図がわかりづらいのでコメント追加

matsumatsu20 · 2018-09-01T17:05:55Z

src/common/tag_util.py

+            else:
+                results.append(tag_name)
+
+        return results


リスト内包表記とかでもかけるのだが、上にも書いたようにすこし意図がわかりにくいのであえて冗長に書いている

matsumatsu20 · 2018-09-01T17:08:27Z

src/common/tag_util.py

+                    'count': num
+                }
+            }
+        }


カウンターの更新には
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
を使っている。

Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field needs to be enabled for this feature to work.

にあるように、一回取得して〜カウントして〜更新して〜、と比べると

ラウンドトリップが減る

その分整合性がおかしくなる可能性が減る

という話であって整合性が保たれるわけではなく、ダーティリード的なものは起こりうる、というところは注意。

と言ってもDynamoDBのatomic counterも整合性は保証されないので、整合性レベルが低くなるわけではない
https://docs.aws.amazon.com/ja_jp/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

matsumatsu20 added 3 commits September 2, 2018 00:04

Change datastore to ElasticSearch for tag

74ef6b0

To save tags case-insensitive

Add tag index settings

cd6eda6

Delete os imports

dc6d561

matsumatsu20 commented Sep 1, 2018

View reviewed changes

Change term from match

710a14b

matsumatsu20 commented Sep 1, 2018

View reviewed changes

Don't refresh tag index after craete tag

d634291

matsumatsu20 requested a review from keillera September 8, 2018 07:45

matsumatsu20 mentioned this pull request Sep 8, 2018

ALIS-1261: Suggest tag API #166

Merged

matsumatsu20 self-assigned this Sep 10, 2018

matsumatsu20 added the In Review label Sep 10, 2018

matsumatsu20 changed the title ~~Feature/create tag case insensitive~~ ALIS-1265: create tag case insensitive Sep 10, 2018

keillera merged commit d634291 into AlisProject:develop Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALIS-1265: create tag case insensitive #164

ALIS-1265: create tag case insensitive #164

matsumatsu20 commented Sep 1, 2018 •

edited

Loading

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

keillera Sep 12, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

keillera Sep 12, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

matsumatsu20 Sep 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

ALIS-1265: create tag case insensitive #164

ALIS-1265: create tag case insensitive #164

Conversation

matsumatsu20 commented Sep 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概要

環境変数(SSMパラメータ)

関連URL

影響範囲(ユーザ)

影響範囲(システム)

技術的変更点概要

DBやDBへのクエリに対する変更

注意点・その他

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

matsumatsu20 commented Sep 1, 2018 •

edited

Loading