Skip to content

Instantly share code, notes, and snippets.

@ijokarumawak
Created May 20, 2021 04:23
Show Gist options
  • Save ijokarumawak/ed1f398574b074aded86e71e9e50783e to your computer and use it in GitHub Desktop.
Save ijokarumawak/ed1f398574b074aded86e71e9e50783e to your computer and use it in GitHub Desktop.
Wikipedia 日本語ページを Elasticsearch に登録する Logstash サンプル

Kuromoji で本文を解析できるようにマッピングを作成。

PUT jawiki
{
  "mappings": {
    "properties": {
      "doc": {
        "properties": {
          "revision": {
            "properties": {
              "text": {
                "properties": {
                  "content": {
                    "type": "text",
                    "analyzer": "kuromoji"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

https://dumps.wikimedia.org/jawiki/20210501/ から小さめのダンプファイルを選んでダウンロードして展開。

jawiki-20210501-pages-articles-multistream6.xml-p4307948p4365476.bz2

36166 件のページがあるファイル、 29.5MB

input {
file {
path => "/tmp/jawiki-20210501-pages-articles-multistream6.xml-p4307948p4365476"
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/tmp/file_completed.log"
codec => multiline {
pattern => " <page>|</mediawiki>"
negate => true
what => "previous"
max_lines => 10000
}
}
}
filter {
xml {
source => "message"
target => "doc"
}
if "_xmlparsefailure" in [tags] {
drop { }
}
prune {
blacklist_names => ["message"]
}
}
output {
# stdout {}
elasticsearch {
hosts => "https://localhost:9200"
index => "jawiki"
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment