How to run sparse retrieval on Japanese texts with Pyserini

VM Environments

Java 11.0.13
Maven 3.8.3
Lucene 8.10.1
Python 3.9.2

Get a VM with JDK11 and Maven

$ docker pull maven:3.8.3-openjdk-11
$ docker run -it maven:3.8.3-openjdk-11 bash
root@277eb5500d97:/# java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment 18.9 (build 11.0.13+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8, mixed mode, sharing)
root@277eb5500d97:/# mvn -version
Apache Maven 3.8.3 (ff8e977a158738155dc465c6a97ffaf31982d739)
Maven home: /usr/share/maven
Java version: 11.0.13, vendor: Oracle Corporation, runtime: /usr/local/openjdk-11
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.19.104-microsoft-standard", arch: "amd64", family: "unix"

Install Lucene and Anserini in VM

wget "https://dlcdn.apache.org/lucene/java/8.10.1/lucene-8.10.1.tgz"
tar xvfz lucene-8.10.1.tgz
export LUCENE_HOME=/lucene-8.10.1
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/core/lucene-core-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/queryparser/lucene-queryparser-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/analysis/common/lucene-analyzers-common-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/demo/lucene-demo-8.10.1.jar

git clone --recurse-submodules https://github.com/castorini/anserini.git
cd anserini && mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true

Install pyserini and other packages

apt update
apt install -y python3 python3-pip
python3 -m pip install -U pip
python3 -m pip install pyserini faiss-cpu torch

Sample texts

ja_texts/text.jsonl

{"id": "doc1", "contents": "吾輩わがはいは猫である。名前はまだ無い。"}
{"id": "doc2", "contents": "どこで生れたかとんと見当けんとうがつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。"}
{"id": "doc3", "contents": "吾輩はここで始めて人間というものを見た。しかもあとで聞くとそれは書生という人間中で一番獰悪どうあくな種族であったそうだ。"}

Index texts

mkdir -p indexes/ja_texts
python3 -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -language ja -input ja_texts -index indexes/ja_texts  -storePositions -storeDocvectors -storeRaw

Search texts

SimpleSearcher.py

from pyserini.search import SimpleSearcher

q = '吾輩'
searcher = SimpleSearcher('indexes/ja_texts')
searcher.set_language('ja')
hits = searcher.search(q)

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')

Results

python3 SimpleSearcher.py
 1 doc1 0.27330
 2 doc3 0.23620

Fetch texts

SimpleFetcher.py

from pyserini.search import SimpleSearcher
import json

docid = 'doc1'
searcher = SimpleSearcher('indexes/ja_texts')

json_doc = json.loads(searcher.doc(docid).raw())
print(json_doc['contents'])

Result

python3 SimpleFetcher.py
吾輩わがはいは猫である。名前はまだ無い。

URLs

https://github.com/castorini/pyserini/blob/master/docs/usage-multilingual.md#how-do-i-index-and-search-my-own-non-english-documents

hideojoho/How-to-run-sparse-retrieval-on-japanese-texts-with-pyserini.md

How to run sparse retrieval on Japanese texts with Pyserini

VM Environments

Get a VM with JDK11 and Maven

Install Lucene and Anserini in VM

Install pyserini and other packages

Sample texts

Index texts

Search texts

Fetch texts

URLs