How to run sparse retrieval on Japanese texts with Pyserini
Java 11.0.13
Maven 3.8.3
Lucene 8.10.1
Python 3.9.2
Get a VM with JDK11 and Maven
$ docker pull maven:3.8.3-openjdk-11
$ docker run -it maven:3.8.3-openjdk-11 bash
root@277eb5500d97:/# java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment 18.9 (build 11.0.13+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8, mixed mode, sharing)
root@277eb5500d97:/# mvn -version
Apache Maven 3.8.3 (ff8e977a158738155dc465c6a97ffaf31982d739)
Maven home: /usr/share/maven
Java version: 11.0.13, vendor: Oracle Corporation, runtime: /usr/local/openjdk-11
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.19.104-microsoft-standard", arch: "amd64", family: "unix"
Install Lucene and Anserini in VM
wget "https://dlcdn.apache.org/lucene/java/8.10.1/lucene-8.10.1.tgz"
tar xvfz lucene-8.10.1.tgz
export LUCENE_HOME=/lucene-8.10.1
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/core/lucene-core-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/queryparser/lucene-queryparser-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/analysis/common/lucene-analyzers-common-8.10.1.jar
export CLASSPATH=$CLASSPATH:$LUCENE_HOME/demo/lucene-demo-8.10.1.jar
git clone --recurse-submodules https://github.com/castorini/anserini.git
cd anserini && mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true
Install pyserini and other packages
apt update
apt install -y python3 python3-pip
python3 -m pip install -U pip
python3 -m pip install pyserini faiss-cpu torch
{"id": "doc1", "contents": "吾輩わがはいは猫である。名前はまだ無い。"}
{"id": "doc2", "contents": "どこで生れたかとんと見当けんとうがつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。"}
{"id": "doc3", "contents": "吾輩はここで始めて人間というものを見た。しかもあとで聞くとそれは書生という人間中で一番獰悪どうあくな種族であったそうだ。"}
mkdir -p indexes/ja_texts
python3 -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -language ja -input ja_texts -index indexes/ja_texts -storePositions -storeDocvectors -storeRaw
from pyserini .search import SimpleSearcher
q = '吾輩'
searcher = SimpleSearcher ('indexes/ja_texts' )
searcher .set_language ('ja' )
hits = searcher .search (q )
for i in range (len (hits )):
print (f'{ i + 1 :2} { hits [i ].docid :4} { hits [i ].score :.5f} ' )
python3 SimpleSearcher.py
1 doc1 0.27330
2 doc3 0.23620
from pyserini .search import SimpleSearcher
import json
docid = 'doc1'
searcher = SimpleSearcher ('indexes/ja_texts' )
json_doc = json .loads (searcher .doc (docid ).raw ())
print (json_doc ['contents' ])
python3 SimpleFetcher.py
吾輩わがはいは猫である。名前はまだ無い。