Skip to content

Instantly share code, notes, and snippets.

@sxfmol
Last active July 15, 2020 10:34
Show Gist options
  • Save sxfmol/3166acd508977e1d358c9ff26daa2f9c to your computer and use it in GitHub Desktop.
Save sxfmol/3166acd508977e1d358c9ff26daa2f9c to your computer and use it in GitHub Desktop.
ES_plugins ES 对 word和PDF文档的全文搜索

ORC 工具

其他

Tesseract

pillow

wand

gocr

ES 插件

Ingest Attachment The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

Ingest Attachment Processor Plugin插件允许Elasticsearch通过使用Apache文本提取库Tika提取通用格式(例如PPT,XLS和PDF)的文件附件。

源字段必须是base64编码的二进制。如果不想增加在base64之间来回转换的开销,则可以使用CBOR格式而不是JSON,并将字段指定为字节数组而不是字符串表示形式。然后,处理器将跳过base64解码。 该插件必须安装在群集中的每个节点上,并且每个节点必须在安装后重新启动。

下载

yes|wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.8.0.zip

安装卸载

  • elasticsearch-plugin install ingest-attachment
  • elasticsearch-plugin remove ingest-attachment

一个不错的case

-how-to-read-files-in-elasticsearch-doc-docx-pdf

Tika

OCR识别

Ambar

Ambar supports large files (>30MB) Supported file types:

  • ZIP archives
  • Mail archives (PST)
  • MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
  • OCR over images
  • Email messages with attachments
  • Adobe PDF (with OCR)
  • OCR languages: Eng, Rus, Ita, Deu, Fra, Spa, Pl, Nld
  • OpenOffice documents
  • RTF, Plaintext
  • HTML / XHTML
  • Multithread processing

https://ambar.cloud/docs/crawlers/ https://ambar.cloud/docs/crawlers/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment