Ingest Attachment The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
Ingest Attachment Processor Plugin插件允许Elasticsearch通过使用Apache文本提取库Tika提取通用格式(例如PPT,XLS和PDF)的文件附件。
源字段必须是base64编码的二进制。如果不想增加在base64之间来回转换的开销,则可以使用CBOR格式而不是JSON,并将字段指定为字节数组而不是字符串表示形式。然后,处理器将跳过base64解码。 该插件必须安装在群集中的每个节点上,并且每个节点必须在安装后重新启动。
- elasticsearch-plugin install ingest-attachment
- elasticsearch-plugin remove ingest-attachment
-how-to-read-files-in-elasticsearch-doc-docx-pdf
- apache tika
- pypi tika python
- github tika python docker安装,python case
- Parsing PDFs in Python with Tika pdf
- 支持文件格式
- HyperText Markup Language * XML and derived formats * Microsoft Office document formats * OpenDocument Format * Portable Document Format * Electronic Publication Format * Rich Text Format * Compression and packaging formats * Text formats * Audio formats * Image formats * Video formats * Java class files and archives * The mbox format
- Fast Text Extraction with Python and Tika
Ambar supports large files (>30MB) Supported file types:
- ZIP archives
- Mail archives (PST)
- MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
- OCR over images
- Email messages with attachments
- Adobe PDF (with OCR)
- OCR languages: Eng, Rus, Ita, Deu, Fra, Spa, Pl, Nld
- OpenOffice documents
- RTF, Plaintext
- HTML / XHTML
- Multithread processing
https://ambar.cloud/docs/crawlers/ https://ambar.cloud/docs/crawlers/