This Seed Streams guide illustrates how to use Lucidworks Fusion to crawl a specific set of documents on a website whose URIs match a regular expression. Additionally, img src
fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.
- Start a Fusion instance on Google. Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with
admin
and the new password. - Create a new application. Call it
XKCD
. - Click on the new application.
- Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
- Navigate to Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the
javascript_indexing_pipeline_stage.js
code below into the script body. Click save. - Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to
.*/[0-9]{1,5}/*.
and click save. This limits the documents to comics pages which appear in the formathttps://xkcd.com/501/
,https://xkcd.com/4/
, etc.
- Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
- Navigate to Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
- Click on the Tika parser stage and check
Return parsed content as XML or HTML
andReturn original XML and HTML instead of Tika XML output
. Click apply below. Click save at top right.
- Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
- Navigate to Querying..Query Workbench. Set the display fields to be
id
andimg_url_s
. - Run a search and ensure the
image_url_s
andimage_url_t
fields are present.
- Note that the text of the comic is already available in the
<div id="transcript">
tag on the comic page. Google's Vision API returns other data about images, however. - Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to
https://vision.googleapis.com/v1/images:annotate
. Change the call method topost
. - Create a query parameter with a property name of
key
. Set the property value to your Google API key for the Vision API. - Copy and paste in the
request_entity_indexing_pipeline.json
string below into the request entity field. - Add a mapping of returned values XPath Expression. Use
//responses/fullTextAnnotation/text
for the first expression. Set the target field to begv_text_s
. ClickAppend To Existing Values In Target Field
. Click save at top. - Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl.
- Run a search and ensure the
gv_text_s
field is present.
- Navigate to the Credentials dashboard. You may need to select the correct project.
- Click the create credentials button. Select API key. Copy the API key when it appears.
- Click restrict key. In the application restrictions tab, select
IP addresses
. Enter the just IP address of the Fusion instance from your browser, without the port number or colon. - Click the API restrictions tab. Set the API restrictions to Cloud Vision API.
- Click save.
Tail the connectors-classic.log
in the ./fusion/4.0.2/var/log/connectors/connectors-classic
directory to debug:
$ cd /fusion/4.0.2/var/log/connectors/connectors-classic
$ tail -f connectors-classic.log