Salman* javagrails

Overview

This Seed Streams guide illustrates how to use Lucidworks Fusion to crawl a specific set of documents on a website whose URIs match a regular expression. Additionally, img src fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.

Start Fusion and Create a New Appliction

Start a Fusion instance on Google. Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with admin and the new password.
Create a new application. Call it XKCD.
Click on the new application.

Add a New Datasource and Limit the Documents

Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com a

Deploy Solr on Google Cloud in 2 Minutes

Useful information and scripts for deploying an instance based Solr Cloud in 2 minutes.

Check this repo out on your Google Cloud Shell terminal.

Launch Solr

Deploy a secure Solr instance on Google cloud:

$ ./deploy-solr.sh

	#!/bin/bash
	NEW_UUID=$(cat /dev/urandom \| tr -dc 'a-z0-9' \| fold -w 4 \| head -n 1)
	gcloud compute instances create fusion-server-$NEW_UUID \
	--machine-type "n1-standard-8" \
	--image "ubuntu-1604-xenial-v20170811" \
	--image-project "ubuntu-os-cloud" \
	--boot-disk-size "50" \
	--boot-disk-type "pd-ssd" \
	--boot-disk-device-name "$NEW_UUID" \
	--zone us-west1-b \

	#!/bin/bash
	NEW_UUID=$(cat /dev/urandom \| tr -dc 'a-z0-9' \| fold -w 4 \| head -n 1)
	SERVER_NAME=ubuntu-dev-$NEW_UUID
	gcloud compute instances create $SERVER_NAME \
	--machine-type "n1-standard-1" \
	--image "ubuntu-1604-xenial-v20170811" \
	--image-project "ubuntu-os-cloud" \
	--boot-disk-size "10" \
	--boot-disk-type "pd-ssd" \
	--boot-disk-device-name "$NEW_UUID" \

	# Run this script to install a Fusion cluster locally.
	#
	# In the working directory you are in, it will create fusion-1, fusion-2, etc... directories.
	#
	# You will then take those directories and either run them from the same machine, or you can copy the directories to separate instances.
	#
	# There are two optional command line properties:
	#
	# --no-download Do not download Fusion from https://download.lucidworks.com instead use the tar.gz file in this directory already.
	# -v Verbose mode.

	007addict.com
	020.co.uk
	027168.com
	0815.ru
	0815.ru0clickemail.com
	0815.ry
	0815.su
	0845.ru
	0clickemail.com
	0-mail.com

	1033edge.com
	11mail.com
	123.com
	123box.net
	123india.com
	123mail.cl
	123qwe.co.uk
	126.com
	150ml.com
	15meg4free.com

	#!/usr/bin/env python
	#
	# Extracts email addresses from one or more plain text files.
	#
	# Notes:
	# - Does not save to file (pipe the output to a file if you want it saved).
	# - Does not check for duplicates (which can easily be done in the terminal).
	#
	# chmod +x extract_emails_from_text.py
	# ./extract_emails_from_text.py file_to_parse.txt \| sort \| uniq

	0-mail.com
	0815.ru
	0clickemail.com
	0wnd.net
	0wnd.org
	10minutemail.com
	20minutemail.com
	2prong.com
	30minutemail.com
	3d-painting.com