- Download and install Riak 2.0 and an appropriate JVM for your machine.
- Download the Erlang
load_data.erl
escript from https://raw.githubusercontent.com/basho/basho_docs/master/source/data/load_data.erl - Download the
goog.csv
test data from https://github.com/basho/basho_docs/raw/master/source/data/goog.csv
While the node is stopped, modify the riak.conf
file. Add the line
search = on
to the end of the file, or find and replace the value within the file. Once search is enabled on all of the nodes in your cluster, start them up. If the nodes fail to start with search enabled, verify that you have a Java Virtual Machine loaded and that the java
command works as expected.
Since the load script does not use column names that are compatible with the default schema and since using the default schema is not recommended for a production system anyhow, we are going to make an appropriate custom schema for the goog.csv
data.
Start with the skeleton Search schema documented in the Search Schema. Create a document in your working folder named goog.xml
and insert the following content:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="schedule" version="1.5">
<fields>
<!-- All of these fields are required by Riak Search -->
<field name="_yz_id" type="_yz_str" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="_yz_ed" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_pn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_fpn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_vtag" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_rk" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rt" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rb" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_err" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
</fields>
<uniqueKey>_yz_id</uniqueKey>
<types>
<!-- YZ String: Used for non-analyzed fields -->
<fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true" />
</types>
</schema>
We will need to configure some additional Solr FieldTypes that we will use for our fields.
Inside the <types>
element, add the following types:
<fieldtype name="date" class="solr.TrieDateField" />
<fieldtype name="integer" class="solr.TrieLongField" />
<fieldtype name="float" class="solr.TrieFloatField" />
<!-- Catch-all Field Type -->
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
These types will be used in our field definitions that will be made in the next step.
The "catch-all" field type will be used on a Dynamic Field. Fields that are in the object, but not in the schema, will match the catch-all field definition and be ignored rather than throwing an error.
Next, we will define Solr Fields. These field definitions are used to tell Search what to do with incoming data—to index it, store it, or ignore it.
Inside the <fields>
element, add the following types:
<field name="date" type="date" indexed="true" stored="false" multiValued="false" />
<field name="open" type="float" indexed="true" stored="false" multiValued="false" />
<field name="high" type="float" indexed="true" stored="false" multiValued="false" />
<field name="low" type="float" indexed="true" stored="false" multiValued="false" />
<field name="close" type="float" indexed="true" stored="false" multiValued="false" />
<field name="volume" type="integer" indexed="true" stored="false" multiValued="false" />
<!-- Catch-all field -->
<dynamicField name="*" type="ignored" />
This just makes the following commands more copy-and-paste-able. Run the following commands in your shell, substituting the correct values where necessary.
export RIAK_HOST="http://localhost"
export RIAK_PORT=8098
export SOLR_PORT=8093
curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/schema/goog" \
-H 'Content-Type:application/xml' --data-binary @goog.xml
curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/index/goog" \
-H 'Content-Type:application/json' -d '{"schema":"goog"}'
curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"goog"}}'
There are a few fixes that will ned to be made to the original load_data.erl file. First, if you do not have erlang installed on your machine, you will need to change the path in the first line of the file to point to the embedded Erlang included with Riak.
Change
#!/usr/bin/env escript
to
#!/usr/lib64/riak/erts-5.10.3/bin/escript
if you are on CentOS or RHEL, or
#!/usr/lib/riak/erts-5.10.3/bin/escript
for Ubuntu, Debian, and FreeBSD
Next, we need to make some changes to the curl that gets called by the the load script. We need to change the URL so that it reflects our Riak IP Address and Port and then create Java style dates so that they will be parseable by Riak Search. Change line 9 from
JSON = io_lib:format("{\"Date\":\"~s\",\"Open\":~s,\"High\":~s,\"Low\":~s,\"Close\":~s,\"Volume\":~s,\"Adj. Close\":~s}", Line),
to
JSON = io_lib:format("{\"date\":\"~sT00:00:00Z\",\"open\":~s,\"high\":~s,\"low\":~s,\"close\":~s,\"volume\":~s,\"adj_close\":~s}", Line),
Modify line 10 and correct the IP address and port if necessary—if you're using a devrel, for example:
Command = io_lib:format("curl -X PUT http://127.0.0.1:8091/riak/goog/~s -d '~s' -H 'content-type: application/json'", [hd(Line),JSON]),
./load_data goog.csv
In this section, we will run some of the sample queries that were provided in the "Advanced MapReduce - Bigger Data Examples" documentation against the sample dataset.
http://10.0.1.19:8098/search/query/goog?wt=json&q=high:[600%20TO%20*]
or
curl -g "$RIAK_HOST:$RIAK_PORT/search/query/goog?wt=json&q=high:[600%20TO%20*]" | jsonpp
curl "$RIAK_HOST:$RIAK_PORT/search/query/goog?q=*:*&fq=%7B%21frange+u%3D0%7Dsub%28close%2Copen%29&wt=json" | jsonpp
or
http://10.0.1.19:8098/search/query/goog?q=*:*&fq={!frange%20u=0}sub(close,open)&wt=json
Before deleting an index, it is important to note that there is no way to rebuild an index in realtime without clearing the AAE trees and waiting for Active Anti-Entropy to repair them.
curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"_dont_index_"}}'
curl -XDELETE "$RIAK_HOST:$RIAK_PORT/search/index/goog"