Skip to content

Instantly share code, notes, and snippets.

@terrywbrady
Last active August 29, 2015 14:08
Show Gist options
  • Save terrywbrady/82bd91b53ea4374b96e4 to your computer and use it in GitHub Desktop.
Save terrywbrady/82bd91b53ea4374b96e4 to your computer and use it in GitHub Desktop.
Copy DSpace Statistics Records to Force UID generation

We found that 5M of our 12M statistics records did not have a uid. The absence of this field caused the sharding process to fail.

  • Add the following to solr.xml

      <core name="tstatistics" instanceDir="tstatistics" />
    
  • Build solrFix-2.0.jar using the pom file listed above

  • Run the solrFix jar repeatedly until all records have been copied from "statitistics" to "tstatistics". This calls the SolrTouch class which reads each statistics record and copies it (exluding uid and version). This will force the re-initialization of these fields.

This process runs into heap or garbage collection contstraints when processing large numbers of items. On line #63, tune the process to set a maximum number of records to process at one time. (Recommended: 100,000 to 500,000)

java -Xmx1000m -jar solrFix-2.0.jar 
  • Stop solr and swap the statistics and tstatistics cores.

We found that 5M of our 12M statistics records did not have a proper version attribute. The following code override will force the correction of the version during the sharding process

  • Update SolrLogger.java in your DSpace code base to contain the method provided above. This method will exclude version numbers from the CSV export in the sharding process.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>edu.georgetown.library</groupId>
<artifactId>solrFix</artifactId>
<version>2.0</version>
<packaging>jar</packaging>
<name>SolrFix</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<artifactId>solr-solrj</artifactId>
<groupId>org.apache.solr</groupId>
<version>4.7.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpmime</artifactId>
<version>4.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.6</version>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>wstx-asl</artifactId>
<version>4.0.6</version>
</dependency>
<dependency>
<groupId>org.noggit</groupId>
<artifactId>noggit</artifactId>
<version>0.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jul-to-slf4j</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>edu.georgetown.library.solrFix.SolrTouch</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<phase>install</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<directory>target</directory>
<outputDirectory>target/classes</outputDirectory>
<finalName>${project.artifactId}-${project.version}</finalName>
<sourceDirectory>src/main</sourceDirectory>
</build>
</project>
public static void shardSolrIndex() throws IOException, SolrServerException {
/*
Start by faceting by year so we can include each year in a separate core !
*/
SolrQuery yearRangeQuery = new SolrQuery();
yearRangeQuery.setQuery("*:*");
yearRangeQuery.setRows(0);
yearRangeQuery.setFacet(true);
yearRangeQuery.add(FacetParams.FACET_RANGE, "time");
//We go back to 2000 the year 2000, this is a bit overkill but this way we ensure we have everything
//The alternative would be to sort but that isn't recommended since it would be a very costly query !
yearRangeQuery.add(FacetParams.FACET_RANGE_START, "NOW/YEAR-" + (Calendar.getInstance().get(Calendar.YEAR) - 2000) + "YEARS");
//Add the +0year to ensure that we DO NOT include the current year
yearRangeQuery.add(FacetParams.FACET_RANGE_END, "NOW/YEAR+0YEARS");
yearRangeQuery.add(FacetParams.FACET_RANGE_GAP, "+1YEAR");
yearRangeQuery.add(FacetParams.FACET_MINCOUNT, String.valueOf(1));
//Create a temp directory to store our files in !
File tempDirectory = new File(ConfigurationManager.getProperty("dspace.dir") + File.separator + "temp" + File.separator);
tempDirectory.mkdirs();
QueryResponse queryResponse = solr.query(yearRangeQuery);
//We only have one range query !
List<RangeFacet.Count> yearResults = queryResponse.getFacetRanges().get(0).getCounts();
for (RangeFacet.Count count : yearResults) {
long totalRecords = count.getCount();
//Create a range query from this !
//We start with out current year
DCDate dcStart = new DCDate(count.getValue());
Calendar endDate = Calendar.getInstance();
//Advance one year for the start of the next one !
endDate.setTime(dcStart.toDate());
endDate.add(Calendar.YEAR, 1);
DCDate dcEndDate = new DCDate(endDate.getTime());
StringBuilder filterQuery = new StringBuilder();
filterQuery.append("time:([");
filterQuery.append(ClientUtils.escapeQueryChars(dcStart.toString()));
filterQuery.append(" TO ");
filterQuery.append(ClientUtils.escapeQueryChars(dcEndDate.toString()));
filterQuery.append("]");
//The next part of the filter query excludes the content from midnight of the next year !
filterQuery.append(" NOT ").append(ClientUtils.escapeQueryChars(dcEndDate.toString()));
filterQuery.append(")");
Map<String, String> yearQueryParams = new HashMap<String, String>();
yearQueryParams.put(CommonParams.Q, "*:*");
yearQueryParams.put(CommonParams.ROWS, String.valueOf(10000));
yearQueryParams.put(CommonParams.FQ, filterQuery.toString());
yearQueryParams.put(CommonParams.WT, "csv");
//Start by creating a new core
String coreName = "statistics-" + dcStart.getYear();
HttpSolrServer statisticsYearServer = createCore(solr, coreName);
System.out.println("Moving: " + totalRecords + " into core " + coreName);
log.info("Moving: " + totalRecords + " records into core " + coreName);
List<File> filesToUpload = new ArrayList<File>();
for(int i = 0; i < totalRecords; i+=10000){
String solrRequestUrl = solr.getBaseURL() + "/select";
solrRequestUrl = generateURL(solrRequestUrl, yearQueryParams);
GetMethod get = new GetMethod(solrRequestUrl);
new HttpClient().executeMethod(get);
InputStream csvInputstream = get.getResponseBodyAsStream();
//Write the csv ouput to a file !
File csvFile = new File(tempDirectory.getPath() + File.separatorChar + "temp." + dcStart.getYear() + "." + i + ".csv");
CSVWriter bw = new CSVWriter(new FileWriter(csvFile));
int excl = -1;
try {
CSVReader reader = new CSVReader(new InputStreamReader(csvInputstream));
String [] nextLine;
String [] firstLine = new String[0];
if ((nextLine = reader.readNext()) != null) {
firstLine = nextLine;
for(int pi=0; pi<firstLine.length; pi++) {
String s = firstLine[pi];
if (s == null) s = "";
if (s.equals("_version_")) {
excl = pi;
break;
}
}
}
for (; nextLine !=null; nextLine = reader.readNext()) {
int sz = firstLine.length;
if (excl > 0) sz--;
String[] outLine = new String[sz];
int outIndex = 0;
for(int pi=0; pi<firstLine.length; pi++) {
String s = (pi > nextLine.length - 1) ? "\"\"" : nextLine[pi];
if (pi == excl) continue;
if (s == null) s = "";
outLine[outIndex++] = s;
}
bw.writeNext(outLine);
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
bw.flush();
bw.close();
//FileUtils.copyInputStreamToFile(csvInputstream, csvFile);
filesToUpload.add(csvFile);
//Add 10000 & start over again
yearQueryParams.put(CommonParams.START, String.valueOf((i + 10000)));
}
for (File tempCsv : filesToUpload) {
//Upload the data in the csv files to our new solr core
try {
ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");
statisticsYearServer.request(contentStreamUpdateRequest);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
statisticsYearServer.commit(true, true);
//Delete contents of this year from our year query !
solr.deleteByQuery(filterQuery.toString());
solr.commit(true, true);
log.info("Moved " + totalRecords + " records into core: " + coreName);
}
FileUtils.deleteDirectory(tempDirectory);
}
package edu.georgetown.library.solrFix;
/*
* java -mx2500M -cp commons-codec-1.6.jar:commons-io-2.4.jar:commons-logging-1.1.3.jar:httpclient-4.3.3.jar:httpcore-4.3.2.jar:httpmime-4.3.3.jar:jcl-over-slf4j-1.7.7.jar:jline-0.9.94.jar:jul-to-slf4j-1.7.7.jar:log4j-1.2.17.jar:netty-3.7.0.Final.jar:noggit-0.5.jar:slf4j-api-1.7.7.jar:slf4j-log4j12-1.7.7.jar:solr-solrj-4.7.2.jar:solrUpdate-2.0.jar:stax2-api-3.0.1.jar:stax-api-1.0.1.jar:woodstox-core-asl-4.0.6.jar:zookeeper-3.4.6.jar edu.georgetown.library.solrUpdate.SolrUpdate
*/
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Map;
import java.util.Vector;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrQuery.ORDER;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.util.ClientUtils;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.SolrInputDocument;
public class SolrTouch {
static String conttype = "";
public static void main(String[] args) {
boolean win = System.getProperty("os.name").startsWith("Windows");
conttype = win ? "text/xml" : "application/xml";
long stime = Calendar.getInstance().getTimeInMillis();
int MAX = 50_000;
String url = "http://localhost/solr/statistics";
String turl = "http://localhost/solr/tstatistics";
try {
HttpSolrServer server = new HttpSolrServer( url );
HttpSolrServer tserver = new HttpSolrServer( turl );
//server.setRequestWriter(new BinaryRequestWriter());
XMLResponseParser xrp = new XMLResponseParser() {
public String getContentType() {return conttype;}
};
SolrQuery tsq = new SolrQuery();
tsq.setQuery("*:*");
tsq.setRows(0);
tserver.setParser(xrp);
QueryResponse tresp = tserver.query(tsq);
int start = (int)tresp.getResults().getNumFound();
tsq = new SolrQuery();
String myQuery = "*:*";
SolrQuery sq = new SolrQuery();
sq.setQuery(myQuery);
sq.setRows(MAX);
sq.setSort("time", ORDER.asc);
server.setParser(xrp);
for(int total = 0; total<100_000 ;) {
System.out.format("%,d%n", start);
sq.setStart(start);
QueryResponse resp = server.query(sq);
SolrDocumentList list = resp.getResults();
if (list.size() == 0) break;
ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<list.size(); i++) {
SolrDocument doc = list.get(i);
SolrInputDocument idoc = new SolrInputDocument();
Map<String, Object> m = doc.getFieldValueMap();
for(String k: m.keySet()){
if (k.equals("uid")) continue;
if (k.equals("_version_")) continue;
idoc.addField(k, m.get(k));
}
idocs.add(idoc);
}
tserver.add(idocs);
tserver.commit();
start += list.size();
long etime = Calendar.getInstance().getTimeInMillis();
total += idocs.size();
System.out.format("%,d updated in %,d sec%n", total, (etime - stime)/1000);
System.gc();
etime = Calendar.getInstance().getTimeInMillis();
//System.out.format("End GC at %,d sec%n", (etime - stime)/1000);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
package edu.georgetown.library.solrFix;
/*
*/
import java.io.IOException;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrQuery.ORDER;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.util.ClientUtils;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.ModifiableSolrParams;
public class SolrTouch2 {
static String conttype = "";
public static void main(String[] args) {
boolean win = System.getProperty("os.name").startsWith("Windows");
conttype = win ? "text/xml" : "application/xml";
String url = "http://localhost/solr/statistics";
try {
HttpSolrServer server = new HttpSolrServer( url );
XMLResponseParser xrp = new XMLResponseParser() {
public String getContentType() {return conttype;}
};
server.setParser(xrp);
String myQueryFacet = "NOT(uid:*)";
SolrQuery sqf = new SolrQuery();
sqf.setQuery(myQueryFacet);
sqf.setRows(0);
sqf.setFacet(true);
sqf.setFacetMinCount(0);
sqf.addFacetField("id");
sqf.setFacetSort("count");
sqf.setFacetLimit(-1);
int total = 0;
int subtotal = 0;
int batchcount = 0;
QueryResponse fresp = server.query(sqf);
List<FacetField> flist = fresp.getFacetFields();
ArrayList<String> solrids = new ArrayList<>();
for(FacetField ff: flist) {
for(org.apache.solr.client.solrj.response.FacetField.Count fc: ff.getValues()) {
if (fc.getCount() == 0) continue;
solrids.add(fc.getName());
subtotal += fc.getCount();
if (subtotal > 50_000 || solrids.size() >= 500) {
System.out.printf("\t%5d\t%5d\t%,d%n", ++batchcount, solrids.size(), subtotal);
total += queryByIds(server, solrids);
subtotal = 0;
}
}
}
System.out.printf("\t%5d\t%5d\t%,d%n", ++batchcount, solrids.size(), subtotal);
total += queryByIds(server, solrids);
System.out.printf("Total %d%n", total);
long etime = Calendar.getInstance().getTimeInMillis();
System.out.format("%,d updated in %,d sec%n", total, (etime - jstime)/1000);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static long stime = Calendar.getInstance().getTimeInMillis();
public static long jstime = stime;
public static String getQuery(ArrayList<String> solrids) {
StringBuffer sbuf = new StringBuffer();
sbuf.append("NOT(uid:*) AND id:(");
boolean first = true;
for(String s:solrids) {
if (first) {
first = false;
} else {
sbuf.append(" OR ");
}
sbuf.append(s);
}
sbuf.append(")");
return sbuf.toString();
}
public static int queryByIds(HttpSolrServer server, ArrayList<String> solrids) throws SolrServerException, IOException {
if (solrids.size() == 0) return 0;
ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
int MAX = 250_000;
String myQuery = getQuery(solrids);
//System.out.println(myQuery);
SolrQuery sq = new SolrQuery();
sq.setQuery(myQuery);
sq.setRows(MAX);
sq.setSort("time", ORDER.asc);
QueryResponse resp = server.query(sq);
SolrDocumentList list = resp.getResults();
if (list.size() > 0) {
for(int i=0; i<list.size(); i++) {
SolrDocument doc = list.get(i);
SolrInputDocument idoc = ClientUtils.toSolrInputDocument(doc);
idocs.add(idoc);
}
}
server.add(idocs);
server.commit(true, true);
server.deleteByQuery(myQuery);
server.commit(true, true);
int subtotal = idocs.size();
//System.gc();
long etime = Calendar.getInstance().getTimeInMillis();
System.out.format("%,d updated in %,d sec%n", subtotal, (etime - stime)/1000);
stime = etime;
solrids.clear();
idocs.clear();
return subtotal;
}
}
@terrywbrady
Copy link
Author

SolrTouch2.java performs the same action as SolrTouch.java without requiring a separate repository to hold the updated records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment