Skip to content

Instantly share code, notes, and snippets.

@robinkraft
Created April 4, 2013 01:11
Show Gist options
  • Save robinkraft/5306891 to your computer and use it in GitHub Desktop.
Save robinkraft/5306891 to your computer and use it in GitHub Desktop.
Sample queries showing problem with Vertnet wide source using Cascalog predicate operator :>> Check out this diff for relevant recent changes to project: https://github.com/VertNet/gulo/compare/feature/stats-queries...feature/new-stats-queries
Fri, 15 Jun 2012 16:05:03 -0500 http://ipt.vertnet.org:8080/ipt/resource.do?r=isu_mammals http://ipt.vertnet.org:8080/ipt/eml.do?r=isu_mammals http://ipt.vertnet.org:8080/ipt/archive.do?r=isu_mammals81e4afd9-0b61-483d-b7fa-0690f06c8e14 ISU Mammals 2e4967ed-fd35-4d34-ae4d-e8731d366e97 Illinois State University PreservedSpecimen 1Mammals North America United States McLean County 48.7288900000 -101.9727800000 1954-01-01 North America, United States, North Dakota, McLean County ISU Unknown Normal Skin only - 1 Female North Dakota 48.7288900° N 101.9727800° W 1954 Animalia Chordata Mammalia Rodentia Sciuridae Sciurus niger Sciurus niger
;; Take this query (where test.txt contains "a\ta\t\ta":
(let [src (hfs-textline "/tmp/simpledata.txt")
fields ["?first" "?second" "?third" "?fourth"]]
(??<- [?first ?second ?third]
(src ?line)
(u/splitline ?line :>> fields)))
;=> (["a" "a" ""])
;; When I use a real Vertnet record (slurped in previously, stored in `a`),
;; I get the same, expected behavior:
(let [src [[a]]]
(??<- [?pubdate]
(src ?line)
(u/splitline ?line :>> u/harvest-fields)))
(["Fri, 15 Jun 2012 16:05:03 -0500"])
;; However, if I'm sourcing from a textfile using hfs-textline, Cascalog only
;; sees 169 fields instead of the 191 fields that are actually there.
(let [src (hfs-textline "/tmp/sampledata.txt")]
(??<- [?pubdate]
(src ?line)
(u/splitline ?line :>> u/harvest-fields)))
;=> cascading.tuple.TupleException: operation added the wrong number of fields. ... got result size: 169
;; It's as though some of the empty fields are going missing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment