Created
April 4, 2013 01:11
-
-
Save robinkraft/5306891 to your computer and use it in GitHub Desktop.
Sample queries showing problem with Vertnet wide source using Cascalog predicate operator :>> Check out this diff for relevant recent changes to project: https://github.com/VertNet/gulo/compare/feature/stats-queries...feature/new-stats-queries
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fri, 15 Jun 2012 16:05:03 -0500 http://ipt.vertnet.org:8080/ipt/resource.do?r=isu_mammals http://ipt.vertnet.org:8080/ipt/eml.do?r=isu_mammals http://ipt.vertnet.org:8080/ipt/archive.do?r=isu_mammals81e4afd9-0b61-483d-b7fa-0690f06c8e14 ISU Mammals 2e4967ed-fd35-4d34-ae4d-e8731d366e97 Illinois State University PreservedSpecimen 1Mammals North America United States McLean County 48.7288900000 -101.9727800000 1954-01-01 North America, United States, North Dakota, McLean County ISU Unknown Normal Skin only - 1 Female North Dakota 48.7288900° N 101.9727800° W 1954 Animalia Chordata Mammalia Rodentia Sciuridae Sciurus niger Sciurus niger |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; Take this query (where test.txt contains "a\ta\t\ta": | |
(let [src (hfs-textline "/tmp/simpledata.txt") | |
fields ["?first" "?second" "?third" "?fourth"]] | |
(??<- [?first ?second ?third] | |
(src ?line) | |
(u/splitline ?line :>> fields))) | |
;=> (["a" "a" ""]) | |
;; When I use a real Vertnet record (slurped in previously, stored in `a`), | |
;; I get the same, expected behavior: | |
(let [src [[a]]] | |
(??<- [?pubdate] | |
(src ?line) | |
(u/splitline ?line :>> u/harvest-fields))) | |
(["Fri, 15 Jun 2012 16:05:03 -0500"]) | |
;; However, if I'm sourcing from a textfile using hfs-textline, Cascalog only | |
;; sees 169 fields instead of the 191 fields that are actually there. | |
(let [src (hfs-textline "/tmp/sampledata.txt")] | |
(??<- [?pubdate] | |
(src ?line) | |
(u/splitline ?line :>> u/harvest-fields))) | |
;=> cascading.tuple.TupleException: operation added the wrong number of fields. ... got result size: 169 | |
;; It's as though some of the empty fields are going missing. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment