Created
August 9, 2018 14:09
-
-
Save VladimirAlexiev/65d8810926f94ade46fe4476097444a4 to your computer and use it in GitHub Desktop.
Thomson Reuters invalid URLs, and a script to fix them
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!perl -n | |
| # zcat OpenPermID-bulk-organization-20180729_132845.ttl.gz | perl permid-fix-url.pl - | gzip - > organization.ttl.gz | |
| # fix permid organization.ttl URLs to avoid: https://jira.ontotext.com/browse/GDB-2798 | |
| m{<https://\?\?} and next; # skip line with ????, pray and hope this won't break turtle | |
| s{ # various malformed variants of http: | |
| http://htpp:// | | |
| http://http:// | | |
| http://ttp:// | | |
| http://wwhttp:// | | |
| http://www.http:// | |
| }{http://}x; | |
| # specific broken URLs | |
| s{http://www.allaboutsigns.nlstatus:}{http://www.allaboutsigns.nl}; | |
| s{http://www.keds.comhttp://www.keds.com}{http://www.keds.com}; | |
| s{https://fshaheen\@oryxgroup.com.qa}{https://oryxgroup.com.qa}; | |
| s{http://wwhttp://www.bbarkansas.com/w.bbarkansas.com/}{http://www.bbarkansas.com/}; | |
| # https://github.com/eclipse/rdf4j/issues/1066 | |
| s{<(https?://[\d+.]+)>}{<$1/>}; | |
| print; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 12:42:01 WARN riot :: [line: 395875, col: 38] Bad IRI: <http://hp1946.cn:81> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:42:24 WARN riot :: [line: 3308342, col: 38] Bad IRI: <http://http://www.europages.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:42:26 WARN riot :: [line: 3506093, col: 38] Bad IRI: <http://http://www.jfm.go.jp/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:42:37 WARN riot :: [line: 4960571, col: 38] Bad IRI: <http://ttp://www.drachengas.de> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:42:49 WARN riot :: [line: 6458867, col: 38] Bad IRI: <http://www.amberleyestate.com.au:81/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:43:16 WARN riot :: [line: 9725820, col: 38] Bad IRI: <http://http://www.asainternational.co.uk/index.asp> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:43:17 WARN riot :: [line: 9781935, col: 38] Bad IRI: <https://????.com/> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing. | |
| 12:43:41 WARN riot :: [line: 13262050, col: 38] Bad IRI: <http://220.163.124.46:88> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:43:50 WARN riot :: [line: 14358852, col: 38] Bad IRI: <http://ttp://www.woodholmes.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:44:03 WARN riot :: [line: 16972852, col: 41] Bad IRI: <http://www.china-kaidiwt.com:81/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:44:33 WARN riot :: [line: 20687030, col: 38] Bad IRI: <http://http://videos.tllis.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:44:40 WARN riot :: [line: 21363354, col: 38] Bad IRI: <http://www.intermonetary.com:80/im/> Code: 13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme it should be omitted. | |
| 12:44:40 WARN riot :: [line: 21363354, col: 38] Bad IRI: <http://www.intermonetary.com:80/im/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:45:03 WARN riot :: [line: 24226772, col: 38] Bad IRI: <http://http://eastprovidencecycle.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:03 WARN riot :: [line: 24228141, col: 38] Bad IRI: <http://http://www.draperenergy.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:03 WARN riot :: [line: 24228247, col: 38] Bad IRI: <http://http://www.ucomparehealthcare.com/drs/matthew_galsky/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24875962, col: 38] Bad IRI: <http://http://www.silstar.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24875973, col: 38] Bad IRI: <http://http://www.jax.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876058, col: 38] Bad IRI: <http://http://www.chapel2000.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876347, col: 38] Bad IRI: <http://http://www.msm-cherokee.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876487, col: 38] Bad IRI: <http://http://www.tradeenvelopes.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876508, col: 38] Bad IRI: <http://http://www.wboc.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876625, col: 38] Bad IRI: <http://http://www.bestmfg.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876636, col: 38] Bad IRI: <http://http://www.t-m-c.com/contact_us.html> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24876647, col: 38] Bad IRI: <http://http://www.mitsubishi.co.jp> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24877078, col: 38] Bad IRI: <http://http://www.usfilter.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24878037, col: 38] Bad IRI: <http://http://www.recycleitnow.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24878080, col: 38] Bad IRI: <http://http://www.jrhale.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24889958, col: 38] Bad IRI: <http://http://www.ambed.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:08 WARN riot :: [line: 24899393, col: 38] Bad IRI: <http://http://www.saniglaze714.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24934777, col: 38] Bad IRI: <http://http://www.umpqua.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24934990, col: 38] Bad IRI: <http://http://www.superiorcontrolsco.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24936167, col: 38] Bad IRI: <http://http://www.kontron.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24938005, col: 38] Bad IRI: <http://http://www.rubbertrimproducts.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24940150, col: 38] Bad IRI: <http://http://www.auditbag.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24941280, col: 38] Bad IRI: <http://http://www.colonybank.net/ashburn/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24943779, col: 38] Bad IRI: <http://http://www.aquariusproducts.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24944588, col: 38] Bad IRI: <http://http://www.northpointford.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24945346, col: 38] Bad IRI: <http://http://www.mediaserv.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24945658, col: 38] Bad IRI: <http://http://www.kable.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24945866, col: 38] Bad IRI: <http://http://www.boisaise.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24945920, col: 38] Bad IRI: <http://http://www.fanshawe.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:09 WARN riot :: [line: 24998231, col: 38] Bad IRI: <http://http://www.sciotocorp.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:41 WARN riot :: [line: 28964666, col: 38] Bad IRI: <http://http://starbrandimports.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:42 WARN riot :: [line: 29037423, col: 38] Bad IRI: <http://http://www.carlblackchevrolet.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:43 WARN riot :: [line: 29132985, col: 38] Bad IRI: <http://www.http://marcqatar.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29449961, col: 38] Bad IRI: <http://http://www.citybank.com/portugal> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29450096, col: 38] Bad IRI: <http://http://www.cazadores.com.mx> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29450117, col: 38] Bad IRI: <http://http://www.cajaduero.es> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29450890, col: 38] Bad IRI: <http://http://www.schem-resin.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29450919, col: 38] Bad IRI: <http://http://www.ua.is> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:46 WARN riot :: [line: 29452508, col: 38] Bad IRI: <http://http://www.daeins.co.kr> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:45:58 WARN riot :: [line: 31130929, col: 38] Bad IRI: <http://htpp://microbiopharma.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:46:01 WARN riot :: [line: 31441347, col: 38] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present. | |
| 12:46:20 WARN riot :: [line: 34012103, col: 41] Bad IRI: <http://www.keds.comhttp://www.keds.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:46:26 WARN riot :: [line: 34882264, col: 41] Bad IRI: <http://www.autoclubgroup.com:80/michigan/> Code: 13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme it should be omitted. | |
| 12:46:26 WARN riot :: [line: 34882264, col: 41] Bad IRI: <http://www.autoclubgroup.com:80/michigan/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:46:29 WARN riot :: [line: 35278486, col: 38] Bad IRI: <http://www.planseguro.com.mx:88/?lang=es_mx> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:46:47 WARN riot :: [line: 37610001, col: 41] Bad IRI: <https://???.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing. | |
| 12:46:54 WARN riot :: [line: 38461003, col: 38] Bad IRI: <http://www.sfisd.net:81/admin/info.aspx> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:47:01 WARN riot :: [line: 39363675, col: 38] Bad IRI: <https://www.huic.co.kr:442/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:47:03 WARN riot :: [line: 39520884, col: 38] Bad IRI: <http://http://www.xilink.co.kr> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:06 WARN riot :: [line: 39916927, col: 38] Bad IRI: <http://http://www.pickpay.ch> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:09 WARN riot :: [line: 40359191, col: 40] Bad IRI: <https://www.daehanpaper.com:449> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. | |
| 12:47:17 WARN riot :: [line: 41437908, col: 38] Bad IRI: <http://http://www.seradyn.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:17 WARN riot :: [line: 41474988, col: 41] Bad IRI: <http://wwhttp://www.bbarkansas.com/w.bbarkansas.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:21 WARN riot :: [line: 41957369, col: 38] Bad IRI: <http://http://www.classicmotorcycle.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:29 WARN riot :: [line: 42837816, col: 38] Bad IRI: <http://http://www.fargem.com.tr/en> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:36 WARN riot :: [line: 43701057, col: 41] Bad IRI: <http://http://zhetysu.kz/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:47:52 WARN riot :: [line: 45951140, col: 41] Bad IRI: <http://www.allaboutsigns.nlstatus:> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:48:07 WARN riot :: [line: 47836395, col: 41] Bad IRI: <http://www.http://uba.com.jo> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:48:18 WARN riot :: [line: 49262635, col: 41] Bad IRI: <http://ttp://www.capitatrustees.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified. | |
| 12:48:26 WARN riot :: [line: 50379246, col: 38] Bad IRI: <https://????????????.??/> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing. | |
| 12:48:31 WARN riot :: [line: 50900663, col: 38] Bad IRI: <http://big5.spdb.com.cn:88/site/cht/news.spdb.com.cn/overseas_institutions/hk_bank/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment