Skip to content

Instantly share code, notes, and snippets.

@VladimirAlexiev
Created August 9, 2018 14:09
Show Gist options
  • Select an option

  • Save VladimirAlexiev/65d8810926f94ade46fe4476097444a4 to your computer and use it in GitHub Desktop.

Select an option

Save VladimirAlexiev/65d8810926f94ade46fe4476097444a4 to your computer and use it in GitHub Desktop.
Thomson Reuters invalid URLs, and a script to fix them
#!perl -n
# zcat OpenPermID-bulk-organization-20180729_132845.ttl.gz | perl permid-fix-url.pl - | gzip - > organization.ttl.gz
# fix permid organization.ttl URLs to avoid: https://jira.ontotext.com/browse/GDB-2798
m{<https://\?\?} and next; # skip line with ????, pray and hope this won't break turtle
s{ # various malformed variants of http:
http://htpp:// |
http://http:// |
http://ttp:// |
http://wwhttp:// |
http://www.http://
}{http://}x;
# specific broken URLs
s{http://www.allaboutsigns.nlstatus:}{http://www.allaboutsigns.nl};
s{http://www.keds.comhttp://www.keds.com}{http://www.keds.com};
s{https://fshaheen\@oryxgroup.com.qa}{https://oryxgroup.com.qa};
s{http://wwhttp://www.bbarkansas.com/w.bbarkansas.com/}{http://www.bbarkansas.com/};
# https://github.com/eclipse/rdf4j/issues/1066
s{<(https?://[\d+.]+)>}{<$1/>};
print;
12:42:01 WARN riot :: [line: 395875, col: 38] Bad IRI: <http://hp1946.cn:81> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:42:24 WARN riot :: [line: 3308342, col: 38] Bad IRI: <http://http://www.europages.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:42:26 WARN riot :: [line: 3506093, col: 38] Bad IRI: <http://http://www.jfm.go.jp/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:42:37 WARN riot :: [line: 4960571, col: 38] Bad IRI: <http://ttp://www.drachengas.de> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:42:49 WARN riot :: [line: 6458867, col: 38] Bad IRI: <http://www.amberleyestate.com.au:81/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:43:16 WARN riot :: [line: 9725820, col: 38] Bad IRI: <http://http://www.asainternational.co.uk/index.asp> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:43:17 WARN riot :: [line: 9781935, col: 38] Bad IRI: <https://????.com/> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
12:43:41 WARN riot :: [line: 13262050, col: 38] Bad IRI: <http://220.163.124.46:88> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:43:50 WARN riot :: [line: 14358852, col: 38] Bad IRI: <http://ttp://www.woodholmes.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:44:03 WARN riot :: [line: 16972852, col: 41] Bad IRI: <http://www.china-kaidiwt.com:81/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:44:33 WARN riot :: [line: 20687030, col: 38] Bad IRI: <http://http://videos.tllis.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:44:40 WARN riot :: [line: 21363354, col: 38] Bad IRI: <http://www.intermonetary.com:80/im/> Code: 13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme it should be omitted.
12:44:40 WARN riot :: [line: 21363354, col: 38] Bad IRI: <http://www.intermonetary.com:80/im/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:45:03 WARN riot :: [line: 24226772, col: 38] Bad IRI: <http://http://eastprovidencecycle.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:03 WARN riot :: [line: 24228141, col: 38] Bad IRI: <http://http://www.draperenergy.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:03 WARN riot :: [line: 24228247, col: 38] Bad IRI: <http://http://www.ucomparehealthcare.com/drs/matthew_galsky/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24875962, col: 38] Bad IRI: <http://http://www.silstar.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24875973, col: 38] Bad IRI: <http://http://www.jax.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876058, col: 38] Bad IRI: <http://http://www.chapel2000.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876347, col: 38] Bad IRI: <http://http://www.msm-cherokee.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876487, col: 38] Bad IRI: <http://http://www.tradeenvelopes.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876508, col: 38] Bad IRI: <http://http://www.wboc.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876625, col: 38] Bad IRI: <http://http://www.bestmfg.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876636, col: 38] Bad IRI: <http://http://www.t-m-c.com/contact_us.html> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24876647, col: 38] Bad IRI: <http://http://www.mitsubishi.co.jp> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24877078, col: 38] Bad IRI: <http://http://www.usfilter.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24878037, col: 38] Bad IRI: <http://http://www.recycleitnow.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24878080, col: 38] Bad IRI: <http://http://www.jrhale.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24889958, col: 38] Bad IRI: <http://http://www.ambed.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:08 WARN riot :: [line: 24899393, col: 38] Bad IRI: <http://http://www.saniglaze714.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24934777, col: 38] Bad IRI: <http://http://www.umpqua.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24934990, col: 38] Bad IRI: <http://http://www.superiorcontrolsco.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24936167, col: 38] Bad IRI: <http://http://www.kontron.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24938005, col: 38] Bad IRI: <http://http://www.rubbertrimproducts.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24940150, col: 38] Bad IRI: <http://http://www.auditbag.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24941280, col: 38] Bad IRI: <http://http://www.colonybank.net/ashburn/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24943779, col: 38] Bad IRI: <http://http://www.aquariusproducts.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24944588, col: 38] Bad IRI: <http://http://www.northpointford.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24945346, col: 38] Bad IRI: <http://http://www.mediaserv.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24945658, col: 38] Bad IRI: <http://http://www.kable.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24945866, col: 38] Bad IRI: <http://http://www.boisaise.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24945920, col: 38] Bad IRI: <http://http://www.fanshawe.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:09 WARN riot :: [line: 24998231, col: 38] Bad IRI: <http://http://www.sciotocorp.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:41 WARN riot :: [line: 28964666, col: 38] Bad IRI: <http://http://starbrandimports.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:42 WARN riot :: [line: 29037423, col: 38] Bad IRI: <http://http://www.carlblackchevrolet.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:43 WARN riot :: [line: 29132985, col: 38] Bad IRI: <http://www.http://marcqatar.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29449961, col: 38] Bad IRI: <http://http://www.citybank.com/portugal> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29450096, col: 38] Bad IRI: <http://http://www.cazadores.com.mx> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29450117, col: 38] Bad IRI: <http://http://www.cajaduero.es> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29450890, col: 38] Bad IRI: <http://http://www.schem-resin.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29450919, col: 38] Bad IRI: <http://http://www.ua.is> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:46 WARN riot :: [line: 29452508, col: 38] Bad IRI: <http://http://www.daeins.co.kr> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:45:58 WARN riot :: [line: 31130929, col: 38] Bad IRI: <http://htpp://microbiopharma.net> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:46:01 WARN riot :: [line: 31441347, col: 38] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
12:46:20 WARN riot :: [line: 34012103, col: 41] Bad IRI: <http://www.keds.comhttp://www.keds.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:46:26 WARN riot :: [line: 34882264, col: 41] Bad IRI: <http://www.autoclubgroup.com:80/michigan/> Code: 13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme it should be omitted.
12:46:26 WARN riot :: [line: 34882264, col: 41] Bad IRI: <http://www.autoclubgroup.com:80/michigan/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:46:29 WARN riot :: [line: 35278486, col: 38] Bad IRI: <http://www.planseguro.com.mx:88/?lang=es_mx> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:46:47 WARN riot :: [line: 37610001, col: 41] Bad IRI: <https://???.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
12:46:54 WARN riot :: [line: 38461003, col: 38] Bad IRI: <http://www.sfisd.net:81/admin/info.aspx> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:47:01 WARN riot :: [line: 39363675, col: 38] Bad IRI: <https://www.huic.co.kr:442/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:47:03 WARN riot :: [line: 39520884, col: 38] Bad IRI: <http://http://www.xilink.co.kr> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:06 WARN riot :: [line: 39916927, col: 38] Bad IRI: <http://http://www.pickpay.ch> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:09 WARN riot :: [line: 40359191, col: 40] Bad IRI: <https://www.daehanpaper.com:449> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
12:47:17 WARN riot :: [line: 41437908, col: 38] Bad IRI: <http://http://www.seradyn.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:17 WARN riot :: [line: 41474988, col: 41] Bad IRI: <http://wwhttp://www.bbarkansas.com/w.bbarkansas.com/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:21 WARN riot :: [line: 41957369, col: 38] Bad IRI: <http://http://www.classicmotorcycle.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:29 WARN riot :: [line: 42837816, col: 38] Bad IRI: <http://http://www.fargem.com.tr/en> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:36 WARN riot :: [line: 43701057, col: 41] Bad IRI: <http://http://zhetysu.kz/> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:47:52 WARN riot :: [line: 45951140, col: 41] Bad IRI: <http://www.allaboutsigns.nlstatus:> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:48:07 WARN riot :: [line: 47836395, col: 41] Bad IRI: <http://www.http://uba.com.jo> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:48:18 WARN riot :: [line: 49262635, col: 41] Bad IRI: <http://ttp://www.capitatrustees.com> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introducing an empty port component should be omitted entirely, or a port number should be specified.
12:48:26 WARN riot :: [line: 50379246, col: 38] Bad IRI: <https://????????????.??/> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
12:48:31 WARN riot :: [line: 50900663, col: 38] Bad IRI: <http://big5.spdb.com.cn:88/site/cht/news.spdb.com.cn/overseas_institutions/hk_bank/> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment