Skip to content

Instantly share code, notes, and snippets.

View khellan's full-sized avatar

Knut Hellan khellan

View GitHub Profile
@khellan
khellan / README.md
Last active March 30, 2020 13:41
Sentencepiece 0.1.85 for Python 3.8 on OSX/Mac

Download the file and install it:

pipenv install <path to local wheel>

There you go.

@khellan
khellan / batch_deleter.py
Created September 21, 2018 11:15
Batchwise deletion of malformed HBase row keys. It will not stop when done so it needs monitoring.
import happybase
connection = happybase.Connection(HBASE_MASTER_IP)
table = connection.table(TABLE_NAME)
while True:
batch = table.batch()
for key, _ in table.scan(columns=[COLUMN_NAMES], filter="RowFilter(=, 'regexstring:.*\x09.*')", limit=10000):
batch.delete(key)
batch.send()
print(key)
Abigail - Nabby, Abby, Gail
Abraham - Abe, Bram
Adelaida - Ida, Idly
Alan - Al
Alastair - Al, Alex
Albert - Al, Bert
Alexander - Alex, Lex, Xander, Sander, Sandy
Alexandra - Alex, Ali, Lexie, Sandy
Alfred - Al, Alf, Alfie, Fred, Fredo
Alonzo - Lonnie
Satya Nadella
B Turner
Lisa Brummel
Rupert Bader
Janet Kennedy
Jordan Levin
Horacio Rrez
Christophe Capossela
Angela Jones
David Aucsmith
ackage no.companybook.extraction.tables;
import org.junit.Test;
import java.util.HashSet;
import java.util.Set;
import static org.junit.Assert.*;
public class PersonTest {
@khellan
khellan / settings.py
Last active May 31, 2016 20:15
Frontera scrapy fetch error
2016-05-31 21:08:31 [scrapy] INFO: Scrapy 1.1.0 started (bot: cb_crawl)
2016-05-31 21:08:31 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cb_crawl.spiders', 'DOWNLOAD_TIMEOUT': 60, 'ROBOTSTXT_OBEY': True, 'DEPTH_LIMIT': 10, 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 256, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['cb_crawl.spiders'], 'AUTOTHROTTLE_START_DELAY': 0.25, 'REACTOR_THREADPOOL_MAXSIZE': 20, 'BOT_NAME': 'cb_crawl', 'AJAXCRAWL_ENABLED': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'cb crawl (+http://www.companybooknetworking.com)', 'SCHEDULER': 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler', 'REDIRECT_ENABLED': False, 'AUTOTHROTTLE_ENABLED': True, 'DOWNLOAD_DELAY': 0.25}
2016-05-31 21:08:31 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2016-05-31 21:08:31 [scrapy] INFO: Enabled downloader middlewares
@khellan
khellan / word2vec_optimized.py
Last active June 22, 2018 14:30
A version of the optimized word2vec that doesn't require access to the training data when restoring the saved model. Run python tensorflow/tensorflow/models/embedding/word2vec_optimized.py --save_path=/Users/knut/data/wiki/model --embedding_size=500 --use --interactive to test.
# Copyright 2015 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@khellan
khellan / word2vec.py
Created November 30, 2015 08:04
TensorFlow word2vec with model loading
"""Multi-threaded word2vec mini-batched skip-gram model.
Trains the model described in:
(Mikolov, et. al.) Efficient Estimation of Word Representations in Vector Space
ICLR 2013.
http://arxiv.org/abs/1301.3781
This model does traditional minibatching.
The key ops used are:
* placeholder for feeding in tensors for each example.
@khellan
khellan / JRuby 1.6.7 double resume
Created June 7, 2012 15:19
Double resume in JRuby. Note that the result in JRuby varies so it seems to be time sensitive.
ruby -v
jruby 1.6.7 (ruby-1.9.2-p312) (2012-02-22 3e82bc8) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_01) [linux-amd64-java]
ruby test/double_resume.rb
Loaded suite test/double_resume
Started
E
Finished in 0.157000 seconds.
1) Error:
test_0001_should_raise_double_resume(ResumingFiberSpec):
@khellan
khellan / gobbler.erl
Created May 15, 2012 06:40
Stepwise introduction to a distributed erlang message loop
-module(gobbler).
-behaviour(gen_server).
-export([code_change/3, handle_call/3, handle_cast/2, handle_info/2]).
-export([init/1, start_link/0, terminate/2]).
-export([count/0, increment/0, stop/0]).
count() -> gen_server:call(?MODULE, count).