Skip to content

Instantly share code, notes, and snippets.

@pezon
pezon / scrappy_command.py
Last active December 24, 2015 15:19
ScrapyCommand is a Django management command that fires multiple scrapy spiders, each with different settings and parameters, i.e., if they need to support different pipelines. These settings are defined by a settings dictionary in django's config settings. ScrapyCommand can be extended by other commands to support separate scraping jobs. When t…
from __future__ import absolute_import
from django.core.management.base import BaseCommand
from django.conf import settings as django_settings
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import CrawlerSettings
from scrapy import log, signals
from scrapy.xlib.pydispatch import dispatcher
@pezon
pezon / admin.py
Last active December 27, 2015 15:09
human write-able integer field for django admin. can also rip the individual dehumanize() function for standalone use.
from django import forms
from django.contrib import admin
from project.app.models import Foo
from project.app.forms import HumanIntegerField
class FooAdminForm(forms.ModelForm):
class Meta:
model = Foo
bar_field = HumanIntegerField()
@pezon
pezon / handsontable-observedblclick.js
Last active March 30, 2018 06:34
Handsontable plugin - Observe dblClick event
function ObserveDblClick() {
var plugin = this;
// public methods
this.init = function() {
bindHeaders.call(this);
};
this.dblclickColHeader = function(col) {
Handsontable.hooks.run(this, 'dblclickColHeader', col);
@pezon
pezon / json-to-table.py
Last active September 13, 2015 14:52
Proof of Concept: Convert JSON object to tabular data using ijson.
"""
Proof of concept: json-to-table
Convert JSON object to tabular data using ijson,
which provides a SAX-like iterable for JSON objects
"""
from urlparse import urlparse
from collections import Counter, OrderedDict
@pezon
pezon / README.md
Last active May 3, 2019 16:00
scrape namecheap domains to a spreadsheet

LAZY AF

  • Step 1. Load console-save.js
  • Step 2. Load namecheap-cart-scraper.js
  • Step 3. Convert with https://json-csv.com/
  • Step 4. Load to Google Spreadsheet
@pezon
pezon / README.md
Last active May 3, 2019 16:01
scrape namecheap domains to a spreadsheet

LAZY AF

  • Step 1. Load console-save.js
  • Step 2. Load namecheap-cart-scraper.js
  • Step 3. Convert with https://json-csv.com/
  • Step 4. Load to Google Spreadsheet
from collections import defaultdict
class Collection(object):
def __init__(self, items={}, id='id'):
self.id = id
self.items = defaultdict(dict)
self.update(items)
def __getitem__(self, key):
@pezon
pezon / 01_description.md
Last active May 3, 2019 15:58
Drop managed tables from Spark cluster

Maintenance - Drop managed tables

This notebook will clean up the table space by dropping all managed tables, including the metadata and underlying data. This process is necessary when updating table schemas (e.g., changing data types).

Note: It is advised to backup your data first.

Recommended procedure

  • Back-up data tables in ADLS (not implemented here)
  • Run maintenance - drop managed tables notebook (this notebook)
  • Re-import all data.
@pezon
pezon / null_transformer.scala
Created May 3, 2019 16:12
Applying a UDF function to multiple columns of different types. The UDF function here (null operation) is trivial.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object NullTransformer {
val nullStringColumnUDF = udf(() => None: Option[String])
val nullLongColumnUDF = udf(() => None: Option[Long])
val nullIntegerColumnUDF = udf(() => None: Option[Integer])
val nullFloatColumnUDF = udf(() => None: Option[Float])
val nullDoubleColumnUDF = udf(() => None: Option[Double])
@pezon
pezon / LowerCaseAlphabet.scala
Created May 15, 2019 16:37
FPE tokenization Spark transformer
package com.example.tokenizers.alphabet
class LowerCaseAlphabet extends TokenizerAlphabet {
val CHARACTERS = Array[Char](
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z')
}