Skip to content

Instantly share code, notes, and snippets.

View mapmeld's full-sized avatar

Nick Doiron mapmeld

View GitHub Profile
@mapmeld
mapmeld / add_data_task.py
Created July 9, 2021 12:40
Add text file task to T5
t5.data.TaskRegistry.add(
"byt5_ex",
t5.data.TextLineTask,
split_to_filepattern={
"train": "gs://BUCKET/train_lines.txt",
"validation": "gs://BUCKET/validation_lines.txt",
},
text_preprocessor=[
functools.partial(
t5.data.preprocessors.parse_tsv,
@mapmeld
mapmeld / bb.md
Last active January 4, 2021 16:01
Bangla Benchmark runs

Code: https://colab.research.google.com/drive/1vltPI81atzRvlALv4eCvEB0KdFoEaCOb?usp=sharing

Can these scores be improved? YES!

Rerunning with more training data, more epochs of training, or using other libraries to set a learning rate / other hyperparameters before training.

  • Experimenting with epochs - when I doubled the number of epochs, MuRIL improves only slightly (69.5->69.7 on one task)

The point of a benchmark is to run these models through a reasonable and identical process; you can tweak hyperparameters on any model to improve results.

@mapmeld
mapmeld / twiml-lightning-share.md
Last active October 22, 2020 15:38
twiml-lightning-share
@mapmeld
mapmeld / dv-wave.py
Last active July 16, 2020 18:29
PythonCode
from simpletransformers.classification import ClassificationModel
# set use_cuda=False on CPU-only platforms
model = ClassificationModel('bert', 'monsoon-nlp/dv-wave', num_labels=8, use_cuda=True, args={
'reprocess_input_data': True,
'use_cached_eval_features': False,
'overwrite_output_dir': True,
'num_train_epochs': 3,
'silent': True
})
@mapmeld
mapmeld / add_to_shapefile.py
Created July 5, 2020 23:10
Add JSON block data to a shapefile with GDAL
# pip install gdal
import json
from osgeo import ogr
# depends on your shapefile
target_shapefile = 'tl_2010_sample_shapefile.shp'
fips_id = 'GEOID10'
saveblocks = json.loads(open('savefile.json', 'r').read())
@mapmeld
mapmeld / load_acs.py
Last active July 8, 2020 16:15
Load 5-year ACS race + ethnicity data, ending in 2017
# pip install requests
import time, json
import requests
api_key = "API_KEY_STRING"
# look up FIPS for state and county:
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697
state = '12'
county_fips = ['086']
@mapmeld
mapmeld / links.md
Last active May 13, 2020 04:19
References and links for Spanish counterfactuals
@mapmeld
mapmeld / AutoKeras_image_regression.ipynb
Created April 28, 2020 21:22
AutoKeras Image Regression
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mapmeld
mapmeld / yolo.py
Created April 27, 2020 05:16
Adjusting yolo.py to return raw boxes and classes for images
# -*- coding: utf-8 -*-
"""
Class definition of YOLO_v3 style detection model on image and video
"""
import colorsys
import os
from timeit import default_timer as timer
import numpy as np

Releasing Hindi ELECTRA model

This is a first attempt at a Hindi language model trained with Google Research's ELECTRA. I don't modify ELECTRA until we get into finetuning, and only then because there's hardcoded train and test files

CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_

Additional background: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81

It's available on HuggingFace: https://huggingface.co/monsoon-nlp/hindi-bert - sample usage: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w