Skip to content

Instantly share code, notes, and snippets.

View jdvala's full-sized avatar

Jay Vala jdvala

View GitHub Profile
@tamuhey
tamuhey / tokenizations_post.md
Last active July 27, 2024 14:46
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links:

@ddelange
ddelange / Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md
Last active August 9, 2024 12:03
Installation instructions for libicu-dev, PyICU, libpostal, pypostal on Mac OS X 10.14+

Installing PyICU, libpostal, pypostal on Mac OS X 10.14+

libicu-dev (PyICU dependency)

brew uninstall --ignore-dependencies icu4c
brew install pkg-config icu4c  # keg-only
@mark5280
mark5280 / relative_to_absolute_urls.py
Last active July 21, 2023 13:03
A Python script to replace relative paths in src/href attributes in a HTML string with absolute path equivalents. Inspiration was provided by https://gist.github.com/sprintingdev/8843526. See additional notes below.
# In development of this script I had four goals:
# 1) Operate on a string fragment of HTML and assume it could be poorly formed (ie - might not parse well with lxml, etc.)
# 2) Ability to call the function with base_url since it would be used in the processing of multiple sites and HTML fragments
# from them
# 3) Desire to use urlparse.urljoin to insure that the resulting absolute url was well formed
# 4) Even if the HTML fragment contains absolute URLs I could pass it through this routine without issue
# Starting with https://gist.github.com/sprintingdev/8843526 I found that certain
# relative URLs would, when made absolute, were not always well formed. Often there
# could be multiple slashes in the URL path. To combat this urlparse.urljoin was introduced.
@stefanthaler
stefanthaler / stateful_lstm_embedding.py
Last active September 2, 2020 14:10
Simple example for a stateful keras LSTM with embedding.
"""
Learning Task:
Given a sequence, predict a label based on the first value of the sequence
Explanation of stateful LSTM and setup:
http://philipperemy.github.io/keras-stateful-lstm/
Exmple:
given a sequence [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], predict 1
given a sequence [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], predict 0
@sprintingdev
sprintingdev / relative_to_absolute_path.py
Created February 6, 2014 12:47
A python script to replace relative paths in src/href attributes of HTML with absolute paths. Created with help from http://stackoverflow.com/questions/3836644/c-sharp-convert-relative-to-absolute-links-in-html-string and http://docs.python.org/2/howto/regex.html
#!/usr/bin/env python
import re
import os
def srcrepl(match):
"Return the file contents with paths replaced"
absolutePath = "http://www.example.com/" #update the URL to be prefixed here.
print "<" + match.group(1) + match.group(2) + "=" + "\"" + absolutePath + match.group(3) + match.group(4) + "\"" + ">"
return "<" + match.group(1) + match.group(2) + "=" + "\"" + absolutePath + match.group(3) + match.group(4) + "\"" + ">"
#