Skip to content

Instantly share code, notes, and snippets.

View dannguyen's full-sized avatar
💭
havin a normal one

Dan Nguyen dannguyen

💭
havin a normal one
View GitHub Profile
@dannguyen
dannguyen / spleeter-and-ffmpeg-quick-tutorial.md
Last active November 29, 2022 04:23
Example use of spleeter (Python+Tensorflow audio-extraction library and ffmpeg
@dannguyen
dannguyen / nihao-excel-README.md
Last active October 25, 2019 19:18
ni hao world (testing unicode in Excel)

Does Excel correctly open utf8-encoded CSV files?

(using Office for Mac 16.16.15)

printf "hello,world\nNǐ hǎo,shìjiè\n你好,世界\n" > nihao.csv
# or:
# curl -o nihao.csv https://gist.githubusercontent.com/dannguyen/13d5c39d499e4bbec622e055283fbb19/raw/f1c50ced36033b9a3aed36ebbbf0cf8734a98809/nihao.csv
  
open -a 'Microsoft Excel' nihao.csv
@dannguyen
dannguyen / modern-cli.md
Last active May 25, 2023 18:19
my modern personal list of modern command-line utility replacements for macos bash
@dannguyen
dannguyen / bashfoo.yaml
Last active July 25, 2019 17:16
bashfoo.yaml : Dan nguyen's personally curated list of bash/command-line commands that are useful but that he keeps forgetting
"""
bashfoo.yaml
https://gist.github.com/dannguyen/ad80b9d03f755822d3cc03174bcbef74
Dan Nguyen's personally curated list of bash/command-line commands and snippets
that are useful but yet he keeps forgetting
"""
# gist: https://gist.github.com/dannguyen/ad80b9d03f755822d3cc03174bcbef74
@dannguyen
dannguyen / aws-textract-sample-readme.md
Last active October 30, 2023 05:49
A gist of AWS Textract sample/demo data for easy reference and preview, in case you're curious how well Amazon does when it comes to pdf-to-csv

AWS Textract -- sample document image and data from the offical demo

AWS Textract is now out of closed beta. You can read the features page here, and you can also read about its limits here (e.g. no handwriting). Basically, if you've ever had to deal with the hell of getting structured data out of a PDF (scanned image or not), Textract is aiming for your business:

image

This short gist contains some of my brief observations about Textract and its demo, as well as direct links to the most relevant and important files, such as the Textract demo sample image and the resulting data files from Textract's API. If you have an AWS account, I h

@dannguyen
dannguyen / aws-textract-demo-readme.md
Last active May 30, 2019 04:38
Amazon Textract, i.e. AWS's OCR-as-a-cloud-service, was just released to the public. Here's how well it did with recognizing data tables in a particularly difficult PDF

[Ignore this gist, checkout the github] Testing AWS Textract's ability to correctly extract data tables from a difficult FBI stats report PDF

Update: I've since realized that this writeup would be far easier to do as its own Github repo, given the number of files involved. Please ignore this gist which I'm keeping here as a backup, and instead, visit: https://github.com/dannguyen/aws-textract-pdf-to-csv-demo

tl;dr: pretty good table structure overall, given the issues with the original PDF. However, there were inexplicable and critical data errors, as if Textract converted the PDF to an image, OCRed it, and then attempted to extract the data tables.

Amazon Textract was announced about 6 months ago but was made public today (May 29). If have an AWS account, you can check out Textract's point-and-click demo, which allows you to upload an image or PDF for T

@dannguyen
dannguyen / sample-public-tweets.csv
Created April 30, 2019 13:11
A few seconds of Twitter's statuses/sample (a small random sample of all public statuses) https://developer.twitter.com/en/docs/tweets/sample-realtime/api-reference/get-statuses-sample.html
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 5.
ID,Posted at,Screen name,Text
1123212586919419905,2019-04-30 13:08:43 +0000,shinya1720777,"妙にテンションの高いまーちゃんにあおられた
キビナゴりせ
#今日のりせ活 https://t.co/qabC3PQi7m"
1123212591109672961,2019-04-30 13:08:44 +0000,inesteiixeira,RT @lunaaaaa20: acho que a coisa mais linda é acompanhar o crescimento da pessoa que gostamos e contribuir para isso
1123212591109689348,2019-04-30 13:08:44 +0000,BTS20520283,#BBMAsTopSocial BTS @BTS_twt kalp
1123212591139045381,2019-04-30 13:08:44 +0000,Dipendr80247123,"RT @YL511: #البتكوين
📌اش الحكاية :
-
@dannguyen
dannguyen / cms-medicare-bulk-downloading.md
Created April 14, 2019 17:14
Compiling all the Medicare payment data
@dannguyen
dannguyen / census-bulk-downloading-scripts.md
Created April 14, 2019 17:12
Bulk Census data downloading script