Skip to content

Instantly share code, notes, and snippets.

@dannguyen
dannguyen / README.openai-structured-output-demo.md
Last active January 3, 2025 19:55
A basic test of OpenAI's Structured Output feature against financial disclosure reports and a newspaper's police blotter. Code examples use the Python SDK and pydantic for the schema definition.

Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output

tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.

For example, given a Congressional financial disclosure report, with assets defined in a table like this:

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Bilbottom
Bilbottom / values-statement.sql
Created May 21, 2024 20:24
The SQL `VALUES` statement
/*
The SQL `VALUES` statement (in DuckDB)
DuckDB version: 0.10.2
Bill Wallis, 2024-05-21
*/
select version();
@RaczeQ
RaczeQ / pyarrow_multiprocessing_streaming.py
Last active May 19, 2024 19:41
Pyarrow Multiprocessing with streaming the result
import multiprocessing
from pathlib import Path
from queue import Queue
from time import sleep
from typing import Callable
import pyarrow as pa
import pyarrow.parquet as pq
from tqdm import tqdm
@marklit
marklit / places.sql
Last active May 19, 2024 22:43
Pull H3s for Overture's Places Dataset for May 2024
COPY (
WITH a AS (
SELECT h3_cell_to_parent(h3_string_to_h3(SUBSTR(id, 0, 17)), 2) h3_2,
COUNT(*) num_recs
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-05-16-beta.0/theme=places/type=place/*.parquet',
filename=true,
hive_partitioning=1)
GROUP BY 1
)
SELECT h3_cell_to_boundary_wkt(h3_2),
@JesseCrocker
JesseCrocker / merge-pmtiles.py
Created March 29, 2024 13:19
Merge a directory of PMTiles files into a single file
#!/usr/bin/env python3
import argparse
import os
from pmtiles.reader import MmapSource, Reader, all_tiles
from pmtiles.writer import Writer
from pmtiles.tile import Compression
from pmtiles.tile import zxy_to_tileid
from tqdm import tqdm
def merge_pmtiles(input_dir: str, output_file: str) -> None:
@raydouglass
raydouglass / ffmpeg_nvidia.sh
Last active February 21, 2024 00:06
Compile ffmpeg 6.0 with NVIDIA hardware acceleration
#!/usr/bin/env bash
set -euxo pipefail
if [ "$EUID" -ne 0 ]; then
echo "Please run as root"
exit 1
fi
if ! command nvcc --version >/dev/null 2>&1; then
@wriglz
wriglz / national_park_voronoi.sql
Last active January 17, 2023 12:56
SQL to generate Voronoi Polygons to determine National Park catchment areas.
/*
Data sources for National Park boundaries:
- England: https://environment.data.gov.uk/DefraDataDownload/?mapService=NE/NationalParksEngland&Mode=spatial
- Scotland: https://spatialdata.gov.scot/geonetwork/srv/eng/catalog.search#/home
- Wales: https://datamap.gov.wales/layers/inspire-nrw:NRW_NATIONAL_PARK
*/
WITH
park_info AS(
-- Select required information about each National Park from a merged dataset
@wriglz
wriglz / snap_points_to_lines.sql
Last active September 1, 2022 04:21
SQL to snap points to the closest line within a predefined radius
-- Snap the points to their closest lines, found in the subquery below
SELECT
point_id,
line_id,
ST_LINE_INTERPOLATE_POINT(line_geom,
ST_Line_Locate_Point(line_geom, point_geom)) AS snapped_points --Create the snapped points
FROM
--Subquery to find the closest line to each point (within a pre-defined raidus)
(
@kylebarron
kylebarron / convert.py
Last active August 31, 2024 04:55
preprocessing script for geoparquet on the web demo (https://observablehq.com/@kylebarron/geoparquet-on-the-web)
import geopandas as gpd
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import pygeos
import pyogrio
# https://ookla-open-data.s3.us-west-2.amazonaws.com/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet