Skip to content

Instantly share code, notes, and snippets.

@pszemraj
Last active January 23, 2025 21:16
Show Gist options
  • Save pszemraj/2145b5c748290dcea44f83e535effe13 to your computer and use it in GitHub Desktop.
Save pszemraj/2145b5c748290dcea44f83e535effe13 to your computer and use it in GitHub Desktop.

bash/zsh util for readerlm-v2

Warning

Using this will send the data of your (public URL) to jina ai, and the license for readerlm-v2 is cc-by-nc

Here's the improved version of your smart_curl function that better handles filename extraction and works in Zsh:

function smart_curl() {
    local raw_url="$1"
    local header="${2:-'x-engine: readerlm-v2'}"
    local jina_prefix="r.jina.ai/"
    local url="https://${jina_prefix}${raw_url#*://}"

    # Clean and parse URL components
    local clean_url="${raw_url%%[?#]*}"          # Remove query/fragment
    clean_url="${clean_url#*://}"                # Remove protocol
    local domain="${clean_url%%/*}"              # Extract domain
    local path_part="/${clean_url#*/}"           # Extract path

    # Split path into meaningful segments
    local -a path_segments
    path_segments=(${(s:/:)path_part})
    path_segments=(${path_segments:#})  # Remove empty elements

    # Determine base filename
    local output_file=""
    case ${#path_segments} in
        0) # No path - use domain
            output_file="${domain//./_}"
            output_file="${output_file#www_}"  # Remove www prefix
            ;;
        1) # Single path segment
            output_file="${path_segments[-1]}"
            ;;
        *) # Multiple path segments
            output_file="${path_segments[-2]}_${path_segments[-1]}"
            ;;
    esac

    # Fallback if all above failed
    if [[ -z "$output_file" ]]; then
        output_file="${domain//./_}_$(date +%s)"
    fi

    # Clean problematic characters and ensure proper extension
    output_file="${output_file//[^a-zA-Z0-9._-]/_}"
    output_file="${${output_file//_+/_}#_}"
    [[ "$output_file" == *_ ]] && output_file="${output_file%_}"
    [[ "$output_file" != *.* ]] && output_file+=".md"

    # Download with improved curl command
    echo "Downloading to: $output_file"
    curl -sSL -H "$header" "$url" -o "$output_file"
    echo "Downloaded to: ${output_file:a}"  # Show absolute path
}

Key improvements:

  1. Better URL Parsing: Uses parameter expansion instead of regex for more reliable component extraction
  2. Path Segment Handling: Properly splits and analyzes URL path components
  3. Domain Cleaning: Removes www. prefix and converts dots to underscores
  4. Filename Sanitization: More robust cleaning of special characters and underscores
  5. Error Handling: Fallback to timestamp if all extraction methods fail
  6. Zsh Compatibility: Uses Zsh-specific features like array slicing and parameter expansion flags
  7. Curl Improvements: Uses -o flag for output and -sSL for better behavior

Example usage with your test case:

smart_curl "https://minecraft.fandom.com/wiki/Music_Disc?so=search#Discs"
# Output filename: wiki_Music_Disc.md

Other cases:

  • https://news.ycombinator.comycombinator_com.md
  • https://example.com/aboutabout.md
  • https://www.blog.google/features/newfeatures_new.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment