Skip to content

Instantly share code, notes, and snippets.

@dmd
Created June 24, 2026 14:46
Show Gist options
  • Select an option

  • Save dmd/c63b820b3df2938fb6421e0c490ea3dc to your computer and use it in GitHub Desktop.

Select an option

Save dmd/c63b820b3df2938fb6421e0c490ea3dc to your computer and use it in GitHub Desktop.

wehackthemoon.com → Static Site: Conversion Notes

Goal: Produce a 100% static, server-side-free copy of https://wehackthemoon.com/ so it can be hosted without PHP, Drupal, or a database (the current host is discontinuing PHP/Drupal support).

Result: wehackthemoon-static/ — 515 HTML pages and ~9,100 supporting files (~1.4 GB). Upload the folder's contents to any static web host (nginx, Apache, Netlify, Cloudflare Pages, S3, etc.) and the site runs as-is.


What was done (high level)

  1. Mirrored the live site with wget (recursive crawl seeded from the XML sitemap plus link-following), capturing every reachable page and all of its CSS, JavaScript, images, and media.
  2. Rewrote every internal link to be relative so pages, navigation, and assets resolve correctly when served from any path — with no dependency on the original domain.
  3. Replaced the two server-dependent features that cannot exist in a static site:
    • Site search → rebuilt as a fully client-side search (lunr.js) with a pre-built index of all 512 content pages. No backend required.
    • "Contact Us" form → preserved as-is, because the original was already just a message directing visitors to the MIT Museum research form (it never submitted anything).
  4. Verified the result in a browser (homepage, biography pages, and search) and programmatically confirmed that every referenced page, image, stylesheet, and script returns successfully — zero broken links or assets.
  5. Packaged the finished site as a single zip for transfer.

Problems inherent to the source site that had to be solved

These were not artifacts of the conversion process — they are characteristics of how the original Drupal site was built and served. Each one would have produced a broken or incomplete archive if left unaddressed.

1. The sitemap advertised the wrong hostname

The site's sitemap.xml listed every URL under an internal hosting hostname (live-wehackthemoon.pantheonsite.io) rather than the public wehackthemoon.com. Crawling those URLs directly would have hit the wrong server. The seed list had to be corrected to the public domain before crawling.

2. An effectively infinite crawl trap (faceted galleries)

Media pages embed a gallery with sort/filter/pagination controls that generate endless unique URLs (?search=&bundle=All&page=1, &page=2, … observed running past page 98 and climbing). A naive mirror follows these forever, ballooning without ever finishing. These URL patterns had to be detected and excluded so the crawl could terminate while still capturing the real content.

3. Drupal's administrative URL space (thousands of dead requests)

The pages contain links into Drupal's operational URLs — /node/N/edit, /node/N/delete, /taxonomy/term/N, etc. — which return 403 Forbidden to anonymous visitors. Following them produced over 10,000 forbidden requests with no useful content. The crawl had to be scoped to public content only.

4. "Uncacheable" dynamic pages defeated automatic link conversion

The homepage and the media-gallery page are served as uncacheable / dynamic by Drupal, so they were re-generated fresh on every fetch. This caused the mirroring tool to skip converting their links to relative form, leaving them pointing at absolute server paths that would break once the original domain goes away. These pages required a dedicated link-conversion pass.

5. Drupal's aggregated CSS/JS "drifts" between page loads

Drupal bundles its CSS and JavaScript into hashed aggregate files (e.g. js_VtafjX….js). Because the dynamic pages were regenerated during the crawl, a page ended up referencing an aggregate hash that no longer matched what had been downloaded — producing a missing-script 404. The referenced-but-missing aggregate had to be fetched directly to restore site behavior.

6. Responsive image derivatives also drift (and use security tokens)

Drupal generates many resized "image style" derivatives per photo, each addressed by a URL containing a focal-point/crop hash and a signed itok token (…?h=…&itok=…). The dynamic listing/detail pages referenced derivative variants that had not all been captured — 407 image references were initially missing. The full set of referenced derivatives had to be reconciled against the live server and the missing ones (≈330 images) downloaded directly.

7. The CMS emitted malformed markup for files with commas/spaces in their names

A number of biography portraits have source filenames containing literal commas and spaces (e.g. Greene, Kenton W.png). In responsive <picture>/srcset markup — where commas are structural separators — this produced invalid, broken image markup (e.g. Kenton Greene's portrait failed to load entirely). The corrupted responsive sources had to be repaired so the portraits render from their clean fallback image.

8. The original site already contained broken internal links

29 internal links in the site's body content point to pages that return 404 on the live site itself — leftover links to articles that had been removed or had their URLs renamed in Drupal (e.g. links labeled "software", "simulation", "Polaris Missile", and several person names). These were broken before any archiving. With approval, 24 of them were redirected to the correct surviving page (each person link to that person's biography; topic links to the matching article/term page), and the remaining 5 — which have no equivalent page — were converted to plain text. In several cases this makes the archive more correct than the live site.

9. Clean (extensionless) URLs

The site uses extensionless "clean" URLs (e.g. /bios/neil-armstrong). These were normalized to portable .html files with all links adjusted to match, so the site is self-consistent and host-independent.


Known, intentional limitations

  • External links remain external (YouTube, NASA, MIT Museum, Draper, and social/share links) — by design.
  • The search index and a couple of metadata tags retain absolute references where appropriate; these do not affect rendering or navigation.
  • The static site is a snapshot; it will not update when content changes on the (retiring) source.

Deployment

Upload the contents of wehackthemoon-static/ to the document root of any static web server. No build step, runtime, or database is required.

Note: the archive intentionally preserves image filenames containing ? and space characters (Drupal's derivative naming). Standard web servers and unzip tools handle these correctly; if syncing to object storage (e.g. S3), preserve those characters in the keys so the relative links continue to resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment