Last active
August 3, 2022 12:09
-
-
Save sunnywalker/be7b2225b0ff175618c76695f7a888e1 to your computer and use it in GitHub Desktop.
HTML Table to Markdown Extra converter
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!DOCTYPE html> | |
<html> | |
<head> | |
<meta charset="utf-8"> | |
<title>HTML Table to Markdown Extra Table</title> | |
<style> | |
* { -moz-box-sizing: border-box; -webkit-box-sizing: border-box; box-sizing: border-box;} | |
body { font-family: -apple-system, "Segoe UI", Arial, Helvetica, sans-serif; line-height: 1.5; | |
text-rendering: optimizeLegibility; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } | |
textarea { width: 100%; height: 15em; } | |
button { font-size: inherit; } | |
p, li { max-width: 75ch; } /* WCAG 2.0, guideline 1.4.8 */ | |
.two-by { display: grid; grid-gap: 1rem; gap: 1rem; } | |
.two-by h2, .two-by p { margin: 0; } | |
code { color: #303; font-weight: bold; font-family: Consolas, Menlo, "Courier New", Courier, monospace; border-radius: 2px; border: 1px solid rgba(0, 0, 0, 0.2); padding: 0 0.2em; background-color: #eee; } | |
@media (min-width: 40rem) { | |
.two-by { grid-template-columns: 1fr 1fr; } | |
} | |
</style> | |
</head> | |
<body> | |
<h1>HTML Table to Markdown Extra Table</h1> | |
<p><em>Paste HTML table code into the Input, click the Convert button, and a | |
<a href="https://michelf.ca/projects/php-markdown/extra/#table">Markdown Extra table</a> | |
will be placed in the Output. This is meant as a first-pass for table conversion and will | |
not work for all types of tables.</em></p> | |
<div class="two-by"> | |
<div> | |
<h2>Input</h2> | |
<form method="post" id="form"> | |
<p><textarea name="in" id="in" autofocus placeholder="paste HTML table code here"><table><thead> | |
<tr> | |
<th>Header 1</th> | |
<th>Header 2</th> | |
</tr> | |
</thead><tbody> | |
<tr> | |
<td><a href="http://example.com/">Cell 1, 1</a></td> | |
<td>Cell <em>1, 2</em></td> | |
</tr> | |
<tr> | |
<td>Cell <code>2, 1</code></td> | |
<td>Cell <strong>2, 2</strong></td> | |
</tr> | |
<tr> | |
<td><img src="cat.png" alt="cat pic"></td> | |
<td><img alt='another cat pic' src="kitty.gif" width="300"></td> | |
</tr> | |
</tbody></table></textarea> | |
<button type="submit" id="submit">Convert</button></p> | |
</form> | |
</div> | |
<div> | |
<h2>Output</h2> | |
<p><textarea id="out" placeholder="markdown extra converted table will appear here"></textarea></p> | |
</div> | |
</div> | |
<h2>Notes</h2> | |
<ul> | |
<li>Malformed HTML such as missing closing tags will likely produce undesirable results. | |
Tag attribute values, such as the URL for <code>href</code> are assumed to be quoted | |
with <code>"</code> or <code>'</code>. For this tool, it only affects <code>href</code>, | |
<code>src</code>, and <code>alt</code> attributes.</li> | |
<li>No CSS styling is preserved.</li> | |
<li>Cell widths in the Markdown are not equalized.</li> | |
<li>The first row in the table is assumed to be the header row, regardless | |
of <code>thead</code> and <code>th</code> tags.</li> | |
<li>Cells containing <code>ul/ol</code>, <code>p</code> are compressed to single | |
line cells. It's likely that such a complex table would be better served with | |
different formatting (headings, paragraphs) anyway for | |
accessibility/readability reasons.</li> | |
<li>Tables with <code>colspan</code> and <code>rowspan</code> are likely to | |
produce undesirable results and are beyond the scope of Markdown Extra anyway.</li> | |
<li><code>img</code> tags are converted to Markdown only if they have a | |
<code>src</code> and an <code>alt</code> attribute. Why? | |
<a href="https://www.w3.org/TR/WCAG20/#text-equiv">WCAG 2.0, guideline 1.1</a>. | |
All other attributes are ignored, including <code>title</code>.</li> | |
<li><code>a</code>, <code>br</code>, <code>strong</code>, <code>em</code>, | |
and <code>code</code> tags are converted to Markdown. <strong>All other | |
tags are discarded.</strong></li> | |
</ul> | |
<script> | |
((window, document) => { | |
// compliments to http://stackoverflow.com/a/5450113 | |
const repeat = (pattern, count) => { | |
if (count < 1) return ''; | |
let result = ''; | |
while (count > 1) { | |
if (count & 1) result += pattern; | |
count >>= 1, pattern += pattern; | |
} | |
return result + pattern; | |
} | |
// cache the inputs | |
const i = document.getElementById('in'); | |
const o = document.getElementById('out'); | |
// bind a submit handler to the form | |
document.getElementById('form').addEventListener('submit', ev => { | |
// stop the normal form submit process because we're doing everything here in this function | |
ev.preventDefault(); | |
// get the input text | |
let t = i.value; | |
// only proceed if there is text to work with | |
if (t.length) { | |
// now perform all the changes on our t string via the r() function above | |
t = t.replace(/\t/g, ' '); // convert tabs to a single space | |
t = t.replace(/\s*[\r\n]\s*/g, ''); // remove lines | |
t = t.replace(/<\!--[\s\S]*?-->/g, ''); // remove html comments | |
t = t.replace(/ *<a[^>]* href=(["'])(.*?)\1[^>]*> *(.*?)<\/a>/ig, '[$3]($2)'); // convert anchor tags | |
t = t.replace(/<\/?strong.*?>/g, '**'); // convert strong to ** | |
t = t.replace(/<\/?em.*?>/g, '_'); // convert em to _ | |
t = t.replace(/<\/?code.*?>/g, '`'); // convert code to ` | |
t = t.replace(/<img[^>]* src=(["'])(.*?)\1[^>]* alt=(["'])(.*?)\3[^>]*>/ig, '![$4]($2)'); // convert images with src, alt | |
t = t.replace(/<img[^>]* alt=(["'])(.*?)\1[^>]* src=(["'])(.*?)\3[^>]*>/ig, '![$2]($4)'); // convert images with alt, src | |
t = t.replace(/ *<tr[^>]*>/ig, '\n|'); // build <tr> as "\n|" | |
t = t.replace(/\s*<t[dh].*?>/ig, ' '); // convert <td> and <th> to a space | |
t = t.replace(/\s*<\/t[dh]>/ig, ' |'); // build </td> and </th> as " |" | |
t = t.replace(/ /ig, ''); // drop non-breaking spaces | |
t = t.replace(/&/ig, '&'); // de-entize ampersands | |
t = t.replace(/<br[^>]*>/ig, '\t'); // temporarily convert BR tags to tabs | |
t = t.replace(/<\/?[^>]+>/ig, ''); // drop all other tags | |
t = t.replace(/\t *\|/g, ' |'); // drop cell-ending BRs | |
t = t.replace(/\s*\t\s*/g, '<br />'); // convert tabs back to BR tags | |
t = t.replace(/\| {2,}/g, '| '); // tighten spacing after the pipe symbols | |
t = t.replace(/ {2,}\|/g, ' |'); // tighten spacing before the pipe symbols | |
t = t.replace(/^ +\|/gm, '|'); // trim line-leading whitespace | |
t = t.replace(/ {4,}/g, ' '); // convert 4+ spaces to three spaces | |
t = t.replace(/^\s+|\s+$/g, ''); // trim whitespace | |
// generate the header row separators | |
const lines = t.split("\n"); | |
if (lines && lines.length) { | |
const segments = lines[0].split('|'); | |
let headers = '|'; | |
for (let j = 1; j < segments.length - 1; j++) { | |
headers += repeat('-', segments[j].length) + '|'; | |
} | |
// console.log(headers); | |
t = lines[0] + "\n" + headers + "\n" + lines.slice(1).join("\n"); | |
} | |
} | |
// put the new version into the output box | |
o.value = t; | |
// clear the old version for quick pasting of new code | |
i.value = ''; | |
// select all the text in the output box and set the browser focus to the output box | |
o.select(); | |
o.focus(); | |
}); | |
})(window, document); | |
</script> | |
</body> | |
</html> | |
<!-- | |
LICENSE: CC0, Public Domain | |
CHANGE LOG: | |
2021-02-11, SW: | |
- stop paying the jQuery tax | |
- upgrade to ES6 syntax | |
- swap the CSS to use Grid instead of Flexbox and renamed to .two-by | |
because… reasons | |
- allow tick parameter value delimiter support in addition to the default | |
double quote delimiter; only for href, src, alt | |
- add some img tags to the sample table | |
- make other mysterious but minor tweaks to html, notes, and CSS | |
- add this change log | |
- add license | |
--> |
Author
sunnywalker
commented
Feb 11, 2021
via email
•
Aloha Russell,
I'm glad you've found it useful!
Do you have some example links that it fails on?
I just tried a few (including the sample table in the default page load,
`<td><a href="http://example.com/">Cell 1, 1</a></td>`) and it converted to
`[Cell 1, 1](http://example/)` without a problem.
Or do you mean you want `<a href="https://example.com/">link text</a>` to
convert to the shortcut MD `<https://example.com/>`?
The regex I'm using assumes well-formed HTML with double-quoted parameters.
That is <a … href="url" …>, so it will fail if the href value is not quoted
or single quoted. Maybe that's where the conversion problem is happening?
It's simple to account for either with some adjustment to the anchor tag
regex.
If you can, share something that is failing and I'll take a look at it.
Mahalo,
Sunny
I added support for tick-quoted attributes for href, src, and alt, dumped
jquery, and made a few other changes and updated the gist. I hope this is
helpful for you.
Mahalo,
Sunny
The CodePen version of this gist has been updated with these new changes as well.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment