Skip to content

Instantly share code, notes, and snippets.

@danmou
Last active August 21, 2024 18:41
Show Gist options
  • Save danmou/46e21b0c7090cb27425fa389f5102b16 to your computer and use it in GitHub Desktop.
Save danmou/46e21b0c7090cb27425fa389f5102b16 to your computer and use it in GitHub Desktop.
Onenote export to HTML. NOTE: This script is now maintained at https://github.com/Danmou/onenote_export
### README
# This Python scripts exports all the OneNote notebooks linked to your Microsoft account to HTML files.
## Output
# The notebooks will each become a subdirectory of the `output` folder, with further subdirectories
# for the sections within each notebook and the pages within each section. Each page is a directory
# containing the HTML file `main.html` and two directories `images` and `attachments` (if necessary)
# for the images and attachments. Any sub-pages will be subdirectories within this one.
## Setup
# In order to run the script, you must first do the following:
# 1. Go to https://aad.portal.azure.com/ and log in with your Microsoft account.
# 2. Select "Azure Active Directory" and then "App registrations" under "Manage".
# 3. Select "New registration". Choose any name, set "Supported account types" to "Accounts in any
# organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web
# and enter `http://localhost:5000/getToken`. Register.
# 4. Copy "Application (client) ID" and paste it as `client_id` below in this script.
# 5. Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and
# confirm.
# 6. Copy the client secret and paste it as `secret` below in this script.
# 7. Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote,
# choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add
# permissions".
# 8. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command
# `pip install flask msal requests_oauthlib`.
## Running
# In a terminal, navigate to the directory where this script is located and run it using
# `python onenote_export.py`. This will start a local web server on port 5000.
# In your browser navigate to http://localhost:5000 and log in to your Microsoft account.
# The first time you do it, you will also have to accept that the app can read your OneNote notes.
# (This does not give any third parties access to your data, as long as you don't share the client id
# and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.
## Note
# Microsoft limits how many requests you can do within a given time period. Therefore, if you have many
# notes you might eventually see messages like this in the terminal: "Too many requests, waiting 20s and
# trying again." This is not a problem, but it means the entire process can take a while. Also, the login
# session can expire after a while, which results in a TokenExpiredError. If this happens, simply reload
# `http://localhost:5000` and the script will continue (skipping the files it already downloaded).
client_id = '...'
secret = '...'
import os
import random
import re
import shutil
import string
import time
import uuid
from html.parser import HTMLParser
from pathlib import Path
from xml.etree import ElementTree
import flask
import msal
from requests_oauthlib import OAuth2Session
output_path = Path('output')
graph_url = 'https://graph.microsoft.com/v1.0'
authority_url = 'https://login.microsoftonline.com/common'
scopes = ['Notes.Read', 'Notes.Read.All']
redirect_uri = 'http://localhost:5000/getToken'
app = flask.Flask(__name__)
app.debug = True
app.secret_key = os.urandom(16)
application = msal.ConfidentialClientApplication(
client_id,
authority=authority_url,
client_credential=secret
)
@app.route("/")
def main():
resp = flask.Response(status=307)
resp.headers['location'] = '/login'
return resp
@app.route("/login")
def login():
auth_state = str(uuid.uuid4())
flask.session['state'] = auth_state
authorization_url = application.get_authorization_request_url(scopes, state=auth_state,
redirect_uri=redirect_uri)
resp = flask.Response(status=307)
resp.headers['location'] = authorization_url
return resp
def get_json(graph_client, url, params=None):
values = []
next_page = url
while next_page:
resp = get(graph_client, next_page, params=params).json()
if 'value' not in resp:
raise RuntimeError(f'Invalid server response: {resp}')
values += resp['value']
next_page = resp.get('@odata.nextLink')
return values
def get(graph_client, url, params=None):
while True:
resp = graph_client.get(url, params=params)
if resp.status_code == 429:
# We are being throttled due to too many requests.
# See https://docs.microsoft.com/en-us/graph/throttling
print(' Too many requests, waiting 20s and trying again.')
time.sleep(20)
elif resp.status_code == 500:
# In my case, one specific note page consistently gave this status
# code when trying to get the content. The error was "19999:
# Something failed, the API cannot share any more information
# at the time of the request."
print(' Error 500, skipping this page.')
return None
else:
resp.raise_for_status()
return resp
def download_attachments(graph_client, content, out_dir):
image_dir = out_dir / 'images'
attachment_dir = out_dir / 'attachments'
# if image_dir.exists():
# shutil.rmtree(image_dir)
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.attrs = {k: v for k, v in attrs}
def generate_html(tag, props):
element = ElementTree.Element(tag, attrib=props)
return ElementTree.tostring(element, encoding='unicode')
def download_image(tag_match):
# <img width="843" height="218.5" src="..." data-src-type="image/png" data-fullres-src="..." data-fullres-src-type="image/png" />
parser = MyHTMLParser()
parser.feed(tag_match[0])
props = parser.attrs
image_url = props.get('data-fullres-src', props['src'])
image_type = props.get('data-fullres-src-type', props['data-src-type']).split("/")[-1]
file_name = ''.join(random.choice(string.ascii_lowercase) for _ in range(10)) + '.' + image_type
img = get(graph_client, image_url).content
print(f' Downloaded image of {len(img)} bytes.')
image_dir.mkdir(exist_ok=True)
with open(image_dir / file_name, "wb") as f:
f.write(img)
props['src'] = "images/" + file_name
props = {k: v for k, v in props.items() if not 'data-fullres-src' in k}
return generate_html('img', props)
def download_attachment(tag_match):
# <object data-attachment="Trig_Cheat_Sheet.pdf" type="application/pdf" data="..." style="position:absolute;left:528px;top:139px" />
parser = MyHTMLParser()
parser.feed(tag_match[0])
props = parser.attrs
data_url = props['data']
file_name = props['data-attachment']
if (attachment_dir / file_name).exists():
print(f' Attachment {file_name} already downloaded; skipping.')
else:
data = get(graph_client, data_url).content
print(f' Downloaded attachment {file_name} of {len(data)} bytes.')
attachment_dir.mkdir(exist_ok=True)
with open(attachment_dir / file_name, "wb") as f:
f.write(data)
props['data'] = "attachments/" + file_name
return generate_html('object', props)
content = re.sub(r"<img .*?\/>", download_image, content, flags=re.DOTALL)
content = re.sub(r"<object .*?\/>", download_attachment, content, flags=re.DOTALL)
return content
@app.route("/getToken")
def main_logic():
code = flask.request.args['code']
token = application.acquire_token_by_authorization_code(code, scopes=scopes,
redirect_uri=redirect_uri)
graph_client = OAuth2Session(token=token)
notebooks = get_json(graph_client, f'{graph_url}/me/onenote/notebooks')
print(f'Got {len(notebooks)} notebooks.')
for nb in notebooks:
nb_name = nb["displayName"]
print(f'Opening notebook {nb_name}')
sections = get_json(graph_client, nb['sectionsUrl'])
print(f' Got {len(sections)} sections.')
for sec in sections:
sec_name = sec["displayName"]
print(f' Opening section {sec_name}')
pages = get_json(graph_client, sec['pagesUrl'] + '?pagelevel=true')
print(f' Got {len(pages)} pages.')
pages = sorted([(page['order'], page) for page in pages])
level_dirs = [None]*4
for order, page in pages:
level = page['level']
page_title = f'{order}_{page["title"]}'
print(f' Opening page {page_title}')
if level == 0:
out_dir = output_path / nb_name / sec_name / page_title
else:
out_dir = level_dirs[level - 1] / page_title
level_dirs[level] = out_dir
out_html = out_dir / 'main.html'
if out_html.exists():
print(' HTML file already exists; skipping this page')
continue
out_dir.mkdir(parents=True, exist_ok=True)
response = get(graph_client, page['contentUrl'])
if response is not None:
content = response.text
print(f' Got content of length {len(content)}')
content = download_attachments(graph_client, content, out_dir)
with open(out_html, "w") as f:
f.write(content)
print("Done!")
return flask.render_template_string('<html><head><title>Done</title></head><body><p1><b>Done</b></p1></body></html>')
if __name__ == "__main__":
app.run()
@minghao51
Copy link

This is awesome, would you be keen to transfer this into a repository so people can contribute?

@danmou
Copy link
Author

danmou commented Jan 9, 2020

@minghao51 Thanks! Sure, here you go: https://github.com/Danmou/onenote_export

I added filename sanitizing, but PRs with further improvements are welcome!

@Underquant
Copy link

Good job! Thanks for sharing your idea!

@sspaeti
Copy link

sspaeti commented Feb 7, 2021

This is awesome and works like a charm. I had a lot of recursive section groups. Therefore I added this to the export and had to handle one error with broken images. If of interest see my fork: https://gist.github.com/sspaeti/8daab59a80adc664fa8cbcba707ea21d/revisions

The changes are mainly recursion_section_group(). I just formated that's why it looks like a lot of differences.

@danmou
Copy link
Author

danmou commented Feb 8, 2021

@sspaeti I'm not sure what you mean by recursive section groups, but if you think your change is generally useful and backwards compatible, feel free to submit a PR to the repo I linked above

@sspaeti
Copy link

sspaeti commented Feb 13, 2021

I meant Section Groups were not handled in your example. And even less if you have a section group in a section group (this can be nested quite heavily, what I did ;-). That's what I added to your script in the example above, just in case someone else has also lots of these.

@Rakeshpro9040
Copy link

This is awesome, now I can start my own free blogging channel, Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment