Skip to content

Instantly share code, notes, and snippets.

@mara004
Last active July 31, 2024 20:53
Show Gist options
  • Save mara004/881d0c5a99b8444fd5d1d21a333b70f8 to your computer and use it in GitHub Desktop.
Save mara004/881d0c5a99b8444fd5d1d21a333b70f8 to your computer and use it in GitHub Desktop.
Parse pdfbox versions
# SPDX-FileCopyrightText: 2024 geisserml <[email protected]>
# SPDX-License-Identifier: Apache-2.0
import re
from datetime import datetime
from urllib.request import urlopen
from packaging.version import Version as PypaVersion
PB_RELEASE_URL = "https://archive.apache.org/dist/pdfbox/"
PB_DISTS_RE = r'<a href="([\d\.]+.+?)/">.+</a>\s+([\d\-]+ [\d:]+)'
PB_DATE_FMT = r"%Y-%m-%d %H:%M"
class PdfboxVersion (PypaVersion):
def __init__(self, version, date):
super().__init__(version)
self.date = date
# prioritize date over pre-release tags because pdfbox uses them inconsistently, and pre-releases will not get backports
# (indices 0, 1 are epoch and release, the rest follows)
self._key = (*self._key[:2], date, *self._key[2:])
def __repr__(self):
return f"PdfboxVersion({super().__str__()!r}, {self.date!r})"
def __str__(self):
return f"{super().__str__():<10} {self.date}"
content = urlopen(PB_RELEASE_URL).read().decode("utf-8")
results = [PdfboxVersion(m.group(1), datetime.strptime(m.group(2), PB_DATE_FMT)) for m in re.finditer(PB_DISTS_RE, content)]
results.sort()
if __name__ == "__main__":
print(*results, sep="\n")
@mara004
Copy link
Author

mara004 commented Jul 27, 2024

Turns out sorting by major versions + date alone is not sufficient.
If multiple releases have been made the same day and there is a transition from 1 -> 2 digits, then ordering goes amiss:

(<Version('1.8.10')>, datetime.datetime(2015, 10, 14, 16, 26))
(<Version('1.8.7')>, datetime.datetime(2015, 10, 14, 16, 26))
(<Version('1.8.8')>, datetime.datetime(2015, 10, 14, 16, 26))
(<Version('1.8.9')>, datetime.datetime(2015, 10, 14, 16, 26))

On the other hand, sorting by version alone also tends to go wrong where pre-release annots are involved:

(<Version('3.0.0a2')>, datetime.datetime(2021, 9, 11, 11, 54))
(<Version('3.0.0a3')>, datetime.datetime(2022, 6, 17, 12, 27))
(<Version('3.0.0b1')>, datetime.datetime(2023, 7, 14, 6, 7))
(<Version('3.0.0rc1')>, datetime.datetime(2021, 4, 1, 21, 18))

So I suppose we need a combination of both (something like: pre-sort by version, then sort by date).

@mara004
Copy link
Author

mara004 commented Jul 27, 2024

I updated the code above; should work now.

Output as of 2024-07-27 (click to expand)
1
2010-03-29 10:12:00  1.1.0
2010-06-28 14:18:00  1.2.0
2010-07-09 10:13:00  1.2.1
2010-10-25 10:49:00  1.3.1
2010-12-20 10:05:00  1.4.0
2011-03-03 08:50:00  1.5.0
2011-07-01 19:20:00  1.6.0
2012-05-28 20:13:00  1.7.0
2012-07-24 20:53:00  1.7.1
2013-03-22 22:24:00  1.8.0
2013-04-10 15:43:00  1.8.1
2013-06-01 21:57:00  1.8.2
2013-11-28 20:30:00  1.8.3
2014-01-30 18:40:00  1.8.4
2014-05-01 18:59:00  1.8.5
2014-06-22 13:40:00  1.8.6
2015-10-14 16:26:00  1.8.7
2015-10-14 16:26:00  1.8.8
2015-10-14 16:26:00  1.8.9
2015-10-14 16:26:00  1.8.10
2016-01-17 21:55:00  1.8.11
2016-04-25 17:02:00  1.8.12
2017-10-04 11:08:00  1.8.13
2018-05-04 15:48:00  1.8.14
2018-06-28 19:24:00  1.8.15
2022-06-17 12:27:00  1.8.16
2022-09-15 17:13:00  1.8.17

2
2015-10-18 20:57:00  2.0.0rc1
2015-11-21 18:57:00  2.0.0rc2
2016-01-14 20:55:00  2.0.0rc3
2016-03-18 12:02:00  2.0.0
2016-04-25 17:23:00  2.0.1
2016-06-09 17:51:00  2.0.2
2016-09-17 09:15:00  2.0.3
2016-12-15 18:02:00  2.0.4
2017-06-26 17:52:00  2.0.5
2017-06-26 17:52:00  2.0.6
2017-10-04 11:08:00  2.0.7
2017-11-02 20:53:00  2.0.8
2018-05-04 15:48:00  2.0.9
2018-06-21 20:04:00  2.0.10
2018-06-28 19:38:00  2.0.11
2018-10-04 18:43:00  2.0.12
2018-11-30 22:31:00  2.0.13
2019-02-28 17:28:00  2.0.14
2019-04-11 15:36:00  2.0.15
2019-06-27 18:20:00  2.0.16
2019-09-20 18:36:00  2.0.17
2019-12-23 18:33:00  2.0.18
2020-02-23 17:50:00  2.0.19
2020-06-07 16:08:00  2.0.20
2020-11-05 18:56:00  2.0.21
2020-12-19 18:33:00  2.0.22
2021-03-18 21:25:00  2.0.23
2021-06-10 17:57:00  2.0.24
2021-12-16 20:50:00  2.0.25
2022-06-17 12:27:00  2.0.26
2022-09-29 15:52:00  2.0.27
2023-04-13 14:37:00  2.0.28
2023-07-01 17:00:00  2.0.29
2023-11-05 11:10:00  2.0.30
2024-03-24 18:04:00  2.0.31
2024-07-24 15:41:00  2.0.32

3
2021-04-01 21:18:00  3.0.0rc1
2021-09-11 11:54:00  3.0.0a2
2022-06-17 12:27:00  3.0.0a3
2023-07-14 06:07:00  3.0.0b1
2023-08-18 04:31:00  3.0.0
2023-11-30 18:47:00  3.0.1
2024-03-14 20:33:00  3.0.2

(datetime.datetime(2024, 3, 14, 20, 33), <Version('3.0.2')>)

@mara004
Copy link
Author

mara004 commented Jul 27, 2024

Another problem: If a backport were made to a minor release series, like so,

a.b.0  2024-07-01
a.c.0  2024-07-02
a.b.1  2024-07-03

then the above would produce the wrong order.

So sorting just by version might be better after all, we'd just need to resolve the v3 RC/alpha situation somehow:

[3.0.0-RC1/]     2021-04-01 21:18
[3.0.0-alpha2/]  2021-09-11 11:54
[3.0.0-alpha3/]  2022-06-17 12:27

@mara004
Copy link
Author

mara004 commented Jul 27, 2024

Updated the code yet again. This time by inheriting from packaging's Version class and hooking date into the compare key.
While that addresses the above issue, it might be a bit wonky, because technically that's private API...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment