Created
September 5, 2022 04:30
-
-
Save victormurcia/b47d506791efc576dc16c0ff344b92ae to your computer and use it in GitHub Desktop.
return all book unique identifiers from a list of raw links
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # return all book unique identifiers from a list of raw links | |
| def get_book_identifiers(links): | |
| # define a url pattern we are looking for | |
| pattern = re.compile('/ebooks/[0-9]+') | |
| # process the list of links for those that match the pattern | |
| books = set() | |
| for link in links: | |
| # check of the link matches the pattern | |
| if not pattern.match(link): | |
| continue | |
| # extract the book id from /ebooks/nnn | |
| book_id = link[8:] | |
| # store in the set, only keep unique ids | |
| books.add(book_id) | |
| return books |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment