Skip to content

Instantly share code, notes, and snippets.

@fwSara95h
Last active August 17, 2024 07:55
Show Gist options
  • Save fwSara95h/e8f37c23d0ab0774675b670a797108a9 to your computer and use it in GitHub Desktop.
Save fwSara95h/e8f37c23d0ab0774675b670a797108a9 to your computer and use it in GitHub Desktop.
A helper function for extracting a JavaScript variable from a BeautifulSoup object [Example at https://stackoverflow.com/a/76366675/6146136 ]

Extract JavaScript variables from BeautifulSoup objects

INPUTS

  • inpX: must be a bs4 document/tag/ResultSet or a string or a list of strings
    • ( target variable must be JSON and seaparated from other variables by ; )
  • varName: name of the target variable
    • ( only the first variable found with the specified name will be returned )
  • selector: a CSS selector for searching the bs4 document/tag for target script
    • ( if inpX is a script-tag/ResultSet/string/list then selector doesn't matter )
  • prepFn: should be a univariate function that takes a string and returns a string
    • ( for modifying the script string before searching for and parsing variable )

EXAMPLES:

  • jsonload_from_script(BeautifulSoup('<script>y=8</script>'), 'y') --> 8
  • jsonload_from_script(['y=8','x=7'], 'x') --> 7
  • jsonload_from_script('y=8;x=7', 'x') --> 7
  • jsonload_from_script('y=8;x={"a":1};w="lorem";', 'x') --> {'a':1}
  • jsonload_from_script('y=8;x={"a":1};w="lorem";', 'w') --> 'lorem'
import json
def get_jsScriptVal(jSoup, valDecl, isJson=True):
script_finder = lambda s: s and valDecl in s
for sc in jSoup.find('script', string=script_finder):
for st in sc.string.split(';'):
ls, rs, *_ = [s.strip() for s in (st.split('=', 1) + [''])]
if ls == valDecl and rs: return json.loads(rs) if isJson else rs
## For extracting a [JSON] variable by name from JavaScript string or bs4 script-tag ##
import json
def jsonload_from_script(inpX, varName, selector='script', prepFn=lambda x:x):
try: sList = [inpX] if inpX.name=='script' else inpX.select(selector)
except: sList = inpX if isinstance(inpX, list) else [inpX]
for s in sList:
s = getattr(s, 'string', s)
if not (isinstance(s,str) and s.strip()): continue
sections = prepFn(s.strip()).split(';')
for i, section in enumerate(sections):
name, val = ['', *section.split('=', 1)][-2:]
if '=' in section and name.strip()==varName:
sections,t1 = [val]+sections[i+1:],True
try: return json.loads(val)
except: break
else: continue
while len(sections) > 1:
try: return json.loads(';'.join(sections))
except: sections = sections[:-1]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment