Elias Dabbas eliasdabbas

💭

https://github.com/eliasdabbas/langchain-advertools

#DigitalMarketing meets #DataScience #advertools #Python #Dashboards #SEO #SEM #Plotly #Dash

361 followers · 51 following

The Media Supermarket
Online
15:59 (UTC +01:00)
https://adver.tools/
@eliasdabbas
in/eliasdabbas

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

eliasdabbas / sitemap.xml.gz

Created April 4, 2020 00:33

	<? xml version = "1.0" encoding = "UTF-8"?>

	<urlset xmlns = "http://www.sitemaps.org/schemas/sitemap/0.9"
	xmlns: xhtml = "http://www.w3.org/1999/xhtml">

	<url>

	<loc> http://www.example.com/ </loc>

	<lastmod> 2005-01-01 </lastmod>

eliasdabbas / user_agent_to_data_frame.py

Created November 30, 2021 16:11

From a list of user agents to a DataFrame of parsed UA's

	user_agents = [
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
	'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
	'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',
	'Mozilla/5.0 (Windows NT 6.1; rv:64.0) Gecko/20100101 Firefox/64.0',
	'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
	'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',
	'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)',
	'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',

eliasdabbas / to_crawl_or_not_to_crawl.py

Created December 19, 2021 19:21

Check if a given URL will be crawled or not given a set of conditions.

	from urllib.parse import urlsplit, parse_qs
	import re


	def crawl_or_not(url,
	exclude_url_params=None,
	include_url_params=None,
	include_url_pattern=None,
	exclude_url_pattern=None):
	"""Check if ``url`` will be crawled or not given the supplied conditions.

eliasdabbas / get_bot_ip_addresses.py

Last active August 11, 2025 18:28

Get the most up-to-date list of IP addresses for crawler bots, belonging to Google and Bing.

	import ipaddress

	import requests
	import pandas as pd

	def bot_ip_addresses():
	bots_urls = {
	'google': 'https://developers.google.com/search/apis/ipranges/googlebot.json',
	'bing': 'https://www.bing.com/toolbox/bingbot.json'
	}

eliasdabbas / score_links.py

Last active September 20, 2022 12:26

Score internal links using two columns of "Source" and "Destination". This calculates various link importance metrics link degree centrality, betweenness centrality and PageRank.

	import networkx as nx
	import pandas as pd

	def score_links(links_file, domain):
	"""Score a network on links based on their importance and centrality.

	links_file: Path to the file having the links (needs a "Source" and
	"Destination" columns) e.g. ScreamingFrog's outlinks file.
	domain: Filter all links, making sure they all point to the domain you want.
	"""

eliasdabbas / bert_pipeline_unmasker.py

Last active August 9, 2022 04:59

	# !pip install --upgrade transformers plotly pandas

	import plotly.graph_objects as go
	import pandas as pd
	pd.options.display.max_columns = None
	from transformers import pipeline
	unmasker = pipeline('fill-mask', model='bert-base-uncased')

	results = []
	cars = ['mercedes', 'audi', 'bmw', 'volkswagen', 'ford', 'toyota',

eliasdabbas / robots_sitemaps_urls_wordfreq.sh

Last active April 6, 2022 20:35

Fetch robots.txt file, get relevant XML sitemap, extract and split URLs, count words in article titles. Watch for more details: https://bit.ly/3HMZC0A

	# pip install advertools==0.14.0a7

	# get the robots.txt file, save to csv:
	advertools robots --url https://www.economist.com/robots.txt econ_robots.csv

	# find lines that start with sitemap, save to variable sitemap_url
	sitemap_url=$(grep ^sitemap -i econ_robots.csv \| cut -d , -f 2)

	# get the sitemap index file without downloading the sub-sitemaps (not recursive),
	advertools sitemaps $sitemap_url econ_sitemap.csv --recursive 0

eliasdabbas / parse_news_sitemaps.py

Created March 14, 2022 18:19

	import datetime

	import advertools as adv
	import pandas as pd


	stopwords = ['to', 'of', 'the', 'in', 'for', 'and', 'on', 'a', 'as', 'with',
	'from', 'over', 'is', 'at', '—', '-', 'be', '2022', '–', 'it', 'by',
	'we', 'why', 'but', 'my', 'how', 'not', 'an', 'are', 'no', 'go',
	'your', 'up', 'his']

eliasdabbas / flag.py

Created March 19, 2022 13:09

	from unicodedata import lookup

	def flag(cc):
	l1 = lookup(f'REGIONAL INDICATOR SYMBOL LETTER {cc[0]}')
	l2 = lookup(f'REGIONAL INDICATOR SYMBOL LETTER {cc[1]}')
	return l1 + l2

eliasdabbas / crawl_multiple_sites.py

Last active April 27, 2022 08:56

Crawl multiple websites with one for loop, while saving the output, logs, and job status separately for each website. Resume crawling any time simply be re-running the same code

	from urllib.parse import urlsplit

	import advertools as adv


	sites = [
	'https://www.who.int',
	'https://www.nytimes.com',
	'https://www.washingtonpost.com',
	]

OlderNewer