This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import advertools as adv | |
| import pandas as pd | |
| pd.options.display.max_columns = None | |
| homepage = 'https://example.com/' # <--- change this | |
| domain = 'example.com' # <--- and this | |
| adv.crawl(homepage, 'output_file.jl', follow_links=True, | |
| custom_settings={'LOG_FILE': 'output_file.log'}) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import plotly.express as px | |
| import pandas as pd | |
| import requests | |
| dflist = [] | |
| for i in range(1, 6): | |
| resp = requests.get(f'https://companiesmarketcap.com/page/{i}/') | |
| df = pd.read_html(resp.text)[0] | |
| dflist.append(df) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import plotly.express as px | |
| def treemap(traffic_df, metric='Users', path=['Medium', 'Source']): | |
| """Make in interactive treemap for two data dimensions/levels. | |
| Parameters: | |
| ----------- | |
| traffic_df : A DataFrame containing two dimensions, and one or more metrics | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import advertools as adv | |
| import pandas as pd | |
| import plotly | |
| import plotly.graph_objects as go | |
| pd.options.display.max_columns = None | |
| cx = 'YOUR_CSE_ID' | |
| key = 'YOUR_GOOGLE_DEV_KEY' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import advertools as adv | |
| import pandas as pd | |
| pd.options.display.max_columns = None | |
| # Copied from https://en.wikipedia.org/wiki/List_of_cancer_types | |
| cancers = { | |
| "Chondrosarcoma": "Bone and muscle sarcoma" , | |
| "Ewing's sarcoma": "Bone and muscle sarcoma" , |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import plotly.graph_objects as go | |
| import pandas as pd | |
| def serp_heatmap(df, num_domains=10, select_domain=None): | |
| df = df.rename(columns={'domain': 'displayLink', | |
| 'searchTerms': 'keyword'}) | |
| top_domains = df['displayLink'].value_counts()[:num_domains].index.tolist() | |
| top_domains = df['displayLink'].value_counts()[:num_domains].index.tolist() | |
| top_df = df[df['displayLink'].isin(top_domains) & df['displayLink'].ne('')] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from urllib.parse import urlsplit | |
| import advertools as adv | |
| sites = [ | |
| 'https://www.who.int', | |
| 'https://www.nytimes.com', | |
| 'https://www.washingtonpost.com', | |
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from unicodedata import lookup | |
| def flag(cc): | |
| l1 = lookup(f'REGIONAL INDICATOR SYMBOL LETTER {cc[0]}') | |
| l2 = lookup(f'REGIONAL INDICATOR SYMBOL LETTER {cc[1]}') | |
| return l1 + l2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import datetime | |
| import advertools as adv | |
| import pandas as pd | |
| stopwords = ['to', 'of', 'the', 'in', 'for', 'and', 'on', 'a', 'as', 'with', | |
| 'from', 'over', 'is', 'at', '—', '-', 'be', '2022', '–', 'it', 'by', | |
| 'we', 'why', 'but', 'my', 'how', 'not', 'an', 'are', 'no', 'go', | |
| 'your', 'up', 'his'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # pip install advertools==0.14.0a7 | |
| # get the robots.txt file, save to csv: | |
| advertools robots --url https://www.economist.com/robots.txt econ_robots.csv | |
| # find lines that start with sitemap, save to variable sitemap_url | |
| sitemap_url=$(grep ^sitemap -i econ_robots.csv | cut -d , -f 2) | |
| # get the sitemap index file without downloading the sub-sitemaps (not recursive), | |
| advertools sitemaps $sitemap_url econ_sitemap.csv --recursive 0 |