This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This script can scrape rudaw.net in two stages | |
1. `python rudaw.py links` collect links for each category | |
2. `python rudaw.py content` collect content for each link and writes it to rudaw.csv | |
""" | |
import sys | |
import os | |
import csv |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
def resolve_ae(text): | |
""" | |
This function takes a text input in Central Kurdish (Sorani) script and performs a series of character replacements | |
to standardize variations in the script. Specifically, it addresses cases where the character 'ە' (Arabic letter | |
AE) may be used in different contexts. | |
""" | |
# First replace all occurrences of 'ه' with 'ە' | |
text = re.sub("ه", "ە", text) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests | |
from bs4 import BeautifulSoup | |
import time | |
url = 'https://www.kurdfonts.com/browse/categories' | |
response = requests.get(url) | |
soup = BeautifulSoup(response.content, 'html.parser') |