Andreas van Cranenburgh andreasvc

Python exercises

Define a function max() that takes two numbers as arguments and returns the largest of them. Use the if-then-else construct available in Python. (It is true that Python has the max() function built in, but writing it yourself is nevertheless a good exercise).
Define a function max_of_three() that takes three numbers as arguments and returns the largest of them.
Define a function that computes the length of a given list or string. (It is true that Python has the len() function built in, but writing it yourself is nevertheless a good exercise).
Write a function that takes a character (i.e. a string of length 1) and returns True if it is a vowel, False otherwise.

Write a function char_freq() that takes a string and builds a frequency listing of the characters contained in it. Represent the frequency listing as a Python dictionary. Try it with something like char_freq("abbabcbdbabdbdbabababcbcbab").
Write a function char_freq_table() that take a file name as argument, builds a frequency listing of the characters contained in the file, and prints a sorted and nicely formatted character frequency table to the screen.
The third person singular verb form in English is distinguished by the suffix -s, which is added to the stem of the infinitive form: run -> runs. A simple set of rules can be given as follows:

a. If the verb ends in y, remove it and add ies b. If the verb ends in o, ch, s, sh, x or z, add es c. By default just add s

	"""Convert XML output of Stanford CoreNLP to CoNLL 2012 format.

	$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \
	-output.printSingletonEntities true \
	-file /tmp/example.txt
	$ python3 corenlpxmltoconll2012.py example.txt.xml > example.conll`
	"""
	import re
	import sys
	from lxml import etree

	"""Prepare https://benjaminvdb.github.io/110kDBRD/ for use with fastText.

	Divide train set into 90% train and 10% dev, balance positive and negative
	rewiews, and shuffle. Write result in fastText format."""
	import os
	import re
	import random
	import glob
	from syntok.tokenizer import Tokenizer

	"""A baseline Bag-of-Words text classification.

	Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams]
	train.txt and test.txt should contain one "document" per line,
	first token should be the label.
	The default is to use regularized Logistic Regression and relative frequencies.
	Pass --svm to use Linear SVM instead.
	Pass --tfidf to use tf-idf instead of relative frequencies.
	Pass --bigrams to use bigrams instead of unigrams.
	"""

	"""Apply polyglot language detection to all .txt files under current directory
	(searched recursively), write report in tab-separated file detectedlangs.tsv.
	"""
	import os
	from glob import glob
	from polyglot.detect import Detector
	from polyglot.detect.base import UnknownLanguage


	def main():

	import datetime


	def addseconds(timestamp, seconds):
	"""Take timestamp as string and add seconds to it.

	>>> addseconds('00:01:45,667', 1)
	'00:01:46,667'
	>>> addseconds('00:01:45,667', 0.5)
	'00:01:46,167'

	import random
	from timeit import timeit
	import re
	import re2

	re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+')
	re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+')

	lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8')
	for _ in range(16000)]

	# This is a comment
	FROM ubuntu:20.04
	MAINTAINER Andreas van Cranenburgh <[email protected]>
	RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
	ENV DEBIAN_FRONTEND=noninteractive
	RUN apt-get update && apt-get install -y \
	build-essential \
	curl \
	git \
	python3 \