-
Define a function
max()that takes two numbers as arguments and returns the largest of them. Use the if-then-else construct available in Python. (It is true that Python has themax()function built in, but writing it yourself is nevertheless a good exercise). -
Define a function
max_of_three()that takes three numbers as arguments and returns the largest of them. -
Define a function that computes the length of a given list or string. (It is true that Python has the
len()function built in, but writing it yourself is nevertheless a good exercise). -
Write a function that takes a character (i.e. a string of length 1) and returns
Trueif it is a vowel,Falseotherwise.
| """Convert XML output of Stanford CoreNLP to CoNLL 2012 format. | |
| $ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \ | |
| -output.printSingletonEntities true \ | |
| -file /tmp/example.txt | |
| $ python3 corenlpxmltoconll2012.py example.txt.xml > example.conll` | |
| """ | |
| import re | |
| import sys | |
| from lxml import etree |
| """Prepare https://benjaminvdb.github.io/110kDBRD/ for use with fastText. | |
| Divide train set into 90% train and 10% dev, balance positive and negative | |
| rewiews, and shuffle. Write result in fastText format.""" | |
| import os | |
| import re | |
| import random | |
| import glob | |
| from syntok.tokenizer import Tokenizer |
| """A baseline Bag-of-Words text classification. | |
| Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams] | |
| train.txt and test.txt should contain one "document" per line, | |
| first token should be the label. | |
| The default is to use regularized Logistic Regression and relative frequencies. | |
| Pass --svm to use Linear SVM instead. | |
| Pass --tfidf to use tf-idf instead of relative frequencies. | |
| Pass --bigrams to use bigrams instead of unigrams. | |
| """ |
-
Write a function
char_freq()that takes a string and builds a frequency listing of the characters contained in it. Represent the frequency listing as a Python dictionary. Try it with something likechar_freq("abbabcbdbabdbdbabababcbcbab"). -
Write a function
char_freq_table()that take a file name as argument, builds a frequency listing of the characters contained in the file, and prints a sorted and nicely formatted character frequency table to the screen. -
The third person singular verb form in English is distinguished by the suffix
-s, which is added to the stem of the infinitive form:run->runs. A simple set of rules can be given as follows:a. If the verb ends in
y, remove it and addiesb. If the verb ends ino,ch,s,sh,xorz, addesc. By default just adds
| """Apply polyglot language detection to all .txt files under current directory | |
| (searched recursively), write report in tab-separated file detectedlangs.tsv. | |
| """ | |
| import os | |
| from glob import glob | |
| from polyglot.detect import Detector | |
| from polyglot.detect.base import UnknownLanguage | |
| def main(): |
| filename | lang | confidence | read_bytes | |
|---|---|---|---|---|
| train/neg/3706_2.txt | en | 81.0 | 1268 | |
| train/neg/9466_1.txt | en | 99.0 | 1066 | |
| train/neg/6464_2.txt | en | 99.0 | 1248 | |
| train/neg/14850_2.txt | en | 99.0 | 1128 | |
| train/neg/4674_2.txt | en | 99.0 | 1306 | |
| train/neg/7036_1.txt | fy | 68.0 | 997 | |
| train/neg/7454_2.txt | en | 63.0 | 688 | |
| train/neg/4856_2.txt | en | 99.0 | 1363 | |
| train/neg/12096_2.txt | en | 99.0 | 1339 |
| import datetime | |
| def addseconds(timestamp, seconds): | |
| """Take timestamp as string and add seconds to it. | |
| >>> addseconds('00:01:45,667', 1) | |
| '00:01:46,667' | |
| >>> addseconds('00:01:45,667', 0.5) | |
| '00:01:46,167' |
| import random | |
| from timeit import timeit | |
| import re | |
| import re2 | |
| re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+') | |
| re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+') | |
| lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8') | |
| for _ in range(16000)] |
| # This is a comment | |
| FROM ubuntu:20.04 | |
| MAINTAINER Andreas van Cranenburgh <[email protected]> | |
| RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime | |
| ENV DEBIAN_FRONTEND=noninteractive | |
| RUN apt-get update && apt-get install -y \ | |
| build-essential \ | |
| curl \ | |
| git \ | |
| python3 \ |