Olga Pustovalova olp-cs

System

git submodule update --init
pip install -r requirements.txt

wget https://archive.apache.org/dist/tika/2.3.0/tika-server-standard-2.3.0.jar
java -jar tika-server-standard-2.3.0.jar

The notebooks in this Gist compare the following operations and demonstrate their equivalent outputs:

For pandas 1.3.0:

input_data.groupby(group_by_column).mean()[[expression_column]]

Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep

	library(igraph) # to work with graphs
	library(RColorBrewer) # to use a color palette
	library(plotrix) # to rescale variables


	# Read the data
	raw_data <- read.csv("network_data.csv")
	names(raw_data) <- c("Source", "Target", "Count", "Money")

	# reformat data for igraph library

	import os
	import sys

	import numpy as np
	import matplotlib.pyplot as plt

	from pandas import DataFrame
	from pandas.util.testing import set_trace

	dirs = []

	from sklearn.metrics import confusion_matrix

	def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
	"""pretty print for confusion matrixes"""
	columnwidth = max([len(x) for x in labels]+[5]) # 5 is value length
	empty_cell = " " * columnwidth
	# Print header
	print " " + empty_cell,
	for label in labels:
	print "%{0}s".format(columnwidth) % label,

	{
	"metadata": {
	"name": "exploring_a_single_data_file"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{