Antonio Piccolboni piccolbo

(a dialog on statistical algorithms with mistral.ai. My questions are in cursive. My comments are in bold)

Do you know the definition of median in statistics?

Yes, in statistics, the median is a measure of central tendency that represents the middle value in a dataset when the data is arranged in ascending or descending order. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values. The median is a useful measure of central tendency when the data is skewed or has outliers, as it is not affected by extreme values like the mean is. For example, if you have a dataset of incomes for a group of people, the median income would be the income level that separates the higher half of incomes from the lower half.

So far so good. Deleted an additional question here that was too open-ended to be clearly wrong or right.

*Can you show me how to find the median of a list in Python? D

Putting wings on the Elephant

[operating-hadoop]

HBase is used widely at Facebook and one of the biggest usecase is Facebook Messages. With a billion users there are a lot of reliability and performance challenges on both HBase and HDFS. HDFS was originally designed for a batch processing system like MapReduce/Hive. A realtime usecase like Facebook Messages where the p99 latency can`t be more than a couple hundreds of milliseconds poses a lot of challenges for HDFS. In this talk we will share the work the HDFS team at Facebook has done to support a realtime usecase like Facebook Messages : (1) Using system calls to tune performance; (2) Inline checksums to reduce iops by 40%; (3) Reducing the p99 for read and write latencies by about 10x; (4) Tools used to determine root cause of outliers. We will discuss the details of each technique, the challenges we faced, lessons learned and results showing the impact of each improvement.

speaker: Pritam Damania

Real-Time Market Basket Analysis for Retail with

Code execution

Hydrogen: allows to send any code selection to a kernel and visualize the results in-text, plus watches and more
hydrogen-launcher: launch terminal or iPython

The git bundle

Some of this may be superseeded by native git integration:

git-blame: find who wrote that cryptic code

merge any development branch you need to merge
git checkout master
run test

make install-dev
make test

when test pass git push
Update HISTORY.rst
Check readthedocs to make sure docs are OK

Keybase proof

I hereby claim:

I am piccolbo on github.
I am piccolbo (https://keybase.io/piccolbo) on keybase.
I have a public key ASATO-Kj3cWENOHAPB5OgNFMlc4xEUtScX1L0-Er8tYX-Ao

To claim this, I am signing this object:

Criteria: most employess remote all the time. No office space allocated or available. No relocation required. Low travel requirement.

Company	Notes
Automattic	4 weeks/year travel all employees
Rocana
RStudio
Plex
Open Knowledge Foundation	Time zone restricted -3:1, different contracts UK/non UK

Dplyr is a well known R package to work on structured data, either in memory or in DB and, more recently, in cluster. The in memory implementations have in general capabilities that are not found in the others, so the notion of backend is used with a bit of a poetic license. Even the different DB and cluster backends differ in subtle ways. But it sure is better than writing SQL directly! Here I provide a list of backends with links to the packages that implement them when necessary. I've done my best to provide links to active projects, but I am not endorsing any of them. Do your own testing. Enjoy and please contribute any corrections or additions, in the comments.

Backend	Package
data.frame	builtin
data.table	builtin
arrays	builtin
SQLite	builtin
PostgreSQL/Redshift	builtin

merge into master
update version #
update date
update Rd help()
push master
Repeat until tests pass
- test local and debug
- test remote and debug
- test additional platforms
apply necessary fixes

	#on cluster
	thrift /spark/sbin/start-thriftserver.sh --master yarn-client
	#ssh tunnel, direct 10000 to unused 8157
	ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 [email protected]
	#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools

	```{r}
	ff = function(){}
	names(ff) = "abc"
	# Error in names(ff) = "abc" : names() applied to a non-vector
	is.vector(mtcars)
	#[1] FALSE
	names(mtcars) = LETTERS[1:11]
	names(mtcars)
	# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
	```