Skip to content

Instantly share code, notes, and snippets.

@piccolbo
piccolbo / mistral-median.md
Last active March 19, 2024 05:33
Mistral ai answers and hallucinates about stats

(a dialog on statistical algorithms with mistral.ai. My questions are in cursive. My comments are in bold)

Do you know the definition of median in statistics?

Yes, in statistics, the median is a measure of central tendency that represents the middle value in a dataset when the data is arranged in ascending or descending order. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values. The median is a useful measure of central tendency when the data is skewed or has outliers, as it is not affected by extreme values like the mean is. For example, if you have a dataset of incomes for a group of people, the median income would be the income level that separates the higher half of incomes from the lower half.

So far so good. Deleted an additional question here that was too open-ended to be clearly wrong or right.

*Can you show me how to find the median of a list in Python? D

Putting wings on the Elephant

[operating-hadoop]

HBase is used widely at Facebook and one of the biggest usecase is Facebook Messages. With a billion users there are a lot of reliability and performance challenges on both HBase and HDFS. HDFS was originally designed for a batch processing system like MapReduce/Hive. A realtime usecase like Facebook Messages where the p99 latency can`t be more than a couple hundreds of milliseconds poses a lot of challenges for HDFS. In this talk we will share the work the HDFS team at Facebook has done to support a realtime usecase like Facebook Messages : (1) Using system calls to tune performance; (2) Inline checksums to reduce iops by 40%; (3) Reducing the p99 for read and write latencies by about 10x; (4) Tools used to determine root cause of outliers. We will discuss the details of each technique, the challenges we faced, lessons learned and results showing the impact of each improvement.

speaker: Pritam Damania

Real-Time Market Basket Analysis for Retail with

@piccolbo
piccolbo / My atom data science bundle.md
Last active August 24, 2019 04:35
A list of atom extensions I use for data science.

Code execution

  • Hydrogen: allows to send any code selection to a kernel and visualize the results in-text, plus watches and more
  • hydrogen-launcher: launch terminal or iPython

The git bundle

Some of this may be superseeded by native git integration:

  • git-blame: find who wrote that cryptic code
@piccolbo
piccolbo / pypi-release-checklist2.md
Last active February 23, 2022 17:41 — forked from audreyfeldroy/pypi-release-checklist2.md
My PyPI Release Checklist 2 (now with bumpversion)
  • merge any development branch you need to merge
  • git checkout master
  • run test
make install-dev
make test
  • when test pass git push
  • Update HISTORY.rst
  • Check readthedocs to make sure docs are OK
@piccolbo
piccolbo / keybase.md
Created October 15, 2017 22:05
keybase identification

Keybase proof

I hereby claim:

  • I am piccolbo on github.
  • I am piccolbo (https://keybase.io/piccolbo) on keybase.
  • I have a public key ASATO-Kj3cWENOHAPB5OgNFMlc4xEUtScX1L0-Er8tYX-Ao

To claim this, I am signing this object:

@piccolbo
piccolbo / emr_spark_thrift_on_yarn
Created January 27, 2016 22:55 — forked from elliottcordo/emr_spark_thrift_on_yarn
EMR spark thrift server
#on cluster
thrift /spark/sbin/start-thriftserver.sh --master yarn-client
#ssh tunnel, direct 10000 to unused 8157
ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 [email protected]
#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools
@piccolbo
piccolbo / Fully-tele-businesses.md
Last active May 2, 2016 16:01
Fully remote business

Criteria: most employess remote all the time. No office space allocated or available. No relocation required. Low travel requirement.

Company Notes
Automattic 4 weeks/year travel all employees
Rocana
RStudio
Plex
Open Knowledge Foundation Time zone restricted -3:1, different contracts UK/non UK
@piccolbo
piccolbo / dplyr-backends.md
Last active June 23, 2018 03:58
Dplyr backends: the ultimate collection

Dplyr is a well known R package to work on structured data, either in memory or in DB and, more recently, in cluster. The in memory implementations have in general capabilities that are not found in the others, so the notion of backend is used with a bit of a poetic license. Even the different DB and cluster backends differ in subtle ways. But it sure is better than writing SQL directly! Here I provide a list of backends with links to the packages that implement them when necessary. I've done my best to provide links to active projects, but I am not endorsing any of them. Do your own testing. Enjoy and please contribute any corrections or additions, in the comments.

Backend Package
data.frame builtin
data.table builtin
arrays builtin
SQLite builtin
PostgreSQL/Redshift builtin
```{r}
ff = function(){}
names(ff) = "abc"
# Error in names(ff) = "abc" : names() applied to a non-vector
is.vector(mtcars)
#[1] FALSE
names(mtcars) = LETTERS[1:11]
names(mtcars)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
```
  • merge into master
  • update version #
  • update date
  • update Rd help()
  • push master
  • Repeat until tests pass
    • test local and debug
    • test remote and debug
    • test additional platforms
  • apply necessary fixes