pydata-2018-london-notes.md

Democratising data journalism: building a collaborative and investigative network across the UK

Charles Boutaud, bureau of investigative journalism https://www.thebureauinvestigates.com/profile/charlesboutaud https://pydata.org/london2018/schedule/presentation/49/ Data driven journalism, overview of things done, they use a slack channel to kick-off the conversation, discuss and ask stuff then translated into potential projects

CatBoost - the new generation of gradient boosting

https://catboost.yandex/ https://github.com/catboost/catboost Anna Veronika Dorogush https://pydata.org/london2018/schedule/presentation/34/ Catboost, basically gradient boost for categories instead of numerical features Supports GPU with several speed multiplier wrt xgboost Parameters are very important, particularly learning rate and iterations to find right balance for error convergence

Python Doesn’t Have to Be Slow: Speeding Up a Large-Scale Optimization Algorithm

Dat Nguyen https://pydata.org/london2018/schedule/presentation/21/ Performance at zopa, lending platform for micro loans, np-hard problems to allocate money pots that meet investors requirements. Mostly unreadable presentation due to tiny font unfortunately. Anyways, they use https://numba.pydata.org/ an anaconda package that seems to give good results

Creating correct and capable classifiers

Ian Ozsvald (pydata organiser) https://pydata.org/london2018/schedule/presentation/32/ Have a look at pandas_profiling https://github.com/pandas-profiling/pandas-profiling testing the Sklearn dummy classifier on Titanic dataset from kaggle, run random forest and notice that performs way better than dummy classifier. Plot yellow brick confusion matrix, check the lib, quite cool viz lib for sklearn https://pythonhosted.org/yellowbrick/introduction.html Eli5 library, explain like I'm 5 http://eli5.readthedocs.io/en/latest/overview.html

Building out data science at QBE

Liam P. Kirwin https://pydata.org/london2018/schedule/presentation/37/ big commercial insurance company, they’re old school because works well, so data science helps where traditional models can’t reach. Interestingly enough, they introduced data science libraries to improve communication across levels. Eli5 / lime (explain results from classifiers): https://github.com/marcotcr/lime / sharply (mh, might have misspelled this) libraries for storytelling

James Powell, more generators

This is the guy of the live coding presentation I saw at pydata 17... I'm sorry, I find incomprehensible live coding sessions a useless show off, I will avoid in the future.

Python at massive scale, Stephen Simmons at JP Morgan

Starting from monolithic repo across the world, no Dev/prod separation in 2006 Scaled enormously to 35M lines of code, using hydra, a proprietary object oriented database with distributed optimistic writing. Source code is stored in hydra, so when running the command, it'd fetch the latest version directly from the db Viz tool called perspective been recently opensourced on GitHub https://jpmorganchase.github.io/perspective/

Sunday keynote: Learning programming and science with Scientific Python

Emmanuelle Gouillart https://pydata.org/london2018/schedule/presentation/51/ core dev for Scikit-image really interesting talk about Empowering learning in her team. Encouraging to write documentation, add api gallery and take all actions to onboard new comers as quickly as possible Check sphinx-gallery https://github.com/sphinx-gallery/sphinx-gallery Check binder, making run notebooks on k8s, it's also featured on sphinx-gallery https://mybinder.org/

Auto-encoders in the wild... of telco land.

Guillermo Christen https://pydata.org/london2018/schedule/presentation/38/ Blah blah on how to use neural networks to create encoders/decoders to compress feature spaces. Meh, not convinced at all.

Searching for Shady Patterns: Shining a light on UK corporate ownership

Adam Hill https://pydata.org/london2018/schedule/presentation/17/ this guy... interesting, seems a good guy, albeit being on the pompous side. showing the dataset from companies house with neo4j, a lot of things looking really odd (i.e. 4,000 children under 2 of age are registered as company owners...). Checkout datakind.org and his presentation http://bit.ly/pyDataLDN2018-Corporate-Ownership

Data Deduplication using Locality Sensitive Hashing

Matti Lyra https://pydata.org/london2018/schedule/presentation/30/ minhash library to compare document similarity https://github.com/mattilyra/LSH kind of meh, focused on similar text only e.g. multiple versions of same article where few things are changed, how to quickly find if they're the same? solution proposed only applies to text

grudelsud/pydata-2018-london-notes.md