- @benjaminbenben @pusher
- accessibility - don’t need specialized programs
- it’s the start of something, not the end
- you don’t print out a csv document, you do something further with it
- how do you get your data back from the cloud?
- runkeeper - GPS routes of running
- how do I get my data from it?
- attempt #1: download
- .zip file, containing .gpx coordinates and .csv with heart rate etc
- 👌
- but:
- no format control
- functionality might change or disappear
- need to go online to retrieve different times
- eg if you need a different set of dates
- can we gather together our data and cut it up ourselves offline?
- attempt #2: script
- bash script,
jq
for json processing - github gist benfoxall/runkeeper-export.sh
- 👌
- format choices
- sharable
- offline
- 👎
- inaccessible
- downloading a csv is easy, writing a bash script is hard
- inaccessible
- bash script,
- attempt #3: web service
- runkeeper-to-csv.herokuapp.com
- connect to runkeeper api, convert to csv documents
- 👌
- accessible
- 👎
- non-trivial backend
- handling sensitive data
- online only
- attempt #4: serve from the browser
- 👌
- accessible
- data stored locally
- 👎
- …
- 👌
- attempt #1: download
- request -> process -> serve
- a small runkeeper API
- javascript
fetch()
API (it’s the new ajax, returns a promise) dataForUser(..)
- javascript
- process
- turning JSON into CSV
- serve to the user
- data URIs
- present csv as a data URI
- data URIs
- 🚢
- …
- how can we make it better?
- support bigger files
- csv might be big
- data uri does base64 which makes it even bigger
- solution:
Blob()
- supported by browsers (except IE9)
- creates an object outside the javascript stack
- it’s also immmutable
- avoids churning through VM memory
- can generate URLs to download Blobs
- no persistence
IndexedDB
(+ Dexie)- chrome devtools resources tab shows the IndexedDB contents
- no permanent URLs
- the Blob URL is only valid while the page is loaded
- a static script can’t pull from this URL
- Service Workers!
- a script which runs separately from your UI thread
- allows offline-first websites
- can serve cached content
- can serve synthesized responses
- proper URL, but no web request
- 👌
- we can cache the service worker response
- we can serve different views on our data
- geoJSON, csv, html
- frontend code can use Service Workers without knowing they exist
- all of this is now offline-capable
- support bigger files
- https://runkeeper-data.herokuapp.com
- when you visit, log in with OAuth
- service worker starts and downloads data into IndexedDB
- continues (even if you close the tab!)
- founder, Cottage Labs
- software dev agency, higher education work
- bespoke information systems, quickly and cost-effectively
- clients work with spreadsheets
- upload into a datastore
- people can then query the datastore and see interesting views
- humans love spreadsheets
- especially in the non-technical world
- tabular data is easy to work with
- the desktop toolchain is excellent (much as we might complain)
- we could never meet the needs that these tools meet (especially on our time and budget and skills)
- lots of information systems are basically the same
- most of the differences are the kind of dta being worked on
- workflows exist, but they happen in the admin area
- admin areas are expensive and boring to build
- lots of web forms – create/edit/delete record
- I’d like it if people could manage their data outside of the admin system
- data visualisation, data science, data journalism are all in
- but also specialist domains and outside the reach of small organizations
- (I’m not a data specialist - no machine learning or stats – but I can help the client cut up their data)
- we find ourselves doing much the same thing over and over again
- they put blurb above their header rows
- the actual table of data is a few rows down
- a spreadsheet is a document, not a dataset
- they colour cells in, with the colour carrying meaning
- this disappears on export-to-csv
- the form-vs-function distinction isn’t clear when seeing a spreadsheet as a document
- sloppy with hard formats (like numbers)
- eg -£1,00,0000.0
- they break boundaries of acceptable use for typed fields
- eg cost column containing “$100 to about 200”
- data models are brittle, humans are flexible
- decode the bits
- welcome to encoding hell!
- excel might give you cp1252 or Windows-1252 (not the same!)
- excel/numbers might give you MacRoman on OSX
- Calc will hopefully give you UTF-8
- any of them could do any one of hundreds of encodings
- some encodings are interchangable, but the newline character is not a common link
- we check we’ve actually got a rectangular dataset for confidence
- welcome to encoding hell!
- read the data
- ignore supporting documentation above the dataset
- translate the header rows
- trim content, ignore empty values, and “N/A” values
- coerce data into something cleaner (“£1,000” -> 1000)
- we’re not scrubbing the data, just allowing for the humanity in the book-keeping
- output: JSON
- make it queryable
- Elasticsearch
- publish interactive interfaces
- javascript frontend on top of elasticsearch query engine
- open access spectrum
- lantern (CSV-only interface)
- some data is hard to represent in spreadsheets
- hierarchical or highly relational data
- don’t make people use a spreadsheet the way we’d use a database!
- consistency use of dictionary terms
- if the spreadsheet maintainers can use consistent names for things, like Countries, it can make things much easier
- we’re not trying to duplicate: open refine, trifacta, tableau
- things we do use:
- d3 + nvd3
- elasticsearch
- objectpath (xpath-like language for JSON)
- things we tried but aren’t currently using
- highcharts
- tablib
- what’s your largest elasticsearch dataset? largest index?
- 2.5 million records; 25Gb
- http://dat-data.com/
- open source project for sharing open data
- funded by Alfred P Sloan foundation
- meetings are open youtube hangouts
- 3 person team
- >800 modules on npm
- around half a percent of all npm modules!
- dat is a p2p file sharing network
- written in javascript
- works in browser
- move the data to the code (don’t move your code to your data)
- data is just files
- you don’t need all the files
- move just the files you need to the code
- similar to bitorrent
- install:
npm install dat
dat link ~/big-file.csv
- creates a content-addressable link
dat://9620fb285...
- can give the link to a friend, then they run
dat dat://9620fb285...
and automatically discover you and start downloading the dataset
- creates a content-addressable link
- split file into chunks which are unlikely to change
- git does one-chunk-per-lin
- if I change one line, I only have to sync that one line, even if the file is large
- only works for text files
- rabin fingerprinting (content-defined chunking)
- scans through the file and creates chunks based on actual file content
- if you insert something in the middle, a rabin fingerprint will create the same chunks on each side of the change
npm install rabin
- https://mafintosh.github.io/hyperdrive
- in-browser dat links
- video player with random access!
- fetch the file chunks needed right now
- @zararah
- open knowledge, school of data, engineroom
- bridging gaps between communities who don’t talk to each other, or people who do talk but in different ways
- https://responsibledata.io
- https://theengineroom.org
- privacy, security, legal challenges
- ask questions
- even if there aren’t any hard-and-fast answers
- this changes hugely in different contexts
- https://responsibledata.io/reflection-stories
- sometimes tech really DOES improve people’s lives
- use of Tor
- The Counted
- sometimes it doesn’t
- Google Photos identified two black people as ‘gorillas’
- Physicians for Human Rights
- programme on sexual violence in conflict zones
- lots of victims don’t come forward to report
- even when they do, challenges to accurately record
- Kenya and eastern Democratic Republic of the Congo
- MediCapt
- standardising data collection
- digitising data collection
- mobile network penetration is v high, but the data is sensitive
- iterating upon tool choice
- tried an off-the-shelf tool, piloted, found it too cumbersome
- developed a new tool, user research with people on the ground
- reality check
- evaluate at the end
- start all over again and iterate
- slow development
- Sharing reports of violence
- a non-profit wanted to support a community which faces a lot of violence
- they weren’t particularly experienced in technology
- started thinking of developing an app
- report a perpetrator of violence to anyone in the area
- legal, privacy issues
- can’t have PII because this is an allegation
- but without PII the report isn’t that useful
- need to tread a fine line
- future proofing
- data minimization
- don’t want to hold data which could in future put people at risk
- people were put off from using app if they had to give too much information
- collaboration
- launch
- human rights dta analysis group
- https://hrdag.org
- data on casualties in Syria
- listing different groups documenting
- “Numbers are only human”
- how do you categorise civilian vs military death?
- how do you categories death due to conflict vs “natural causes”?
- should you use exact (but uncertain) figures to draw attention to causes?
- http://cis-india.org/papers/ebola-a-big-data-disaster
- in some countries there was a push to release Call Detail Records (CDRs) from mobile companies
- getting access to the data
- in Sierra Leone and Guinea, they released this data; in Liberia they didn’t
- decision-making
- the call was to have the data anonymised
- but: it’s hard to anonymise such detailed information
- and: in the Ebola response, the data is most useful when it can be linked to real personal identities
- privacy rights weren’t respected
- the call was to have the data anonymised
- digital infrastructure
- what might an adversary do with your data?
- not necessarily your adversary
- what malicious things could they do with your data and how might they gain from that?
- what would happen then?
- what’s your holistic security plan?
- what does informed consent look like for your users?
- if you know that noone’s reading your Ts & Cs
- are you making things visible that your users should know about
- what levels of technical literacy do your users have?
- in your team, whose job is it to think about the ethics?
- tech & data projects can have unintended consequences, even when well-intended
- do you have examples where they managed to embed context with the
data
- the MediCapt team found the context cruicial
- the HRDAG work has lots of asides and nuanced explanations
- they’ve very careful about waht they say, though they are probably more sure about their findings than many other groups
- this reminds me of an app for reporting requests for bribes. how
do organizations share anonymised data securely?
- https://responsibledata.io/
- let’s not reinvent the wheel
- Technical Director, ODI
- data on wikipedia about last local elections
- data table + map
- all of this is hardcoded behind the scene in table rows
- if you want to get hold of the data, you need to parse the html
- election results are often entered on wikipedia really quickly
- it’d be really cool to be able to get them out quickly too
- it’s also not great to have the same data duplicated
- could we reference the CSV data directly?
- we can do it with images
<img src="url://">
; why not with tables of data? <table src="uk-local-election-summary-2015.csv">
- reference source for party-to-colour mapping
- could bring it into your maps and tables
- it would help people presenting data
- improve quality of data available for us
- motivate machine-readable data
- motivate fixing of errors
- visualisations of your tabular data demonstrate errors very quickly!
- motivate publishers to give accurate metadata
- CSV on the Web @ W3C completed 2016
- building on and learning from:
- OKFN’s data packages / Tabular Data Format
- Google’s Dataset Publishing language
- national archives validation
- existing CSV parsers
- broad set of documented use cases & requirements
- CSV needs metadata
- “these columns contain numbers”
- “this column should be displayed as +/-”
- “this column is a pointer to this other table”
- the metadata needs to be in a separate file
- CSVW metadata standard
- people want to download CSV, not a zipped-up package or the JSON
metadata
- the JSON metadata has a link to the CSV so you could discover it (in principle)
- but: normal people won’t do this
- the link generally needs to be to the CSV file itself
- how do we find the metadata?
- RFC 5988 link:
Link: <metadata.json>; rel="describedBy"; type="applications/csvm+json"
- often can’t control Link: headers though
- default filenames
- just add
-metadata.json
to end of csv file
- just add
- there’s some geeky stuff about
/.well-known
if you care about that
- default filenames
- hard links
- you can (if you want) think of CSVs as being like a relational database
- foreign key relationships in your metadata
- soft links
- give a URL template
- http://example.org/party/{party}
- CSV is on the boundary between these two worlds
- human variability in CSV headers
- “country” vs “Country”
- “unemployment” vs “Unemployment rate”
- CSVW metadata standard allows you to give different options for titles and indicate they mean the same thing
- locale-specific variation
{en:country, de:Land}
- formats for dates and numbers
- use standard number & date formats
- Unicode Technical Standard #35
- minimal set that MUST be implemented
- nothing that requires actually knowing languages
- eg names of months, currency units
- Implementations can do more
- use standard number & date formats
- Implementations
- validation
- conversion
- into JSON and into RDF
- authoring metadata
- not yet for display
- tables, maps, etc
- it’d be really cool to have some web component type stuff
<table src="...">
- annotation?
- navigation?
- https://www.w3.org/TR/tabular-data-primer
- sometimes there’s a value in the header (eg “election results
2014”). how do you deal with that?
- there is a facility for “virtual columns” for static information
- principal engineer, FT
- 800,000 subscribers
- company licences
- page analytics
- education
- when do you remove something from the front page because it’s becoming stale?
- email communication with users
- focus on the users need
- learnable
- ease of use (APIs to get stuff in and out)
- iterative
- I work for the Internet Archive, but I’m not here to talk about that
- @tripofmice
- grimoire.org
- a book of magic spells and invocations - OED
- scope for this talk: 16th and 17th century, european christian tradition
- in this time:
- no clear divide between magic, religion, science
- cunning folk prevalent in Europe
- “low” magic
- common people, often illiterate
- medicine, divination, folk magic
- ceremonial magic
- “high” magic
- summoning angels, demons, spirits, fairies
- piously christian (sometimes, at least)
- witchcraft
- capital offence
- nobody self-identifies as a witch
- what’s okay vs a capital offence? what’s for scholars vs common people? it’s a bit woolly
- England, 1580
- Queen Elizabeth I
- John Dee
- some of his magical items are now in the British Museum
- William Shakespeare
- Propspero from the Tempest (based on John Dee?)
- Oberon from a midsummer night’s dream
- grimoires offered spells to summon Oberon
- Psudomanarchia Daemonum (1577)
- Lesser Key of Solomon (1641)
- King Solomon’s Temple
- Solomon was able to summon and control and use demons to help build his temple, aided by archangel Gabriel
- examples:
- agares
- crocell
- buer
- every demon is given a sigil, which is a calling card used to summon them
- summoning a demon is really involved
- elaborate circles
- if you get it wrong, you might get eaten
- crocell’s powers:
- make it sound like it’s raining
- run you a warm bath
- teach you geometry
- …
- that’s it!
- what are grimoires for?
- how do they get used?
- it’s tough to model in relational model
- lots of many-many relationships (eg demon <-> grimoire)
- join tables
- I used neo4j to model this as a graph problem
- eg: glue to fix a porcelain vase (?!)
- advantages:
- designed for relationships & connections
- flexible
- no migrations
- disadvantages
- no schema for consistency
- non-performant for simple tabular data
- common use cases
- social networks
- public transport systems
- do any of these demons appear in paintings?
- don’t know
- what did people use these grimoires for?
- hard to know
- do you have a way to tell how comprehensive your dataset is?
- the complete dataset is borderline infinite
- there’s a finite number of grimoires that have survived and been translated into english
- you mentioned node4j for pictorial representation. anything else
for this purpose?
- no
- I have tables of spells and a timeline, but not much else in terms of data visualisation
- could you use this dataset to perform unsupervised learning to
generate new spells or demons?
- sure why not
- @sarahtgold
- @projectsbyif
- government, politics, civics, …
- GDS
- currently: IF
- a design studio
- we make things that change how people think about data
- we are multidisciplinary
- product development
- design
- security
- we understand technology and design as disciplines which inform each other
- everything we do is centred on people
- people who understand the things they use make better decisions about how to use them
- more things are becoming data conscious
- more data being collected
- more things being connected to the internet
- it’s never been so cheap to put a chip in it
- IoT
- Internet of Shit
- @InternetOfShit
- there’s a lot of nonsense
- @InternetOfShit
- we are producing a lot of personal data
- phones, laptops, fitbits, etc
- data maximalism
- Ts & Cs are our default consent model
- and they don’t work
- samsung smart TV privacy policy: “Don’t talk in front of the TV”
- objects are becoming informants
- and they will betray us
- smart bins
- collecting MAC addresses of passers-by
- hyde park visitors covertly tracked via mobile phone data
- we don’t know if something is working properly
- http://androidvulnerabilities.org/
- terrifying graph of devices running vulnerable versions of android
- http://androidvulnerabilities.org/
- software is politics – Richard Pope
- gherkin syntax
- makerversity
- design for minimum viable data
- know which data type you’re designing with
- https://projectsbyif.github.io/data-permissions-catalogue
- data licences
- how do I licence my data? what do I care about?
- the more informed people are to the implications of tracking, the
more likely they are to say no; how do companies which provide
free services deal with this?
- it’s very complicated
- ad blockers
- not enough time to do this justice
- with instances like royal parks, they could give their patrons information about how useful their data has been
- professor of statistics at UBC
- @JennyBryan @STAT545
- it’s nice to be allowed to talk about spreadsheets for once
- people like to moan about them
- slides (with references!) https://github.com/jennybc/2016-05_csvconf-spreadsheets
- inspiration: csv,conf,v1 talk Felienne Hermans “Spreadsheets are code”
- it’s okay to care about spreadsheets!
- how I pick people to work with:
- venn diagram overlap of (crazy technically competent ∩
intellectually generous, loves gifs)
- => Rich Fitzjohn
- https://github.com/richfitz/jiffy
- venn diagram overlap of (crazy technically competent ∩
intellectually generous, loves gifs)
- inequality is toxic in a whole lot of contexts
- in this case: ability to do what you want with data
- there’s this “data 1%”
- anything we want to do, we know how, or how to figure it out, or how to find someone who knows
- lots of people I teach at UBC are much less able to get these things done, feel paralysed
- down with software elitism
- up with the last mile of data munging
- I supported myself for ~4 years doing spreadsheets
- I was doing a management consulting gig
- during grad school I supported myself doing high-end excel work
- there’s a lot you can do with these consumer-level tools
- I’d like to create a more porous border between spreadsheets and R/python/etc
- https://twitter.com/tomaspetricek/status/687947134088392704
- “Ouch. “50 million accountants use monads in Excel. They just don’t go around explaining monads to everyone…” @Felienne #ndclondon”
- reactivity is one of the main things people love about
spreadsheets
- spreadsheets have pushed computer science to deal with reactivity
- i was talking on a podcast about the future of spreadsheets and whether they will go away; i felt reactivity was key
- with R, I write a Makefile to rebuild everything from scratch
- but I still have to kick this thing
- spreadsheets also have less syntax bullshittery
- argument names, separators, etc
- you can just select things with your mouse and click “average”
- FACTS!
- about 1 billion people use MS OFfice
- about 650 million people use spreadsheets
- up to half use formulas
- …
- 250k - 1m use R
- 1-5m use Python
- you go into data analysis with the tools you know, not the tools you need
- what you think people are doing ≠ what you think people should be doing ≠ what people are actually doing
- most tools are designed for the middle thing (what you think people should be doing)
- The Enron Corpus
- “the pompeii of spreadsheets”
- 600k emails
- 15k spreadsheets
- example:
- some cells are data
- some are formulas
- some are phone numbers
- visualizations
- spreadsheets within spreadsheets (ie a rectangular group of cells)
- Hermans, Murphy-Hill (research paper on the corpus)
- prevalence of formulas
- prevalence of unique formulas
- http://www.felienne.com/archives/3634
- lots of colour
- data and formatting blurred together
- font choice and colour of cell gives you a categorical variable
- inconsistency between rows and columns
- references to other spreadsheets, that you don’t have
- columns of intermediate computations are so boring, so they get hidden
- http://xkcd.com/1667/
- machine readable & human readable
- (see JeniT’s keynote further up)
- a spreadsheet is often neither machine nor human readable
- technically, yes you can open them and look at them
- but a machine cannot get useful data out in an unsupervised, scalable way
- and a human reading someone else’s spreadsheet is like reading another person’s codebase
- spreadsheets are (data ∩ formatting ∩ programming logic)
- but often we only care about one or two of these concerns
- (can we separate them after the fact?)
- what are the problems?
- which ones can we solve?
- with training?
- sometimes people use spreadsheets for inappropriate things and we can train them to stop it
- with tooling?
- (just a subset; not all problems can be solved with tooling)
- with training?
- two angles:
- create new spreadsheet implementations that use, eg, R or
python for computation and visualization
- anticipate version control, collaboration
- AlphaSheets
- stencila
- accept spreadsheets as they are
- create tools to get goodies out
- maybe write back into sheets?
- create new spreadsheet implementations that use, eg, R or
python for computation and visualization
- ~googlesheets~ R package
- (google sheets are much less common than excel, but they’re still reasonably common)
- goal: spreadsheet reading tools in R
- with no non-R dependency
- Book: Spreadsheet implementation technology
- what are the interesting differences between excel and google
sheets (for ingesting data)
- the excel spec is 6000 pages long; the google sheets spec is 0 pages long
- I wish there was something in between
- they’re both very verbose xml
- not really big differences in parsing
- google sheets has to chase excel and be super compatible with excel
-
- getting UK government to publish data on all their spending
- in CSV format
- with a spec
- defined columns
- but: problems
- 401 html document saved as csv :/
- friction
- containerization for data
- docker docker docker
- key principles
- simplicity
- web oriented
- existing tools
- open
- validation
- a success story from the previous csv,conf
- ONS produces thousands of spreadsheets each year on our website
- we’re getting more efficient at it
- the underlying structures no longer exist for us to get that data in a machine-readable way
- we’ve gotten so good at producing these spreadsheets but neglected the source data
- we have CSVs, but “we can’t publish that on the website”
- I can’t do my aggregation in there
- how do we get to a point where we publish CSVs?
- scraperwiki + ONS at csv,conf,v1
- Dragon Dave McKee’s talk on XYPath
- version 1
- python
- command-line
- not pretty but functional
- example
- spreadsheet with merged cells, multiple tabs, hidden columns, etc etc (see Jenny Bryan’s keynote above)
- we set up some recipes to instruct Data Baker:
- what files we want to look at
- where the data is
- what transformations we want to do
- run the command
- slurp in the .xls files
- generates some output .xls files
- one output: a colour-coded .xls file to show how the data
was sliced up
- sanity check to make sure we’re doing it right
- code! https://github.com/scraperwiki/databaker
- the jenalia research campus (“the bell labs of neuroscience”)
- northern virginia
- research institute, non-profit funded
- there’s a lot we don’t know
- try talking to fifth graders!
- “how is it that I can hear a phone number and the next day I still remember that phone number?”
- “why do I always dream about robots and dinosaurs?”
- mice as a model
- two-photon imaging
- we often want to analyse data as quickly as possible to drive decisions about what experiment to do next
- random access two photon mesoscope
- rich data patterns of brain activity
- the 80/20 problem
- time spent doing incredible measurements
- time spent doing other stuff
- used to be 80% data gathering & experimental research; 20% analysis
- now, it’s all changed; only 20% doing actual science
- analysis isn’t a linear process
- lots of backtracking and dead ends
- lots of reinventing the wheel between different labs
- no sharing of infrastructure
- often no source control
- goal: lots of modules that solve well-defined small problems,
that can be glued together
- eg thunder project & bolt-project
- thunder: a collection of modules for image and time series data analysis
- neurofinder.codeneuro.org
- analysing a picture and determining which groups of pixels correspond to neurons
- a really common neuroscience problem!
- but every lab has come up with their own independent way of doing it
- website to allow people to submit results from their algorithms (against training and testing datasets)
- (Question: why didn’t you use kaggle?
- this seemed like a simple enough problem to solve for ourselves rather than buying into the kaggle space
- we originally thought about having people submit code and run it in a container but running matlab in a container is somewhere between difficult and illegal)
- lightning-viz.org – modular visualization things
- https://github.com/mikolalysenko/regl
- webgl and 3d is a really important part of the future of scientific visualization
- the 1 to 2 problem:
- starting collaboration between two individuals
- jupyter notebooks
- https://github.com/sofroniewn/tactile-coding
- github is great for sharing code (and to some degree, data)
- it doesn’t solve the problem of making an environment usable on someone else’s machine
- can we use things like docker to take jupyter notebooks and
data and code and bundle them all together?
- good, we had to repeat the complex process each time
- mybinder.org
- tell us a github repo
- has to have a certain set of contents
- code needed to run your notebooks
- some metadata
- (not required: a complete Dockerfile)
- builds a docker image
- then embed a button in your github repo
- the button launches into a running environment
- has to have a certain set of contents
- what’s the value in being able to reproduce someone else’s
analysis?
- if someone can rerun this and, as a result, start a collaboration, that’s really cool
- tell us a github repo
- buzzfeed made a binder to analyse refugee data
- data relevant for policy decisions: we should have access
- the analysis should be open too
- binder doesn’t address data sharing
- you can put it in a github repo
- but it’s not a wonderfully sustainable solution
- dat sounds really cool though! http://dat-data.com
- you can put it in a github repo
- Question: nick had a live image render in a jupyter notebook –
how do you do that?
- the data comes off the microscope
- goes directly to the machines in a cluster
- crunching happens
- then gets absorbed into html rendering in the notebook
- mouse VR
- data from neurons as a mouse’s whiskers get closer or further from a wall
- hexaworld
- what do you do about describing the data? where did it come from?
when was it measured?
- almost no coordination of metadata right now in neuroscience
- I don’t know how to get two postdocs in the same lab to coordinate on data
- @CallMeAlien
- developer advocate, @CodeForAfrica
- a civic tech organization
- works to empower citizens by giving them access to information
- call for action: build more tools that directly impact the communities we live in
- access to proper healthcare is a basic human right; but the WHO estimates about a third of the world’s population has no access to the most basic medicines
- in Kenya, quack doctors are very common
- story: my boss (from south africa) had a business trip to kenya
- got really sick, sought medical advice, got treated, felt better, returned to SA
- then got even worse
- visited his regular family doctor
- SA requested medical records from kenyan treatment
- when the SA doctor’s office contacted the kenyan doctor’s office, it turned out the “doctor” was in fact a vet
- a lot of people in rural africa or south east asia struggle to access doctors
- how sure are they that they’re seeing a registered practitioner?
- story: my boss (from south africa) had a business trip to kenya
- Code For Africa collaborated with The Star, the largest
blue-collar newspaper
- http://bit.ly/starHeatlh
- enter the name of the town you’re in
- get a list of medical practitioners you can see, what their speciality is, what clinics they are in
- story: a woman went to the police and reported she had been
drugged and raped by an alleged gynaecologist
- it hit the news, then many more women came forward
- it turned out he was a quack doctor; he wasn’t even registered
- just put up a sign
- and people trusted him with their lives
- public outcry
- The Star started publicising the platform and people started using it
- Kenya Medical Practitioners and Dentists Board is the authority
- published the list across >300 web pages
- websites are not universally accessible
- a lot of people still have feature phones
- our service has an SMS interface
- text us a request and we can tell you details about specific doctors
- we don’t just take the data from the government; we also validate and report errors back to the government
- it’s now been replicated by a newsroom in Nigeria
- they’ve started adding medicine prices too
- is the data available too?
- yes it’s available, I can point you to the github
- re: sms delivery: how do people submit the names?
- people submit a name
- we have to do some normalization to allow variability “D” “Dr” “Doctor” etc
- another issue: the database only has 11,000 doctors
- we have 44 million people in kenya!
- either we have only 1 doctor per 4000 people (far too low!)
- or there are many many unregistered doctors (also bad!)
- could you look at geographical variabity? eg pockets of countries
with low coverage
- yes, definitely
- how do you keep the data up to date?
- the scrapers are automated
- re-scrape on a weekly basis
- in January this year, we realised that our scrapers weren’t updated themselves
- it’s a contant gardening effort
- have you reached out to the organization to see if you could get
a data dump?
- there’s a big trend in kenya (#dodgydoctors hashtag, and another swahili hashtag)
- people are calling for all government services to have SMS interfaces
- it’s a bit complicated to get the data from the government
- https://github.com/CodeForAfrica/theStarHealth