Created
August 30, 2017 20:05
-
-
Save baditaflorin/32a233833fd787a582e09f4b9a27ffe3 to your computer and use it in GitHub Desktop.
We can't make this file beautiful and searchable because it's too large.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"user_username","article_url","image_count","post_tags","recommends","reading_time","title","text","link_count" | |
"neuroecology","https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4","22","{Writing,Literature,""Data Visualization""}","2670","3.67641509433962","Punctuation in novels","","1" | |
"eklimcz","https://medium.com/truth-labs/designing-data-driven-interfaces-a75d62997631","14","{""Data Visualization"",""Design Thinking"",UX}","2660","7.83867924528302","Designing Data-Driven Interfaces","","2" | |
"quincylarson","https://medium.com/free-code-camp/the-economics-of-working-remotely-28d4173e16e2","5","{Tech,""Life Lessons"",""Data Science"",Travel,Startup}","2068","3.95786163522013","Fitter. Happier. More productive. Working remotely.","Travel the world as a digital nomad. Surf a new beach every morning. Eat a different local cuisine each night. | |
Or just stay home all day in your pajamas. | |
It doesn’t really matter. You can get your work done either way. | |
More than 10% of Americans now work remotely. | |
I’m one of them. After 10 years of working in an office, I’ve had the luxury of working out of my closet for the past 5 years. I could rave until I’m blue in the face about how great it is. But talk is cheap. Let’s look at data. | |
What are the economic costs and benefits of working remotely? | |
We looked at more than 36,000 salaries of developers working both remotely and on-site from Stack Overflow’s 2016 dataset. Here’s what we found: | |
Of the 10,583 developers world-wide who reported working remotely some or all the time, their median salary was US $55,000, with a standard deviation of $51,200. | |
This was significantly higher than the salaries of the 25,413 non-remote developers. Their median salary was just $45,000, with a standard deviation of $44,727. | |
When you exclude developers outside of the United States, this wage difference remains. The 3,200 US-based remote developers had a median salary of $105,000 (with a standard deviation of $47,400), versus their 6,461 non-remote counterparts, whose median salary was only $95,000 (with a standard deviation of $ 42,977). | |
People who work remotely are most represented at the top and bottom ends of the pay scale. | |
Freelance developers who accept contracts through relatively low-paying channels like Upwork tend to be remote. | |
At the same time, developers with more than ten years of experience are twice as likely to work remotely as newcomers. | |
The caliber of developers who can command salaries of more than $200,000 have a lot of bargaining power. When they demand to work remotely, employers are more likely to yield. | |
Anecdotally, I’ve met plenty of high-paid developers who live up in the mountains, or who jet around the world, living out of hotels. You would have to pay them a small fortune to convince them to work out of an office, and even then they probably wouldn’t last long. | |
In 2012, Stanford researchers set out to better understand remote work and its implications. They conducted a 9-month study of 249 service sector workers — the largest academic study of remote work ever. | |
Here are the key findings: | |
One downside the study found was that remote workers were 50% less likely to be promoted than their on-site counterparts. This may mean that it’s harder for bosses to form relationships with you if you work remotely. | |
Working remotely also offers some major cost-savings — both for employees and their employers. | |
In many cities, workers spend more than an hour per day commuting to and from their office. | |
Not having to commute saves you time. It also saves you the monetary cost of mass transit, or gas and depreciation on a car. | |
You also have to spend a lot of time getting ready for work. For many people, this means dressing up in a suit and/or putting on makeup — tasks you can skip if you’re working remotely. | |
My unscientific estimate (because no comprehensive study has yet been conducted) is that about a tenth of an on-site worker’s income — and a tenth of their time — would be immediately freed-up if they could work remotely. | |
Companies should also take note that during Stanford study, the employer saved $2,000 per employee in furniture and office space costs. | |
And don’t forget parking. In San Francisco, for example, it costs an average of $38,000 to create a single new parking space for an employee. | |
The decision to work remotely is a highly personal one. | |
My goal here is to share the facts. | |
Instead of gushing about how great I personally think it is to work remotely, I recommend you read this excellent book: | |
It’s by journalist Scott Berkun, who managed a remote team for a year at Automattic (the all-remote company behind WordPress). | |
The Year Without Pants is filled with interesting first-hand anecdotes. It will give you a clear idea of what it’s like to work remotely. | |
Working remotely isn’t for everyone. But it is a viable option. And there are some compelling economic reason why it might make sense for you. | |
If you liked this, click the 💚 below. Follow me and Free Code Camp for more articles on technology.","1" | |
"quincylarson","https://medium.com/free-code-camp/what-i-learned-from-analyzing-the-top-253-medium-stories-of-2016-9f5f1d0a2d1c","3","{Medium,Writing,""Data Science"",""Social Media"",""Life Lessons""}","1883","7.65566037735849","What I learned from analyzing the top 252 Medium stories of 2016","Medium may be struggling to find a sustainable business model, but they have years worth of funding left, and more readers than ever. | |
Medium isn’t going away any time soon. So instead, let’s focus on how you can write stories that readers will find helpful here in 2017. | |
I pulled down the top 252 stories of 2016 — all of which had at least 2,500 recommendations from Medium’s readers — and analyzed the dataset. | |
To put things in perspective, writers published 7,500,000 stories on Medium last year. So this dataset represents the most popular 0.00336% stories of 2016. | |
Together, these 252 stories racked up 1,033,961 recommendations. That’s a lot of green hearts. | |
Here are some things I learned from my time with this dataset that can help you reach a wider audience for your writing in 2017. | |
169 different writers published one of these top-252 stories. Some of those writers had multiple top stories. | |
Here are the people who wrote more than one top-252 story: | |
The only person on this list whom I’d heard of before reading their work here on Medium is Chris Dixon, a well-known tech blogger. | |
You may recognize some of these names if you’re in their field, but I doubt you’d recognize these people if you ran into them at the super market. They may be “internet famous” but they’re a far cry from household names. | |
If you can write consistently useful stories and gradually build a following, you can crack this list, too. | |
Here are the most common tags among these top 252 stories: | |
The following topics — which are the basis of most popular magazines — occurred zero times: | |
“Humor” occurred 9 times, and “satire” 5 times. But that’s about it. | |
It seems that most people read Medium to: | |
Judging by this dataset, the stereotype of Medium’s readership as developer-designer-hustlers isn’t all that far from the truth. | |
A vast majority of the top 252 stories were published in one of Medium’s publications. | |
If you think about this for a moment, it makes perfect sense. These stories showed up not only in the news feeds of the readers who followed their authors, but also readers who followed the publication. | |
And some of these publications have a lot of followers. | |
Here’s a lexical analysis of the most common words in the titles of the top 252 stories. I’ve filtered out stop words like “the” and “of.” | |
The words “you” and “I” were by far the most common, which suggests that addressing the reader directly as an individual person is a better writing strategy than writing in third person. | |
The most common words that fell outside of the 100 most common English language words were “life” and “design.” | |
Speaking of English, all but three of the top 252 stories were written in English. | |
Many people complain about the abundance of profanity in Medium headlines. | |
While it’s certainly present, the “F” word and its variants only occurred 13 times in top-252 headlines, and the “S” word only occurred 3 times. | |
Only 23 of the top 252 stories were explicitly “listacles” — bullet-point-driven stories. | |
These stories have headlines that follow the pattern of “[number] things you should do before you [time].” | |
Full disclosure: one of these listacles is my story about Linux turning 25. | |
But overall, I’d say the decline of listacles is a good thing. | |
My personal advice to writers is to focus on stories that dive deep into a single topic. | |
The top 252 stories averaged 6.7 minutes in length — the same length that Medium’s data science team determined was optimal back in 2014. | |
Only 16 of the top stories didn’t have any images. | |
The median number of images a story included was 3. | |
Don’t worry about over-doing it with images, either. 11% of stories used 10 or more images, and two of them used more than 50 images. | |
The median number of followers these authors had as of year’s end was 6,809. | |
Even if you don’t yet have a lot of followers, there’s still hope of cracking 2017’s top stories. 29 of the authors had less than 2,000 followers, which you can reach in a matter of months if you manage to write a few popular stories. | |
The best way to get people to follow you is to remind them to follow you. | |
If someone reads all the way to the bottom of your article, it’s fair (and well within Medium’s terms of service) to remind them to follow you. | |
Only 6 of the top stories had disabled responses. | |
Imagine someone’s reading your story and thinks of something insightful to add. They scroll to the bottom of your story only to discover that they can’t share their thoughts because you’ve disabled responses. | |
Are they going to recommend your story? I sure wouldn’t. | |
Don’t hinder the discourse around your story. Allow your readers to respond to you. | |
A huge thanks to Levent Aşkan, who put in the time compiling these top 252 Medium stories. You can read his story about them here. | |
Also, thanks to Kande Bonfim for further expanding upon Levent’s dataset. | |
And if you’re interested in getting more readers in 2016, check out my tips for writing stories on Medium that people will read. | |
And my unofficial style guide for Medium: | |
Cheers, and happy writing!","7" | |
"davidventuri","https://medium.com/free-code-camp/if-you-want-to-learn-data-science-start-with-one-of-these-programming-classes-fb694ffe780c","9","{Education,Programming,""Data Science"",""Learning To Code"",Technology}","1760","13.1320754716981","If you want to learn Data Science, start with one of these programming classes","A year ago, I was a numbers geek with no coding background. After trying an online programming course, I was so inspired that I enrolled in one of the best computer science programs in Canada. | |
Two weeks later, I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. So I dropped out. | |
The decision was not difficult. I could learn the content I wanted to faster, more efficiently, and for a fraction of the cost. | |
I already had a university degree and, perhaps more importantly, I already had the university experience. Paying $30K+ to go back to school seemed irresponsible. | |
I started creating my own data science master’s degree using online courses shortly afterwards, after realizing it was a better fit for me than computer science. I scoured the introduction to programming landscape. I’ve already taken several courses and audited portions of many others. I know the options, and what skills are needed if you’re targeting a data analyst or data scientist role. | |
For this guide, I spent 20+ hours trying to find every single online introduction to programming course offered as of August 2016, extracting key bits of information from their syllabi and reviews, and compiling their ratings. For this task, I turned to none other than the open source Class Central community and its database of thousands of course ratings and reviews. | |
Since 2011, Class Central founder Dhawal Shah has kept a closer eye on online courses than arguably anyone else in the world. Dhawal personally helped me assemble this list of resources. | |
Each course had to fit four criteria: | |
We believe we covered every notable course that exists and which fits the above criteria. Since there are seemingly hundreds of courses on Udemy in Python and R, we chose to consider the most reviewed and highest rated ones only. There is a chance we missed something, however. Please let us know if you think that is the case. | |
We compiled average rating and number of reviews from Class Central and other review sites. We calculated a weighted average rating for each course. If a series had multiple courses (like Rice University’s Part 1 and Part 2), we calculated the weighted average rating across all courses. We also read text reviews and used this feedback to supplement the numerical ratings. | |
We made subjective syllabus judgment calls based on three factors: | |
Programming is not computer science and vice versa. There is a difference of which beginners may not be acutely aware. Borrowing this answer from Programmers Stack Exchange: | |
The course we are looking for introduces programming and optionally touches on relevant aspects of computer science that would benefit a new programmer in terms of awareness. Many of the courses considered, you’ll notice, do indeed have a computer science portion. | |
None of the courses, however, are strictly computer science courses, which is why something like Harvard’s CS50x on edX is excluded. | |
University of Toronto’s “Learn to Program” series on Coursera. LTP1: The Fundamentals and LTP2: Crafting Quality Code have a near-perfect weighted average rating of 4.81 out of 5 stars. They also have a great mix of content difficulty and scope for the beginner data scientist. | |
This free, Python-based introduction to programming sets itself apart from the other 20+ courses we considered. | |
Jennifer Campbell and Paul Gries, two associate professors in the University of Toronto’s department of computer science (which is regarded as one of the best in the world) teach the series. The self-paced, self-contained Coursera courses match the material in their book, “Practical Programming: An Introduction to Computer Science Using Python 3.” LTP1 covers 40–50% of the book and LTP2 covers another 40%. The 10–20% not covered is not particularly useful for data science, which helped their case for being our pick. | |
The professors kindly and promptly sent me detailed course syllabi upon request, which were difficult to find online prior to the course’s official restart in September 2016. | |
Timeline: 7 weeks | |
Estimated time commitment: 6–8 hours per week | |
This course provides an introduction to computer programming intended for people with no programming experience. It covers the basics of programming in Python including elementary data types (numeric types, strings, lists, dictionaries, and files), control flow, functions, objects, methods, fields, and mutability. | |
Modules | |
Learn to Program: Crafting Quality Code (LTP2) | |
Timeline: 5 weeks | |
Estimated time commitment: 6–8 hours per week | |
You know the basics of programming in Python: elementary data types (numeric types, strings, lists, dictionaries, and files), control flow, functions, objects, methods, fields, and mutability. You need to be good at these in order to succeed in this course. | |
LTP: Crafting Quality Code covers the next steps: designing larger programs, testing your code so that you know it works, reading code in order to understand how efficient it is, and creating your own types. | |
Modules | |
Associate professor Gries also provided the following commentary on the course structure: “Each module has between about 45 minutes to a bit more than an hour of video. There are in-video quiz questions, which will bring the total time spent studying the videos to perhaps 2 hours.” | |
These videos are generally shorter than ten minutes each. | |
He continued: “In addition, we have one exercise (a dozen or two or so multiple choice and short-answer questions) per module, which should take an hour or two. There are three programming assignments in LTP1, each of which might take four to eight hours of work. There are two programming assignments in LTP2 of similar size.” | |
He emphasized that the estimate of 6–8 hours per week is a rough guess: “Estimating time spent is incredibly student-dependent, so please take my estimates in that context. For example, someone who knows a bit of programming, perhaps in another programming language, might take half the time of someone completely new to programming. Sometimes someone will get stuck on a concept for a couple of hours, while they might breeze through on other concepts … That’s one of the reasons the self-paced format is so appealing to us.” | |
In total, the University of Toronto’s Learn to Program series runs an estimated 12 weeks at 6–8 hours per week, which is about standard for most online courses created by universities. If you prefer to binge-study your MOOCs, that’s 72–96 hours, which could feasibly be completed in two to three weeks, especially if you have a bit of programming experience. | |
If you already have some familiarity with programming, and don’t mind a syllabus that has a notable skew towards games and interactive applications, I would also recommend Rice University’s An Introduction to Interactive Programming in Python (Part 1 and Part 2) on Coursera. | |
With nearly 3,000 reviews and the highest weighted average rating of 4.99/5 stars, this popular course is noted for its engaging videos, challenging quizzes, and enjoyable mini projects. It’s slightly more difficult, and focuses less on the fundamentals and more on topics that aren’t applicable in data science than our #1 pick. | |
These courses are also part of the 7 course Principles in Computing Specialization on Coursera. | |
The materials are self-paced and free, and a paid certificate is available. The course must be purchased for $79 (USD) for access to graded materials. | |
The condensed course description and full syllabus are as follows: | |
“This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications … To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple. These applications will involve windows whose contents are graphical and respond to buttons, the keyboard, and the mouse. | |
Recommended background: A knowledge of high school mathematics is required. While the class is designed for students with no prior programming experience, some beginning programmers have viewed the class as being fast-paced. For students interested in some light preparation prior to the start of class, we recommend a self-paced Python learning site such as codecademy.com.” | |
Timeline: 5 weeks | |
Estimated time commitment: 7–10 hours per week | |
Week 0 — statements, expressions, variables Understand the structure of this class, and explore Python as a calculator. | |
Week 1 — functions, logic, conditionals Learn the basic constructs of Python programming, and create a program that plays a variant of Rock-Paper-Scissors. | |
Week 2 — event-driven programming, local/global variables Learn the basics of event-driven programming, understand the difference between local and global variables, and create an interactive program that plays a simple guessing game. | |
Week 3 — canvas, drawing, timers Create a canvas in Python, learn how to draw on the canvas, and create a digital stopwatch. | |
Week 4 — lists, keyboard input, the basics of modeling motion Learn the basics of lists in Python, model moving objects in Python, and recreate the classic arcade game “Pong.” | |
Week 5 — mouse input, list methods, dictionaries Read mouse input, learn about list methods and dictionaries, and draw images. Week 6 — classes and object-oriented programming Learn the basics of object-oriented programming in Python using classes, and work with tiled images. | |
Week 7 — basic game physics, sprites Understand the math of acceleration and friction, work with sprites, and add sound to your game. | |
Week 8 — sets and animation Learn about sets in Python, compute collisions between sprites, and animate sprites. | |
If you are set on an introduction to programming course in R, we recommend DataCamp’s series of R courses: Introduction to R, Intermediate R, Intermediate R — Practice, and Writing Functions in R. Though the latter three come at a price point of $25/month, DataCamp is best in category for covering the programming fundamentals and R-specific topics, which is reflected in its average rating of 4.5/5 stars. | |
We believe the best approach to learning programming for data science using online courses is to do it first through Python. Why? There is a lack of MOOC options that teach core programming principles and use R as the language of instruction. We found six such R courses that fit our testing criteria, compared to twenty-two Python-based courses. Most of the R courses didn’t receive great ratings and failed to meet most of our subjective testing criteria. | |
The series breakdown is as follows: | |
Estimated time commitment: 4 hours | |
Chapters: | |
Estimated time commitment: 6 hours | |
Chapters: | |
Estimated time commitment: 4 hours | |
This follow-up course on intermediate R does not cover new programming concepts. Instead, you will strengthen your knowledge of the topics in intermediate R with a bunch of new and fun exercises. | |
Estimated time commitment: 4 hours | |
Chapters: | |
Another option for R would be to take a Python-based introduction to programming course to cover the fundamentals of programming, and then pick up R syntax with an R basics course. This is what I did, but I did it with Udacity’s Data Analysis with R. It worked well for me. | |
You can also pick up R with our top recommendation for a statistics class, which teaches the basics of R through coding up stats problems. | |
Our #1 and #2 picks had a 4.81 and 4.99 star weighted average rating over 269 and 2,982 reviews, respectively. Let’s look at the other alternatives. | |
This is the first of a six-piece series that covers the best MOOCs for launching yourself into the data science field. It will cover several other data science core competencies: statistics, data analysis, data visualization, and machine learning. | |
The final piece will be a summary of those courses, and the best MOOCs for other key topics such as data wrangling, databases, and even software engineering. | |
If you’re looking for a complete list of Data Science MOOCs, you can find them on Class Central’s Data Science and Big Data subject page. | |
If you have suggestions for courses I missed, let me know in the responses! | |
If you found this helpful, click the 💚 so more people will see it here on Medium.","4" | |
"quincylarson","https://medium.com/free-code-camp/we-asked-15-000-people-who-they-are-and-how-theyre-learning-to-code-4104e29b2781","24","{""Data Science"",Design,Tech,Startup,""Life Lessons""}","1465","4.33490566037736","We asked 15,000 people who they are, and how they’re learning to code","More than 15,000 people responded to the 2016 New Coder Survey, granting researchers an unprecedented glimpse into how adults are learning to code. | |
We’ve released the entire dataset of participants’ individual responses to all 48 questions — under the Open Data Common License — on a public GitHub repository. | |
In the coming weeks, we’ll publish a website filled with interactive visualizations of these data, answering dozens of questions like: | |
In the meantime, here are a few high-level statistics from the 2016 New Coder Survey results to tide you over. | |
CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members. | |
Of the 15,655 people who responded to the survey: | |
We’ve cleaned and normalized all 15,655 records. Our community is using these to build data visualizations that answer a range of different questions. | |
If you have a question about people who are learning to code, create a GitHub issue and we’ll see if we can build an interactive visualization that answers it. | |
If you’re interested in analyzing these data and/or building some visualizations, join our Data Science chat room and introduce yourself. | |
If you liked this, click the💚 below. Follow me and Free Code Camp for more articles on technology.","2" | |
"moyicat","https://medium.com/graphiq-engineering/finding-the-right-color-palettes-for-data-visualizations-fcd4e707a283","22","{Design,Colors,""Data Visualization""}","1170","7.0688679245283","Finding the Right Color Palettes for Data Visualizations","While good color palettes are easy to come by these days, finding the right color palette for data visualizations is still quite challenging. | |
At Graphiq, things are arguably made even more difficult, as we need to convey information across thousands of unique data sets in many different types of visualization layouts. | |
Rather than diving in head first and creating our own color palette, we started by conducting some research on existing color palettes around the web. Surprisingly, we found that few are actually designed for complex charts and data visualizations. We identified several reasons as to why we couldn’t use existing color palettes: | |
Many of the color palettes we looked at were not designed for visualizations. Not only do they not vary enough in brightness, but they were often not created with accessibility in mind. Flat UI Colors is one of the most widely used color palettes out there, and it’s easy to see why: it looks great. But, as its name indicates, it’s designed for user interfaces. Those who are color blind may find it difficult to interpret a data visualization that uses the Flat UI palette: | |
Another problem is that many existing color palates did not have enough colors. When building Graphiq visualizations, we need a palette that offers at least six colors, if not eight to twelve colors, to cover all of our use cases. Most color palettes we looked at did not provide enough options. | |
Here are a few examples from Color Hunt: | |
While they are good color palettes, they are not flexible enough to present complex data series. | |
But wait a second, there are color palettes that are like gradients — theoretically one can create any number of colors from that, right? | |
Unfortunately, there’s often not enough variation in brightness, and many of them would become indistinguishable very quickly, like these ones, also from Color Hunt: | |
Let’s just try taking the first one and extending it to a ten data series: | |
I’d be surprised if the average user could correctly distinguish the colors in the visualization and match up to the label in the legend, especially among the four greens on the left hand side. | |
At Graphiq, we think, eat and breathe data, and we invested a lot of time in finding not one, but multiple color palettes that worked for our visualizations. We learned a lot during this process, and we wanted to share three rules we’ve discovered for generating flexible color palettes: | |
To make sure color palettes are extremely accessible and easy to distinguish, they must vary enough in brightness. Differences in brightness are universal. Take any monochromatic color palette and test how it looks in Protanopia, Deuteranopia, and grayscale mode. You’ll quickly be able to tell how accessible this palette is. | |
However, having a palette that varies only in brightness may not be enough. The more variance you can have in the color palette, the easier it is for users to map your data series to the visualization. If we can utilize the change in hue for people who are not color blind, it will delight them even more. | |
And for both the brightness and the hue, the wider range you can find, the more data series you can support. | |
There’s a secret that designers know which is not always immediately intuitive to left-brained folks: Not all colors are created equal. | |
From a purely mathematical standpoint, a color progression that transitions from light purple to dark yellow should feel roughly similar to a transition from light yellow to dark purple. But as we can see below–the former feels natural, and the latter not so much. | |
This is because we’ve been conditioned by gradients that consistently appear in nature. We see bright yellow transition into dark purple in gorgeous sunsets, but there’s really no place on earth where you can see a light purple transition into a dark brownish yellow. | |
Similarly, a light green to a purplish blue, a light dry yellow to dark green, an orangey brown to cold gray, and more. | |
Because we see these natural gradients all the time, they feel familiar and pleasant when we see a corresponding palette used in a visualization. | |
Gradient palettes that incorporate different hues offer the best of both worlds. Whether you need 2 colors or 10 colors, colors can be strategically extracted from these gradients to produce a visualization that feels natural, but also has enough variation in hue and brightness. | |
It’s not easy to switch to a gradient mindset, but a good way to start is by setting up grid lines at the breakpoints for each number of data series in Photoshop and constantly testing the gradient and making tweaks. Here’s a snapshot of the process we used to perfect our gradients: | |
As you can see, we place our color palettes at the top against grayscale, tweak the gradient overlays (so we can get the exact transition code later), and select colors from those breakpoints to test how the palette would work in real life. | |
We’re excited about what we ended up with. Here are some of our color palettes in use, they all begin with pure white and end with pure black to achieve the maximum variation in brightness. | |
While there are an increasing number of good color palettes out there, not all of them are applicable to charts and data visualizations. Our approach to visualization color palettes is to make natural gradients that vary in both hue and brightness. By doing this, our palettes are accessible by people who are color blind, obvious for others, and works with anywhere from one to twelve data series. | |
Along the way, we identified a few great resources and articles that reached similar conclusions as we did, but take a more mathematical approach and even dive into the color theories. We thought we’d share for further reading: | |
And here are some other good color palette resources we found and loved. While they are not necessarily designed for data visualization, we think you would find them useful. | |
Hopefully this post was useful to you! What’s your process of creating color palettes? What other tools have you used? We’d love to hear any lessons you’ve learned related to color palettes and visualizations. | |
To see more about our engineering process, please subscribe to our publication: Graphiq Inc.","6" | |
"datagovsg","https://medium.com/datagovsg-blog/how-we-caught-the-circle-line-rogue-train-with-data-79405c86ab6a","15","{""Data Science"",""Data Visualization"",Singapore,""Public Transport"",Python}","1100","7.62075471698113","How the Circle Line rogue train was caught with data","Text: Daniel Sim | Analysis: Lee Shangqian, Daniel Sim & Clarence Ng | |
Singapore’s MRT Circle Line was hit by a spate of mysterious disruptions in recent months, causing much confusion and distress to thousands of commuters. | |
Like most of my colleagues, I take a train on the Circle Line to my office at one-north every morning. So on November 5, when my team was given the chance to investigate the cause, I volunteered without hesitation. | |
From prior investigations by train operator SMRT and the Land Transport Authority (LTA), we already knew that the incidents were caused by some form of signal interference, which led to loss of signals in some trains. The signal loss would trigger the emergency brake safety feature in those trains and cause them to stop randomly along the tracks. | |
But the incidents — which first happened in August — seemed to occur at random, making it difficult for the investigation team to pinpoint the exact cause. | |
We were given a dataset compiled by SMRT that contained the following information: | |
We started by cleaning the data. We worked in a Jupyter Notebook, a popular tool for writing and documenting Python code. | |
As usual, the first step was to import some useful Python libraries. | |
We then extracted the useful parts from the raw data. | |
We combined the date and time columns into one standardised column to make it easier to visualise the data: | |
This gave us: | |
We could not find any obvious answers in our initial exploratory analysis, as seen in the following charts: | |
1. The incidents were spread throughout a day, and the number of incidents across the day mirrored peak and off-peak travel times. | |
2. The incidents happened at various locations on the Circle Line, with slightly more occurrences on the west side. | |
3. The signal interferences did not affect just one or two trains, but many of the trains on the Circle Line. “PV” is short for “Passenger Vehicle”. | |
Our next step was to incorporate multiple dimensions into the exploratory analysis. | |
We were inspired by the Marey Chart, which was featured in Edward Tufte’s vaunted 1983 classic The Visual Display of Quantitative Information. More recently, it was used by Mike Barry and Brian Card for their extensive visualisation project on the Boston subway system: | |
In this chart, the vertical axis represents time — chronologically from top to bottom — while the horizontal axis represents stations along a train line. The diagonal lines represent train movement. | |
We started by drawing the axes in our version of the Marey Chart: | |
Under normal circumstances, a train that runs between HarbourFront and Dhoby Ghaut would move in a line similar to this, with each one-way trip taking just over an hour: | |
Our intention was to plot the incidents — which are points instead of lines — on this chart. | |
First, we converted the station names from their three-letter codes to a number: | |
If the incident occurred between two stations, it would be denoted as 0.5 + the lower of the two station numbers. For example, If an incident happened between HarbourFront (number 29) and Telok Blangah (number 28), the location would be “28.5”. This made it easy for us to plot the points along the horizontal axis. | |
And then we computed the numeric location IDs… | |
And added that to the dataset: | |
Then we had: | |
With the data processed, we were able to create a scatterplot of all the emergency braking incidents. Each dot here represents an incident. Once again, we were unable to spot any clear pattern of incidents. | |
Next, we added train direction to the chart by representing each incident as a triangle pointing to the left or right, instead of dots: | |
It looked fairly random, but when we zoomed into the chart, a pattern seemed to surface: | |
If you read the chart carefully, you would notice that the breakdowns seem to happen in sequence. When a train got hit by interference, another train behind moving in the same direction got hit soon after. | |
At this point, it still wasn’t clear that a single train was the culprit. | |
What we’d established was that there seemed to be a pattern over time and location: Incidents were happening one after another, in the opposite direction of the previous incident. It seemed almost like there was a “trail of destruction”. Could it be something that was not in our dataset that caused the incidents? | |
Indeed, imaginary lines connecting the incidents looked suspiciously similar to those in a Marey Chart (Screenshot 2). Could the cause of the interference be a train — in the opposite track? | |
We decided to test this “rogue train” hypothesis. | |
We knew that the travel time between stations along the Circle Line ranges between two and four minutes. This means we could group all emergency braking incidents together if they occur up to four minutes apart. | |
We found all incident pairs that satisfied this condition: | |
We then grouped all related pairs of incidents into larger sets using a disjoint-set data structure. This allowed us to group incidents that could be linked to the same “rogue train”. | |
Then we applied our algorithm to the data: | |
These were some of the clusters that we identified: | |
Next, we calculated the percentage of the incidents that could be explained by our clustering algorithm. | |
The result was: | |
What it means: Of the 259 emergency braking incidents in our dataset, 189 cases — or 73% of them — could be explained by the “rogue train” hypothesis. We felt we were on the right track. | |
We coloured the incident chart based on the clustering results. Triangles with the same colour are in the same cluster. | |
As we showed in Figure 5, each end-to-end trip on the Circle Line takes about 1 hour. We drew best-fit lines through the incidents plots and the lines closely matched that of Figure 5. This strongly implied that there was only one “rogue train”. | |
We also observed that the unidentified “rogue train” itself did not seem to encounter any signalling issues, as it did not appear on our scatter plots. | |
Convinced that we had a good case, we decided to investigate further. | |
After sundown, we went to Kim Chuan Depot to identify the “rogue train”. We could not inspect the detailed train logs that day because SMRT needed more time to extract the data. So we decided to identify the train the old school way — by reviewing video records of trains arriving at and leaving each station at the times of the incidents. | |
At 3am, the team had found the prime suspect: PV46, a train that has been in service since 2015. | |
On November 6 (Sunday), LTA and SMRT tested if PV46 was the source of the problem by running the train during off-peak hours. We were right — PV46 indeed caused a loss of communications between nearby trains and activated the emergency brakes on those trains. No such incident happened before PV46 was put into service on that day. | |
On November 7 (Monday), my team processed the historical location data of PV46 and concluded that more than 95% of all incidents from August to November could be explained by our hypothesis. The remaining incidents were likely due to signal loss that happen occasionally under normal conditions. | |
The pattern was especially clear on certain days, like September 1. You can easily see that interference incidents happened during or around the time belts when PV46 was in service. | |
LTA and SMRT eventually published a joint press release on November 11 to share the findings with the public. | |
When we first started, my colleagues and I were hoping to find patterns that may be of interest to the cross-agency investigation team, which included many officers at LTA, SMRT and DSTA. The tidy incident logs provided by SMRT and LTA were instrumental in getting us off to a good start, as minimal cleaning up was required before we could import and analyse the data. We were also gratified by the effective follow-up investigations by LTA and DSTA that confirmed the hardware problems on PV46. | |
From the data science perspective, we were lucky that incidents happened so close to one another. That allowed us to identify both the problem and the culprit in such a short time. If the incidents were more isolated, the zigzag pattern would have been less apparent, and it would have taken us more time — and data — to solve the mystery. | |
Of course, we were most pleased that all of us can now take the Circle Line to work with confidence again. | |
Daniel Sim, Lee Shangqian and Clarence Ng are data scientists at GovTech’s Data Science Division. | |
Follow Data.gov.sg: Twitter | Facebook","2" | |
"rchang","https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6","5","{""Data Science"",""Silicon Valley"",Entrepreneurship}","906","14.3314465408805","Doing Data Science at Twitter","On June 17, 2015, I celebrated my two year #Twitterversary @Twitter. Looking back, the Data Science (short for DS) landscape at Twitter has shifted quite a bit: | |
And these are only a handful of changes among many others! On a personal note, I’ve recently branched out from Growth to PIE (Product, Instrumentation, and Experimentation) to work on the statistical methodologies of our home grown A/B Testing platform. | |
Being at Twitter is truly exciting, because it allows me to observe and learn, first hand, how a major technology company leverages data and DS to create competitive edges. | |
Meanwhile, demands and desires to do data science continued to skyrocket. | |
There are many, and I mean many, discussions around how to become a data scientist. While these discussions are extremely informative (I am one of the beneficiaries), they tend to over-emphasize on techniques, tools, and skill-sets. In my opinion, it is equally important for aspiring Data Scientists to know what it is really like to work as a DS in practice. | |
As a result, as I hit my two year mark at Twitter, I want to use this reflection as an opportunity to share my personal experience, in the hope that others in the field would do the same! | |
Before Twitter, I got the impression that all DS need to be unicorns — from Math/Stat, CS/ML/Algorithms, to data viz. In addition to technical skills, writing and communication skills are crucial. Furthermore, being able to prioritize, lead, and manage projects are paramount for execution. Oh yeah, you should also evangelize a data driven culture. Good luck! | |
A few months in into my job, I learned that while unicorns do exist, for the majority of us who are still trying to get there, it is unrealistic/infeasible to do all these things at once. That said, almost everything data related is tied to the term DS, and it was a bit daunting to find my place as a newbie. | |
Overtime, I realized that there is a overly simplified but sufficiently accurate dichotomy of the different types of Data Scientists. I wasn’t able to articulate this well until I came across a Quora answer from Michael Hochster, who elegantly summarized this point. In his words: | |
I wish I had known this earlier. In fact, as an aspiring DS, it is very useful to keep this distinction in mind as you make career decisions and choices. | |
Personally, my background is in Math, Operations Research, and Statistics. I identified myself mainly as a Type A Data Scientist, but I also really enjoy Type B projects that involved more engineering! | |
One of the most common decisions to make while looking for tech jobs is the decision between joining a large v.s. small company. While there are a lot of good general discussions on this topic, there isn’t much information specifically for DS — namely, how the role of DS would change depending on the stage and size the company. | |
Companies at different stages produce data in different velocity, variety, and volume (the infamous 3Vs). A start-up trying to find its product market fit probably don’t need Hadoop because there isn’t much data. A growing start-up will be more data intensive but might do just fine using PostgreSQL or Vertica. But a company like Twitter cannot efficiently process all its data without using Hadoop and the Map-Reduce framework. | |
One important lesson I learned at Twitter is that a Data Scientist’s capability to extract value from data is largely coupled with the maturity of the data platform of its company. Understand what kind of DS work you want to get involved, and do your research to evaluate if the company’s infrastructure can support your goal is not only smart, but paramount to ensure the right mutual fit. | |
By the time I joined Twitter, it already has a very mature data platform and stable infrastructure in place. The warehouse is clean and reliable, and ETL processes are processing hundreds of Map-Reduce jobs easily on a daily basis. Most importantly, we have talented DS working on data platform, product insights, Growth, experimentations, and Search/Relevance, along with way other focus areas. | |
I was the first dedicated Data Scientist on Growth, and the reality is, it took us a good few months before Product, Engineering, and DS converged on how DS can play a critical role in the process. Based on my experience working closely with the product team, I categorize my responsibilities into four general areas: | |
Let me describe my experience and learning in each of these topics. | |
One of the unique aspects of working for a consumer technology company is that we can leverage data to understand and infer the voice and preference of our users. Whenever a user interacts with the product, we record useful data and metadata and store them for future analyses. | |
This process is known as logging or instrumentation, and is constantly evolving. Frequently, DS might find a particular analysis difficult to perform because the data is either malformed, inappropriate, or missing. Establishing a good relationship with the engineers is very useful here because DS can help engineers to identify bugs or unintended behaviors in the system. In return, engineers can help DS to close “Data Gaps” and to make data richer, more relevant, and more accurate. | |
Here are a few examples of product related analyses I performed at Twitter: | |
Analyses come in different forms — sometimes you are asked to provide straightforward answers to simple data pulls (push analysis), other times you might need to invent and come up with new ways to calculate a new but important operational metrics (SMS delivery rates), and finally you might be tasked to understand deeper about user behaviors (multiple accounts). | |
Generating insights through product analysis is an iterative process. It requires challenging the questions being asked, understanding the business context, and figuring out the right dataset to answer the questions. Over time, you will become an expert in where the data lives and what they mean. You will get better at estimating how much time it will take to carry out an analysis. More importantly, you will slowly move from a reactive state to proactive state and start suggesting interesting analyses that product leaders might not think of, because they don’t know the data exists or that disparately different data sources can be complementary and combined in a particular way. | |
Even though type A data scientists might not produce codes that are directly user facing, surprisingly often, we still commit codes into the codebase for the purpose of data pipeline processing. | |
If you heard of the operation | (pipe) from Unix that facilities the execution of a series of commands, a data pipeline is nothing but a series of operations, when streamed together, helped us to automatically capture, munged, aggregated data on a recurring basis. | |
Before Twitter, most of my analysis are ad-hoc in nature. They are mostly run and executed once or few times on my local machine. The codes were rarely code reviewed, and they are most likely not version controlled. When a data pipeline is created, a new set of concerns start to surface such as dependency management, scheduling, resource allocation, monitoring, error reporting, and alerting. | |
Here is a typical process of creating a data pipeline: | |
Obviously, a pipeline is more complex than an ad-hoc analysis, but the advantage is that this job can now be run automatically, the data it produces can be used to power dashboards so more users can consume your data/results. More importantly but a subtle point, this is a great learning process to pick up engineering best practices, and it provides the foundation in case you ever need to build specialized pipelines such as a Machine Learning Model (I will talk about this more in the last section) or a A/B testing platform. | |
Right at this moment, it’s very possible that the Twitter app you are using is slightly different from mine, and it’s entirely possible that you actually have a feature that I do not see. Under the hood, since Twitter has a lot of users, it can direct a small % of its traffic to experience a new feature that is not yet public, so to understand how these specific users react to it compared to those who don’t (the control group) — this is known as A/B testing, where we get to test which variant, A or B, is better. | |
I personally think A/B testing is one of the unique perks of working for a large consumer tech company. As a Data Scientist, you get to establish causality (something really hard to do with observational data) by running actual randomized, controlled experiments. At Twitter, “It’s rare for a day to go by without running at least one experiment” — Alex Roetter, VP of Engineering. A/B testing is ingrained in our DNA and our product development cycle. | |
Here is the typical process of running a A/B test: Gather Samples -> Assign Buckets -> Apply Treatments -> Measure Outcomes -> Make Comparisons. Surely this sounds pretty easy, no? On the contrary, I think A/B testing is one of the most under-appreciated and tricky analytics work, and it’s a skill that’s rarely taught in school. To demonstrate my point, let’s revisit 5 steps above again and some of the practical problems you might run into: | |
Addressing each and every question from above requires a good command of Statistics. Even if you are as rigorous as possible when designing an experiment, other people might fall short. A PM would be incentivize to peek the data early, or to cherrypick the results that they want (because it’s human nature). An engineer might have forgotten to log the specific information we need to calculate the success metric, or the experiment codes can be written in the wrong way and unintended bias is introduced. | |
As a Data Scientist, it’s important to play the devil’s advocate and help the team to be rigorous, because time wasted on running a ill-designed experiment is not recoverable. Far worse, an ill-informed decision based on bad data is far more damaging than anything else. | |
My first big project at Twitter was to augment a set of fatigue rules to our existing email notification product so to reduce spams to our users. While this is a noble gesture, we also know that email notification is one of the biggest retention levers (we know this causally because we ran experiments on it), so finding the right balance is the key. | |
With this key observation, I quickly decided to focus on trigger based email, email types that tend to burst arrive at users’ email inbox when interactions happen. Being an ambitious new DS who is trying to prove his value, I decided to build a fancy ML model to predict email CTR at the individual level. I marched on and aggregated a bunch of user level feature in Pig and built a random forest model to predict email click. The idea is that if a user has a consistent long history of low CTR, we can safely holdback that email from that user. | |
There was only one problem — all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not “productionized” and the infrastructure cannot talk to my local model. Hard lesson learned! | |
A year later, I found a great opportunity to build a churn prediction model with two other DS from Growth. This time around, because I had accumulated enough experience building data pipeline, I learned that building a ML pipeline was in fact quite similar — There is the training phase, which can be done offline with periodic model updates through Python; There is the prediction part, where we aggregate user features daily, and let the prediction function does its magic (mostly just dot product) to produce a churn probability score for each user. | |
We built the pipeline in a few weeks, confirmed that it has good predictive power, and rolled it out by writing the scores to Vertica, HDFS, and our internal key-value store at Twitter called Manhattan. The fact that we make the scores easily query-able by analysts, DS, and engineering services helped us to evangelize and drive use cases of our model. And it was the biggest lesson I learned about building models in production. | |
I deliberately ignore the steps one needs to take in building a ML model in this discussion so far — framing the problem, defining labels, collect training data, engineer features, build prototypes, and validate and test the model objectively. These are obviously all important, but I feel that they are fairly well taught and many good advices have been given on this very subject. | |
I think most of the brilliant DS, especially type A DS have the opposite problem, they know how to do it right but are not sure how to push these models into the ecosystem. My recommendation is to talk to the Type B Data Scientists who have a lot of experience on this topic, find the set of skills that are needed, and find intersections and hone your skills so you can pivot to these projects when the time is right. Let me close this section by the following quote: | |
Well said. | |
Being a Data Scientist is truly exciting, and the thrill of finding a particular insight can be as exciting as an adrenaline rush. Building a data pipeline or ML model from ground can be deeply satisfying, and there are a lot of fun ‘playing God’ when running an A/B tests. That said, this road ain’t easy and pretty, there will be a lot of struggles along the way, but I think a motivated and smart individual will pick these things up quickly. | |
Here are some additional information I found very useful along the way, and I hope you find them useful too: | |
Data Science and Software Engineering: | |
A/B Testing: | |
Recruiting: | |
It’s a long journey, and we are all still learning. Good luck, and most importantly, have fun!","3" | |
"davidventuri","https://medium.com/free-code-camp/if-you-want-to-learn-data-science-take-a-few-of-these-statistics-classes-9bbabab098b9","8","{""Data Science"",Programming,""Learning To Code"",Technology,""Life Lessons""}","899","13.1069182389937","If you want to learn Data Science, take a few of these statistics classes","A year ago, I was a numbers geek with no coding background. After trying an online programming course, I was so inspired that I enrolled in one of the best computer science programs in Canada. | |
Two weeks later, I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. So I dropped out. | |
The decision was not difficult. I could learn the content I wanted to faster, more efficiently, and for a fraction of the cost. | |
I already had a university degree and, perhaps more importantly, I already had the university experience. Paying $30K+ to go back to school seemed irresponsible. | |
I started creating my own data science master’s degree using online courses shortly afterwards, after realizing it was a better fit for me than computer science. I scoured the introduction to programming landscape. For the first article in this series, I recommended a few coding classes for the beginner data scientist. | |
I have taken a few courses, and audited portions of many. I know the options out there, and what skills are needed for learners preparing for a data analyst or data scientist role. | |
For this guide, I spent 15+ hours trying to identify every online intro to statistics and probability course offered as of November 2016, extracting key bits of information from their syllabi and reviews, and compiling their ratings. For this task, I turned to none other than the open source Class Central community and its database of thousands of course ratings and reviews. | |
Since 2011, Class Central founder Dhawal Shah has kept a closer eye on online courses than arguably anyone else in the world. Dhawal personally helped me assemble this list of resources. | |
Each course must fit four criteria: | |
We believe we covered every notable course that fits the above criteria. Since there are seemingly hundreds of courses on Udemy, we chose to consider the most-reviewed and highest-rated ones only. There’s always a chance that we missed something, though. So please let us know in the comments section if we left a good course out. | |
We compiled average rating and number of reviews from Class Central and other review sites. We calculated a weighted average rating for each course. If a series had multiple courses (like the University of Texas at Austin’s two-part “Foundations of Data Analysis” series), we calculated the weighted average rating across all courses. We read text reviews and used this feedback to supplement the numerical ratings. | |
We made subjective syllabus judgment calls based on three factors: | |
William Chen, a data scientist at Quora who has a master’s in Applied Mathematics from Harvard, wrote the following in this popular Quora answer to the question: “How do I learn statistics for data science?” | |
Since a lot of a data scientist’s statistical work is carried out with code, getting familiar with the most popular tools is beneficial. | |
Probability is not statistics and vice versa. My favorite explanation of their differences is from Stony Brook University: | |
They explain that “probability is primarily a theoretical branch of mathematics, which studies the consequences of mathematical definitions,” while “statistics is primarily an applied branch of mathematics, which tries to make sense of observations in the real world.” | |
Statistics is generally regarded as one of the pillars of data science. Probability — though it generates less attention — is also an important part of a data science curriculum. | |
Joe Blitzstein, a Professor in the Harvard Statistics Department, stated in this popular Quora answer that aspiring data scientists should have a good foundation in probability theory as well. | |
Justin Rising, a data scientist with a Ph.D. in statistics from Wharton, clarified that this “good foundation” means being comfortable with undergraduate level probability. | |
“Foundations of Data Analysis” includes two of the top reviewed statistics courses available with a weighted average rating of 4.48 out of 5 stars over 20 reviews. The series is one of the only courses in the upper echelon of ratings to teach statistics with a focus on coding up examples. Though not mentioned in either course titles, the syllabi contain sufficient probability content to satisfy our testing criteria. These courses together have a great mix of fundamentals coverage and scope for the beginner data scientist. | |
Michael J. Mahometa, Lecturer and Senior Statistical Consultant at the University of Texas at Austin, is the “Foundations of Data Analysis” series instructor. Both courses in the series are free. The estimated timeline is 6 weeks at 3–6 hours per week for each course. One prominent reviewer said: | |
Please note each course’s description and syllabus are accessible via the links provided above. | |
Update (December 5, 2016): Our original second recommendation, UC Berkeley’s “Stat2x: Introduction to Statistics” series, closed their enrollment a few weeks after the release of this article. We promoted our top recommendation in “The Competition” section accordingly. | |
…which contains the following five courses: | |
This five-course specialization is based on Duke’s excellent Data Analysis and Statistical Inference course, which had a 4.82-star weighted average rating over 55 reviews. The specialization is taught by the same professor, plus a few additional faculty members. The early reviews on the new individual courses, which have a 2.6-star weighted average rating over 5 reviews, should be taken with a grain of salt due to the small sample size. The syllabi are comprehensive and has full sections dedicated to probability. | |
Dr. Mine Çetinkaya-Rundel is the main instructor for the specialization. The individual courses can be audited for free, though you don’t have access to grading. Reviews suggest that the specialization is “well worth the money.” Each course has an estimated timeline of 4–5 weeks at 5–7 hours per week. One prominent reviewer said the following about the original course that the specialization was based upon: | |
Consider the above MIT course if you want a deeper dive into the world of probability. It is a masterpiece with a weighted average rating of 4.91 out of 5 stars over 34 reviews. Be warned: it is a challenge and much longer than most MOOCs. The level at which the course covers probability is also not necessary for the data science beginner. | |
John Tsitsiklis and Patrick Jaillet, both of whom are professors in the Department of Electrical Engineering and Computer Science at MIT, teach the course. The contents of this course are essentially the same as those of the corresponding MIT class (Probabilistic Systems Analysis and Applied Probability) — a course that has been offered and continuously refined over more than 50 years. The estimated timeline is 16 weeks at 12 hours per week. One prominent reviewer said: | |
I encourage you to visit Class Central’s page for this course to read the rest of the reviews. | |
Our #1 pick had a weighted average rating of 4.48 out of 5 stars over 20 reviews. Let’s look at the other alternatives. | |
The following courses had no reviews as of November 2016. | |
This is the second of a six-piece series that covers the best MOOCs for launching yourself into the data science field. We covered programming in the first article, and the remainder of the series will cover several other data science core competencies: data analysis, data visualization, and machine learning. | |
The final piece will be a summary of those courses, and the best MOOCs for other key topics such as data wrangling, databases, and even software engineering. | |
If you’re looking for a complete list of Data Science MOOCs, you can find them on Class Central’s Data Science and Big Data subject page. | |
If you have suggestions for courses I missed, let me know in the responses! | |
If you found this helpful, click the 💚 so more people will see it here on Medium. | |
This is a condensed version of the original article published on Class Central, where course descriptions, syllabi, and multiple reviews are included.","4" | |
"cesifoti","https://medium.com/mit-media-lab/what-i-learned-from-visualizing-hillary-clintons-leaked-emails-d13a0908e05e","2","{Research,Politics,""2016 Election"",Journalism,""Data Visualization""}","886","10.1531446540881","What I learned from visualizing Hillary Clinton’s emails","It all started early last week. Kevin Hu, one of my senior grad students, told me that a friend of his asked if we could use Immersion — an email visualization tool we had released in 2013 — to visualize Clinton’s Wikileaks email dataset. | |
The timing was not ideal for us. Kevin asked me this question when the Media Lab member event was getting started, which is a particularly busy time of the year. So my first question to Kevin was: “Can we?” | |
What made the project possible, besides Kevin’s amazing talents, was that we had reactivated the Immersion project a few weeks ago together with Jingxian Zhang, a new grad student in the group. Immersion had been paralyzed for two years since the students who had worked on it graduated. But Jingxian was now working to help complete the original vision of this project, and with her software engineering skills, visualizing the Wikileaks Clinton email dataset in Immersion was a possibility. So now that I knew we could do it, the question was: Should we? | |
Should questions are tough, especially when you need to consider a number of different variables. But before I tell you why I decided to move ahead with the project, I’ll try to make sure we are all on the same page about why this dataset is relevant. | |
The answer is quite simple, but also, not the one that you hear most often. Clinton’s emails are not relevant because they expose an alleged circle of corruption, or wrongdoing, as many conservatives claim. Her emails are also not irrelevant because they do not expose corruption or wrongdoing, as many liberals claim. These emails are relevant because Clinton was a person in charge of doing a security job, and anyone working on a security job is not supposed to communicate using an unsecured or unauthorized channel. This should be obvious, since each extra channel of communication increases the vulnerability of the system by increasing the probability that messages are intercepted. So the reason why Clinton’s emails are a big deal is because a person in charge of security should not be using an insecure channel, and those who argue from that perspective have a valid point. | |
Now, did Clinton actually reveal sensitive information? Whether she did or didn’t is a separate point from the one above, but also, one that we need to consider. Also, how are people supposed to learn about what was revealed in these messages? Should they blindly trust what the media tells them, or should they be allowed to evaluate this information themselves? And in a world where this information is already publicly available, but hard to digest, should we silence efforts that make this primary source of data available to citizens, or should we embrace them, as these efforts allow us to make our own conclusions by personally browsing the data? | |
For years I have created teams with unique capacities to make large datasets easy to understand. Earlier this year we released Data USA, the most comprehensive visualization of US public data. In 2013, we released a project visualizing the entire formal sector economy of Brazil (dataviva.info). My group also has been hosting a very popular tool to visualize international trade data (atlas.media.mit.edu) since 2011 (see chidalgo.com for a full list of projects). So in this environment, where I lead groups with the ability to make data easily digestible, and have a commitment to making data accessible so that people can explore it directly and make their own decisions, I decided that improving people’s ability to navigate a politically relevant dataset that was already publicly available was the right choice. My intuition was that, if you were going to spend 1, 5, or 10 minutes looking directly at these emails, you would get a slightly deeper understanding of what was in them if you used our interface rather than the ones that were presently available. I believe that these potential increases in depth, together with the creation of tools that allow people to explore primary sources of data directly, are a contribution. You may disagree with my choice, but I hope you at least understand it. | |
So what did we learn by making this dataset accessible? | |
We learned a few things about what Clinton’s emails said, about how the media works, and about how people interpreted the project. | |
We made clinton.media.mit.edu publicly available last Friday night (October 28, 2016). We launched with a single story, written by Alejandra Vargas from Univision. | |
My intuition was that the story was likely to get picked up by other news sources. After all, the tool facilitated people’s ability to read and understand the content of these emails, and the connections of the people involved in them. But I was wrong—it has been nearly a week since we released the project and no other major news source has picked up the story, despite having been viewed by more than 300,000 people in less than a week. | |
So how did we get so much traffic without any news coverage? The answer is social media. So far, the tool has been shared widely on Twitter, Facebook, and for a brief but intense time, on Reddit. Its spread has been fueled by different motives, and also, has been battled in different ways. | |
Many reporters shared the news on their personal accounts understanding that the tool represents a different form of data reporting, or data journalism: one where people are provided with a tool that facilitates their ability to explore a relevant dataset, instead of being provided with a story summarizing a reporter’s description of that dataset. | |
Another group of people that shared the news were interface designers, who understand that there is a need to improve the tabular interface of present day email clients, and that the inbox we presented in this project was an attractive new alternative. | |
But many people also shared our site claiming that this was evidence of Clinton’s corruption, and that the site supported Trump. More on that later. | |
But the spread of the site was not without its detractors. A few hours after we released the site I received a message from a friend telling me that what I had done was “a huge mistake” and that I should have waited to post this until “later in the year.” | |
A few days later, outside my lab, a member of a neighboring research group called me a “Trump supporter” and told me that I should have only made that site available if it also included Trump’s emails. I told him that I would be happy to include them, but I had no access to the data. In haste, this colleague began emailing me news articles, none of which provided access to the alleged public dataset of Trump emails. | |
Later, a friend of one of my students posted the news on Reddit, where it went viral. And I mean really viral. It became the top story of the Internetisbeautiful subreddit, and made it to Reddit’s front page. It collected more than 3,000 upvotes and 700 comments. But as the story peaked, a moderator single-handedly removed it in an authoritarian move, and justified this unilateral silencing of the post by adding a rule banning “sites that serve a political agenda or that otherwise induce drama.” Of course, the rule was added AFTER the post was removed. | |
So when it comes to media, social or not, I learned that providing information directly to people so that they can inspect it and evaluate it, is a value that many people consider second to supporting their preferred electoral choice. The twist is that I don’t support Trump. In fact, I don’t support him at all. I think he is potentially a threat to global security, and also, a candidate that has shown repeatedly to be a dividing rather than unifying force. He has failed to respect contracts numerous times, defrauding contractors; and he certainly has shown little respect for people’s development by creating a fraudulent university. So I think he is ill prepared for most jobs, including a difficult one like that of being president. | |
I support Clinton in this election, and even though I don’t get to vote (As a green card holder I just pay taxes), I want her to win next Tuesday. I really do. But I understand that this is my own personal choice, a choice that I want to make sure is informed by my ability to evaluate information about the candidates directly, and by a media that is more transparent than the one we now have. Trust me, if I had Trump’s tax records, I would also think it is a good idea to make a tool that makes them more easily digestible. But my reason to make that tool, once again, would not come from my support for Clinton, or my opposition to Trump. It would come from my support for a society where people have direct access to relevant sources of information through well-designed data visualization tools. | |
So what did I learn about Clinton’s emails? One of the advantages of helping design a data visualization tool is that you get an intimate understanding of the data you are visualizing. After all, you have to explore the data and use the tool to make dozens of design decisions. In this case, the development cycle was particularly fast, but nevertheless I got to learn a few things about the data. | |
Of course, the whole point of making this tool is that you can use it to come up with your own interpretation of the data. That said, you might be curious about mine, so I’ll share it with you too. | |
What I saw on Clinton’s emails was not surprising to me. It involved a relatively small group of people talking about what language to use when communicating with other people. Also, it involved many unresponded-to emails. Many conversations revolved around what words to use or avoid, and what topics to focus on, or how to avoid some topics, when speaking in public or in meetings. This is not surprising to me because I’ve met many politicians in my life, including a few presidents and dozens of ministers and governors, so I know that what work means to many people in this line of work, on a daily basis, is strategizing what to say and being careful about how to say it. I am sure that if we had access to Trump’s emails we would see plenty of the same behavior. | |
So what I got from reading some of Clinton’s email is another piece of evidence confirming my intuition that political systems scale poorly. The most influential actors on them are spending a substantial fraction of their mental capacity thinking about how to communicate, and do not have the bandwidth needed to deal with many incoming messages (the unresponded-to emails). This is not surprising considering the large number of people they interact with (although this dataset is rather small. I send 8k emails a year and receive 30k. In this dataset Clinton is sending only 2K emails a year). | |
Our modern political world is one where a few need to interact with many, so they have no time for deep relationships — they physically cannot. So what we are left is with a world of first impressions and public opinion, where the choice of words matters enormously, and becomes central to the job. Yet, the chronic lack of time that comes from having a system where few people govern many, and that leads people to strategize every word, is not Clinton’s fault. It is just a bug that affects all modern political systems, which are ancient Greek democracies that were not designed to deal with hundreds of millions of people. | |
On another note, this exercise also helped me reaffirm my belief that the best way to learn about the media is not by reading the news, but by being news. I’ve had the fortune, and misfortune, to have been news many times. This time, I honestly thought that we had a piece of content that some media channels would be interested in and that it would get picked up easily. I have many reporter friends who are enthusiastic about new forms of data journalism, and that actually have been positive and encouraging this week. So I imagined that there was a good chance that a reporter would see the site, go to his or her editor, and say: “Hey, I have an interactive data visualization of all Clinton’s emails. Can I write a story on it?” and the editor would say: “Of course, make it quick.” I don’t know if these conversations actually happened, but given the large volume of traffic our project received I would be surprised if they didn’t. I learned that the outcome was not the one I intuited. | |
And this brings me to my final point, which is that while I support Clinton in this election, and I think Trump is a bad choice for president (a really bad one), I still think that we should work on the creation of tools that improve the ability of people to personalize scrutinize politically relevant information. I now understand that much of the U.S. media may not share that view with me, and that I think this is an important point of reflection. I hope the media takes some time to think about this on November 9 (or the week after). | |
Also, the large number of people who were unable to interpret our tool as anything but an effort to support or oppose a political candidate — and that was true for both liberals and conservatives — speaks to me about an ineffective public sphere. And that’s something I think we should all be concerned about. This polarization is not just a cliché. It is a crippling societal condition that is expressed in the inability of people to see any merit, or any point, in opposing views. That’s a dangerous, and chronic, institutional disease that is expressed also in the inability of people to criticize their own candidates, because they fear being confused with someone their peers will interpret as a supporter of the opposing candidate. If you cannot see any merit in the candidate you oppose, even in one or two of the many points that have been made, you may have it. | |
So that’s how this election has muddled the gears of democracy. When we cannot learn from those we oppose, or agree when they have a valid point, our learning stops. We keep on talking past each other. I know that this election has made learning from those we oppose particularly difficult, but the difficult tests are the ones that truly show us what we are really made of. These are the situations that push us to see past all of the things that we don’t like, or don’t agree on, so we can rescue a lesson. You may not agree with me, but I hope at least I gave you something to think about. | |
César A. Hidalgo is associate professor of media arts and sciences at the MIT Media Lab and the author of Why Information Grows: The evolution of order from atoms to economies. He has also lead the creation of data visualization sites that have received more than 100 million views, including datausa.io, dataviva.info, atlas.media.mit.edu, immersion.media.mit.edu, pantheon.media.mit.edu, streetscore.media.mit.edu, and others (see chidalgo.com for more details).","1" | |
"quincylarson","https://medium.com/free-code-camp/infrastructure-is-beautiful-cb0daa1aa76b","16","{""Data Visualization"",""Life Lessons"",Tech,Design,Travel}","885","4.40660377358491","Infrastructure is Beautiful","The US is a big place. It’s the third-largest country by land mass. And the infrastructure that connects it is equally immense. | |
Let’s marvel at the intricacy of these human-made systems. | |
Together, we’ll explore its internet, transportation, and energy distribution infrastructure — all through rapid-fire data visualizations. | |
Here are all the places with at least 1 megabit download speed: | |
And here are all the places where consumers only have one available broadband provider. | |
Not surprisingly, these monopolies aren’t always honest about how fast their internet is. The darker the pink is, the more that broadband companies are overstating the speed of their internet services. | |
The internet isn’t magic — it’s mostly just a whole lot of fiber optic cable. Here’s all the long-haul fiber that carries internet data around the United States. Red squares represent the junctions between “long haul” fibers. | |
Of course, in a perfect world, the internet would be completely borderless and free. Here’s what the entire internet looks like as of 10 years ago (it’s even more intricate, now). The individual spokes are IP addresses: | |
The US Interstate system is a network of 77,017 kilometers (47,856 miles) of highway that connects all major cities. President Eisenhower commissioned it in 1956 as a way to transport military equipment and supplies. It took 35 years and an inflation-adjusted $500 billion to complete. | |
Here’s what it looks like: | |
But I prefer this stylized representation: | |
Bridges are a big part of our road system. Here are all 600,000 bridges in the US. About 10% of them (the red ones) are structurally deficient. | |
Bridges collapse all the time, often killing people. They need to be maintained and replaced at the end of their service life, though this can be quite expensive. | |
These days, most people are in a hurry and would prefer to fly. | |
There were more than 8 million flights last year in the continental US alone. That’s a lot of carbon dioxide emissions! | |
Here’s what those flights look like: | |
Another way the US moves things around is through its railroad network. | |
The circles in the map below are Amtrak stations for passenger travel. | |
When it comes to heavy cargo, moving things around by sea is still the most cost-effective approach. | |
Here are all the power plants in the United States, and the energy we get from them: | |
For every mile of interstate highway, there are 6 miles of natural gas pipelines. Natural gas provides 30% of the US’s energy. | |
But what are all these electrons good for if you don’t have a way to distribute them? Here’s what the US power grid looks like: | |
Infrastructure is complex. Improving America’s infrastructure involves everything from filling potholes on main street to launching new arrays of thousands of low-orbit communication satellites, like Elon Musk is planning to do. | |
It will involve continually upgrading billions of discrete components within complex systems, like cells replacing themselves in our bodies. | |
The next time you pull onto an interstate, flip a light switch, or tweet out an article, pause for a moment. | |
Contemplate the grand scale of these systems, and the millions of minds and bodies that brought them into existence. | |
It’s beautiful. | |
If you want to learn more about the history of one vital piece of infrastructure — the internet — I strongly recommend reading Where Wizards Stay up Late by Katie Hafner. | |
The book traces the creation of the internet from its early days connecting just four universities — UCLA, Stanford, UCSB, and MIT — and talks about how they managed to scale it. | |
Thanks for taking time out of your busy day to read my article. If you liked it, click the 💚 below so other people will see this here on Medium.","1" | |
"davidventuri","https://medium.com/free-code-camp/the-6-most-desirable-coding-jobs-and-the-types-of-people-drawn-to-each-aebac45fd7f7","14","{Programming,""Data Science"",""Web Development"",Mobile,Design}","832","5.79716981132075","The 6 most desirable coding jobs (and the types of people drawn to each)","More than 15,000 people responded to Free Code Camp’s 2016 New Coder Survey, granting researchers (like me!) an unprecedented glimpse into how people are learning to code. The entire dataset was released on Kaggle. | |
6,503 new coders answered the question: “Which one of these roles are you most interested in?” | |
These roles are full-stack developer, front-end developer, back-end developer, data scientist/engineer, mobile developer, and user experience (UX) designer.* For each, we’ll look at three categorical variables: | |
…and five numerical ones: | |
*UX designer was a default option in the original survey. Though the degree to which it is a coding job is debatable, a basic understanding of code is helpful. | |
UX designer is by far the most diverse discipline in terms of gender, with 52% males, 46% females, and the highest percentage of agender, genderqueer, and trans respondents (2%). Mobile development is the most male-dominated discipline at 81%, though full-stack and back-end development are close. | |
Mobile developer is the most diverse role in terms of citizenship. UX design is the most North American of all of the disciplines. | |
Free Code Camp is based in the United States, which explains the tilt towards North America. | |
Data science and data engineering are most skewed towards post-secondary studies. Mobile development has the highest percentage of respondents with no, some, or only a high school education, though back-end development is a close second. | |
I wonder if these skews will reflect themselves in the form of age. | |
Mobile developers are indeed the youngest. Their 25th percentile is two years younger than the next youngest role. Mobile being a newer discipline probably has something to do with this. Front-end development is the oldest discipline with an average age of 29 years. Note that data science/engineering is second-youngest, not back-end development. | |
By the way, here’s how to read this chart (and the other box plots in this article): the “x” is the mean. The horizontal line is the median (a.k.a. the 50th percentile). The bottom of the box is the 25th percentile, and the top of the box is the 75th percentile. Whisker length is 1.5 times the height of the box. The circles are outliers. All y-axes are on a logarithmic scale to better visualize the outlier-heavy data. | |
Data scientists-, data engineers-, and back-end developers-in-training have programmed the longest with a median experience of eight months. UX designers have the lowest first quartile — by two whole months — at two months. Programming experience is so positively skewed that some of the means, which should be taken with a grain of salt, are above their third quartile. | |
Full-stack developers dedicate the most time to learning each week, with 25% of respondents dedicating 30+ hours weekly. UX designers spend the least amount of time learning per week with a mean of 12 hours per week. | |
In contrast, time spent learning didn’t vary much by gender and continent. (I wrote a full analysis of this here.) | |
Aspiring data scientists and data engineers clearly have the highest current salaries. Their third quartile of $60k per year is $8k higher than the next highest discipline. There isn’t much income differentiation between the remaining job roles of interest, though all are above the 2014 US median income of $28.9k. | |
Those interested in data science and data engineering expect to earn the most at their next job, with a median expected salary of $60k. Front-end developers are the least optimistic discipline (and yes, this difference in means is statistically significant). Note that expected salaries are higher than current salaries across the board. | |
Let’s compare all of the numerical variables in a single chart, using something called a radar chart. The mean for each numerical variable, scaled (or normalized) between 0 and 1, is plotted on a radial axis: | |
One thing jumps out immediately: data scientists/engineers lead the pack for programming experience, current salary, and expected next salary. | |
Front-end and mobile developers have the smallest areas, thanks to the lowest programming experience and expected next salary means for the former, and low age and current salary means for the latter. | |
Note that we are strictly using this plot to efficiently compare roles across several numerical variables, and not to determine which role is better if such a determination even exists. Perception of strength based on overall area is a common misinterpretation of radar plots. | |
A lot! Each type of programmer has a unique set of characteristics. | |
Relatively speaking, females appear drawn to user experience design. Asians, South Americans, and Africans appear drawn to mobile development. Data science/engineering and mobile development stick out as the most and least seasoned in terms of education, respectively. | |
Aspiring data scientists/engineers have the highest current salaries, expect the highest next salaries, and have the most programming experience. Front-end developers are the oldest, but not significantly. Full-stack developers dedicate the most amount of time to learning per week. | |
Front-end developers are the least experienced coders and expect the lowest next salaries. UX designers spend the least amount of hours learning weekly and have the lowest current salaries, but not significantly for the latter. Mobile developers are the youngest. | |
You can find a more detailed version of this analysis on Kaggle, where you’ll find statistical tests supporting the inferences in this article. | |
Be sure to check out my other pieces exploring Free Code Camp’s 2016 New Coder Survey: | |
If you have questions or concerns about this series or the R code that generated it, don’t hesitate to let me know.","4" | |
"ctn","https://medium.com/deep-learning-101/algorithms-of-the-mind-10eb13f61fc4","4","{Cognition,""Big Data"",""Machine Learning""}","826","7.19433962264151","Algorithms of the Mind","","7" | |
"foursquare","https://medium.com/foursquare-direct/how-the-trump-presidential-campaign-is-affecting-trump-businesses-c343178e3c03","4","{""Donald Trump"",Foursquare,""Big Data""}","769","5.9","How the Trump Presidential Campaign is Affecting Trump Businesses","At Foursquare, our data scientists are often called upon to analyze real world trends; we use Big Data to determine how commercial fortunes are rising or falling. This year, politics and business are intersecting, as one of the presidential candidates, Donald Trump, has extensive properties including casinos, hotels, and golf courses. Has his campaign been good for Trump-branded business? | |
We have experience tackling these types of questions with a high degree of accuracy. Based on our foot traffic intelligence covering over 50 million users a month, we predicted Apple iPhone 6s sales, a hit Q4 for McDonald’s all-day breakfast, and a tough Q1 for Chipotle. Time and again, our predictions have been proven on-target once these companies announced their earnings. | |
Reporters have lately been asking us if our foot traffic data can shed light on visits to Trump properties, so we decided to take a closer look. To be clear, as a technology company, we’re not in the game of taking a stance on political questions. But we are interested in the power of data to illuminate cultural trends. So let’s look at the numbers. | |
Have The Donald’s politics trumped Trump businesses? | |
It turns out the data is fairly clear: Since Donald Trump announced his candidacy in June 2015, foot traffic to Trump-branded hotels, casinos and golf courses in the U.S. has been down. Since spring, it’s fallen more. In July, Trump properties’ share of visits fell 14% year over year, for instance. | |
There has been an interesting arc over the last year. Before Trump announced his presidential bid, foot traffic to his properties was steady year-over-year — and maybe even saw a small uptick. After he entered the race, his branded properties failed to get their usual summertime traffic gains. In August 2015, the share of people coming to all Trump-branded properties was down 17% from the year before. | |
These losses stabilized to single digits for a number of months, but as Primary voting season hit full swing in March 2016, share losses grew again. Trump properties did not get their usual springtime bounce of travelers and locals. March share was down 17% once more. | |
The properties that were hardest hit were the Trump SoHo, Trump International Hotel & Tower Chicago and Trump Taj Mahal, down 17–24% in raw foot traffic this past year as compared to the previous year. Incidentally, Trump Taj Mahal just yesterday announced that it will be closing its doors after Labor Day, citing an ongoing employee strike as the reason; our foot traffic report shows the problems ran deeper. | |
We also decided to look at how foot traffic might differ between “blue state” and “red state” locations. Trump properties include a number of award-winning hotels (particularly Trump International Hotel & Tower Chicago). However, his hotels, casinos, and golf courses are mainly located in reliably “blue” Democratic states, and depend highly on guests and visitors who live in the region. | |
Breaking out Blue States, the loss in foot traffic runs deeper than the national average. For the past five months, Trump’s blue state properties — spread between New York, New Jersey, Illinois, and Hawaii — have taken a real dip, with diminishing visits starting in March and a widening gap that continues straight through July, when share fell 20% versus July 2015. | |
When we dissect this traffic further, we see that the market share losses have been driven by a fall-off among women. Trump properties have seen a double-digit decrease in visits from women this year, with a gap that widened starting in March 2016. (The one anomaly was February, for unclear reasons.) In July, visit share among women to Blue State properties was down 29%. This seems to reflect the gender division in the polls among American women. | |
Foot traffic for Trump-branded properties in purple ‘swing states’ tells a different story. These states have fluctuated greatly over the campaign. They have seen share loss, but it’s more favorable territory for the Trump brand. | |
When Trump was battling for the nomination against his final competitors from March to May, fewer people were visiting Trump properties in Las Vegas and Miami. Sentiment pivoted once more around the Republican Convention in July. July’s bounce from -20% share in June to -3% in July in Purple State locations is notable. | |
We Leave It to the Pundits to Interpret All This | |
As a location intelligence company, helping both consumers and businesses make smarter decisions in the real world, our job is simply to report the data. | |
For fans of Trump, the business losses may simply reflect the cost of sticking by his campaign statements and beliefs. For critics of Trump, the fact that more people are staying away from Trump-branded properties may reflect people “voting with their feet.” | |
Additionally, many Trump-branded properties have different owners and have licensed the Trump brand, so the economic impact on Mr. Trump himself may be very small in such cases. | |
Regardless of how you feel about the election, we hope our data analysis encourages more people to pay close attention, and illuminates how technology is opening up new kinds of societal understanding based on mobile usage. | |
A Word About Methodology and Market Share | |
At Foursquare, we understand foot traffic trails from more than 50 million monthly global users of our consumer apps (Foursquare and Swarm) and our websites, which people use to explore the world and check in. These location-based apps help us to see — always anonymously and in the aggregate — trends and other notable shifts. | |
To analyze foot traffic patterns to the dozens of Trump-branded hotels, casinos and golf courses in this study, Foursquare looked at explicit check-ins as well as implicit visits from Foursquare and Swarm app users who enable background location and visit these locations in the U.S. | |
Like pollsters and data scientists have been doing for decades, we normalize our data against U.S. census data, ensuring that our panel of millions accurately matches the U.S. population to remove any age or gender bias (though urban geographies are slightly over-represented in our panel). Foursquare’s proprietary understanding of Place Shapes and our ability to detect when mobile phones enter or exit over 100 million businesses and places around the world is the foundation of Place Insights, our product for analysts and marketers. | |
In this analysis we looked at “market share,” measuring how visits to Trump properties changed over time relative to competitive properties in the same area. We do this so we can best understand shifts within the hotel, casino, and golf course markets. For example: Trump Soho’s visits are reviewed alongside visits to all hotels in the New York City DMA, so when there’s a seasonal dip, we’re not attributing it as a dip in absolute visits. We frequently look at market share in our Place Insights product for marketers, to understand how a company, such as a fast food chain or a hotel group, is winning or losing against its competitive set. | |
In our research, we also cross-checked our market share analysis against absolute visits, to ensure that the dip in foot traffic share was not due to a sudden increase in traffic to non-Trump venues for reasons unrelated to the Trump properties. In this view, again we see the same decrease in visits to Trump properties by about 10% overall this past year as compared to the previous year. So there’s a clear indicator that visits to Trump properties are, indeed, down. | |
Whether the loss in visits is coming from sightseers versus paying hotel guests is unclear. Traffic does not always equate with revenue. We do not claim to know the relationship between reduced walk-in visitors and reduced revenue to the properties, especially since these Trump properties do not publish their historical financials to establish correlations over time. | |
The next several months will be telling for so many reasons. Trump International Hotel opens in Washington, D.C. in September 2016, just before the presidential election follows in November, and the impact of Trump’s campaign on its opening success is yet to be seen. | |
*** | |
If you’re interested in further analysis from Foursquare’s Place Insights, visit https://enterprise.foursquare.com/insights.","4" | |
"quincylarson","https://medium.com/free-code-camp/with-open-data-you-finally-get-what-your-taxes-already-paid-for-6f1990d98e9","5","{""Data Science"",Startup,Tech,Economy,""Life Lessons""}","767","3.40691823899371","With open data, you finally get what you’ve paid for all these years","If you pay your taxes, you pay for research. | |
Governments take your money and distribute it to their own departments (like NASA and DARPA), to universities and nonprofits in the form of grants, and to corporations in the form of subsidies. | |
Up until now, when you attempt to access the fruits of that research — academic papers and their data — you get paywalled: | |
But things are changing fast. | |
This week, NASA announced that from now on all, all of its peer-reviewed papers and their datasets will be publicly accessible. | |
And NASA’s in good company — New York City recently launched an open data portal. So did Singapore. | |
And earlier this year, the European Union announced that all of the results of its publicly funded research will be freely available by 2020. | |
There are a million reasons why all of this research and data should be open. For one thing, it improves accountability. | |
Earlier this summer, a data scientist got a parking ticket. Instead of just contesting it in court, he analyzed New York City’s open data and discovered that the NYPD was systematically ticketing legally parked cars to the tune of millions of dollars per year. Using this insight — and the open data to back it up — he was able to put an end to the practice. | |
This is just one example of what can happen when anyone with an internet connection can dig into a dataset, or re-run a study’s numbers and attempt to reproduce its results. | |
Most published scientific research cannot be reproduced. Now that well-meaning outsiders are armed with open data, they can help isolate the signal from the noise and expedite our quest for the truth. | |
Another reason that data and research should be open is that science builds upon itself. | |
You no longer have to be a scientist working at NASA to be able to access its data. Other countries’ space programs can now benefit from these datasets — as can private sector efforts like SpaceX and Blue Origin. | |
And scientists aren’t the only ones who benefit from open data. A bootstrapped startup in Lagos can now design a new product based on open research. A nonprofit in Dhaka can now glean insights from open datasets and mount a fund raising campaign around them. | |
Think about all the economic benefits of open source software. 75% of smart phones run on Android. 80% of servers run on Linux. And this is just the tip of the open source iceberg. | |
The open data movement will unleash even more human potential and economic activity. It will speed up innovation everywhere. | |
With the EU making open data the law in 2020, hopefully other governments will soon follow suit. | |
In the meantime, here are some places you can explore open data right now: | |
And to celebrate the open data movement, we created an open data t-shirt — designed by camper Kosta Cemerikic — and released under a creative commons license. | |
The battle for open data isn’t over, but we’re getting there. | |
You can help by making use of all these open datasets, and by releasing your own data under an open data license. | |
You can also raise awareness of the open data movement by telling your friends about it. You can start by sharing this article with them 😉 | |
If you liked this, click the💚 below so other people will see this here on Medium.","1" | |
"drmdhumphries","https://medium.com/the-spike/how-a-happy-moment-for-neuroscience-is-a-sad-moment-for-science-c4ba00336e9c","1","{Science,Neuroscience,""Data Science"",Politics,""Big Data""}","711","3.87547169811321","How a happy moment for neuroscience is a sad moment for science","The Allen Institute for Brain Science released a landmark set of data in June. Entitled the “Allen Brain Observatory”, it contains a vast array of recordings from the bit of cortex that deals with vision, while the eyes attached to that bit of cortex were looking at patterns. Not too exciting, you say. In some respects you’d be right: some mouse brain cells became active when shown some frankly boring pictures. Experimental neuroscience is eternally lucky that mice have a very high boredom threshold. | |
The release of this data took a privately funded institute. It could not have come from a publicly-funded scientist. It is a striking case-study in how modern science is worryingly broken, because it prioritises private achievement over the public good. | |
You see, it’s not the what, but the how. These data are the first complete set of neural activity recordings released before publication. No papers preceded it; not even a report. Nothing. Just: here you go guys, the fruits of the joint labour of around 100 people over 4 years. | |
And all available to anyone, for free. Anyone at all. You, in fact – if you fancy being a neuroscientist for a day, go take a look at it. I’ll just wait here. | |
What did you find? If you found something new to science, or replicated some older work, go ahead and write it up for publication. The Allen Institute claim no jurisdiction over the data at all. It gives you that warm, fuzzy feeling inside. | |
Most scientists would never even contemplate such a manoeuvre. Research needs grants to fund it, and grants need papers. Promotion needs papers. Tenure need papers. Postdoc positions need papers. Even PhD studentships need papers now, God help us all. Everything needs bloody papers. (Which works well for people like me who enjoy writing; but is a distinct disadvantage for talented scientists who don’t.) | |
(Last semester, we even got a Faculty-wide email encouraging us to write up our Master’s students’ project work for publication. Because what science needs right now is more unfinished crap.) | |
Data makes papers. Data makes grants. Who would ever release data without first writing up a paper? Who would fund grants to work on data that you’ve already released? Which committees recognise “releasing data” as a principal output when looking for a new job candidate or a promotion? Or assessing the research quality of a university? | |
This all means I’m feeling rather ambivalent about the “Brain Observatory” data. On the one hand, I deeply admire that the philanthropic principles of the Allen Institute extend to giving away their data for free. On the other hand, I’m deeply sad that it takes a billionaire software designer’s philanthropy to make such a thing happen. The Allen Institute is supported by Paul Allen, erstwhile founding partner of Microsoft with Bill Gates, making it an entirely private, self-sufficient research institution. Which brings with it a more corporate approach to science: dedicated teams of specialists solving technical issues, or collecting specific types of data. Their performance targets are tight deadlines for reaching project milestones, rigour of the methods, and quality of the resulting science. | |
Their targets are not papers. Nor money. | |
So, landmark moment for neuroscience that it is, the Allen Institute’s “Brain Observatory” is also a case study in how modern science’s incentives are all wrong. If we only measure the quality of someone’s science by the amount of money they accrue and the number of “impactful” papers they produce, then by definition we are not measuring the quality and rigour of the science itself. It is sad that an entirely private research institute can show up so starkly the issues of publicly-funded science. | |
But this also offers a case study in the solutions to science’s incentive problem. The Allen Institute have shown repeatedly that quality and rigor of science can be prioritised over quantity of output and money as measures of “success”. Others have also shown how dedicating many resources to long term projects can produce deep insights and highly beneficial tools for neuroscience. For example, Jeremy Freeman’s team producing a suite of neuroscience analysis tools for high performance computing platforms; or Christian Machens’ team developing their general neuron population analysis framework, and applying it a vast range of datasets. | |
What all these have in common is their origin in dedicated, privately funded research institutes. These researchers are somewhat immune to the science incentive problem that pervades universities. This is because universities drive the quest for money. Research grants pay a lot towards universities’ infrastructure, services, and administrative people. So universities want grants. And papers, as noted above, play a key role in getting grants: so they want papers too. (In the UK we also have the direct equation that papers = money, thanks to the REF). | |
A solution is thus that universities should adopt the private institute model: stop pressurising researchers to obtain grants and papers. Instead they could spend their own money sustaining the research programmes of their own researchers (rather than on, say, yet more bloody buildings , or administrators). This would remove the pressure to get short term grants, but leave open the need for high value grants for major programmes of work. Reward quality and rigour, not output. Reward the work effort, not the luck of the draw in where the paper finally came out. | |
There, solved. Next week: how nuclear disarmament can be achieved with a teaspoon and an angry badger. | |
(Read Vox’s great piece on the problems facing science here) | |
If you liked this, please click the 💚 below so other people can read about it on Medium. | |
Read more by this author","2" | |
"eklimcz","https://medium.com/truth-labs/re-thinking-reading-on-the-web-158e789eddd7","10","{Innovation,UX,""Data Visualization""}","704","7.69905660377358","Re-thinking reading on the Web","","1" | |
"bjacobso","https://medium.com/swlh/is-it-brunch-time-ffe3adf485d8","10","{""Data Science"",Twitter,Mathematics,""Data Visualization"",Brunch}","666","5.88018867924528","Is it brunch time?","I created isitbrunchtimeyet.com a few years ago as a joke[1]. It has since been used mostly as a funny and/or slightly passive aggressive way to ask people to brunch. When I created it I arbitrarily set the brunch time to be 10:15am to 11:45am. It is that decision which leads us here — I want to do better than arbitrary. | |
Twitter is a platform that allows users to “get real-time updates about what matters to you.”[2] If we assume people generally tweet about things while they are doing those things then we can assume tweets containing “brunch” are generally happening while that person is actually having brunch. Therefore, if we collect enough tweets over a long enough period of time and analyze the time at which they were tweeted we could infer the specific time range in which brunch falls. | |
We begin by using the Twitter Streaming API. This API allows us to subscribe to search terms, for example “brunch”, and get any tweet matching that term sent to our program in real-time. Not only did we collected “brunch” tweets but we also collected tweets containing “breakfast”, “lunch”, and “dinner” to use as controls (which we will review later). We allowed the program run from 2015–06–01 to 2016–05–31 which yielded 100M+ tweets for analysis. Twitter is global platform, so we have to do some additional work to understand the time of day a specific tweet is occurring. At the time of tweet we analyzed the timezone of the person tweeting and any attached geolocation data (when available). Using this data we then made an informed estimate of the localized hour for each tweet. | |
As a result, we are able to break down a count of tweets by localized hour: | |
Simply looking at the histogram above one will quickly notice that 11am is the most popular hour to tweet about brunch — which therefore means 11am to 12pm must be brunch time, right? | |
Well, that is a perfectly reasonable conclusion to draw. However, we think there is room for improvement. First, this solution lacks any real fidelity. It would seem unlikely that brunch starts exactly at 11am and ends exactly at 12pm. Second, if we look at the histogram again you can see that activity in 12pm is very close to 11am. This informs us that a lot of brunch activity takes place in the hours following 11am that this solution doesn’t take into account. | |
Since we are visualizing a histogram it is logical for us to jump to a distribution function for further analysis. And since our data has a positive skew we will look to a probability density function (PDF). To start, we calculate the lognormal distribution[5] PDF (the red line below) and locate the maximum point of that function (the mode). We chose the lognormal over the normal distribution because it appeared to fit the data better. We then identify the range in which a large portion, let’s say 1/4 (25%), of all tweets occur centered around that maximum: | |
This strategy gives us a brunch time of 9:25am to 12:24pm. However, this strategy has a few drawbacks. First, the 25% occurrence is an arbitrary value. It’s no better than 30% or 20% or any other number. Second, the lognormal PDF heavily depends on how we think about the x-axis. In this example we start at 3am, which we only chose because it generally had the lowest tweet count of our data set (otherwise it would like slightly bimodal). If we instead started the x-axis at, say 12am, we would get a different curve. Third, looking at the graph it feels like it has a similar issue with our first solution in that it isn’t really capturing the long tail of activity in the early afternoon. Each of these arbitrary assumptions are unacceptable tradeoffs in our quest for an objective answer. | |
Splines allow us to create a smooth curve through many different points. Let’s make a curve through each hour and graph it: | |
If we look at this curve we can see the maximum aligns very nicely with where we would intuitively assume brunch is occurring based on the histogram. Let’s give this maximum a name, how about Brunch Point? Brunch Point can be defined as: the exact time of day in which brunch maximally occurs. Based on our data that time is 11:56am. | |
Now, that appears to be an excellent answer! However, we are looking for a time range, not just a point in time. So how can we calculate a range from the spline? Remember calculus from high school? This is the real life use case for calculus that was promised by your professors! Let’s look at the derivative: | |
Don’t have your calculus books handy? What is a derivative? The derivative of a function will give us the rate of change (a.k.a. the slope, a.k.a. the m in y = mx + b) of a function at any given point in time.[4] It helps us understand the acceleration of a function. Let’s find the tangent lines with the highest and lowest slope: | |
At that maximum slope, the number of brunch tweets is experiencing its highest positive rate of change: people are speeding up their tweets about brunch — pushing hardest on the gas pedal. At the minimum slope, the number of brunch tweets is experiencing its highest negative rate of change: people are slowing down their tweets — pushing hardest on the brake pedal. Using these two points we get a brunch time of: 10:01am to 1:40pm. | |
We like this method the most for several reasons: | |
But how do we know if it’s correct? Let’s apply the solution to other terms and observe how it performs. | |
For comparison let’s look at the splines for “breakfast” and “lunch” tweets: | |
Since “breakfast” and “lunch” are more popular terms on twitter than “brunch” let’s modify these curves to be based on frequency — which makes it easier to see the distinctions between them. Then let’s also apply the solution above to get a sense of how they compare: | |
Finally, we have some results: | |
So that’s it — Brunch is officially from 10:01am to 1:40pm. It is interesting that the end of breakfast and the beginning of lunch are within 15 minutes of each other. It’s also interesting that Brunch Point falls in that same time frame as well. Both of which give this solution at least a little bit of affirmation. | |
Now go eat some brunch! | |
tl;dr — Brunch is officially from 10:01am to 1:40pm, based on 100M+ tweets and analysis. | |
All analysis and charts were done in python using pandas/matplotlib | |
Source: https://github.com/bjacobso/brunch | |
[1] Very much inspired by isitchristmas.com and isitlunchtimeyet.com | |
[2] https://about.twitter.com/ | |
[3] https://en.wikipedia.org/wiki/Spline_interpolation | |
[4] https://en.wikipedia.org/wiki/Derivative | |
[5] Excellent feedback from Reddit","2" | |
"mbostock","https://medium.com/@mbostock/introducing-d3-scale-61980c51545f","12","{D3,JavaScript,""Data Visualization""}","645","5.21792452830189","Introducing d3-scale","","1" | |
"hoffa","https://medium.com/@hoffa/400-000-github-repositories-1-billion-files-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fd","3","{Bigquery,""Big Data"",""Open Source"",Github,""Google Cloud Platform""}","604","2.51981132075472","400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?","I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions: | |
That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs]. | |
In the [contents] table we have each unique file represented only once. To see the total number of files and size represented: | |
Then it was time to run the ranking according to the previously established rules: | |
Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery. | |
Reddit, Hacker News, Slashdot | |
Gizmodo, Business Insider, ADTMag, i-programmer | |
Gizmodo .es, Business Insider .pl, Le blog du Modérateur, Tecnoblog (pt), Biglobe (jp), My Drivers (cn), Genk .vn. | |
Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.","9" | |
"mattfogel","https://medium.com/swlh/the-7-best-data-science-and-machine-learning-podcasts-e8f0d5a4a419","12","{Podcast,""Data Science"",""Machine Learning""}","588","3.35","The 7 Best Data Science and Machine Learning Podcasts","Data science and machine learning have long been interests of mine, but now that I’m working on Fuzzy.ai and trying to make AI and machine learning accessible to all developers, I need to keep on top of all the news in both fields. | |
My preferred way to do this is through listening to podcasts. I’ve listened to a bunch of machine learning and data science podcasts in the last few months, so I thought I’d share my favorites: | |
A great starting point on some of the basics of data science and machine learning. Every other week, they release a 10–15 minute episode where hosts, Kyle and Linda Polich give a short primer on topics like k-means clustering, natural language processing and decision tree learning, often using analogies related to their pet parrot, Yoshi. This is the only place where you’ll learn about k-means clustering via placement of parrot droppings. | |
Website | iTunes | |
Hosted by Katie Malone and Ben Jaffe of online education startup Udacity, this weekly podcast covers diverse topics in data science and machine learning: teaching specific concepts like Hidden Markov Models and how they apply to real-world problems and datasets. They make complex topics extremely accessible. | |
Website | iTunes | |
Each week, hosts Chris Albon and Jonathon Morgan, both experienced technologists and data scientists, talk about the latest news in data science over drinks. Listening to Partially Derivative is a great way to keep up on the latest data news. | |
Website | iTunes | |
This podcast features Ben Lorica, O’Reilly Media’s Chief Data Scientist speaking with other experts about timely big data and data science topics. It can often get quite technical, but the topics of discussion are always really interesting. | |
Website | iTunes | |
Data Stories is a little more focused on data visualization than data science, but there is often some interesting overlap between the topics. Every other week, Enrico Bertini and Moritz Stefaner cover diverse topics in data with their guests. Recent episodes about smart cities and Nicholas Felton’s annual reports are particularly interesting. | |
Website | iTunes | |
Billing itself as “A Gentle Introduction to Artificial Intelligence and Machine Learning”, this podcast can still get quite technical and complex, covering topics like: “How to Reason About Uncertain Events using Fuzzy Set Theory and Fuzzy Measure Theory” and “How to Represent Knowledge using Logical Rules”. | |
Website | iTunes | |
The newest podcasts on this list, with 8 episodes released as of this writing. Every other week, hosts Katherine Gorman and Ryan Adams speak with a guest about their work, and news stories related to machine learning. | |
Website | iTunes | |
Feel I’ve unfairly left a podcast off this list? Leave me a note to let me know. | |
Published in Startups, Wanderlust, and Life Hacking | |
-","1" | |
"giorgialupi","https://medium.com/accurat-studio/the-architecture-of-a-data-visualization-470b807799b4","24","{""Data Visualization"",Design,Journalism}","561","9.85566037735849","The Architecture of a Data Visualization","","2" | |
"perborgen","https://medium.com/xeneta/boosting-sales-with-machine-learning-fbcf2e618be3","11","{""Machine Learning"",""Data Science"",Startup}","515","6.84339622641509","Boosting Sales With Machine Learning","In this blog post I’ll explain how we’re making our sales process at Xeneta more effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions. | |
Head over to GitHub if you want to check out the script immediately, and feel free to suggest improvements as it’s under continuous development. | |
It started with a request from business development representative Edvard, who was tired of performing the tedious task of going through big excel sheets filled with company names, trying to identify which ones we ought to contact. | |
This kind of pre-qualification of sales leads can take hours, as it forces the sales representative to figure out what every single company does (e.g. through read about them on LinkedIn) so that he/she can do a qualified guess at whether or not the company is a good fit for our SaaS app. | |
And how do you make a qualified guess? To understand that, you’ll first need to know what we do: | |
More specifically, if your company ships above 500 containers per year, you’re likely to discover significant saving potential by using Xeneta, as we’re able to tell you exactly where you’re paying above the market average price. | |
This means that our target customers are vastly different from each other, as their only common denominator is that they’re somewhat involved in sea freight. Here are some examples of company categories we target: | |
Though the broad range of customers represents a challenge when finding leads, we’re normally able to tell if a company is of interest for Xeneta by reading their company description, as it often contains hints of whether or not they’re involved in sending stuff around the world. | |
This made us think: | |
If so, this algorithm could prove as a huge time saver for the sales team, as it could roughly sort the excel sheets before they start qualifying the leads manually. | |
As I started working on this, I quickly realised that the machine learning part wasn’t be the only problem. We also needed a way to get hold of the company descriptions. | |
We considered crawling the companies’ websites and fetch the About us section. But this smelled like a messy, unpredictable and time consuming activity, so we started looking for API’s to use instead. After some searching we discovered FullContact, which have a Company API that provides you with descriptions of millions of companies. | |
However, their API only accept company URL’s as inputs, which rarely are present in our excel sheets. | |
So we had to find a way to obtain the URL’s as well, which made us land on the following workflow: | |
There’s of course a loss at each step here, so we’re going to find a better way of doing this. However, this worked well enough to test the idea out. | |
Having these scripts in place, the next step was to create our training dataset. It needed to contain at least 1000 qualified companies and 1000 disqualified companies. | |
The first category was easy, as we could simply export a list of 1000 Xeneta users from SalesForce. | |
Finding 1000 disqualified was a bit tougher though, as we don’t keep track of the companies we’ve avoided contacting. So Edvard manually disqualified 1000 companies. | |
With that done, it was time to start writing the natural language processing script, with step one being to clean up the descriptions, as they are quite dirty and contain a lot of irrelevant information. | |
In the examples below, I’ll go though each of the cleaning techniques we’re currently applying, and show you how a raw description ends up as an array of numbers. | |
The first thing we do is to use regular expressions to get rid non-alphabetical characters, as our model will only be able to learn words. | |
We also stem the words. This means reducing multiple variations of the same word to its stem. So instead of accepting words like manufacturer, manufaction, manufactured & manufactoring, we rather simplify them to manufact. | |
We then remove stop words, using Natural Language Toolkit. Stop words are words that have little relevance for the conceptual understanding the text, such as is, to, for, at, I, it etc. | |
But cleaning and stemming the data won’t actually help us do any machine learning, as we also need to transform the descriptions into something the machine understands, which is numbers. | |
For this, we’re using the Bag of Words (BoW) approach. If you’re not familiar with BoW, I’d recommend you to read this Kaggle tutorial. | |
BoW is a simple technique to turn text phrases into vectors, where each item in the vectors represents a specific word. Scikit learn’s CountVectorizer gives you super simple way to do this: | |
The max_features parameter tells the vectorizer how many words you want to have in our vocabulary. In this example, the vectorizer will include the 5000 words that occur most frequently in our dataset and reject the rest of them. | |
Finally, we also apply a tf-idf transformation, which is a short for term frequency inverse document frequency. It’s a technique that adjusts the importance of the different words in your documents. | |
More specifically, tf-idf will emphasise words that occur frequently in a description (term frequency), while de-emphasise words that occur frequently in the entire dataset (inverse document frequency). | |
Again, scikit learn saves the day by providing tf-idf out of the box. Simply fit the model to your vectorized training data, and then use the transform method to transform it. | |
After all the data has been cleaned, vectorised and transformed, we can finally start doing some machine learning, which is one of the simplest parts of this task. | |
I first sliced the data into 70% training data and 30% testing data, and then started off with two scikit learn algorithms: Random Forest (RF) and K Nearest Neighbors (KNN). It quickly became clear that RF outperformed KNN, as the former quickly reached more than 80% accuracy while the latter stayed at 60%. | |
Fitting a scikit learn model is super simple: | |
So I continued with RF to see how much I could increase the accuracy by tuning the following parameters: | |
With these parameters tuned, the algorithm reaches an accuracy of 86,4% on the testing dataset, and is actually starting to become useful for our sales team. | |
However, the script is by no means finished. There are tons of way to improve it. For example, the algorithm is likely to be biased towards the kind of descriptions we currently have in our training data. This might become a performance bottle neck when testing it on more real world data. | |
Here are a few activities we’re considering to do in the road ahead: | |
We’ll be pushing to GitHub regularly if you want to follow the progress. And feel free to leave a comment below if you have anything you’d like to add. | |
Cheers, | |
Per Harald Borgen | |
Thanks for reading! We are Xeneta — the world’s leading sea freight intelligence platform. We’re always looking for bright minds to join us, so head over to our website if you’re interested! | |
You can follow us at both Twitter and Medium.","1" | |
"mhkt","https://medium.com/@mhkt/siri-what-do-you-think-of-me-f9b78afc77a2","2","{Tech,""Data Science"",Privacy}","505","5.79842767295598","Siri, What Do You Think of Me?","“You haven’t seen the new Viagra ads?! They’re ridiculous.” | |
I have a couple of friends over for dinner, and after a bottle of wine sinks in, our conversation turns to recent highlights in bizarre video advertising (ads: they kinda work!). I wake up my computer to correct their ignorance of the absurd woman-lustily-tossing-a-football-and-applying-perfume spot. | |
I instinctively open an Incognito tab in Chrome. | |
Why am I secretly viewing something I am about to show so publicly? Well … I don’t want Facebook to start serving me Viagra ads unbidden or Google Now to helpfully insert facts about impotence, it’s symptoms and treatment, on my phone’s home screen. | |
Where I’m going, I don’t want the algorithms to follow. | |
For me, and I suspect many others, going into the shadows like this has become second nature. I open up a private tab whenever I’m doing something a little off my personal brand, something I wouldn’t want Siri to throw back in my face. I hide from the algorithmic version of myself.¹ | |
With the unassuming explanation above, Ehsan Akhgari, an engineer working on then-dominant browser Firefox, introduced private browsing to the world in December 2008. It was three years after the founding of YouTube, yet the language reads today as Lassie-level quaint. He imagines a snooping coworker, an untrusting spouse — real live humans who might catch a detail you don’t want them to see. | |
Akhgari couldn’t have anticipated how much of our daily experience is delivered by today’s data-insatiable products. Seven years later, I use private browsing not to avoid prying human eyes, but to hold back data points from a machine. I use these features to stop the accretion of every term searched, every profile clicked, every video viewed, every blog to the platform’s representation of me. | |
I’m not hiding from my coworkers when I go Incognito, I just don’t want whatever silly thing I tap to change who Google thinks I am. I’m limply asserting the right to define how the machine sees me, not climbing the high horse of some Victorian notion of “privacy.” | |
ME: What’s with these weird Bluetooth-enabled bike shorts ads everywhere?ALGO_ME: I clicked a Facebook ad for a smart workout shirt in late 2013! It was pretty neat, I read the marketing material for 4.6 minutes.ME: Okay, but, maybe it was just because the guy looked fit and that was a shitty period in my personal exercise hab — ALGO_ME: Great, I love exercise! Going to go ahead and up my score in “weight consciousness & personal wellness” to a 7.8/10ME: Weight loss?? That’s so not my — ALGO_ME: Look at this amazing ultra-high-protein nootropic appetite suppressant gel! Great design too. It’s kind of a hipper Soylent. I spent 66.8 minutes reading about Soylent in 2014, and even commented with 80%-likely positive sentiment about it on a friend’s post. It’s perfect.ME: That was … opposition research … and sarcasm? Fine, fine, show me the gel. | |
Machine learning products are ubiquitous because they’re extremely useful. Google Now’s ravenous ingestion of all of my email in exchange for notifications of delayed flights and social engagements is … kinda great. More subtly, I would miss many births, promotions, illnesses, and other life events of all but my closest friends if Facebook didn’t have a keen sense that these milestones are why I look at News Feed. Not to mention Foursquare’s making me seem hipper than I am choosing restaurants on vacation, thanks to its knowledge of every business I have stepped foot in for six years. | |
You would be forgiven for believing companies are willing to use as much data as they can get away with, adding your every click to the model of your algorithmic self. The origins of real algorithmic products are much more haphazard. I know this because I’ve had a hand in making several of them. | |
Startups and even massive platforms develop features so provisionally, there’s often no notion that the data byproducts of a feature will ever be used in data mining. At Tumblr, the ♥ button existed for two years before we fed all those heart-ings into an algorithm to generate a (helpful)! list of “Blogs You Might Like.” | |
As an engineer setting out to create a feature like recommended blogs — a feature I think you’ll really like— I go on a fishing expedition through your actions. I pull in data about what you tapped when, who you follow, who the people you follow follow, how often you login, and on and on. I assign numeric values to each of your acts and relationships.² | |
Your past and future life on the product becomes a many-many-dimensional matrix. This matrix is nameless. To me, it is a banal, temporary set of mathematical facts to be compared to others’ matrices to generate a useful recommendation: a blog to follow. But this matrix is you. | |
To my fellow technologists, anthropomorphizing the math this way may seem ridiculous.³ But these algorithm-friendly representations decide what millions of people read, buy, eat, and watch every day. The algorithmic self is only inert and abstract because we technologists choose to keep it hidden in the machine. | |
Could we name these abstractions and let you see them? We squirm. The sum total of your affinities won’t be flattering. The categories you float through will be abstract and hard to name. But this is a discomfort worth diving into headfirst. | |
We created a Frankenstein double of you, we could at least finish the job and animate him. | |
Probabilities (in %) based on volunteered data and past interactions | |
Demographics: | |
Top monetizable interests: | |
Waking hours: 6:30–7:28am (snooze-dependent) to 11pm ETMindless insomnia: 2:30 to 3:30am ETDays without leaving the house (monthly): ~2.2 | |
Ducking from algorithms by opening a “private” tab is the weakest of hedges. I get the distinct feeling it is like spelling the bad things L-E-T-T-E-R B-Y-L-E-T-T-E-R in front of a precocious child: temporarily useful but ultimately futile. | |
With the rising wave of products with AI at their core, our algorithmic selves are soon to play an even larger role in our real selves’ lives. It would be nice to meet our other halves. | |
Why shouldn’t apps tell us what they really think of us? | |
¹ I wonder about going out in algorithmic drag, purposefully creating false versions of myself just for the machines. On Monday, hop on Twitter as a suburban working mom, getting served up ads for childcare marketplaces and mini-vans. Tuesday, hit Pinterest in the guise of a hedge fund manager remodeling his Hamptons estate, and see the infinity pool recommendations that come with. Etc, etc. | |
² I took some creative license in describing the creation of such a feature here. See this talk by Adam Laiacano, who built the actual version of blog recommendations at Tumblr, to understand the complex technical work involved. | |
³ More importantly, my use of “algorithmic” as shorthand for a wide range of technology is reductive. At least I didn’t call it “Big Data,” okay? | |
If you liked this article, why not subscribe to my newsletter and get my unfiltered commentary and a handful of links for product makers in your email every week?","1" | |
"matthew_daniels","https://medium.com/@matthew_daniels/the-journalist-engineer-c9c1a72b993f","11","{""Data Visualization"",Journalism,Strategy}","470","5.43584905660377","The Journalist-Engineer","Lately, some of the best articles in the NY Times and Bloomberg are 99% code. The end-product is predominantly software, not prose. | |
Here’s an example: the NY Times’ mapping of migration in the US. | |
Several years ago, this article might have been a few thousand words. There’d be tables and charts. They’d reference academic studies and correlate the data with something like unemployment. | |
This example is different. It’s a well-designed data dump. It’s raw numbers without any abstractions. There’s no attachment to the news cycle. There’s no traditional thesis. It cannot be made in Photoshop or Illustrator. You must write software. | |
It represents the present-day revolution within news organizations. Some call it data journalism. Or explorable explanations. Or interactive storytelling. Whatever the label, it’s a huge shift from ledes and infographics. | |
Here’s another example: a graphic from the NY Times on yield curve data: | |
The story is the code. It depicts the yield curve, an incredibly complex system, in all of its glory. It’s an amazing piece of software (I bet financial companies would even buy it). | |
Here’s another example: The Parable of the Polygons, an explanation of an academic paper about segregation. | |
It’s is a very elegant presentation of a system using code. For reference, here’s how the original author conveyed the idea, back in 1991. | |
No need to hate on the design — it worked just fine, pre-Internet. But today, code makes the possibilities so much richer. | |
Here’s two more excellent depictions of complex ideas using code, one on OPEC prices and another on machine learning: | |
Note that all of these examples brilliantly include some prose. There’s an expert presenting her beliefs about the data, which acts as a guide to the data and launches the reader into their own discovery process. | |
Creative coders turned their sights from media art to journalism. They’re writing software about ideas that have eluded traditional news organizations, either because they were too complex to explain in prose or they were trapped in a spreadsheet/academic paper. | |
And that’s what I’m doing with Polygraph…liberating those ideas. | |
A couple months ago, I published an article comparing historic and present-day popularity of older music. I used two huge datasets: 50,000 Billboard songs and 1,4M tracks on Spotify. | |
If I were writing an academic paper, I’d do a ton of analysis, regression, and modeling to figure out why certain songs have become more popular over time. | |
Or I could just make some sick visualizations… | |
Instead of reporting on my “theory”, I wagered that readers would get more out of an elegant presentation of the data, not an analysis of it. It’s a completely different approach to storytelling. | |
Here’s that same approach on another project: rappers and the size of their vocabulary. The process: depict the system (the vocabulary among rappers, Shakespeare, and Melville) rather than a thesis/point. | |
Instead of proving that one rapper was better than another, readers are really good at absorbing the data, and they’d much rather form their own judgements. | |
A few years ago, Bret Victor wrote about the notion of passive and active readers: | |
In theory, this sounds great…but kinda crazy. Imagine it: the Internet engaging in intense discourse over data. | |
It would represent a big shift in journalistic voice and place an enormous burden on the reader: “you find the story. You’re the data analyst.” It’s the opposite role of traditional media, which assumes the role of informer: “we have the knowledge; you don’t. We’re an authoritative source. Read, listen, watch this thing we researched.” | |
But it’s happening — there are active readers. I’ve been shocked at readers’ response for the handful of projects that I’ve worked on. Readers feel powerful. They don’t know what to call it — it feels foreign. | |
I believe it’s a response to “too long, didn’t read.” I open a 10,000 word article, and I anxiously wonder whether the time investment will pay-off. Maybe the author’s point will suck. | |
An experience for active readers doesn’t create anxiety. They don’t feel the burden of time — it’s at their own pace. Give readers the right depiction, avoid abstractions, add a narrative to guide them through the experience, and they’ll data science the shit out of a story. | |
There’s a few reasons why things are so different in 2015. | |
News organizations had to accept that code-driven content wouldn’t have a viable print-version. The NY Times launched a data-led blog, The Upshot, to address this tension. Michael Bloomberg is subsidizing all of Bloomberg Business’s engineers in the Editorial Department. No one else is even close to making a head-count commitment. | |
We needed a pretty unique skill-set: people who could design, write, and code. The talent pool has arrived: all of the coders who were creating apps, dashboards, and analytics tools could shift their design sense from users to readers. Like traditional journalists, engineers now had plenty of empathy for how people consume information. | |
D3 came along. Visualizing millions of data points on the Internet used to be impossible. And browsers are now robust enough to render our creations. | |
D3 also made it easier to be creative with design. Pre-D3, we were taking screenshots of charts from Excel. Now, you can create something that best expresses the data, instead of limiting yourself to traditional, pre-Internet design patterns (e.g., bar chart, scatter charts, pie charts, etc.). | |
Here’s an example of that process: an evolution of how Mike Bostock explored various bespoke designs for a visualization of corporate tax rates. | |
Note that very few of these designs would be possible in statistics packages or design programs. Long live D3. | |
There’s 4.3 million people subscribed to Reddit’s /r/dataisbeautiful sub. It’s a top 50 subreddit. The Internet has grossly undervalued our intrinsic interest in visualization. I expect the market for this sort of content to explode. | |
I’m psyched for the next wave of software that makes journalism easier to code. Someone will write a framework for Oculus Rift. Someone will figure out D3 for mobile. Even on desktop, scroll-based events are still in their infancy. | |
In the mean-time, I’ll be busy coding. | |
— | |
Matt Daniels is founder of Polygraph, a publication that explores popular culture with visual storytelling. Here’s my back-log of projects. Help me create them!.","3" | |
"mbostock","https://medium.com/@mbostock/introducing-d3-shape-73f8367e6d12","5","{""Data Visualization"",D3}","470","2.53899371069182","Introducing d3-shape","","1" | |
"sachagreif","https://medium.com/free-code-camp/the-state-of-javascript-2016-results-4beb4ff06961","4","{React,JavaScript,Startups,""Data Science"",""Web Development""}","465","4.37924528301887","The State Of JavaScript 2016: Results","I just looked through my inbox, and found a receipt for the awesome React for Beginners course dated November 4, 2015. So it’s been almost one full year since I ventured into the Wild West of modern JavaScript development. | |
I’m now fairly confident in my React skills, but it seems like as soon as I master one challenge, another one pops up: should I use Redux? Or maybe look into Vue instead? Or go full-functional and jump on the Elm bandwagon? | |
I knew I couldn’t be the only one with these question, so I decided to launch the State of JavaScript survey to get a more general picture of the ecosystem. Turns out I hit a nerve: within a week, I had accumulated over 9000 responses (no meme intended)! | |
It took me a while to go through the data, but the results are finally live! | |
And if you’d like to know a little bit more about the whole enterprise, just read on. | |
You might be wondering why it took me so long to analyze and publish the data. Hopefully this will become clear when you read through the report. | |
I didn’t want to simply publish a bunch of charts with no context. Raw stats are great if you already know what you’re looking for, but if you’re looking for guidance then they can just as well add to the overall noise. | |
Instead, I decided to use these stats as a basis for a detailed report on the current state of JavaScript. | |
I was originally planning on writing the whole thing myself, but I quickly realized that A) this would be a lot of work and B) I didn’t want the report to be too biased by my own preconceptions. | |
So I asked a few developer friends to pitch in and write the various sections of the report. Not only is the overall report a lot more objective –and interesting– as a result, but I was also able to get experts for each topic (I’ll be the first to admit that there are entire swathes of the JavaScript world I know little about). | |
So a huge thank to all the authors who contributed to the report: Tom Coleman, Michael Rambeau, Michael Shilman, Arunoda Susiripala, Jennifer Wong, and Josh Owens. | |
Here’s a little more info about the main types of chart you’ll see throughout the survey. | |
This is the main chart for each section. For each technology, it shows the breakdown of developers who have never heard of it, have heard of it but aren’t interested/want to learn it, and have used it and would not/would use it again. | |
You can toggle between percents and absolute numbers as well as filter by interest or satisfaction. But note that when filtering, the percentages are relative to the currently selected value pair (in other words both numbers total 100%). | |
I also wanted to explore the correlations between each technology. | |
The heatmap charts achieve this by showing you how likely someone who uses one technology (defined as having selected “I have used X and would use it again”) is to use another technology, compared to the average. | |
Pink means very likely, blue means very unlikely. In other words, a deep pink tile in the React row and Redux column means “React users are a lot more likely than average to also use Redux”. | |
I decided to practice what I preached and build the survey app itself using modern JavaScript tools, namely React powered by the excellent Gatsby static site generator. | |
It might seem weird at first to use React for what is essentially a static HTML page, but it turns out this brings a ton of advantages: for example, you’re able to use React’s vast ecosystem of modules such as the great Recharts library. | |
In fact I believe this may just prove to be a new, better approach for developing static sites, and I hope to write a more detailed post about it soon. | |
Finally, I wouldn’t have been able to take a month off to work on this without financial support from some really cool people. | |
Both Wes Bos (who has put out the afore-mentioned React for Beginners as well as the new ES6 for Everybody) and egghead.io (which in my opinion is the single best resource out there for learning cutting-edge JavaScript development) accepted to sponsor the project. Thanks guys! | |
If you think what I’ve done here is valuable and would like to support the project, a tweet or share would be much appreciated! | |
Additionally, you can also contribute a donation to get access to the raw anonymized data (or just enter “0” to get it for free). | |
Now that the survey is over and we all know what the best technologies are, hopefully we can put any talks of “JavaScript fatigue” or “endless churn” to rest and move on with our programming lives. | |
Haha, as if! | |
If one thing has become clear to me, it’s that the growing pains that JavaScript is going through right now are only the beginning. While React has barely emerged as the victor of the Front-End Wars of 2015, some developers are already decrying React for not being functional enough, and embracing Elm or ClojureScript instead. | |
In other words, my job here isn’t done, and I fully intend to do this survey again next year! If you want to be notified when that happens, I encourage you to leave me your email here. | |
Until then, I can only hope these survey results will provide a little clarity in our never-ending quest to make sense of the JavaScript ecosystem!","1" | |
"felix.gessert","https://medium.com/baqend-blog/nosql-databases-a-survey-and-decision-guidance-ea7823a822d","10","{NoSQL,Database,""Big Data"",Data,Scalability}","447","25.1216981132075","NoSQL Databases: a Survey and Decision Guidance","Together with our colleagues at the University of Hamburg, we — that is Felix Gessert, Wolfram Wingerath, Steffen Friedrich and Norbert Ritter — presented an overview over the NoSQL landscape at SummerSOC’16 last month. Here is the written gist. We give our best to convey the condensed NoSQL knowledge we gathered building Baqend. | |
Today, data is generated and consumed at unprecedented scale. This has lead to novel approaches for scalable data management subsumed under the term “NoSQL” database systems to handle the ever-increasing data volume and request loads. However, the heterogeneity and diversity of the numerous existing systems impede the well-informed selection of a data store appropriate for a given application context. Therefore, this article gives a top-down overview of the field: Instead of contrasting the implementation specifics of individual representatives, we propose a comparative classification model that relates functional and non-functional requirements to techniques and algorithms employed in NoSQL databases. This NoSQL Toolbox allows us to derive a simple decision tree to help practitioners and researchers filter potential system candidates based on central application requirements. | |
Traditional relational database management systems (RDBMSs) provide powerful mechanisms to store and query structured data under strong consistency and transaction guarantees and have reached an unmatched level of reliability, stability and support through decades of development. In recent years, however, the amount of useful data in some application areas has become so vast that it cannot be stored or processed by traditional database solutions. User-generated content in social networks or data retrieved from large sensor networks are only two examples of this phenomenon commonly referred to as Big Data. A class of novel data storage systems able to cope with Big Data are subsumed under the term NoSQL databases, many of which offer horizontal scalability and higher availability than relational databases by sacrificing querying capabilities and consistency guarantees. These trade-offs are pivotal for service-oriented computing and as-a-service models, since any stateful service can only be as scalable and fault-tolerant as its underlying data store. | |
There are dozens of NoSQL database systems and it is hard to keep track of where they excel, where they fail or even where they differ, as implementation details change quickly and feature sets evolve over time. In this article, we therefore aim to provide an overview of the NoSQL landscape by discussing employed concepts rather than system specificities and explore the requirements typically posed to NoSQL database systems, the techniques used to fulfil these requirements and the trade-offs that have to be made in the process. Our focus lies on key-value, document and wide-column stores, since these NoSQL categories cover the most relevant techniques and design decisions in the space of scalable data management. | |
In Section 2, we describe the most common high-level approaches towards categorizing NoSQL database systems either by their data model into key-value stores, document stores and wide-column stores or by the safety-liveness trade-offs in their design (CAP and PACELC). We then survey commonly used techniques in more detail and discuss our model of how requirements and techniques are related in Section 3 , before we give a broad overview of prominent database systems by applying our model to them in Section 4 . A simple and abstract decision model for restricting the choice of appropriate NoSQL systems based on application requirements concludes the paper in Section 5. | |
In order to abstract from implementation details of individual NoSQL systems, high-level classification criteria can be used to group similar data stores into categories. In this section, we introduce the two most prominent approaches: data models and CAP theorem classes. | |
The most commonly employed distinction between NoSQL databases is the way they store and allow access to data. Each system covered in this paper can be categorised as either key-value store, document store or wide-column store. | |
2.1.1 Key-Value Stores. A key-value store consists of a set of key-value pairs with unique keys. Due to this simple structure, it only supports get and put operations. As the nature of the stored value is transparent to the database, pure key-value stores do not support operations beyond simple CRUD (Create, Read, Update, Delete). Key-value stores are therefore often referred to as schemaless: Any assumptions about the structure of stored data are implicitly encoded in the application logic (schema-on-read) and not explicitly defined through a data definition language (schema-on-write). | |
The obvious advantages of this data model lie in its simplicity. The very simple abstraction makes it easy to partition and query the data, so that the database system can achieve low latency as well as high throughput. However, if an application demands more complex operations, e.g. range queries, this data model is not powerful enough. Figure 1 illustrates how user account data and settings might be stored in a key-value store. Since queries more complex than simple lookups are not supported, data has to be analyzed inefficiently in application code to extract information like whether cookies are supported or not (cookies: false). | |
2.1.2 Document Stores. A document store is a key-value store that restricts values to semi-structured formats such as JSON documents. This restriction in comparison to key-value stores brings great flexibility in accessing the data. It is not only possible to fetch an entire document by its ID, but also to retrieve only parts of a document, e.g. the age of a customer, and to execute queries like aggregation, query-by-example or even full-text search. | |
3.1.3 Wide-Column Stores Wide-column stores inherit their name from the image that is often used to explain the underlying data model: a relational table with many sparse columns. Technically, however, a wide-column store is closer to a distributed multi-level sorted map: The first-level keys identify rows which themselves consist of key-value pairs. The first-level keys are called row keys, the second-level keys are called column keys. This storage scheme makes tables with arbitrarily many columns feasible, because there is no column key without a corresponding value. Hence, null values can be stored without any space overhead. The set of all columns is partitioned into so-called column families to colocate columns on disk that are usually accessed together. On disk, wide-column stores do not colocate all data from each row, but instead values of the same column family and from the same row. Hence, an entity (a row) cannot be retrieved by one single lookup as in a document store, but has to be joined together from the columns of all column families. However, this storage layout usually enables highly efficient data compression and makes retrieving only a portion of an entity very efficient. The data are stored in lexicographic order of their keys, so that data that are accessed together are physically co-located, given a careful key design. As all rows are distributed into contiguous ranges (so-called tablets) among different tablet servers, row scans only involve few servers and thus are very efficient. | |
Bigtable, which pioneered the wide-column model, was specifically developed to store a large collection of webpages as illustrated in Figure 3. Every row in the webpages table corresponds to a single webpage. The row key is a concatenation of the URL components in reversed order and every column key is composed of the column family name and a column qualifier, separated by a colon. There are two column families: the “contents” column family with only one column holding the actual webpage and the “anchor” column family holding links to each webpage, each in a separate column. Every cell in the table (i.e. every value accessible by the combination of row and column key) can be versioned by timestamps or version numbers. It is important to note that much of the information of an entity lies in the keys and not only in the values . | |
Another defining property of a database apart from how the data are stored and how they can be accessed is the level of consistency that is provided. Some databases are built to guarantee strong consistency and serializability (ACID), while others favour availability (BASE). This trade-off is inherent to every distributed database system and the huge number of different NoSQL systems shows that there is a wide spectrum between the two paradigms. In the following, we explain the two theorems CAP and PACELC according to which database systems can be categorised by their respective positions in this spectrum. | |
CAP. Like the famous FLP Theorem, the CAP Theorem, presented by Eric Brewer at PODC 2000 and later proven by Gilbert and Lynch, is one of the truly influential impossibility results in the field of distributed computing, because it places an ultimate upper bound on what can possibly be accomplished by a distributed system. It states that a sequentially consistent read/write register that eventually responds to every request cannot be realised in an asynchronous system that is prone to network partitions. In other words, it can guarantee at most two of the following three properties at the same time: | |
Brewer argues that a system can be both available and consistent in normal operation, but in the presence of a system partition, this is not possible: If the system continues to work in spite of the partition, there is some non-failing node that has lost contact to the other nodes and thus has to decide to either continue processing client requests to preserve availability (AP, eventual consistent systems) or to reject client requests in order to uphold consistency guarantees (CP). The first option violates consistency, because it might lead to stale reads and conflicting writes, while the second option obviously sacrifices availability. There are also systems that usually are available and consistent, but fail completely when there is a partition (CA), for example single-node systems. It has been shown that the CAP-theorem holds for any consistency property that is at least as strong as causal consistency, which also includes any recency bounds on the permissible staleness of data (Δ-atomicity). Serializability as the correctness criterion of transactional isolation does not require strong consistency. However, similar to consistency, serializability can also not be achieved under network partitions. | |
The classification of NoSQL systems as either AP, CP or CA vaguely reflects the individual systems’ capabilities and hence is widely accepted as a means for high-level comparisons. However, it is important to note that the CAP Theorem actually does not state anything on normal operation; it merely tells us whether a system favors availability or consistency in the face of a network partition. In contrast to the FLP-Theorem, the CAP theorem assumes a failure model that allows arbitrary messages to be dropped, reordered or delayed indefinitely. Under the weaker assumption of reliable communication channels (i.e. messages always arrive but asynchronously and possibly reordered) a CAP-system is in fact possible using the Attiya, Bar-Noy, Dolev algorithm, as long as a majority of nodes are up. (Therefore, consensus as used for coordination in many NoSQL systems either natively (e.g. in Megastore) or through coordination services like Chubby and Zookeeper is even harder to achieve with high availability than strong consistency, see FLP Theorem.) | |
PACELC. This lack of the CAP Theorem is addressed in an article by Daniel Abadi in which he points out that the CAP Theorem fails to capture the trade-off between latency and consistency during normal operation, even though it has proven to be much more influential on the design of distributed systems than the availability-consistency trade-off in failure scenarios. He formulates PACELC which unifies both trade-offs and thus portrays the design space of distributed systems more accurately. From PACELC, we learn that in case of a Partition, there is an Availability-Consistency trade-off; Else, i.e. in normal operation, there is a Latency-Consistency trade-off. | |
This classification basically offers two possible choices for the partition scenario (A/C) and also two for normal operation (L/C) and thus appears more fine-grained than the CAP classification. However, many systems cannot be assigned exclusively to one single PACELC class and one of the four PACELC classes, namely PC/EL, can hardly be assigned to any system. | |
Every significantly successful database is designed for a particular class of applications, or to achieve a specific combination of desirable system properties. The simple reason why there are so many different database systems is that it is not possible for any system to achieve all desirable properties at once. Traditional SQL databases such as PostgreSQL have been built to provide the full functional package: a very flexible data model, sophisticated querying capabilities including joins, global integrity constraints and transactional guarantees. On the other end of the design spectrum, there are key-value stores like Dynamo that scale with data and request volume and offer high read and write throughput as well as low latency, but barely any functionality apart from simple lookups. | |
In this section, we highlight the design space of distributed database systems, concentrating on sharding, replication, storage management and query processing. We survey the available techniques and discuss how they are related to different functional and non-functional properties (goals) of data management systems. In order to illustrate what techniques are suitable to achieve which system properties, we provide the NoSQL Toolbox (Figure 4) where each technique is connected to the functional and non-functional properties it enables (positive edges only). | |
Several distributed relational database systems such as Oracle RAC or IBM DB2 pureScale rely on a shared-disk architecture where all database nodes access the same central data repository (e.g. a NAS or SAN). Thus, these systems provide consistent data at all times, but are also inherently difficult to scale. In contrast, the (NoSQL) database systems focused in this paper are built upon a shared-nothing architecture, meaning each system consists of many servers with private memory and private disks that are connected through a network. Thus, high scalability in throughput and data volume is achieved by sharding (partitioning) data across different nodes (shards) in the system. There are three basic distribution techniques: range-sharding, hash-sharding and entity-group sharding. To make efficient scans possible, the data can be partitioned into ordered and contiguous value ranges by range-sharding. However, this approach requires some coordination through a master that manages assignments. To ensure elasticity, the system has to be able to detect and resolve hotspots automatically by further splitting an overburdened shard. | |
Range sharding is supported by wide-column stores like BigTable, HBase or Hypertable and document stores, e.g. MongoDB, RethinkDB, Espresso and DocumentDB. Another way to partition data over several machines is hash-sharding where every data item is assigned to a shard server according to some hash value built from the primary key. This approach does not require a coordinator and also guarantees the data to be evenly distributed across the shards, as long as the used hash function produces an even distribution. The obvious disadvantage, though, is that it only allows lookups and makes scans unfeasible. Hash sharding is used in key-value stores and is also available in some wide-coloumn stores like Cassandra or Azure Tables. | |
The shard server that is responsible for a record can be determined as serverid = hash(id)%servers, for example. However, this hashing scheme requires all records to be reassigned every time a new server joins or leaves, because it changes with the number of shard servers (servers). Consequently, it infeasible to use in elastic systems like Dynamo, Riak or Cassandra, which allow additional resources to be added on-demand and again be removed when dispensable. For increased flexibility, elastic systems typically use consistent hashing where records are not directly assigned to servers, but instead to logical partitions which are then distributed across all shard servers. Thus, only a fraction of the data have to be reassigned upon changes in the system topology. For example, an elastic system can be downsized by offloading all logical partitions residing on a particular server to other servers and then shutting down the now idle machine. For details on how consistent hashing is used in NoSQL systems, see the Dynamo paper. | |
Entity-group sharding is a data partitioning scheme with the goal of enabling single-partition transactions on co-located data. The partitions are called entity-groups and either explicitly declared by the application (e.g. in G-Store andMegaStore) or derived from transactions’ access patterns (e.g. in Relational Cloud and Cloud SQL Server). If a transaction accesses data that spans more than one group, data ownership can be transferred between entity-groups or the transaction manager has to fallback to more expensive multi-node transaction protocols. | |
In terms of CAP, conventional RDBMSs are often CA systems run in single-server mode: The entire system becomes unavailable on machine failure. And so system operators secure data integrity and availability through expensive, but reliable high-end hardware. In contrast, NoSQL systems like Dynamo, BigTable or Cassandra are designed for data and request volumes that cannot possibly be handled by one single machine, and therefore they run on clusters consisting of thousands of servers. (Low-end hardware is used, because it is substantially more cost-efficient than high-end hardware.) Since failures are inevitable and will occur frequently in any large-scale distributed system, the software has to cope with them on a daily basis . In 2009, Google fellow Jeff Dean stated that a typical new cluster at Google encounters thousands of hard drive failures, 1,000 single-machine failures, 20 rack failures and several network partitions due to expected and unexpected circumstances in its first year alone. Many more recent cases of network partitions and outages in large cloud data centers have been reported . Replication allows the system to maintain availability and durability in the face of such errors. But storing the same records on different machines (replica servers) in the cluster introduces the problem of synchronization between them and thus a trade-off between consistency on the one hand and latency and availability on the other. | |
Gray et al. propose a two-tier classification of different replication strategies according to when updates are propagated to replicas and where updates are accepted. There are two possible choices on tier one (“when”): Eager(synchronous) replication propagates incoming changes synchronously to all replicas before a commit can be returned to the client, whereas lazy (asynchronous) replication applies changes only at the receiving replica and passes them on asynchronously. The great advantage of eager replication is consistency among replicas, but it comes at the cost of higher write latency due to the need to wait for other replicas and impaired availability. Lazy replication is faster, because it allows replicas to diverge; as a consequence, stale data might be served. On the second tier (“where”), again, two different approaches are possible: Either a master-slave (primary copy) scheme is pursued where changes can only be accepted by one replica (the master) or, in a update anywhere (multi-master) approach, every replica can accept writes. In master-slave protocols, concurrency control is not more complex than in a distributed system without replicas, but the entire replica set becomes unavailable, as soon as the master fails. Multi-master protocols require complex mechanisms for prevention or detection and reconciliation of conflicting changes. Techniques typically used for these purposes are versioning, vector clocks, gossiping and read repair (e.g. in Dynamo) and convergent or commutative datatypes (e.g. in Riak). | |
Basically, all four combinations of the two-tier classification are possible. Distributed relational systems usually perform eager master-slave replication to maintain strong consistency. Eager update anywhere replication as for example featured in Google’s Megastore suffers from a heavy communication overhead generated by synchronisation and can cause distributed deadlocks which are expensive to detect. NoSQL database systems typically rely on lazyreplication, either in combination with the master-slave (CP systems, e.g. HBase and MongoDB) or the update anywhere approach (AP systems, e.g. Dynamo and Cassandra). Many NoSQL systems leave the choice between latency and consistency to the client, i.e. for every request, the client decides whether to wait for a response from any replica to achieve minimal latency or for a certainly consistent response (by a majority of the replicas or the master) to prevent stale data. | |
An aspect of replication that is not covered by the two-tier scheme is the distance between replicas. The obvious advantage of placing replicas near one another is low latency, but close proximity of replicas might also reduce the positive effects on availability; for example, if two replicas of the the same data item are placed in the same rack, the data item is not available on rack failure in spite of replication. But more than the possibility of mere temporary unavailability, placing replicas nearby also bears the peril of losing all copies at once in a disaster scenario. An alternative technique for latency reduction is used in Orestes, where data is cached close to applications using web caching infrastructure and cache coherence protocols. | |
Geo-replication can protect the system against complete data loss and improve read latency for distributed access from clients. Eager geo-replication, as implemented in Megastore, Spanner, MDCC and Mencius achieve strong consistency at the cost of higher write latencies (typically 100ms to 600ms). With lazy geo-replication as in Dynamo, PNUTS, Walter, COPS, Cassandra and BigTable recent changes may be lost, but the system performs better and remains available during partitions. Charron-Bost et al. (Chapter 12) and Öszu and Valduriez (Chapter 13) provide a comprehensive discussion of database replication. | |
For best performance, database systems need to be optimized for the storage media they employ to serve and persist data. These are typically main memory (RAM), solid-state drives (SSDs) and spinning disk drives (HDDs) that can be used in any combination. Unlike RDBMSs in enterprise setups, distributed NoSQL databases avoid specialized shared-disk architectures in favor of shared-nothing clusters based on commodity servers (employing commodity storage media). Storage devices are typically visualized as a “storage pyramid” (see Figure 5 or Hellerstein et al.). There is also a set of transparent caches (e.g. L1-L3 CPU caches and disk buffers, not shown in the Figure), that are only implicitly leveraged through well-engineered database algorithms that promote data locality. The very different cost and performance characteristics of RAM, SSD and HDD storage and the different strategies to leverage their strengths (storage management) are one reason for the diversity of NoSQL databases. Storage management has a spatial dimension (where to store data) and a temporal dimension (when to store data). Update-in-place and append-only-IO are two complementary spatial techniques of organizing data; in-memory prescribes RAM as the location of data, whereas logging is a temporal technique that decouples main memory and persistent storage and thus provides control over when data is actually persisted. | |
In their seminal paper “the end of an architectural era”, Stonebraker et al. have found that in typical RDBMSs, only 6.8% of the execution time is spent on “useful work”, while the rest is spent on: | |
This motivates that large performance improvements can be expected if RAM is used as primary storage (in-memory databases). The downside are high storage costs and lack of durability — a small power outage can destroy the database state. This can be solved in two ways: The state can be replicated over n in-memory server nodes protecting against n-1 single-node failures (e.g. HStore, VoltDB) or by logging to durable storage (e.g. Redis or SAP Hana). Through logging, a random write access pattern can be transformed to a sequential one comprised of received operations and their associated properties (e.g. redo information). In most NoSQL systems, the commit rule for logging is respected, which demands every write operation that is confirmed as successful to be logged and the log to be flushed to persistent storage. In order to avoid the rotational latency of HDDs incurred by logging each operation individually, log flushes can be batched together (group commit) which slightly increases the latency of individual writes, but drastically improves throughput. | |
SSDs and more generally all storage devices based on NAND flash memory differ substantially from HDDs in various aspects: “(1) asymmetric speed of read and write operations, (2) no in-place overwrite — the whole block must be erased before overwriting any page in that block, and (3) limited program/erase cycles” (Min et al., 2012). Thus, a database system’s storage management must not treat SSDs and HDDs as slightly slower, persistent RAM, since random writes to an SSD are roughly an order of magnitude slower than sequential writes. Random reads, on the other hand, can be performed without any performance penalties. There are some database systems (e.g. Oracle Exadata, Aerospike) that are explicitly engineered for these performance characteristics of SSDs. In HDDs, both random reads and writes are 10–100 times slower than sequential access. Logging hence suits the strengths of SSDs and HDDs which both offer a significantly higher throughput for sequential writes. | |
For in-memory databases, an update-in-place access pattern is ideal: It simplifies the implementation and random writes to RAM are essentially equally fast as sequential ones, with small differences being hidden by pipelining and the CPU-cache hierarchy. However, RDBMSs and many NoSQL systems (e.g. MongoDB) employ an update-in-place update pattern for persistent storage, too. To mitigate the slow random access to persistent storage, main memory is usually used as a cache and complemented by logging to guarantee durability. In RDBMSs, this is achieved through a complex buffer pool which not only employs cache-replace algorithms appropriate for typical SQL-based access patterns, but also ensures ACID semantics. NoSQL databases have simpler buffer pools that profit from simpler queries and the lack of ACID transactions. The alternative to the buffer pool model is to leave caching to the OS through virtual memory (e.g. employed in MongoDB’s MMAP storage engine). This simplifies the database architecture, but has the downside of giving less control over which data items or pages reside in memory and when they get evicted. Also read-ahead (speculative reads) and write-behind (write buffering) transparently performed with OS buffering lack sophistication as they are based on file system logics instead of database queries. | |
Append-only storage (also referred to as log-structuring) tries to maximize throughput by writing sequentially. Although log-structured file systems have a long research history, append-only I/O has only recently been popularized for databases by BigTable’s use of Log-Structured Merge (LSM) trees consisting of an in-memory cache, a persistent log and immutable, periodically written storage files. LSM trees and variants like Sorted Array Merge Trees (SAMT) and Cache-Oblivious Look-ahead Arrays (COLA) have been applied in many NoSQL systems (Cassandra, CouchDB, LevelDB, Bitcask, RethinkDB, WiredTiger, RocksDB, InfluxDB, TokuDB). Designing a database to achieve maximum write performance by always writing to a log is rather simple, the difficulty lies in providing fast random and sequential reads. This requires an appropriate index structure that is either permanently updated as a copy-on-write (COW) data structure (e.g. CouchDB’s COW B-trees) or only periodically persisted as an immutable data structure (e.g. in BigTable-style systems). An issue of all log-structured storage approaches is costly garbage collection (compaction) to reclaim space of updated or deleted items. | |
In virtualized environments like Infrastructure-as-a-Service clouds many of the discussed characteristics of the underlying storage layer are hidden. | |
The querying capabilities of a NoSQL database mainly follow from its distribution model, consistency guarantees and data model. Primary key lookup, i.e. retrieving data items by a unique ID, is supported by every NoSQL system, since it is compatible to range- as well as hash-partitioning. Filter queries return all items (or projections) that meet a predicate specified over the properties of data items from a single table. In their simplest form, they can be performed as filtered full-table scans. For hash-partitioned databases this implies a scatter-gather pattern where each partition performs the predicated scan and results are merged. For range-partitioned systems, any conditions on the range attribute can be exploited to select partitions. | |
To circumvent the inefficiencies of O(n) scans, secondary indexes can be employed. These can either be local secondary indexes that are managed in each partition or global secondary indexes that index data over all partitions. As the global index itself has to be distributed over partitions, consistent secondary index maintenance would necessitate slow and potentially unavailable commit protocols. Therefore in practice, most systems only offer eventual consistency for these indexes (e.g. Megastore, Google AppEngine Datastore, DynamoDB) or do not support them at all (e.g. HBase, Azure Tables). When executing global queries over local secondary indexes the query can only be targeted to a subset of partitions if the query predicate and the partitioning rules intersect. Otherwise, results have to be assembled through scatter-gather. For example, a user table with range-partitioning over an age field can service queries that have an equality condition on age from one partition whereas queries over names need to be evaluated at each partition. A special case of global secondary indexing is full-text search, where selected fields or complete data items are fed into either a database-internal inverted index (e.g. MongoDB) or to an external search platform such as ElasticSearch or Solr (Riak Search, DataStax Cassandra). | |
Query planning is the task of optimizing a query plan to minimize execution costs. For aggregations and joins, query planning is essential as these queries are very inefficient and hard to implement in application code. The wealth of literature and results on relational query processing is largely disregarded in current NoSQL systems for two reasons. First, the key-value and wide-column model are centered around CRUD and scan operations on primary keys which leave little room for query optimization. Second, most work on distributed query processing focuses on OLAP (online analytical processing) workloads that favor throughput over latency whereas single-node query optimization is not easily applicable for partitioned and replicated databases. However, it remains an open research challenge to generalize the large body of applicable query optimization techniques especially in the context of document databases. (Currently only RethinkDB can perform general Θ-joins. MongoDB’s aggregation framework has support for left-outer equi-joins in its aggregation framework and CouchDB allows joins for pre-declared map-reduce views.) | |
In-database analytics can be performed either natively (e.g. in MongoDB, Riak, CouchDB) or through external analytics platforms such as Hadoop, Spark and Flink (e.g. in Cassandra and HBase). The prevalent native batch analytics abstraction exposed by NoSQL systems is MapReduce. (An alternative to MapReduce are generalized data processing pipelines, where the database tries to optimize the flow of data and locality of computation based on a more declarative query language, e.g. MongoDB’s aggregation framework.) Due to I/O, communication overhead and limited execution plan optimization, these batch- and micro-batch-oriented approaches have high response times.Materialized views are an alternative with lower query response times. They are declared at design time and continuously updated on change operations (e.g. in CouchDB and Cassandra). However, similar to global secondary indexing, view consistency is usually relaxed in favor of fast, highly-available writes, when the system is distributed. As only few database systems come with built-in support for ingesting and querying unbounded streams of data,near-real-time analytics pipelines commonly implement either the Lambda Architecture or the Kappa Architecture: The former complements a batch processing framework like Hadoop MapReduce with a stream processor such as Storm (see for example Summingbird) and the latter exclusively relies on stream processing and forgoes batch processing altogether. | |
In this section, we provide a qualitative comparison of some of the most prominent key-value, document and wide-column stores. We present the results in strongly condensed comparisons and refer to the documentations of the individual systems for in-detail information. The proposed NoSQL Toolbox (see Figure 4) is a means of abstraction that can be used to classify database systems along three dimensions: functional requirements, non-functional requirements and the techniques used to implement them. We argue that this classification characterizes many database systems well and thus can be used to meaningfully contrast different database systems: Table 1 shows a direct comparison of MongoDB, Redis, HBase, Riak, Cassandra and MySQL in their respective default configurations. A more verbose comparison of central system properties is presented in the large comparison Table 2 at the end of this article. | |
The methodology used to identify the specific system properties consists of an in-depth analysis of publicly available documentation and literature on the systems. Furthermore, some properties had to be evaluated by researching the open-source code bases, personal communication with the developers as well as a meta-analysis of reports and benchmarks by practitioners. | |
For detailed descriptions see the slides from our ICDE 2016 Tutorial, which goes over many details of the different NoSQL systems: | |
The comparison elucidates how SQL and NoSQL databases are designed to fulfill very different needs: RDBMSs provide an unmatched level of functionality whereas NoSQL databases excel on the non-functional side through scalability, availability, low latency and/or high throughput. However, there are also large differences among the NoSQL databases. Riak and Cassandra, for example, can be configured to fulfill many non-functional requirements, but are only eventually consistent and do not feature many functional capabilities apart from data analytics and, in case of Cassandra, conditional updates. MongoDB and HBase, on the other hand, offer stronger consistency and more sophisticated functional capabilities such as scan queries and — only MongoDB: — filter queries, but do not maintain read and write availability during partitions and tend to display higher read latencies. Redis, as the only non-partitioned system in this comparison apart from MySQL, shows a special set of trade-offs centered around the ability to maintain extremely high throughput at low-latency using in-memory data structures and asynchronous master-slave replication. | |
Choosing a database system always means to choose one set of desirable properties over another. To break down the complexity of this choice, we present a binary decision tree in Figure 6 that maps trade-off decisions to example applications and potentially suitable database systems. The leaf nodes cover applications ranging from simple caching (left) to Big Data analytics (right). Naturally, this view on the problem space is not complete, but it vaguely points towards a solution for a particular data management problem. | |
The first split in the tree is along the access pattern of applications: They either rely on fast lookups only (left half) or require more complex querying capabilities (right half). The fast lookup applications can be distinguished further by the data volume they process: If the main memory of one single machine can hold all the data, a single-node system like Redis or Memcache probably is the best choice, depending on whether functionality (Redis) or simplicity (Memcache) is favored. If the data volume is or might grow beyond RAM capacity or is even unbounded, a multi-node system that scales horizontally might be more appropriate. The most important decision in this case is whether to favor availability (AP) or consistency (CP) as described by the CAP theorem. Systems like Cassandra and Riak can deliver an always-on experience, while systems like HBase, MongoDB and DynamoDB deliver strong consistency. | |
The right half of the tree covers applications requiring more complex queries than simple lookups. Here, too, we first distinguish the systems by the data volume they have to handle according to whether single-node systems are feasible (HDD-size) or distribution is required (unbounded volume). For common OLTP (online transaction processing) workloads on moderately large data volumes, traditional RDBMSs or graph databases like Neo4J are optimal, because they offer ACID semantics. If, however, availability is of the essence, distributed systems like MongoDB, CouchDB or DocumentDB are preferrable. | |
If the data volume exceeds the limits of a single machine, the choice of the right system depends on the prevalent query pattern: When complex queries have to be optimised for latency, as for example in social networking applications, MongoDB is very attractive, because it facilitates expressive ad-hoc queries. HBase and Cassandra are also useful in such a scenario, but excel at throughput-optimised Big Data analytics, when combined with Hadoop. | |
In summary, we are convinced that the proposed top-down model is an effective decision support to filter the vast amount of NoSQL database systems based on central requirements. The NoSQL Toolbox furthermore provides a mapping from functional and non-functional requirements to common implementation techniques to categorize the constantly evolving NoSQL space. | |
Don’t want to miss our next post on NoSQL topics? Get it conveniently delivered to your inbox by joining our newsletter.","1" | |
"noahhl","https://medium.com/signal-v-noise/lets-chart-stop-those-lying-line-charts-60020e299829","11","{""Data Visualization"",Design,""Data Science"",Analytics}","425","4.27735849056604","Let’s Chart: stop those lying line charts","","4" | |
"datalab","https://medium.com/hacker-daily/wannabe-data-scientists-learn-the-basics-with-these-7-books-1a41cfbbdd34","1","{""Data Science"",""Big Data"",""Learning To Code"",Analytics,Startup}","415","2.8188679245283","Wannabe Data Scientists! Learn the basics with these 7 books!","In the last few years I spent a significant time with reading books about Data Science. I found these 7 books the best. These together are a very valuable source of learning the basics. It drives you through everything, you need to know. | |
Though they are very enjoyable, none of these is light reading. So if you decided to go with them, allocate some time and energy. It is worth it! If you combine this knowledge with the free practical data courses, that I wrote about earlier, it’s already a good-enough level for an entry level Data Scientist position. (In my opinion, at least.) | |
Note: you can see I listed four O’Reilly books here. If it looks suspicious: I’m not affiliated with them in any way. ;-) I just find their books really useful. | |
The first book to read is about the basic business mindset about how to use data. It says it’s for startups, but I feel like it’s much more than that. You will learn, why is it so important to select the One Metric That Matters as well as the 6 basic online business types — and the data strategy behind those. | |
If Lean Analytics is about business + data for startups, this book is business + data for big companies. It sounds less fancy, than the first one, but there is always a chance to pick up some useful knowledge from the big guys. Eg. how insurance companies use predictive analytics or what data issues are banks facing. | |
I constantly promote this book on my channels. It’s not just for Data Scientists. It’s the very basis of statistical thinking, which I think every human being should be familiar with. This book comes with many stories and you will learn how not to be scammed by headlines like “How we pushed 1300% on our conversion rate by changing only one word” and other BSs. | |
The last book before going really tech-focused. This one takes the things that you learned so far in the first 3 books to the next level. It goes deeper into topics like regression models, spam filtering, recommendation engines and even big data. | |
The other thing I constantly promote is to learn (at least) basic coding. With that you can be much more flexible on getting, clearing, transforming and analyzing your data. It just extends your opportunities in Data Science.And when you start, I suggest to start with the Command Line. This is the only book, I’ve seen about Data Science + Command Line, but one is enough as it pretty much covers everything. | |
The second data language to learn is Python. It’s not too difficult and it’s very widely used. You can do almost everything in Python, when it comes to analysis, predicting and even machine learning.This is a heavy book (literally: it’s more than 400 pages), but covers everything with Python. | |
The last book on the list is only 60 pages and very technical. It gives you a good view to the technical background of data collecting and processing. Most probably as an analyst or data scientist you won’t use this kind of knowledge directly, but at least you will be aware, what the data infrastructure specialists of the company do. | |
As I mentioned before, if you go through on all of these — combined with the free practical data courses — you will have a solid knowledge about Data Science! | |
Learn more about how to Create a Good Research plan: here. | |
Thanks for reading!Enjoyed the article? Please just let me know by clicking the 💚 below. It also helps other people see the story! | |
Tomi Mestermy blog: data36.commy Twitter: @data36_com","4" | |
"jakeybob","https://medium.com/@jakeybob/brexit-a-story-in-maps-d70caab7315e","9","{Brexit,""UK Politics"",""Data Science"",Maps,""Data Visualization""}","410","6.8188679245283","Brexit — a story in maps","","2" | |
"noahhl","https://medium.com/signal-v-noise/real-time-dashboards-considered-harmful-7ab026942ac","1","{Dashboard,""Data Science"",""Big Data"",Analytics}","410","4.93207547169811","Real-time dashboards considered harmful","","2" | |
"eklimcz","https://medium.com/uber-design/crafting-data-driven-maps-b0835b620554","13","{""Data Visualization"",Maps,Design,UX,Uber}","395","7.25283018867925","Crafting Data-Driven Maps","","1" | |
"larrykim","https://medium.com/the-mission/you-wont-believe-all-the-personal-data-facebook-has-collected-on-you-387c8060ab09","3","{Facebook,""Big Data"",Privacy,""Online Privacy"",""Data Science""}","384","1.9688679245283","You Won’t Believe All the Personal Data Facebook Has Collected on You","As much as consumers claim to have concerns around Facebook privacy issues, we sure don’t mind handing our information over to it left and right. | |
Every single day, more than a billion active users share their thoughts, photos, news, videos, memes, and more with friends and connections on Facebook. | |
Sure, we know we’re sharing that content and the associated meta data that goes with it back to Facebook. Our sign-in and posting locations; where we took a certain photo; what events we attended and which artists we enjoy. We share all of this, much of it indirectly, without a second thought. | |
And yet some of us are still surprised when it seems like ads are following us around the web. | |
Have you ever searched for something, only to see that same product pop up in a sponsored post in your Facebook stream the next time you sign in? | |
Or read an article about a certain topic, then had ads about that topic appear in your Facebook newsfeed? | |
The amount of Facebook data collects about us is staggering, and it’s not for no reason. They essentially lease out our online profiles to companies looking to sell us goods and services. | |
In fact, examining those Facebook ad targeting options sheds a lot of light on just how much personal information they’re collecting — everything from relationship status to location, life events, political leanings, interests, digital activities, and personal connections. | |
The car you drive. | |
The charitable donations you make. | |
The websites you visit. | |
Facebook’s partnerships with offline data tracking companies means it has a crazy amount of information about your online activities, but also about the money you spend and the things you do in the real world, too. | |
My company, WordStream, compiled all of the current Facebook ad targeting options in one epic infographic to demonstrate the breadth and depth of the personal information advertisers can use to target consumers via social media. Check it out: | |
Image credit: WordStream | |
Originally published on Inc.com | |
Found this post useful? Kindly tap the ❤ button below! :) | |
About The Author | |
Larry Kim is the Founder of WordStream. You can connect with him on Twitter, Facebook, LinkedIn and Instagram.","4" | |
"giorgialupi","https://medium.com/accurat-studio/beautiful-reasons-c1c6926ab7d7","26","{""Data Visualization"",Design,Data}","366","13.3330188679245","Beautiful Reasons","","2" | |
"alinelernerllc","https://medium.com/free-code-camp/resumes-suck-heres-the-data-ee88fcc27615","10","{Careers,Programming,Jobs,Tech,""Data Science""}","361","10.4575471698113","Resumes suck. Here’s the data.","I reviewed a solid year’s worth of resumes from engineers we had hired at TrialPay. The strongest signal I could find for whether we would extend an offer to an engineer: the number of typos and grammatical errors on their resume. | |
On the other hand, where people went to school, their GPA, and highest degree earned — these didn’t matter at all. | |
These results were pretty unexpected. They ran counter to how resumes are normally filtered. And they left me scratching my head about how good people really are at making value judgments based on resumes. | |
So, I decided to run an experiment. I wanted to see how good engineers and recruiters actually were at resume-based candidate filtering. | |
Going into it, I was pretty sure that engineers would do a much better job than recruiters. After all, engineers are technical. They don’t need to rely on proxies like recruiters do. | |
But that’s not what happened at all. As it turned out, people are pretty bad at filtering resumes — across the board. After running the numbers, it began to look like resumes might not be a particularly effective filtering tool in the first place. | |
The setup was simple. I would: | |
Essentially, each participant saw something like this: | |
If the participant didn’t want to interview the candidate, they’d have to write a few words about why. If they did want to interview, they also had the option of substantiating their decision, but, in the interest of not fatiguing participants, I didn’t require it. | |
To make judging easier, I told participants to pretend that they were hiring for a full-stack or back-end web dev role, as appropriate. I also told participants not to worry too much about the candidate’s seniority when making judgments, and to assume that the seniority of the role matched the seniority of the candidate. | |
For each resume, I had a pretty good idea of how strong the engineer in question was, and I split resumes into two strength-based groups. To make this judgment call, I drew on my personal experience — most of the resumes came from candidates I placed (or tried to place) at top-tier startups. In these cases, I knew exactly how the engineer had done in technical interviews, and, more often than not, I had visibility into how they performed on the job afterwards. | |
The remainder of resumes came from engineers I had worked with directly. The question was whether the participants in this experiment could figure out who was who just from the resume. | |
At this juncture, a disclaimer is in order. Certainly, someone’s subjective hirability based on the experience of one recruiter is not an oracle of engineering ability. With the advent of more data and more rigorous analysis, perhaps these results will be proven untrue. But, you gotta start somewhere. | |
That said, here’s the experiment by the numbers: | |
Each participant made judgments on 6 randomly selected resumes from the original set of 51, for a total of 716 data points. | |
(This number is less than 152*6=912 because not everyone who participated evaluated all 6 resumes.) | |
If you want to take the experiment for a whirl yourself, you can do so here. | |
Participants were broken up into engineers (both engineers involved in hiring and hiring managers themselves) and recruiters (both in-house and agency). There were 46 recruiters (22 in-house and 24 agency) and 106 engineers (20 hiring managers and 86 non-manager engineers who were still involved in hiring). | |
So, what ended up happening? Below, you can see a comparison of resume scores for both groups of candidates. | |
A resume score is the average of all the votes each resume got, where a ‘no’ counted as 0 and a ‘yes’ vote counted as 1. The dotted line in each box is the mean for each resume group — you can see they’re pretty much the same. The solid line is the median, and the boxes contain the 2nd and 3rd quartiles on either side of it. | |
As you can see, people weren’t very good at this task. What’s pretty alarming is that scores are all over the place, for both strong and less strong candidates. | |
Another way to look at the data is to look at the distribution of accuracy scores. Accuracy in this context refers to how many resumes people were able to tag correctly out of the subset of 6 that they saw. As you can see, results were all over the board. | |
On average, participants guessed correctly 53% of the time. This was pretty surprising. At the risk of being glib, according to these results, when a good chunk of people involved in hiring make resume judgments, they might as well be flipping a coin. | |
What about performance broken down by participant group? Here’s the breakdown: | |
None of the differences between participant groups were statistically significant. In other words, all groups did equally poorly. | |
For each group, you can see how well people did below. | |
To try to understand whether people really were this bad at the task or whether perhaps the task itself was flawed, I ran some more stats. | |
One thing I wanted to understand, in particular, was whether inter-rater agreement was high. In other words, when rating resumes, were participants disagreeing with each other more often than you’d expect to happen by chance? If so, then even if my criteria for whether each resume belonged to a strong candidate wasn’t perfect, the results would still be compelling. | |
No matter how you slice it, if people involved in hiring consistently can’t come to a consensus, then something about the task at hand is too ambiguous. | |
The test I used to gauge inter-rater agreement is called Fleiss’ kappa. The result is on the following scale of -1 to 1: | |
Fleiss’ kappa for this data set was 0.13. 0.13 is close to zero, implying just mildly better than coin flip. In other words, the task of making value judgments based on these resumes was likely too ambiguous for humans to do well on with the given information alone. | |
TL;DR Resumes might actually suck. | |
In addition to the finding out that people aren’t good at judging resumes, I was able to uncover a few interesting patterns. | |
We’ve all heard of and were probably a bit incredulous about the study that showed recruiters spend less than 10 seconds on a resume on average. In this experiment, people took a lot longer to make value judgments. People took a median of 1 minute and 40 seconds per resume. In-house recruiters were fastest, and agency recruiters were slowest. However, how long someone spent looking at a resume appeared to have no bearing, overall, on whether they’d guess correctly. | |
Whenever a participant deemed a candidate not worth interviewing, they had to substantiate their decision. Though these criteria are clearly not the be-all and end-all of resume filtering. If they were, people would have done better. | |
It was interesting to see that engineers and recruiters were looking for different things. | |
( I created the categories below from participants’ full-text rejection reasons, after the fact.) | |
Incidentally, lack of relevant experience didn’t refer to lack of experience with a specific stack. Verbatim rejection reasons under this category tended to say stuff like “projects not extensive enough,” “lack of core computer science,” or “a lot of academic projects around Electrical Engineering, not a lot on the resume about programming or web development.” | |
Culture fit in the engineering graph denotes concerns about engineering culture fit, rather than culture fit overall. This could be anything from concern that someone used to working with Microsoft technologies might not be at home in a Ruby on Rails shop to worrying that the candidate is too much of a hacker to write clean, maintainable code. | |
First of all, and not surprisingly, engineers tended to do slightly better on resumes that had projects. Engineers also tended to do better on resumes that included detailed and clear explanations of what the candidate worked on. | |
To get an idea of what I mean by detailed and clear explanations, take a look at the two versions below (source: Lessons from a year’s worth of hiring data). The first description can apply to pretty much any software engineering project, whereas after reading the second, you have a pretty good idea of what the candidate worked on. | |
Recruiters, on the other hand, tended to do better with candidates from top companies. This also makes sense. Agency recruiters deal with a huge, disparate candidate set while also dealing with a large number of companies in parallel. They’re going to have a lot of good breadth-first insight including which companies have the highest engineering bar, which companies recently had layoffs, which teams within a specific company are the strongest, and so on. | |
So, why are people pretty bad at this task? As we saw above, it may not be a matter of being good or bad at judging resumes but rather a matter of the task itself being flawed — at the end of the day, the resume is a low-signal document. | |
If we’re honest, no one really knows how to write resumes particularly well. Many people get their first resume writing tips from their university’s career services department, which is staffed with people who’ve never held a job in the field they’re advising for. | |
Hell, some of the most fervent resume advice I ever got was from a technical recruiter, who insisted that I list every technology I’d ever worked with on every single undergrad research project I’d ever done. I left his office in a cold sweaty panic, desperately trying to remember what version of Apache MIT had been running at the time. | |
Very smart people — who are otherwise fantastic writers — seem to check every ounce of intuition and personality at the door, then churn out soulless documents expounding their experience with the software development life cycle or whatever. They’re scared that sounding like a human being on their resume — or not peppering it with enough keywords — will eliminate them from the applicant pool before an engineer even has the chance to look at their resume. | |
Writing aside, reading resumes is a tedious and largely thankless task. If it’s not your job, it’s a distraction that you want to get over with so you can go back to writing code. | |
If reading resumes is your job, you probably have a huge stack to get through. So it’s going to be hard to do deep dives into people’s work and projects, even if you’re technical enough to understand them — and this is assuming they included links to their work in the first place. | |
On top of that, spending more time on a given resume may not even yield a more accurate result, at least according to what I observed in this study. | |
Assuming that my results are reproducible and people, across the board, are really quite bad at filtering resumes, there are a few things we can do to make top-of-the-funnel filtering better. | |
In the short term, improving collaboration across different teams involved in hiring is a good start. As we saw, engineers are better at judging certain kinds of resumes, and recruiters are better at others. If a resume has projects or a GitHub account with content listed, passing it over to an engineer to get a second opinion is probably a good idea. And if a candidate is coming from a company with a strong brand, but one that you’re not too familiar with, getting some insider info from a recruiter might not be the worst thing. | |
Longer-term, how engineers are filtered fundamentally needs to change. In my TrialPay study, I found that, in addition to grammatical errors, one of the things that mattered most was how clearly people described their work. In this study, I found that engineers were better at making judgments on resumes that included these kinds of descriptions. | |
Given these findings, relying more heavily on a writing sample during the filtering process might be in order. For the writing sample, I am imagining something that isn’t a cover letter. People tend to make those pretty formulaic and don’t talk about anything too personal or interesting. Rather, it should be a concise description of something you worked on recently that you are excited to talk about, as explained to a non-technical audience. | |
I think the non-technical audience aspect is critical, because if you can break down complex concepts for a layman to understand, you’re probably a good communicator and actually understand what you worked on. Moreover, recruiters could actually read this description and make valuable judgments about whether the writing is good and whether they understand what the person did. | |
Honestly, I really hope that the resume dies a grisly death. One of the coolest things about coding is that it doesn’t take much time or effort to determine whether someone can perform above some minimum threshold. All you need is the internets and a code editor. | |
Of course, figuring out whether someone is great is tough and takes more time. But figuring out if someone meets a minimum standard — mind you the same kind of minimum standard we’re trying to meet when we go through a pile of resumes — is pretty damn fast. | |
And in light of this, relying on low-signal proxies doesn’t make sense at all. | |
I’m CEO and co-founder of interviewing.io, a platform where engineers can practice technical interviewing anonymously and find jobs based on interview performance rather than resumes. | |
Want to find a great job without ever touching your resume? Join interviewing.io.","1" | |
"olivercameron","https://medium.com/@olivercameron/20-weird-wonderful-datasets-for-machine-learning-c70fc89b73d5","1","{""Machine Learning"",""Artificial Intelligence"",""Deep Learning"",Startup,""Big Data""}","354","1.09056603773585","20 Weird & Wonderful Datasets for Machine Learning","They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found. | |
In the hope that others might find this catalog useful, here’s 20 weird and wonderful datasets you could (perhaps) use in machine learning. | |
Caveat: I haven’t validated that all of these datasets are actually useful for machine learning (in terms of size or accuracy). Use your own judgement when playing with them (and check licenses)! | |
My favorite? The 80,000+ UFO reports dataset: | |
I’ve also been fascinated with the militarized interstates disputes dataset, which includes 200 years of international threats and conflicts. It includes the action taken, level of hostility, fatalities, and outcomes. | |
If you have any thoughts, questions, or datasets you’d like to share, I’d love to hear from you in Tweet-form. You can follow and message me at @olivercameron.","1" | |
"lisamsilvia","https://medium.com/what-s-in-the-box/data-visualization-people-remember-the-feeling-not-the-numbers-db0018dc9998","5","{""Data Visualization"",Design,Tech}","354","4.97672955974843","Data Visualization — People remember the feeling not the numbers","Analytics belong to every process at your company. Your salesperson is data-driven, your manager is data-driven, your company is data-driven, and you are probably data-driven too. The problem is that when you hear the word data and see a spreadsheet or chart, you feel constrained. The amount of information you have in hands is overwhelming, and you are either not sure whether you made the right decisions, or whether you addressed this information correctly to your audience. | |
Data visualizations are being designed for multiple users with different data needs. In order to successfully deliver a compelling message and make better decisions, you need to understand how your audience experiences data, and where your data is coming from. Here is where data visualization takes place. | |
Learning how to read data through a visual representation and how to create visualization correctly so that people can understand a dataset easier. | |
Data visualizations should be designed for multiple users with different data needs. In order to successfully deliver a compelling message and make better decisions, you need to understand how your audience experiences data, and where your data is coming from. | |
Researchers: | |
These three digital visualizations researchers were interested in seeing how non-data experts engage with visualizations. Ten focus groups with just under 50 participants were selected to see visuals including a range of chart types with different degrees of interactivity, original source or location, subject matter, etc, whether it was online or in print form. They were asked to record what they saw, how they felt and what they learned. This information was recorded in a 2x2 grid to capture their likes, dislikes, experiences, and the degree of learning. The sample of participants came from very distinct backgrounds to tease out biases. | |
This study was purely qualitative and its goal was to give people a chance to respond in any way they wanted. | |
The researchers assessed if people could extract information from these visuals and feel confident in doing it. They did not test for comprehension, but for thinking and behavioral patterns. | |
Some of the participants enjoyed the challenge of breaking down an unusual data set. Some struggled to understand the simplest charts like a pie chart or bar chart. | |
Others that were unfamiliar with certain graphics were given annotations and assistance to understand the information. However, they were still confused. Eventually, people felt way more confident to extract data when they were given enough time to analyze these charts. The aesthetics of the visuals also influenced people’s reactions and visually oriented people felt more comfortable and decided to invest more time. | |
How do people currently respond to data visualization and what does this mean for your presentation’s effectiveness? | |
With your current visuals, at least half of your audience pretends to understand what is going on with your data, but still do not quite get it, and it makes sense. It is a real challenge to design for a diverse audience. They are familiar with what you do, but there are huge interpersonal differences on how they perceive visualizations. What you perceive as intuitive can differ from what your peers feel. | |
Data and numbers can inspire fear and anxiety among your audience and this is the first thing they see. They don’t start reading all the instructions and titles. It’s very boring to tell people to read. | |
After the focus group, some participants were selected to write their encounters and reactions after 5–6 weeks. They could not remember any of the numbers, but they could remember how big the statistic was. | |
As you are designing these visualizations and delivering your message, observe your audience in-context, specific to unique cultures and environments in advance. Demonstrate deep user empathy in your approach by pushing specific feelings to your audience when you are trying to communicate with them. People remember the gist, message, and the feeling, not the numbers. | |
If you can find and understand the root cause of their fears, you can turn data into a story that will stay with your audience, conveying exactly what you are trying to communicate and solve.","2" | |
"auren","https://medium.com/newco/where-should-machines-go-to-learn-c2461f7e45fc","4","{""Machine Learning"",""Artificial Intelligence"",Tech,Policy,""Big Data""}","336","6.34905660377358","Where Should Machines Go To Learn?","Past civilizations built grand libraries to organize the world’s knowledge. These repositories of information focused on cataloging, aggregating, organizing and making information accessible so that others could focus on learning and creating new knowledge. | |
AI and machine learning systems also need repositories of information from which to learn — and right now everyone is building their own. If different groups of people focus on organizing data versus building AI, the progress of intelligent computers will massively accelerate. | |
Despite all the progress in machine learning (ML), most of our computers (and their applications) leave much to be desired. In her essay Rise of the Data Natives Data Scientist Monica Rogati explains: | |
When it comes to AI improving human lives, the pace of progress has been super slow. | |
Many state of the art AI and ML applications would be dramatically improved with more training data. This is hugely important. Google is one of the best AI and ML companies in the world. Why? Peter Norvig, Research Director at Google, famously stated that “We don’t have better algorithms. We just have more data.” | |
Michele Banko and Eric Brill at Microsoft Research wrote a now famous paper in support of this idea. They found that very large data sets can improve even the worst machine learning algorithms. The worst algorithm became the best algorithm once they increased the amount of data by multiple orders of magnitude. | |
Good, high quality data serve as truth sets about the world from which our machines can learn. And most AI and ML is likely under performing because getting access to great truth sets is very hard. | |
Organizing the past at the scale required by AI and ML is hard. Before a company can actually work on their AI or ML they need to solve four key challenges: | |
1) Acquire the data | |
2) Host the data | |
3) Prepare the data | |
4) Understand Data privacy | |
To understand the difficulty, let’s briefly discuss what each of these challenges entail. | |
This is where companies like Google and Facebook have a huge advantage — their businesses generate treasure troves of data. | |
For example, to build a neural network that could recognize human faces and cat faces, Google used 10 million YouTube videos. That is a data set to which literally no one else in the world had access. Four years later (just a few months ago in the Fall of 2016) Google released a large-scale dataset of labeled photos and videos to help the machine learning community — because they recognize how valuable this dataset is for everyone else. | |
But if you are a start-up doing image classification or a new self-driving car company (like Oliver Cameron), or a new search/query technology (see Daniel Tunkelang), or trying to revolutionize health care (see Jeremy Howard), or any other valuable AI application … even if you have $800 million in funding … you have very little data. | |
You might need to contact hundreds of companies and negotiate major business development (BD) deals to try to license data. It will take a huge effort — sometimes many years — and a lot of money. You might need to spend tons of engineering time and person hours and technological innovations to aggregate and organize and label data from open sources. | |
If companies can’t figure out how to get truth sets, they can’t build smart machines. | |
Let’s assume you have access to a great truth set — you also need to stand it up (host it in some way that your data scientists and engineers can work with it). Because the best datasets are very large, you need a cloud infrastructure and distributed processing technologies and your share of hot-shot, high-priced, back-end data engineers to make the data queryable and actionable. | |
Your best engineers spend most of their time just managing your big data infrastructure, pipelines, and query layers. | |
Then, even the best sources — even data you generate yourself — will be dirty. Your data will have errors, typos, mislabels, holes and need tons of cleaning. Your ML engineers and data scientists will spend most of their time just getting your data ready to use. It’s a cliche: data scientists spend 80% of their time just preparing the data. And this is the least enjoyable part of their job. | |
The size of this problem has led some to joke that the most impactful applications of AI would be to help data scientists clean their data faster. | |
If the process of building smart computers itself creates new human problems for smart computers to solve — then we can see how progress will be slow. | |
If you are trying to solve the most important questions of society, you probably are working with data about people. That means you need to become an expert on privacy and protecting consumer data. This requires significant ethical and legal sophistication. | |
You must understand the evolving regulatory landscape of data privacy. This includes the implications of FCC rulings in the United States and the significance of GDPR in the EU and much more. It requires understanding evolving definitions of PII (personally identifiable information). | |
It probably means having some talented and expensive legal counsel. And it requires good business judgement as new technology continuously pushes data privacy laws into unchartered territory. Protecting personal privacy and developing next-generation AI are essential and mutually inclusive — but without the right expertise you can fail at both. | |
Our point is this: Almost all the super-smart data people want to focus on building AI and machine learning applications to improve human lives. They want to use data to make decisions and predictions about the future and power incredible new technologies — like self-driving cars or super-human medical diagnoses or global economic forecasts. This is great. But instead, they are spending tons of time organizing the past: acquiring data, hosting data, preparing data, and navigating data privacy. | |
David Ricardo’s classic economic theory of comparative advantage boils down to this: focus on your strengths, trade with others, and everybody wins. Organizing the past and predicting the future are different kinds of expertise — you shouldn’t have to master the former to contribute to the latter. In fact, requiring everyone to master all of these domains is bad for AI and ML as a field because progress will be rate-limited to the handful of big players with the resources to do it. | |
Just like internet companies use Amazon Web Services to rent access to hardware, ML and AI companies should rent access to data. Innovators should focus on applying AI and ML to their domains of expertise (cancer, robotics, self-driving cars, economics, etc.). They should rely on other companies with different kinds of expertise to acquire the data, to build appropriate infrastructure, to clean the data, make it easy to work with, and to protect consumers’ privacy. | |
If people (and companies) focus on their strengths, then some will organize the past and others will predict the future. The barrier to start working on AI and ML will be dramatically lowered. Access to data will be democratized. If we focus on our strengths, the pace of innovation in AI and ML will massively accelerate. | |
— - | |
This piece was authored by Auren Hoffman and Ryan Fox Squire. Auren is CEO of SafeGraph and former CEO of LiveRamp. Ryan is Product Manager at SafeGraph and former Data Scientist at Lumos Labs. | |
Special thanks to inspirations: Oliver Cameron, Michael E. Driscoll, Anthony Goldbloom, Brett Hurt, John Lilly, Hilary Mason, dj patil, Delip Rao, Joseph Smarr","3" | |
"airbnbeng","https://medium.com/airbnb-engineering/caravel-airbnb-s-data-exploration-platform-15a72aa610e5","8","{Analytics,""Data Science"",""Open Source""}","331","4.30691823899371","Superset: Airbnb’s data exploration platform","By Maxime Beauchemin | |
At Airbnb, we love data, and we like to think that analytics belongs everywhere. For us to be data-driven, we need data to be fluid, fast flowing, and crystal clear. | |
As a vector for data exploration, discovery, and collaborative analytics, we have built and are now open sourcing, a data exploration and dashboarding platform named Superset. Superset allows data exploration through rich visualizations while performing fast and intuitive “slicing and dicing” against just about any dataset. | |
Data explorers can easily travel through multi-dimensional datasets while creating and sharing “slices”, and assemble them in interactive dashboards. | |
It takes very little time, maybe 10 to 30 seconds of delays, to break someone’s cognitive flow. Superset keeps your thinking loop spinning by providing a fluid query interface and enforces fast query times. Slicing, dicing, drilling down, and pivoting across visualizations allow users to explore multi-dimensional data spaces effectively. | |
The codeless approach to data navigation allows everyone on board, democratizing access to data. On one side of the spectrum, users that are less technical find an easy interface to query data. On the other end of that spectrum, advanced users enjoy gaining velocity and the ease of sharing the content they create. | |
Data scientists, engineers and other data wizards can still use Tableau, R, Jupyter, Airpal, Excel, and other means to interact with data, but Superset is gaining mind share internally as a frictionless and intuitive vehicle for sharing data and ideas. | |
Superset should work just as well in your environment as it does in ours. The query layer was written using SQLAlchemy, a SQL toolkit that allows authoring queries that can be translated to most SQL dialects out there. | |
Beyond the SQL world, Superset is designed to harness the power of Druid.io. Druid is an open source, fast, column-oriented, realtime, distributed data store. Coupling the two together accelerates analysis cycles by taking delays out of the equation. | |
Superset allows you to manage a thin layer to enrich your datasets’ metadata. This simple layer defines how your dataset is exposed to the user and is composed of: | |
We’ve made taking Superset for a test drive very easy. After the simple installation process, you’ll get Superset loaded with a nice set of dashboards, charts, and datasets that you can explore and interact with. The next logical step is to connect to your local databases and start visualizing them. | |
Superset started as a hackathon project less than a year ago. While the project is already solid, it’s still young and gaining momentum. Look forward to more interactivity in dashboards, support for a growing number of visualizations, a set of training videos, more social features like tags, comments, usage information, chart annotations, and much more! | |
We’re planning on releasing the data visualizations and controls exposed in Superset as reusable React components. This modular approach will make these building blocks available to application developers. At Airbnb, we have many use cases for rich and interactive visualizations as part of of internal applications; for example, our A/B testing framework, anomaly detection framework and user session explorer. It would be great to share the same components across all of these applications. | |
Join the community and find pointers to resources on Superset’s Github repository! | |
Note: Superset was originally released with the name Caravel.","1" | |
"chrismoranuk","https://medium.com/@chrismoranuk/what-i-learned-from-seven-years-as-the-guardians-audience-editor-621df42c14ab","0","{Journalism,Media,""Digital Journalism"",Analytics,""Big Data""}","329","4.54339622641509","What I learned from seven years as the Guardian’s audience editor","","2" | |
"triciawang","https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7","9","{""Big Data"",""Thick Data"",Ethnography,""Decision Making"",China}","315","12.5962264150943","Why Big Data Needs Thick Data","","12" | |
"karlsluis","https://medium.com/@karlsluis/beyond-tufte-fd93cbcec6af","7","{Design,""Data Visualization"",Books}","306","6.27264150943396","Beyond Tufte","Nine Great Books about Information Visualization | |
Maybe it’s anachronistic to celebrate static, printed books when so many of us love and create interactive data displays. I don’t care. I love books. | |
Edward Tufte, the patron saint of information visualization, has published four legendary books. Here are nine more indispensable favorites about visualizing information. I’ve limited myself to books I’ve actually read; if you find a favorite missing, leave a response! Let’s start with: | |
by Steven Few | |
Steven Few is an unsung hero of information visualization. Steadily he toils in the shadow of The Great and Wonderful Tufte as the vanguard of business intelligence dashboards. Few writes practical, clear, no-nonsense advice about information-dense dashboards. He focuses on execution, not theory, and for that, his most popular title, Now You See It, is great foundational reading. | |
Few highlights the power of visualizations to help us “think with our eyes” and overcome the limits of human memory (see The Magical Number Seven). From small multiples to brushing data on dashboards, he emphasizes externalizing information processing to help the brain do what it does best — recognize patterns. | |
Few also covers a surprising amount of data analysis in Now You See It. The book provides an approachable introduction to data science, from navigating data to common patterns in time-series, deviation, distribution, correlation, and multivariate data. The below graphic is a great example of Few’s approach: here, Few regroups data by month to correct for periodicity and adds an average to help the reader see the patterns in the data. | |
by Jacques Bertin | |
Make no mistake: this is the ur-book of information visualization. Bertin’s masterpiece, published in 1967, outlines an systemic approach to the creation of information graphics as no other book has, before or since. In fact, Bertin provides a simple table that ties together math, music, and graphics, right on the second page of Semiology of Graphics: | |
This table exemplifies Bertin’s approach to visualizing information: break a problem into its constituent elements, cross them, then examine and define each intersection. Bertin presents a whole system in Semiology of Graphics: he begins with retinal variables, such as size and color, which express planar dimensions, such as association, order, and quantity. Watch Bertin demonstrate the distinct graphic opportunities that the retinal variables provide for solving the same visual problem: | |
After addressing fundamentals, Bertin dives into theory. He covers the three questions every visualization should answer, as well as the three functions of visualization: recording, communicating, and processing information — here, by using his infamous “contingency table” and “reorderable matrix.” Bertin covers maps, graphics, and so much more in Semiology — though dense, a read pays dividends. | |
by Dona M. Wong | |
Wong’s reference book, Guide to Information Graphics, is rarely far from my desk. Even after years of designing visualizations, I still turn to page 75 for a quick check on her guidelines for ordering wedges in a pie chart — on the rare occasion I’m designing a pie chart, that is. | |
If you think that’s a simple and useful diagram, Wong’s book offers many more: 140 pages of ready reference for the working visualization designer. Guide to Information Graphics goes beyond chart fundamentals, too — the book also provides tips for math and copy writing, as well as a fun (and helpful) chapter titled “Tricky Situations”. | |
by Nathan Yau | |
Nathan Yau’s Visualize This is a fantastic book for beginners ready to move beyond out-of-the-box visualization tools and to create their own work. | |
Yau breaks down easy processes to create essential visualizations like bar graphs and U.S. maps. Using R to create visualizations, or taking rough graphs from Excel into Illustrator, are methods I use often, even today. For example, in a few clearly explained pages, Yau teaches how to transform a table of information into an excellent small-multiples visualization like this: | |
by Colin Ware | |
Fair warning: I felt mislead by the title of this book. Though I expected another book about colors, shapes, bar graphs, line charts and the like, I found instead hardware specification for your eyes and brain. Information Visualization explains how our minds process visual information. It’s all there, from an in-depth explanation of how eyes seek edges and patterns, to color theory, to space perception, and even a chapter on the limits of memory. | |
By way of example, a nugget of knowledge I learned from Information Visualization: extend your arm and hold up your thumb — your thumbnail represents the extent of your focus, or more specifically, your fovea, where the vast majority of the cones that provide color vision cluster. Beyond your thumbnail, your brain is far more responsive to motion than to detail, which helps explain why animation can be incredibly powerful for visualizations and pre-attentive processing. | |
by Jenifer Tidwell | |
Though Tidwell’s Designing Interfaces is hardly the only book on its subject, it is perhaps the best book I’ve read about common web design patterns. Anyone interested in designing interactive information visualizations would do well to familiarize herself with these common patterns. Though some forms, such as bar graphs and line graphs, may have matured, interaction with these forms still provides many opportunities for every designer. | |
In the book, Tidwell provides a practical overview of fundamental interface patterns, from site architecture to forms and controls. Tidwell also describes when to use specifc patterns, framed by the problems these patterns solve and their benefits for users. I was delighted to discover that Chapter 7 specifically addresses “Showing Complex Data” — she’s writing for us! Designing Interfaces is an excellent resource for new designers and great reference for those with experience. | |
by David McCandless | |
I picked up the first edition of David McCandless’s book, The Visual Miscellaneum, after a trip to an exhibit at the Cooper Hewitt Design Museum. Bar none, this is the book that convinced me to make information visualization my professional focus. | |
I fell in love with McCandless’s playful and unpretentious approach to visualizing data and information about the world around us, from budgets to to beards. The enthusiasm that welcomed his work made me believe that I could find a place in the world visualizing information. I turn to his book for inspiration, usually to see this frequently consulted graphic: | |
by Herbert Bayer | |
My favorite book of inspirational data graphics is, naturally, out of print and in demand. I watched eBay for months until I found a slightly damaged copy of World Geographic Atlas for less than four figures. It was worth it. | |
In 1953, the Container Corporation of America commissioned Bauhaus designer Herbert Bayer to create an atlas to commemorate the company’s 25th year of business. Alongside three other accomplished designers, Bayer worked for five years on an oversize (11.5"" x 15.5"") book of 368 pages, featuring over 2,200 diagrams. The results are stunning and, to my eye, yet unmatched. Just take a look at this rich spread on the United Kingdom and the Scandinavian countries: | |
From color choice to icon design, to illustrations, to topographic choropleth maps, this book literally has it all. Find a copy if you can, or, take a few minutes to watch this video. | |
by Joost Grootens | |
Joost Grootens is a Dutch graphic designer. Although the Metropolitan World Atlas is not his only covetable book, it is certainly my favorite. Grooten’s style is spare — he designs with only a few colors, one or two typefaces, and repeats, repeats, and repeats structure. This restraint lets his data — geographic or otherwise — take center stage. | |
Metropolitan World Atlas maps 50 major global cities at the same scale with consistent visualizations of data about those cities, including economic and population factors. It is strangely difficult to find any examples of a scaled comparison of major cities, let alone examples this beautiful and useful. Grootens takes advantage of the physical affordances of the book — comparing two cities is as easy as flipping pages back and forth: | |
Sadly, Metropolitan World Atlas is another out-of-print treasure. If you can find one, be sure to scoop it up! | |
Thanks for reading and tell your friends! | |
@karlsluis | karlsluis.com","1" | |
"Luca","https://medium.com/startup-grind/analyzing-205-718-verified-twitter-users-cf0811781ac8","11","{Twitter,""Social Media"",""Data Visualization"",""Social Network"",""Twitter Data""}","299","8.92641509433962","Analyzing 205 718 Verified Twitter Users","Since 2008 I create network visualizations to better understand how communities work. In this article I take a look at how verified Twitter users are connected and who they are. | |
Here are all verified Twitter accounts in one image. Every node is an account and the size stems from how many people follow them. Size adjusted with spline interpolation to make accounts with fewer followers more visible and reduce the sizes of those with the most followers. Else the accounts with millions of followers would be bigger than many of the communities themselves. | |
The image looks nice, but it becomes interesting when you go deeper. Looking at each of these knots of accounts to understand what they are about. If they have something in common. Or if they are just random people following each other. | |
As you can see in the big image further below, the groups aren’t as disconnected as they look like without the connections (edges). But the algorithm was still able to find tightly knit communities. And while there are many cross communities followings, most followings happen among the communities itself. | |
The whole graph is US centric. That big brown node in the middle of everything. That’s @twitter. And the light blue one it’s overlapping with, that is @youtube. The other big light blue in the middle bottom right are celebrities. @katyperry, @justinbieber, @theellenshow, @rihanna, @ladygaga and so on. There is much more going on in this center sector but on this visualization it’s hard to see. I will take a closer look later in this article. | |
While it’s possible to differentiate between topical groups in the center, for the rest of the sub-communites they are mostly regionally grouped. This has to do with the smaller number of verified accounts for the other countries. If I put each of these groups on their own graph, I am sure it will be possible to get a clearer picture how they are connected in itself and not just one blob. | |
Germany, Austria and Switzerland are naturally connected by the language. Canada is the US extension to the left, UK to the right and Australia at the bottom right. Again, language as one factor for it. There is the cultural closeness as well. But this exists for more groups. | |
Another language group is Spanish in green at the top. Even after multiple attempts, I wasn’t able to find another connection between all the accounts grouped together there. They are from different South American countries as well as US media outlets in Spanish and more. Close to Mexico, Argentina, and Spain. Brazil with some distance. | |
Turkey is far away from everyone else. Especially the EU, which is right beside the UN and some UK politics. Nearer to the EU there is Israel. I was surprised that Portugal appears near Finland, Sweden and Denmark. On the other side Russia. And just behind Russia Qatar and Saudi Arabia. | |
France has an outsider position at the top right, some connection to Italy and Spain. Far away from the EU and Germany. But not all of Germany. There is another German sub-community. Soccer. Or better: Bundesliga. Bundesliga is close to the Netherlands as well. And of course to soccer in the UK. There is another group of sub-communities close: eSport. And eSport goes through twitch to game developers who sit at the edge of the US bubble. | |
At the right edge there is Asia with many several smaller communities. I still don’t know why Japan got two sub-communities which aren’t that well connected with each other. | |
In the video I go through the bigger sub-communities and talk about what I think they are and why they are displayed as a distinct group. | |
Twitter has no auto-complete for location data like Facebook does. People can put into the location field whatever they want. As a result it isn’t that great to work with the data. As you can see in the graph above in the Top 25 locations have several different spellings. London isn’t more popular than Los Angeles, but more people use the same form. And these are only the most popular forms, for each City there are many different ways to write them. Some add the state, some the country, some use the neighborhood and much more. | |
Maybe I will have time to look into all these forms one day or find a tool that normalizes them. For now I think the word cloud is enough because it doesn’t care about the exact location of the word in the string. | |
When my Twitter account became verified, I noticed that @verified started to follow me. Looking at its followings it’s easy to guess that it follows every verified account on Twitter. Therefore I can say there are 206 000 verified accounts on Twitter at the moment. And they verify circa 1000 accounts every day. There may be some accounts who block @verified and therefore don’t appear in their followings but I assume that the amount is so small that I can ignore it. Using the account as a starting point it’s possible to collect the network of verified accounts on Twitter. | |
I used a modified version of the Python command line tool twecoll by JP de Vooght to first collect a list of every account followed by @verified. The tool then went through all of these 205k accounts and looked whom they follow. For one data set I limited it to accounts who follow less than 10 000 accounts and a second data set for accounts who follow less than 1 000 accounts. There are two reasons for this. The more accounts people follow, the less important becomes each connection. The second reason is the technical limitation of my computer (i5 4670k at 4,2GHz, 16GB RAM, Samsung EVO 840 250GB, GTX 760). While it does work with the larger data set, it isn’t fun to work on, because everything takes longer. | |
The data collection ran on a Raspberry Pi 2 for 7 days from 22. to 28.08.2016 with only some hours pause because of errors I had to fix manually. Because of the long run time there are some inconsistencies in the data when people followed or unfollowed someone in that timeframe. On this scale it doesn’t make a difference. There are some accounts in the data set which aren’t verified anymore. I took a closer look at 36 accounts. These were all the accounts who lost their verified status in one day out of every verified accounts. Half of them deleted their account/got suspended, the other half went private and lost their verified status because of that. | |
The big data set, <10 000 followings, has 205 718 twitter accounts and 45 302 877 connections between them. The smaller data set, <1 000 followings, has 205 718 accounts as well and 19 176 260 connections. | |
I use Gephi to visualize the data. I tweeted the process of getting the data into a useful state. OpenOrd (25, 25, 25, 10, 15; cut 0,8; 500 iterations) gave me the most useful layout. Colors are calculated by modularity algorithm. I change the sizes of the nodes from time to time. If not noted they are followers. | |
I loaded the stats of the 205k verified accounts into Excel and ignored the connections. These numbers don’t ignore any accounts, no matter how many accounts they follow. | |
When I submitted my account for verification, I was told on by some contacts that I don’t have enough followers. Indeed verified accounts have on average 117 845 followers. But there is quite a longtail. The median is at 9 370 followers. There are more than 100k accounts with less than 10 000 followers. And the rest doesn’t have that much more. The average gets skewed by mega accounts like @katyperry with 92.2m followers. There are 188 verified accounts with more than 10 million followers and 4 330 verified accounts with more than 1 million followers. There is one verified account with only two followers. | |
But how many accounts do verified accounts follow? On average they follow 2031 accounts. But again we got some mega followings. One account follows 3.6m accounts. The median is at a quite manageable 475 followings. Personally I feel like everything above 5 000 followings isn’t followed manually. Following everyone is an often used tactic to generate followings. So many people did it that Twitter introduced a limit that you can only follow a certain percentage accounts more than follow you (Base limit 5 000, daily limit 1 000). This resulted in a new follow-and-unfollow tactic. Accounts follow as many people as possible and unfollow everyone who doesn’t follow them back in x days. I digress. There are 3 551 accounts which follow nobody and 33 328 accounts which follow less than 100 others. One account returned a negative followings of -28. I assume that’s a bug with the Twitter database. | |
Verified accounts have posted an accumulated amount of 2 488 119 264 status updates. 12 095 on average. Median at 4 191. Without account age, these numbers mean nothing. Most talkative accounts are support accounts by companies. AmexOffers has posted 5.2m tweets. Of course there are many verified accounts which haven’t tweeted at all. Or deleted everything. 131 to give you the number. And 25 764 verified accounts posted less than 500 tweets. | |
Most verified Twitter accounts were created in 2009. The graph above is quite unsurprising if you look at the general popularity of Twitter. | |
From Monday to Sunday nearly two times as many accounts were created than on the Weekend. I need a data set to compare it to before I can say if this is more likely to come from the general usage pattern of Twitter or if these accounts are more likely work accounts which are often created by agencies. | |
You can explore the graph either as a Gigapan version, which loads dynamically or you go all in and try the 30MB sigma.js version, which may crash your browser and takes some minutes to load, but has a search function. Or without search, but better zooming. Both sigma.js version have their y-axis inverted, what’s at the top in the screenshots here, is at the bottom there. | |
I want to write articles about each sub-communities and need help from people already knowing the respective group. If you want to collaborate with me on one of these articles please send me a mail with the group you are interested in: [email protected]. I will then send you a file with the data set for that group, a short guide how you can work with it and a Google Docs link where we can work on the article together. You can publish the final article on your own blog/medium/publication or we publish it here. | |
Here is the guide how anyone can analyze Twitter networks with Gephi | |
Update: I just stumbled across a similar analysis of verified accounts by Haje Jan Kamps from about year .","3" | |
"Mybridge","https://medium.com/mybridge-for-professionals/machine-learning-top-10-articles-for-the-past-month-2f3cb815ffed","17","{""Machine Learning"",""Artificial Intelligence"",""Data Science"",Programming,""Software Development""}","291","3.24150943396226","Machine Learning Top 10 Articles for the Past Month.","In this observation, we ranked nearly 1,750 articles posted in August 2016 about machine learning, deep learning and AI. | |
Mybridge AI evaluates the quality of content and ranks the best articles for professionals. This list is competitive and carefully includes quality content for you to read. You may find this condensed list useful in learning and working more productively in the field of machine learning. | |
Teaching Computer to Play Super Mario with neural network. Courtesy of Ehren J. Brav and Google DeepMind | |
[Part II] A Beginner’s Guide To Understanding Convolutional Neural Networks. Courtesy of Adit Deshpande at UCLA | |
…………[Part III] The 9 Deep Learning Papers I Recommend | |
Image Completion with Deep Learning in TensorFlow. Courtesy of Brandon Amos, Ph.D at Carnegie Mellon University | |
[Part V] Language Translation with Deep Learning and the Magic of Sequences. Courtesy of Adam Geitgey, Director of Software Engineering at Groupon | |
Deep learning for complete beginners: Recognizing handwritten digits. Courtesy of Petar Veličković | |
An Exclusive Look at How AI and Machine Learning Work at Apple. Courtesy of Steven Levy and The Backchannel Team | |
Deep Dream (Google). Courtesy of Computerphile and Mike Pound | |
TensorFlow in a Nutshell — Part One: Basics. Courtesy of Camron Godbout | |
The Three Faces of Bayes | |
The Brain Behind Google’s Artificial Intelligence: Interview with Jeff Dean. Courtesy of Peter High | |
Srez: Image super-resolution through deep learning | |
[3,293 stars on Github] | |
Very simple implementation of Neural Algorithm of the Artistic Style (With TensorFlow) | |
[121 stars on Github] | |
Data Science for Beginners: Deep Learning in Python with Tensor flow and Neural Networks. (Most popular as of September, 2016) | |
[2,073 recommends, 4.6/5 rating] | |
That’s it for Machine Learning Monthly Top 10. | |
If you like this curation, you can read Top 10 daily articles based on your skills on our iPhone & iPad app.","7" | |
"muellerfreitag","https://medium.com/@muellerfreitag/10-data-acquisition-strategies-for-startups-47166580ee48","12","{""Machine Learning"",""Big Data"",Startup}","271","10.1122641509434","10 Data Acquisition Strategies for Startups","","8" | |
"airbnbeng","https://medium.com/airbnb-engineering/using-r-packages-and-education-to-scale-data-science-at-airbnb-906faa58e12d","7","{""Data Science"",""R Programming""}","268","8.2311320754717","Using R packages and education to scale Data Science at Airbnb","By Ricardo Bion | |
One of my favorite things about being a data scientist at Airbnb is collaborating with a diverse team to solve important real-world problems. We are diverse not only in terms of gender, but also in educational backgrounds and work experiences. Our team includes graduates from Mathematics and Statistics programs, PhDs in fields from Education to Computational Genomics, veterans of the tech and finance worlds, as well as former professional poker players and military veterans. This diversity of training and experience is a tremendous asset to our team’s ability to think creatively and to understand our users, but it presents challenges to collaboration and knowledge sharing. New team members arrive at Airbnb proficient in different programming languages, including R, Python, Matlab, Stata, SAS, and SPSS. To scale collaboration and unify our data science brand, we rely on tooling, education, and infrastructure. In this post, we focus on the lessons we have learned building R tools and teaching R at Airbnb. Most of these lessons also generalize to Python. | |
Our approach has two main pillars: package building and education. We build packages to develop collaborative solutions to common problems, to standardize the visual presentation of our work, and to avoid reinventing the wheel. The goals of our educational efforts are to give all data scientists exposure to R and to the specific packages we use, and to provide opportunities for further learning to those who wish to deepen their skills. | |
In small data science teams, individual contributors often write single functions, scripts, or templates to optimize their workflows. As the team grows, different people develop their own tools to solve similar problems. This leads to three main challenges: (i) duplication of work within the team, both in writing the tools and reviewing code, (ii) lack of transparency about how tools are written and lack of documentation, often resulting in bugs or incorrect usage, (iii) difficulty sharing new developments with other users, slowing down productivity. | |
R packages shared through Github Enterprise address these three challenges, which makes them a great solution for our needs. Specifically, (i) multiple people can collaborate simultaneously in order to improve the tools and fix bugs, (ii) contributions are peer reviewed, and (iii) new versions can be deployed to all users as needed. Packages are the basic units of reproducible R code. They can include functions, documentation, data, tests, add-ins, vignettes, and R markdown templates. I started working on our first internal R package, called Rbnb, nearly two years ago. It was initially launched with only a couple of functions. The package now includes more than 60 functions, has several active developers, and is actively used by members of our Engineering, Data Science, Analytics, and User Experience teams. As of today, our internal Knowledge Repo has nearly 500 R Markdown research reports using the Rbnb package. | |
The package is developed in an internal Github Enterprise repository. There, users can submit issues and suggest enhancements. As new code is submitted in a branch, it is peer reviewed by our Rbnb developers group. Once the changes are approved and documented, they are merged into the master codebase as a new version of the package. Team members can then install the newest release of Rbnb directly from Github using devtools. We are currently working on adding lintr checking for both style and syntax, and test coverage with testthat. | |
The package has four main components: (i) a consistent API to move data between different places in our data infrastructure, (ii) branded visualization themes, scales, and geoms for ggplot2, (iii) R Markdown templates for different types of reports, and (iv) custom functions to optimize different parts of our workflow. | |
The most used functions in Rbnb allow us to move aggregated or filtered data from a Hadoop or SQL environment into R, where visualization and in-memory analysis can happen more naturally. Before Rbnb, getting data from Presto into R in order to run a model required multiple steps. Data scientists would have to authenticate with their cluster credentials, open an SSH tunnel, enter host, port, schema, and catalog information for Presto, download a csv file, load that file into R, and only then run the desired models. Now, all of this can be done by piping two functions, as Rbnb takes care of all of the implementation details under the hood, while working with other well-maintained packages like RPresto. Similarly, getting data from R and moving it to Amazon S3 can be done with only one line of code. Data scientists no longer have to save a csv file from R, set up multi-factor authentication with our API keys, configure AWS, and run a bash command to move the csv into remote storage. More importantly, all functions follow a similar specification (i.e., place_action(origin, destination)). | |
If our data infrastructure changes — for instance, if a cluster moves or our Amazon S3 authentication details change — we can change our implementation of Rbnb without changing our functions’ interface. | |
The package has also helped us brand our work across Airbnb through the use of consistent styles for data visualizations — see these posts by Bar Ifrach and Lisa Qian for examples. We have built custom themes, scales, and geoms for ggplot2, CSS templates for htmlwidgets and Shiny, and custom R Markdown templates for different types of reports. These features override R defaults with fonts and colors that are consistent with the Airbnb brand. | |
The Rbnb package also has dozens of functions that we have created to automate common tasks such as imputing missing values, computing year-over-year trends, performing common data aggregations, and repeating patterns that we use to analyze our experiments. Adding a new function to the package might take some time, but this initial investment pays off in the long run. By using the same R package we develop a common language, visualization style, and foundation of peer-reviewed code as our building blocks. | |
It does not matter how many tools you build if people do not know how to use them. After a period of rapid growth, we started organizing monthly week-long data bootcamps for new hires and current team members. They include 3-hour R workshops, and optional mentorship in a bootcamp project coded in R and written in R Markdown. | |
The bootcamp R class focuses on the Rbnb package and on common R packages used to reshape and manipulate data frames (tidyr and dplyr), visualize data (ggplot2), and write dynamic reports (R Markdown). We give participants study guides and materials a few days before our class. During class, we walk through a structured tutorial using our own data, including challenges that we commonly face on the job as working examples. | |
This approach allows users who are not familiar with R to start coding within a few hours, without having to worry about the intricacies of more advanced programing. We also introduce users to our internal style guide and to many useful R packages, such as formattable, diagrammeR, and broom. Finally, we give them directions on how to find help and online resources. | |
After the bootcamp, we encourage users to continue learning. We sponsor individual memberships to DataCamp and help team members organize study groups around self-paced and interactive online courses. We also pair new hires with experienced peers who serve as mentors. These mentors walk new team members through their first contributions as data scientists. We have an internal Slack channel in which users can pose any questions related to R, and organize regular office hours in which experienced developers can help with more complex coding challenges. Our team members organize learning lunches and classes on topics such as SparkR, R object systems, and package development. Most recently, four team members attended a Master R Developer Workshop organized by RStudio, and shared what they learned with the team afterwards. | |
Members of our Data Science team are also encouraged to contribute code to Rbnb. The process of going through a comprehensive code review allows users to develop new skills that are valuable to future projects. In addition, they feel ownership of an important internal tool and see how their contributions can benefit their peers’ work. We guide new contributors on best practices, function documentation, testing, and style. | |
We also engage with the broader R community outside Airbnb. We sponsor conferences like the upcoming rOpenSci Unconf, contribute to open source projects (e.g., ggtech, ggradar), and give talks at meetings such as the Shiny Developer Conference and UseR Conference. We have been fortunate to have influential R developers visit our headquarters in San Francisco last year, including Hadley Wickham and Ramnath Vaidyanathan. | |
In addition to tools and education, we also invest in strong data infrastructure. Our Shiny apps have had nearly 100k page views since our server was first started three years ago. We recently started supporting a new RStudio Server and SparkR cluster. We have a single Chef recipe with R packages and version control across all of the machines in our clusters, allowing for rapid updates and large-scale deployment. | |
Powerful R tools, continuous education, engagement with the R community, and strong data infrastructure have helped our Data Science team scale. Since we started this initiative nearly two years ago, we have watched team members who had never before opened R transform into strong R developers who now teach R to our new hires. The foundation we have built allows us to hire a wide range of data scientists, sharing a growth mindset and excitement to learn new skills. This approach has helped us build a diverse team that brings new insights and perspectives to our work. | |
The creation of the Rbnb R package has inspired our Python developers to release an internal Python package for data scientists, called Airpy. Our developers collaborate so that the packages have a similar interface and set of functions. We encourage team members to contribute code to both Rbnb and Airpy, and we work together to develop more effective education resources and tools to empower our team. Today, many members of our team are proficient in both R and Python, and are able to review and write reliable code in both languages. In a recent survey with 66 members of our team, we found that 80% of our data scientists and analysts rated themselves as closer to “Expert” than “Beginner” in using R for data analysis, even though only 64% of them use R as their primary data analysis language. Similarly, 47% of the team members rated themselves as closer to “Expert” than “Beginner” in using Python for data analysis, though only 31% use it as their primary data analysis tool. The remaining 5% said they used both languages around equally. We focus on building a balanced team with strong developers using both languages, and have no preference or bias for either in our hiring process. This is yet another way through which diversity of skills, experiences, and backgrounds, have helped increase the impact of our team. | |
Thanks to Jenny Bryan, Mine Cetinkaya-Rundel, Scott Chamberlain, Garrett Grolemund, Amelia McNamara, Hilary Parker, Karthik Ram, Hadley Wickham, and to the Airbnb Engineering and Data Science teams for comments on an earlier version of this post.","6" | |
"Mybridge","https://medium.com/mybridge-for-professionals/machine-learning-top-10-articles-for-the-past-month-35c37825a943","19","{""Machine Learning"",""Artificial Intelligence"",""Data Science"",Programming,Tech}","259","3.04339622641509","Machine Learning Top 10 Articles For The Past Month.","In this observation, we ranked nearly 1,400 articles posted in October 2016 about machine learning, deep learning and AI. (0.7% chance) | |
Mybridge AI ranks the best articles for professionals. Hopefully this condensed reading list will help learn more productively in the area of Machine Learning. | |
Implementation of Reinforcement Learning Algorithms in Python, OpenAI Gym, Tensorflow. [1502 stars on Github] | |
Cross-validation and hyperparameter tuning — Model evaluation & selection, and algorithm in machine learning [Part 3]. Courtesy of Sebastian Raschka | |
Deep Learning is Revolutionary: 10 reasons why deep learning is living up to the hype. Courtesy of Oliver Cameron | |
A primer on machine learning for fraud detection. Courtesy of Stripe Engineering. | |
Deep Reinforcement Learning: Playing a Racing Game. Courtesy of Pedro Lopes | |
How Robots Can Acquire New Skills from Their Shared Experience. Courtesy of Google Developers | |
Adversarial Neural Cryptography in Theano. Courtesy of Liam Schoneveld | |
A Return to Machine Learning: Recent developments in machine learning research that intersect with art and culture. Courtesy of Kyle McDonald | |
Deconvolution and Checkerboard Artifacts. Courtesy of Google Brain Team | |
Differentiable neural computers: Memory-augmented neural network for answering complex questions. Courtesy of Deep Mind | |
Neural Enhance: Super Resolution for images using deep learning. | |
[5098 stars on Github] | |
Keras-js: Run trained Keras models in the browser, with GPU support | |
[1552 stars on Github] | |
Fast Style Transfer in TensorFlow | |
[1915 stars on Github] | |
. | |
Supervised Machine Learning in Python: Implementing Learning Algorithms From Scratch. | |
[615 recommends, 4.7/5 rating] | |
That’s it for Machine Learning Monthly Top 10. If you like this curation, read daily Top 10 articles based on your programming skills on our iOS app.","4" | |
"giorgialupi","https://medium.com/accurat-studio/learning-to-see-visual-inspirations-and-data-visualization-ce9107349a","14","{""Data Visualization"",Design,Art}","257","6.05","Learning to See: Visual Inspirations and Data Visualization","","4" | |
"davidventuri","https://medium.com/@davidventuri/i-dropped-out-of-school-to-create-my-own-data-science-master-s-here-s-my-curriculum-1b400dcee412","13","{Programming,""Data Science"",Education,""Machine Learning"",Learning}","254","4.86792452830189","I Dropped Out of School to Create My Own Data Science Master’s — Here’s My Curriculum","I dropped out of a top computer science program to teach myself data science using online resources like Udacity, edX, and Coursera. The decision was not difficult. I could learn the content I wanted to faster, more efficiently, and for a fraction of the cost. I already had a university degree and, perhaps more importantly, I already had the university experience. Paying $30K+ to go back to school seemed irresponsible. | |
Here are my curriculum choices and the rationale behind them. You can read my detailed reviews for most of these courses here on Medium or on my personal website — davidventuri.com. | |
Note that the curriculum covers both Python and R, which are the two most popular programming languages for data science. | |
I wanted a solid computer science foundation before I started learning data science. My engineering background gave me a head start on the math and stats. Completing these three courses means I will have completed a standard first-year computer science curriculum, plus the full mathematical and statistical core. | |
The following courses from my undergrad chemical engineering program are also core computer science courses: | |
Listed below are the individual courses contained within the Nanodegree. The estimated timeline for graduation is 378 hours. | |
First and foremost, it received stellar reviews. Second, I wanted a consistent learning experience for my introduction to the field. The Data Analyst Nanodegree offered a combination of breadth, depth, and cohesiveness that a combination of content from various providers would be hard pressed to provide. I am also a fan of their “less passive listening (no long lectures) and more active doing” approach to education. | |
Listed below are the individual courses contained within Johns Hopkins University’s “Mastering Software Development in R Specialization” on Coursera: | |
The role of software engineering in data science is covered in great detail here by Alec Smith (a data science recruiter) and here by Roger Peng (Johns Hopkins University professor and “Mastering Software Development in R Specialization” creator). A quote from the former: | |
And from the Mastering Software Development in R Specialization page: | |
This Quora page and this Udacity article suggest that back end development and data science can be a useful combination. These Udacity courses, which are the back end courses in their Full Stack Web Developer Nanodegree, along with Stanford’s top-ranked databases course, add an aspect of data engineering to the curriculum. | |
This section is fluid. Additional resources will be added as I progress through the curriculum. | |
Many thanks to Dhawal Shah of Class Central, as the ratings and reviews from his online course search engine (plus a few insider tips) helped guide the above curriculum choices. | |
If you have any recommendations for the curriculum, the above subject material in general, or would like to chat about your own educational goals, please don’t hesitate to contact me. | |
Originally published at davidventuri.com.","5" | |
"ltthompso","https://medium.com/the-coffeelicious/hillary-clinton-is-bad-at-venn-diagrams-9ce18a8b6923","8","{""Data Visualization"",Politics,""2016 Election""}","254","3.21635220125786","Hillary Clinton is Bad at Venn Diagrams","I hate pie charts. They’re the data visualization equivalent of opioids. Calming, numbing, dulling. They give the illusion of insight, but they’re just stopping up the receptors. They’re not helping you get better in and of themselves. Indeed, spin a pie chart on its axis and people will come to new, different conclusions. Sad! | |
What then of the Venn Diagram, noble second cousin of the disreputable pie chart? Venn Diagrams are, if nothing else, striking. They’re an intuitively effective, if not particularly precise way to show the interrelationships between different groups. So yeah, I’m broadly pro-Venn Diagram. | |
However, Venn Diagrams have limitations. A good data visualization should simplify and clarify. As a baseline, I always ask “does this visualization make the data I’m explaining easier to interpret than a simple matrix table?” If the answer is “no” then use the table. Everybody can read a table. | |
By way of example, have a look at FiveThirtyEight’s GOP primary Venn Diagram. It’s a difficult, complicated set of factors to explain, but I think we’d all agree that this suffers from crowding. | |
Contrast the above to this: | |
Revealing, isn’t it? Everybody’s part of the “Establishment” and Rand Paul is the only guy in the whole Libertarian fifth. Now, the 538 diagram is also supposed to convey relative proximity to the other circles within each circle. So while Senator Paul may have the Libertarian bubble to himself, Paul Ryan and Scott Walker are camped out pretty nearby. Fair enough: replace the X marks with % estimates. And if you don’t have numerical estimates underlying the dot locations, then what is this other than a really sloppy and confusing visualization? | |
Keep your diagrams simple, of course, and they can be very useful. A personal favorite: | |
Which brings us to the disaster of a chart Hillary Clinton tweeted out this afternoon. See below: | |
Wait, that’s not it. Here it is. | |
Observe the Clinton campaign doing violence to clarity, sense — nay to logic itself. | |
What’s even happening here? On the simplest level, she’s tweeting at Congress to pass universal background checks. But the image above seems to say that very few people support the position she’s advocating. Taken at face value, if we took 90% of Americans and 83% of presumably non-American gun owners — drawn at random I guess — roughly a third of each group would support universal background checks. | |
What about the other 10% of Americans, or 17% of gun owners omitted from the chart? And why would we slavishly kowtow to the views of less than half of 90% of Americans just because a little more than third of 83% of gun toting foreigners agree with them? | |
Fortunately, we have The Washington Post to the rescue. The below chart mostly fixes Clinton’s debacle, though its color choice isn’t exactly fortuitous. Always be mindful that the hues you choose send messages to your readers: | |
That’s why it’s almost always best to keep things simple in terms of color and presentation: | |
If you like what you see, sign up for my email newsletter and follow me on Twitter. As ever and always, if you’d like to send me angry emails about this essay, you can do so here.","1" | |
"savolai","https://medium.com/interaction-reimagined/regular-expressions-you-can-read-a-new-visual-syntax-526c3cf45df1","3","{Regex,UX,Programming,""Data Visualization"",Usability}","253","4.15377358490566","Regular expressions you can read: | |
A new visual syntax (and UI)","Lots has been written about the problems with regular expressions: learning them, debugging them, etc. | |
I propose a more visual syntax and a keyboard-usable UI for generating regular expressions. | |
The UI/syntax proposed here helps address issues related to readability, learnability, and memorability. Those who readily understand regex will find that this visual syntax does not slow them down. It makes existing regexes easier to read for both novices and true regex superheroes. | |
(If you just want to use it already, go here to enter your email and we’ll let you know.) | |
You write regexes just like you always have — with optional ctrl+space popup menu command completion or insertion. Also, part of the UI concept is to be able to import existing regex expressions for editing, then export them in your chosen dialect. | |
This dialect-agnostic visual syntax seeks a balance between two ends of a continuum: | |
An example from regexper.com: | |
The real power of the visual syntax comes to life with the suggested UI. The UI will particularly help those who find the traditional syntax hard to remember. | |
You write a regex as you normally would. The UI will visualize the structure on the fly. When you find that you can’t remember a command, you can press ctrl+space to summon a search menu. This menu contains all regex commands and descriptions: You can either search by command (to confirm if you remember the command’s meaning right) or by description (to recall what is the command for given task). | |
Regular expressions have a hard-to-memorize syntax. This is a particularly serious an issue considering that most of us do not write regexes for a living. | |
For many users, regex is a tool that gets summoned say, a couple of times a year. When we come back to them, previous learning has faded, and we might need hours just to get up to speed with the syntax. | |
To solve this, we will augment the above visual syntax with an UI that enables learning. This means three things: | |
This is a concept design. The idea is that the visual syntax will generate traditional regexes. You could see it as a visual DSL that generates (only barely human readable) traditional regular expressions. Ideally, IDEs would have support for this visual syntax such that you could switch between traditional syntax and this visual one. | |
See also Part 2: Regex You Can Read: How It Works | |
Even this syntax can get unwieldy if the expression is complex enough. | |
Also, this does not solve all issues with regular expressions. Namely, it does not solve the core issue more intrinsically built into regexes: How do I make sure that my regex matches exactly those strings I want it to and none of the ones I don’t? There are debugging tools for regexes that allow you to find what you want by means of trial and error, but that’s a topic for another post. | |
To get early access to our crowdfunding campaign, and to know when you can try this in action, go here to enter your email. | |
Although Regex UCR will likely be open source, we will need your financial support to pay for coders doing the work. Hear things before others as they happen, and receive exclusive perks when our crowdfunding thing happens. (Added August 3, 2016) | |
Contact me to join our Slack channel if you want to work together on this and get write access to our github repository as well. We warmly welcome any help bridging the gap from design to code. | |
We’re still on the lookout for more people to join us. We would especially like more folks who | |
We already have a bunch of folks who have shown interest and plans are underway, but discussion is still just getting started. Open source and GPL. Now’s the time to step up! | |
BTW, thanks to Bret Viktor. His work has provided inspiration for much of this. | |
Go here to enter your email and we’ll let you know when we have something you can try. | |
See also Part 2: Regex You Can Read: How It Works with more samples of syntax.","3" | |
"mattfogel","https://medium.com/startup-grind/the-10-best-ai-data-science-and-machine-learning-podcasts-d7495cfb127c","14","{""Machine Learning"",""Data Science"",Podcasts,Podcast,""Artificial Intelligence""}","252","4.05754716981132","The 10 Best AI, Data Science and Machine Learning Podcasts","It seems like AI, data science, machine learning and bots are some of the most discussed topics in tech today. Given my company Fuzzy.ai’s mission to make AI and machine learning more accessible to all developers and product managers, a lot of people ask me about how I keep on top of news in the field. | |
My preferred way to do this is always through listening to podcasts. Here are the ones I’ve found the most interesting: | |
A long-time favorite of mine and a great starting point on some of the basics of data science and machine learning. They alternate between great interviews with academics & practitioners and short 10–15 minute episodes where the hosts give a short primer on topics like calculating feature importance, k-means clustering, natural language processing and decision trees, often using analogies related to their pet parrot, Yoshi. This is the only place where you’ll learn about k-means clustering via placement of parrot droppings. | |
Website | iTunes | |
Hosted by Katie Malone and Ben Jaffe, this weekly podcast covers diverse topics in data science and machine learning: talking about specific concepts like model theft and the cold start problem and how they apply to real-world problems and datasets. They make complex topics accessible. | |
Website | iTunes | |
Well into its second season, in this podcast, hosts Katherine Gorman and Ryan Adams speak with a guest about their work, and news stories related to machine learning. A great listen. | |
Website | iTunes | |
This podcast features Ben Lorica, O’Reilly Media’s Chief Data Scientist speaking with other experts about timely big data and data science topics. It can often get quite technical, but the topics of discussion are always really interesting. | |
Website | iTunes | |
The second O’Reilly entry on this list is one of the newer podcasts on the block. Hosted by Jon Bruner (and sometimes Pete Skomoroch), it focuses specifically on bots and messaging. This is one of the newer and hotter areas in the space, so it’s definitely worth a listen! | |
Website | iTunes | |
Concerning AI offers a different take on artificial intelligence than the other podcasts on this list. Brandon Sanders & Ted Sarvata take a more philosophical look at what AI means for society today and in the future. Exploring the possibilities of artificial super-intelligence can get a little scary at times, but it’s always thought-provoking. | |
Website | iTunes | |
Another relatively new podcast, This Week in Machine Learning & AI releases a new episode every other week. Each episode features an interview with a ML/AI expert on a variety of topics. Recent episodes include discussing teaching machines empathy, generating training data, and productizing AI. | |
Website | iTunes | |
Data Stories is a little more focused on data visualization than data science, but there is often some interesting overlap between the topics. Every other week, Enrico Bertini and Moritz Stefaner cover diverse topics in data with their guests. Recent episodes about data ethics and looking at data from space are particularly interesting. | |
Website | iTunes | |
Billing itself as “A Gentle Introduction to Artificial Intelligence and Machine Learning”, this podcast can still get quite technical and complex, covering topics like: “How to Catch Spammers using Spectral Clustering” and “How to Enhance Learning Machines with Swarm Intelligence”. | |
Website | iTunes | |
Hosts Chris Albon, Jonathon Morgan and Vidya Spandana all experienced technologists and data scientists, talk about the latest news in data science over drinks. Listening to Partially Derivative is a great way to keep up on the latest data news. | |
Website | iTunes | |
Feel I’ve unfairly left a podcast off this list? Leave me a note to let me know.","1" | |
"chazhutton","https://medium.com/swlh/another-series-of-observations-via-post-it-notes-76d114dc956b","9","{""Big Data"",Humor,Millennials}","247","1.4188679245283","Another series of observations explained via Post-it notes.","That’s it for now. For more graph based nonsense though, head over to InstaChaaz. — or follow me on twitter. | |
You can also find Part 1 here. | |
Published in Startups, Wanderlust, and Life Hacking | |
-","2" | |
"gilgul","https://medium.com/i-data/trumpwon-trend-vs-reality-16cec3badd60","10","{""Social Media"",Twitter,""Data Science"",""Donald Trump"",""Hillary Clinton""}","245","6.05754716981132","#TrumpWon? trend vs. reality","","4" | |
"eklimcz","https://medium.com/truth-labs/data-spaces-ad0a2bb073bd","10","{Beacons,IoT,Design,Technology,""Data Visualization""}","239","6.67641509433962","Data Spaces","","3" | |
"olivercameron","https://medium.com/udacity/open-sourcing-223gb-of-mountain-view-driving-data-f6b5593fbfa5","2","{""Self Driving Cars"",""Autonomous Vehicles"",""Open Source"",""Machine Learning"",""Big Data""}","239","1.0625786163522","Open Sourcing 223GB of Driving Data","A necessity in building an open source self-driving car is data. Lots and lots of data. We recently open sourced 40GB of driving data to assist the participants of the Udacity Self-Driving Car Challenge #2, but now we’re going much bigger with a 183GB release. This data is free for anyone to use, anywhere in the world. | |
223GB of image frames and log data from 70 minutes of driving in Mountain View on two separate days, with one day being sunny, and the other overcast. Here is a sample of the log included in the dataset. | |
Note: Along with an image frame from our cameras, we also include latitude, longitude, gear, brake, throttle, steering angles and speed. | |
To download both datasets, please head to our GitHub repo. | |
We can’t wait to see what you do with the data! Please share examples with us in our self-driving car Slack community, participate in Challenge #2, or send a Tweet to @olivercameron. Enjoy!","3" | |
"EvanSinar","https://medium.com/@EvanSinar/the-10-best-data-visualization-articles-of-2016-and-why-they-were-awesome-ce30618ea06a","5","{""Data Visualization"",Design,""Data Science"",Research,Analytics}","233","6.55786163522013","The 10 Best Data Visualization Articles of 2016 (and Why They Were Awesome)","","10" | |
"quincylarson","https://medium.com/free-code-camp/code-briefing-lessons-from-3-000-developer-job-interviews-711111dcaa64","2","{Programming,Tech,""Data Science"",""Life Lessons"",Design}","232","0.824842767295598","Code Briefing: Lessons from 3,000 developer job interviews","Here are three stories we published this week that are worth your time: | |
Bonus: So far more than 1,000 people have committed to the #100DaysOfCode challenge. If you want to rapidly improve your coding skills, you can take this challenge too. | |
Finally, here’s a cool infographic to recap 2016: | |
Happy coding, | |
Quincy Larson, teacher at Free Code Camp","4" | |
"quincylarson","https://medium.com/free-code-camp/scientists-can-now-store-data-with-individual-atoms-eeac7f71905f","1","{Technology,Education,Design,Programming,""Data Science""}","216","0.645283018867925","Scientists can now store data with individual atoms","Here are this week’s three links that are worth your time: | |
1. Scientists can now store data with individual atoms. This is 100 times denser than the most efficient hard drive ever created: 4 minute read | |
2. Haseeb learned to code and got a starting salary of more than $200,000 per year. Here are his negotiation tactics: 20 minute read | |
3. Here are the 50 best free online university courses, according to Class Central’s huge dataset: 11 minute read | |
Happy coding, | |
Quincy Larson, teacher at Free Code Camp | |
P.S. You can now comment on these links by clicking the “Respond” button below. You can also recommend this by clicking the “Recommend” button.","2" | |
"JeffGlueck","https://medium.com/foursquare-direct/foursquare-predicts-chipotle-s-q1-sales-down-nearly-30-foot-traffic-reveals-the-start-of-a-mixed-78515b2389af","4","{Chipotle,Finance,""Data Science""}","215","5.57169811320755","Foursquare Predicts Chipotle’s Q1 Sales Down Nearly 30%; Foot Traffic Reveals the Start of a Mixed…","Foursquare Predicts Chipotle’s Q1 Sales Down Nearly 30%; Foot Traffic Reveals the Start of a Mixed Recovery | |
When Chipotle came on the scene, the chain earned lots of fans for its approach to “food with integrity,” including antibiotic-free meats, GMO-free ingredients, and fresh local produce. However, the last six months have been a tumultuous ride. Since the first E. coli reports emerged in October 2015, reports popped up across the country raising skepticism about its products and processes, and Chipotle has been racing to squash the issues, institute better training and manage its reputation. The fast casual Mexican-themed chain is still dealing with the repercussions, and an even more recent norovirus outbreak in March at two stores. | |
In February, the CDC gave the chain a clean bill of health. To take a deeper look at how the downturn and recovery has gone, we analyzed the foot traffic patterns at the more than 1,900 Chipotle US locations and compared them to the previous year. At Foursquare, we have a trove of anonymous and aggregate data on where people go, based on the 50 million people who use our apps (Foursquare and Swarm) and websites monthly. Many users passively share their background location with us, which our technology can match up with our location database of over 85 million places, giving us clear insight into natural foot traffic patterns. (Here’s a video that shows how Foursquare maps a large Chipotle location in downtown Portland, Oregon.) | |
A Look Back | |
Foot traffic to Chipotle started to follow the same directionally downward seasonal winter traffic trend in 2015 as in 2014. But as time went on, it became clear that 2015 was no ordinary winter for Chipotle; traffic was down in a more significant way. | |
The chart below shows the share of visits to Chipotle restaurants in comparison to visits to ALL restaurants in the United States. In the 2015–2016 winter, visits to Chipotle restaurants declined more significantly than in 2014–2015. | |
Visit share began to recover in February 2016, marked by the CDC’s conclusion of its E. Coli investigation and Chipotle’s ‘raincheck’ promotion launch, ostensibly for customers who were unable to satisfy their burrito cravings during the company’s system-wide closure on February 8. Foot traffic took another dip, albeit much smaller, following the more minor norovirus outbreak in Boston in two locations in early March. | |
Sales Projections | |
Chipotle has publicly reported its weekly sales for the first 10 weeks of Q1, giving us ample data to build statistical models to project sales for the rest of the quarter. Taking into account reported sales, redeemed coupons and other factors, along with Foursquare foot traffic data, we estimate that Chipotle ended Q1 2016 with same store sales down roughly 30% year-over-year (which we expect to be confirmed by Chipotle when it reports earnings on April 26). Foot traffic estimates, however, tell a brighter story. Foursquare data shows that year-over-year, Q1 same store traffic declined only about 23%. The gap between sales and foot traffic is likely a result of all the free burrito coupons that were redeemed, which lured in people, though not revenue. | |
We believe the 23% decline in same store foot traffic is the more meaningful number that shareholders should focus on, rather than the 30% decline in sales. It shows that Chipotle is building trust back with customers, which is more important to its success long-term. Although it sacrifices revenue this quarter by giving product away, it is proving to be a winning strategy for getting people comfortable with coming back. The trick is in making sure that these customers come back again and spend money in the future. | |
Chipotle Needs to Focus on Loyalty | |
We looked at how frequently customers went to Chipotle over the past year and found some interesting insights. Last summer, just 20% of Chipotle customers made up about 50% of foot traffic visits. Because this cohort of loyal customers reliably returned to Chipotle month after month, they contributed to an outsized percentage of foot traffic, and likely sales. Interestingly, it’s this group of faithful customers that have changed their Chipotle eating habits most dramatically: these once-reliable visitors were actually 50% more likely to STAY AWAY in the fall during the outbreak, and they have been even harder to lure back in. While those who infrequently visited Chipotle last summer have returned to Chipotle at similar rates as before, the formerly loyal customers have been 25% less likely to return. The loss of these important customers is what has really hurt Chipotle, since losing 2–3 loyal customers is the equivalent of losing about 10 other customers. | |
To demonstrate that this is an unusual loss of loyalty, versus natural attrition, we compared this pattern with a cohort of frequent Panera goers. The chart below illustrates that while both chains experienced a similar seasonal dip, Chipotle has lost much more traffic from its loyalists than Panera has. | |
Chipotle has famously dismissed the idea of having a loyalty program, stating that it didn’t believe that loyalty programs help turn infrequent goers into loyal visitors. According to Chipotle CFO Jack Hartung, “The problem is that Chipotle’s customers are already so darn loyal.” Looks like it’s time to reconsider those famous last words. | |
So, where are they headed instead? Foursquare foot traffic data reveals that they have replaced their usual Chipotle visits with visits to other popular chains such as McDonald’s and Starbucks. They have also been slightly more likely than the average person to visit Whole Foods, for which there are some naturally overlapping qualities in offerings that have more integrity and healthfulness. | |
Looking to the Future | |
Two weeks from today, Chipotle will share its official Q1 earnings. We, alongside most analysts, anticipate the bitter pill the restaurant chain will have to swallow as they report on losses. But we also see a more unique, nuanced and slightly rosier picture, and urge Chipotle to continue building brand loyalty — one burrito at a time. | |
### | |
Foursquare’s Location Intelligence Paves the Path To Recovery | |
The data looks promising for recovery, but there will be trouble for Chipotle if they don’t lure back the formerly loyal visitors or nurture a new group of faithful fans. | |
Some ideas for how to do this effectively: | |
When you’re operating a brick-and-mortar business, location intelligence is critical. Chipotle’s burrito-based bottom line proves it. | |
Interested in any of the analysis or tools mentioned above? Do you need the power of location intelligence? Read more about our enterprise solutions or contact us. | |
### | |
Notes on Methodology","1" | |
"climatedesk","https://medium.com/climate-desk/a-gorgeous-animated-model-of-global-weather-patterns-3b144c1b342e","6","{Weather,""Climate Change"",""Data Visualization""}","212","1.88584905660377","A Gorgeous, Animated Model of Global Weather Patterns","","4" | |
"noahhl","https://medium.com/signal-v-noise/practical-skills-that-practical-data-scientists-need-da71e6b93f95","0","{""Big Data"",""Data Science"",Analytics}","212","3.06037735849057","Practical skills that practical data scientists need","","2" | |
"Mybridge","https://medium.com/mybridge-for-professionals/machine-learning-top-10-articles-for-the-past-month-b499e4213a34","18","{""Machine Learning"",""Data Science"",""Artificial Intelligence"",Programming,Tech}","209","3.04245283018868","Machine Learning Top 10 Articles For The Past Month.","In this observation, we ranked nearly 1,200 articles posted in November 2016 about machine learning, deep learning and AI. (0.8% chance) | |
Mybridge AI ranks the best articles for professionals. Hopefully this condensed reading list will help you learn more productively in the area of Machine Learning. | |
Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python | |
Image-to-Image Translation with Conditional Adversarial Networks [1,274 stars on Github]. Courtesy of Ph.d. candidates at UC Berkeley | |
20 Weird & Wonderful Datasets for Machine Learning. Courtesy of Oliver Cameron, Lead Engineer of Self-driving Car at Udacity | |
How to Learn Machine Learning, The Self-Starter Way | |
The Next Frontier in AI: Unsupervised Learning. Courtesy of Yann LeCun, Director of AI Research at Facebook | |
Peeking into the neural network architecture used for Google’s Neural Machine Translation. Courtesy of Stephen Merity | |
Finding Beautiful Food Photos Using Deep Learning. Courtesy of Yelp Engineering | |
An Interactive Tutorial on Numerical Optimization. Courtesy of Ben Frederickson, Data Scientist at Flipboard | |
Learning to See: The complex landscape of machine learning through one example from computer vision. [Part 4] | |
………………………………………[Part 5] | |
Deep Dream in TensorFlow and Numpy: Learn Python for Data Science. | |
A.I. Experiments: Explore machine learning by playing with pictures, language, music, code — Google Research | |
. | |
Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition. DeepMind’s WaveNet & Tensorflow | |
[1,158 stars on Github] | |
Learn Machine Learning with Python & Spark and become a data scientist in tech. | |
[15,908 recommends, 4.5/5 rating] | |
That’s it for Machine Learning Monthly Top 10. If you like this curation, read daily Top 10 articles based on your programming skills on our iOS app.","10" | |
"EvanSinar","https://medium.com/@EvanSinar/7-data-visualization-types-you-should-be-using-more-and-how-to-start-4015b5d4adf2","8","{""Data Visualization"",Analytics,""Data Science""}","208","7.6125786163522","7 Data Visualization Types You Should be Using More (and How to Start)","","2" | |
"dtunkelang","https://medium.com/@dtunkelang/getting-uncomfortable-with-data-7339e27adf6f","0","{""Data Science"",""Big Data"",Ethics}","207","1.30188679245283","Getting Uncomfortable with Data","Many data science speakers preach to the choir. They tell us that data science is the sexiest job of the 21st century, or that data is eating the world. Others try to offer practical advice — I generally put myself in this category. We provide tips and tricks across the stack — from building infrastructure to building teams. | |
But it’s a rare data scientist who challenges our core values and exhorts us to get uncomfortable with the fundamental tools of our trade. | |
A few days ago, Cloudera hosted Wrangle, a conference for and by data scientists. The talks were consistently excellent, filled with war stories from some of the industry’s top companies in the field. | |
But the talk that stood out was Clare Corthell’s talk on “AI Design for Humans”. Perhaps my only quibble is the title: I propose “Getting Uncomfortable with Data”. | |
She made several points that I hope every data scientist internalizes: | |
We data scientists create algorithms in our own images. It’s a god-like power, and we can’t let that get to our heads. With great power comes great responsibility. A responsibility to create the world we want to live in.","2" | |
"Mybridge","https://medium.com/mybridge-for-professionals/python-top-10-articles-v-november-8ce4540246a6","18","{Python,""Data Science"",Programming,""Software Development"",""Web Development""}","206","2.69528301886792","Python Top 10 Articles (v.November)","In this observation, we ranked nearly 1,500 articles posted in October-November 2016 about Python and Data Science. (0.67% chance) | |
Mybridge AI ranks the best articles for professionals. Hopefully this condensed reading list will help learn more productively in the area of Python. | |
Data Mining in Python: A Guide | |
Python cheatsheet | |
Open-source home automation platform running on Python 3 [4710 stars on Github] | |
How to Build Your Own Self Driving Toy Car with Python, Raspberry Pi OpenCV, TensorFlow. | |
…….…….…….…….……..[Code on Github] | |
NumPy Tutorial: Data analysis with Python | |
The Comprehensive Introduction To Your Genome With the SciPy Stack. | |
Introduction — Learn Python for Data Science #1. | |
Image Processing with Numpy | |
Become a pdb power-user: Python Debugging. Courtesy of Ashwini Chaudhary | |
Web scraping and parsing with Beautiful Soup & Python Introduction p.1 | |
StackOverflow Importer: Import code from Stack Overflow as Python modules | |
[1147 stars on Github] | |
Clairvoyant: Software designed to identify and monitor social/historical cues for short term stock movement | |
[830 stars on Github] | |
The Python Bible: Build 11 Projects and Go from Beginner to Pro with Python Programming | |
[5778 recommends, 4.7/5 star] | |
That’s it for Python Monthly Top 10. If you like this curation, read daily Top 10 articles based on your programming skills on our iOS app.","7" | |
"akelleh","https://medium.com/free-code-camp/if-correlation-doesnt-imply-causation-then-what-does-c74f20d26438","3","{""Data Science"",Data,Statistics,Tech,Logic}","202","8.99150943396226","If Correlation Doesn’t Imply Causation, Then What Does?","We’ve all heard in school that “correlation does not imply causation,” but what does imply causation?! The gold standard for establishing cause and effect is a double-blind controlled trial (or the AB test equivalent). If you’re working with a system on which you can’t perform experiments, is all hope for scientific progress lost? Can we ever understand systems that we have limited or no control over? This would be a very bleak state of affairs, and fortunately there has been progress in answering these questions in the negative! | |
So what is causality good for? Anytime you decide to take an action, in a business context or otherwise, you’re making some assumptions about how the world operates. That is, you’re making assumptions about the causal effects of possible actions. Most of the time, we only consider very simple actions: I should eat, because the causal effect of “eating” on “hunger” is negative. I should show more impressions of this ad, because the causal effect of impressions on pageviews is positive. What about more complex interventions? What about the downstream effects? Maybe showing more impressions of the ad increases pageviews of it, but it draws traffic away from something more shareable, reducing traffic to the site as a whole. Is there a way to make all of the downstream causal effects obvious? Is there a way to make causal assumptions explicit, so their merits can be discussed, and our model of the world can improve? | |
The more I read and talked to people about the subject of causality, the more I realized the poor state of common knowledge on the subject. In discussing it with our group, we decided to work through Causality by Judea Pearl in our Math Club. There have been a lot of great questions and discussions coming out of those sessions, so I decided to finally start writing up some of the discussions here. I’ll do my best to give proper credit where it’s due (with permission), but I apologize to any participants I leave out! | |
For this first post, I’d like to explain what causality is all about, and talk a little about what “evidence” means. This falls a little outside of the standard pedagogy, but I think it’s a useful way of looking at things. You’ll see that it can give us a model of the world to discuss and build on over time, and we’ll take a step toward measuring the downstream effects of interventions! | |
What is Causality? | |
The term “causality” has a nice intuitive definition, but has eluded being well-defined for decades. Consider your commute to work. We have an intuitive understanding that traffic will cause you to be late for work. We also know that if your alarm doesn’t go off, it will cause you to be late to work. We can draw this as a picture, like figure 1. | |
This picture is a great start, but these are really just two of the most common causes of being late for work. Others could include your car not starting, forgetting to make the kids lunch, getting distracted by the news, etc. etc. How can our picture incorporate all of these little things that we don’t include? Can we ever hope to get a reasonable picture of the world when we can’t possibly measure all of the causes and effects? | |
The answer turns out to be relatively simple. Our model just needs to include the most common, large effects on our trip to work. If we omit a large number of small, independent effects, we can just treat them as “noise”. We stop talking about things as being completely determined by the causes we take into account. Instead, we talk about a cause as increasing the chances of its effect. You go from intuitions like “my alarm not going off causes me to be late” to intuitions like “my alarm not going off causes me to be much more likely to be late”. I think you’ll agree that the second statement reflects our understanding of reality much better. It takes care of the host of “what-if” questions that come up from all of the unlikely exceptions we haven’t taken into account. “What if I happened to wake up at the right time anyhow when my alarm didn’t go off?” or “What if I was tired enough that I overslept, even though my alarm did go off?”. These are all incorporated as noise. Of course we’re free to add any of these things into our picture as we like. It’s just that we may prefer not to. There’s an exception we’ll talk about briefly in a moment. First, we need one more idea. | |
We can build a much more comprehensive picture by chaining together causes and effects. What are the causes of traffic? What are the causes of the alarm not going off? If there’s a disaster, it could cause the power to go out, preventing the alarm from going off. It could also cause traffic. Our new picture should look something like figure 2. This picture says something very important. Notice that if a disaster happens, it’s both more likely both that your alarm will fail to go off, and that there will be traffic. This means in a data set where you measure days on which there is traffic, and whether your alarm goes off on those days you’ll find a correlation between the two. We know there’s no causal effect of your alarm going off on whether or not there’s traffic (assuming you drive like a sane person when you’re late), or vice versa. This is the essence of “correlation does not imply causation”. When there is a common cause between two variables, then they will be correlated. This is part of the reasoning behind the less-known phrase, “There is no correlation without causation”. If neither A nor B causes the other, and the two are correlated, there must be some common cause of the two. It may not be a direct cause of each of them, but it’s there somewhere “upstream” in the picture. This implies something extremely powerful. You need to control for common causes if you’re trying to estimate a causal effect of A on B (read more about confounding). If there were a rigorous definition of “upstream common cause,” then there would be a nice way to choose what to control for. It turns out there is, and it’s rooted in these nice pictures (“causal graphs”). This can be the subject of a future post. | |
It turns out that if you don’t include hidden common causes in your model, you’ll estimate causal effects incorrectly. This raises a question: can we possibly hope to include all of the hidden common causes? What other alternative is there, if this approach fails? | |
How Science Works | |
We’re asking questions at the core of the scientific method. I’ll try to avoid getting into the philosophy of science, and instead make some observations based on basic formulae. Suppose we think the world works according to our first picture, fig. 1. We want to test out this assumption, that fig. 1 is really how the world works. The absence of a cause-effect relationship between traffic and the alarm going off, in these pictures, means there is no causal relationship between the two, whatsoever (including unobserved common causes!). Disasters are rare. It’s conceivable that in the limited amount of time we’ve been observing traffic and our alarm going off, we’ve never seen one. | |
As time goes on, more and more disasters might happen, and Fig. 1 stands on shaky footing. We start to accumulate data points with the disasters’ effect on traffic and the alarm clock. Maybe we live in the midwestern United States, where tornadoes are relatively common, or California, where earthquakes are and the collection happens quickly. Over time, we’ll go from a statistically insignificant measurement of a correlation between traffic and alarm clock, and reach a statistically significant one. Voila! We have evidence that our model is incorrect. Fig. 1 shows no cause-effect relationship between alarm and traffic, but we’re observing that they’re correlated, and we know there is “no correlation without causation”. The disasters are the cause that is missing from our model. | |
What does the correct picture look like? We can use our background knowledge of how the world works to say there’s no causal link between the alarm and traffic, and so there must be some unobserved common cause between the two. Since we’re only observing the alarm clock, traffic, and lateness, we can only update our model with the knowledge that it’s wrong, since it neglects an unobserved common cause. How do we do this? As it happens, there’s a way to write this, with a double-ended arrow as in fig. 3. The double-ended arrow is a way to say “there is some unobserved common cause between alarm and traffic”. | |
Now we know we need to start looking at the causes of the alarm going off and of traffic. Hopefully, we’ll eventually narrow it down to fig. 2. | |
Notice that it took a lot of data to find a situation where we started noticing the missing link between the alarm and traffic. You can get an intuitive sense that maybe, even though some links and variables might be missing, we have something “close enough” to the whole picture. What are the odds that fig. 2 is wrong today? How bad is my estimate of the odds that I would have been late if my alarm had actually gone off, in general? | |
It turns out that the answer is, because disasters are rare, you won’t be too far off. That’s true even if there’s a 100% chance of being late given that a disaster happens. Over time, we can take even this small amount of error into account. | |
There is precedent for this type of advancement in physics. We generally don’t regard a set of laws as being the complete picture, but they’re “good enough” for the situation we’re considering. We known that Newton’s gravity theory is missing some important pieces, and that Einstein’s General Relativity is a more complete picture. Even so, Newtonian gravity was enough to get us to the moon. It’s not enough, however, to operate a GPS system. The reason we hadn’t had to use it before was that there were relatively few situations where we needed to model high energy systems acting over large distances and times. This is analogous to our example where the chances of a disaster are low. There were a few situations that stood out as anomalous, like the precession of Mercury which suggested that the model as we understood it would break down. Over time, you find more anomalies, and improve your picture of the world. | |
We’ve answered part of the question. We see that we have to have larger and larger datasets to capture rare effects, and so our understanding of the world could improve over time. What remains is whether we’re guaranteed to notice all anomalies, and whether that matters. We’ve seen that the incomplete model can be useful. Soon we’ll talk about the prospects of noticing all anomalies that aren’t included in our model. Is it even possible, with an infinite dataset? Once we have the model, is it possible to estimate the odds that we’re wrong with a specific prediction? This might be the subject of a future post. For now, let’s talk a little more about how to use these types of models. | |
What Would Have Happened If …? | |
The question of “What would have happened if things were different?” is an essentially causal question. You’re asking what would be the effect if the world had operated differently than it was observed to, perhaps based on some policy intervention. The question “What would have happened if I had intervened in the system with some policy intervention?” is essentially the question “What is the causal effect of this policy intervention on the system?”. If you only observe how the system normally operates, you’ll generally get the wrong answer. For example, if you intervene to make sure your alarm never fails to go off (for example, by switching to a battery powered alarm clock), then you will underestimate the odds of being late to work. You’ll misattribute lateness due to traffic (which happens at the same time as your clock failing to go off!) as being due to your alarm clock, and so overestimate the effect of the alarm failing to go off. | |
This type of question is at the core of a lot of business and policy decisions. What would happen if our recommender system didn’t exist? What would happen if we made some changes to our supply chain? What would be the effect of a certain policy intervention on homelessness? All of these questions are extremely hard to answer experimentally, and can’t necessarily be answered from statistical data (i.e. large data sets). They could be relatively easier to answer if we have a good causal model of how the system operates, to go with our statistical data, and possibly supplemented by the few experiments within the system that we’re able to do. We’ll talk about how that’s done in some future posts! | |
[Edit:] This is the first post in a series! Check out the rest here!","1" | |
"ronnieftw","https://medium.com/several-people-are-coding/data-wrangling-at-slack-f2e0ff633b69","4","{""Big Data"",Analytics}","199","8.05094339622642","Data Wrangling at Slack","By Ronnie Chen and Diana Pojar | |
For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. | |
The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “Based on a team’s activity within its first week, what is the probability that it will upgrade to a paid team?” or “What is the performance impact of the newest release of the desktop app?” | |
We knew when we started building this system that we would need flexibility in choosing the tools to process and analyze our data. Sometimes the questions being asked involve a small amount of data and we want a fast, interactive way to explore the results. Other times we are running large aggregations across longer time series and we need a system that can handle the sheer quantity of data and help distribute the computation across a cluster. Each of our tools would be optimized for a specific use case, and they all needed to work together as an integrated system. | |
We designed a system where all of our processing engines would have access to our data warehouse and be able to write back into it. Our plan seemed straightforward enough as long as we chose a shared data format, but as time went on we encountered more and more inconsistencies that challenged our assumptions. | |
Our central data warehouse is hosted on Amazon S3 where data could be queried via three primary tools: Hive, Presto and Spark. | |
To help us track all the metrics that we want, we collect data from our MySQL database, our servers, clients, and job queues and push them all to S3. We use an in-house tool called Sqooper to scrape our daily MySQL backups and export the tables to our data warehouse. All of our other data is sent to Kafka, a scalable, append-only message log and then persisted on to S3 using a tool called Secor. | |
For computation, we use Amazon’s Elastic MapReduce (EMR) service to create ephemeral clusters that are preconfigured with all three of the services that we use. | |
Presto is a distributed SQL query engine optimized for interactive queries. It’s a fast way to answer ad-hoc questions, validate data assumptions, explore smaller datasets, create visualizations and use it for some internal tools, where we don’t need very low latency. | |
When dealing with larger datasets or longer time series data, we use Hive, because it implicitly converts SQL-like queries into MapReduce jobs. Hive can handle larger joins and is fault-tolerant to stage failures, and most of our jobs in our ETL pipelines are written this way. | |
Spark is a data processing framework that allows us to write batch and aggregation jobs that are more efficient and robust, since we can use a more expressive language, instead of SQL-like queries. Spark also allows us to cache data in memory to make computations more efficient. We write most of our Spark pipelines in Scala to do data deduplication and write all core pipelines. | |
How do we ensure that all of these tools can safely interact with each other? | |
To bind all of these analytics engines together, we define our data using Thrift, which allows us to enforce a typed schema and have structured data. We store our files using Parquet which formats and stores the data in a columnar format. All three of our processing engines support Parquet and it provides many advantages around query and space efficiency. | |
Since we process data in multiple places, we need to make sure that our systems always are aware of the latest schema, thus we rely on the Hive Metastore to be our ground truth for our data and its schema. | |
Both Presto and Spark have Hive connectors that allow them to access the Hive Metastore to read tables and our Spark pipelines dynamically add partitions and modify the schema as our data evolves. | |
With a shared file format and a single source for table metadata, we should be able to pick any tool we want to read or write data from a common pool without any issues. In our dream, our data is well defined and structured and we can evolve our schemas as our data needs evolve. Unfortunately, our reality was a lot more nuanced than that. | |
All three processing engines that we use ship with libraries that enable them to read and write Parquet format. Managing the interoperation of all three engines using a shared file format may sound relatively straightforward, but not everything handles Parquet the same way, and these tiny differences can make big trouble when trying to read your data. | |
Under the hood, Hive, Spark, and Presto are actually using different versions of the Parquet library and patching different subsets of bugs, which does not necessarily keep backwards compatibility. One of our biggest struggles with EMR was that it shipped with a custom version of Hive that was forked from an older version that was missing important bug fixes. | |
What this means in practice is that the data you write with one of the tools might not be read by other tools, or worse, you can write data which is read by another tool in the wrong way. Here are some sample issues that we encountered: | |
One of the biggest differences that we found between the different Parquet libraries was how each one handled the absence of data. | |
In Hive 0.13, when you use use Parquet, a null value in a field will throw a NullPointerException. But supporting optional fields is not the only issue. The way that data gets loaded can turn a block of nulls— harmless by themselves — into an error if no non-null values are also present (PARQUET-136). | |
In Presto 0.147, the complex structures were the ones that made us uncover a different set of issues — we saw exceptions being thrown when the keys of a map or list are null. The issue was fixed in Hive, but not ported in the Presto dependency (HIVE-11625). | |
To protect against these issues, we sanitize our data before writing to the Parquet files so that we can safely perform lookups. | |
Another major source of incompatibility is around schema and file format changes. The Parquet file format has a schema defined in each file based on the columns that are present. Each Hive table also has a schema and each partition in that table has its own schema. In order for data to be read correctly, all three schemas need to be in agreement. | |
This becomes an issue when we need to evolve custom data structures, because the old data files and partitions still have the original schema. Altering a data structure by adding or removing fields will cause old and new data partitions to have their columns appears with different offsets, resulting in an error being thrown. Doing a complete update will require re-serializing all of the old data files and updating all of the old partitions. To get around the time and computation costs of doing a complete rewrite for every schema update, we moved to a flattened data structure where new fields are appended to the end of the schema as individual columns. | |
These errors that will kill a running job are not as dangerous as invisible failures like data showing up in incorrect columns. By default, Presto settings use column location to access data in Parquet files while Hive uses column names. This means that Hive supports the creation of tables where the Parquet file schema and the table schema columns are in different order, but Presto will read those tables with the data appearing in different columns! | |
It’s a simple enough problem to avoid or fix with a configuration change, but easily something that can slip through undetected if not checked for. | |
Upgrading versions is an opportunity to fix all of the workarounds that were put in earlier. But it’s very important to do this thoughtfully. As we upgrade EMR versions to resolve bugs or to get performance improvements, we also risk exchanging one set of incompatibilities with another. When libraries get upgraded, it’s expected that the new libraries are compatible with the older versions, but changes in implementation will not always allow older versions to read the upgraded versions. | |
When upgrading our cluster, we must always make sure that the Parquet libraries being used by the analytics engines we are using are compatible with each other and with every running version of those engines on our cluster. A recent test cluster to try out a newer version of Spark resulted in some data types being unreadable by Presto. | |
This leads to us being locked into certain versions until we implement workarounds for all of the compatibility issues and that makes cluster upgrades a very scary proposition. Even worse, when upgrades render our old workarounds unnecessary, we still have a difficult decision to make. For every workaround we remove, we have to decide if it’s more effective to backfill our data to remove the hack or perpetuate it to maintain backwards compatibility. How can we make that process easier? | |
To solve some of these issues and to enable us to safely perform upgrades, we wrote our own Hive InputFormat and Parquet OutputFormat to pin our encoding and decoding of files to a specific version. By bringing control of our serialization and deserialization in house, we can safely use out-of-the-box clusters to run our tooling without worrying about being unable to read our own data. | |
These formats are essentially forks of the official version which bring in the bug fixes across various builds. | |
Because the various analytics engines we use have subtly different requirements about serialization and deserialization of values, the data that we write has to fit all of those requirements in order for us to read and process it. To preserve the ability use all of those tools, we ended up limiting ourselves and building only for the shared subset of features. | |
Shifting control of these libraries into a package that we own and maintain allows us to eliminate many of the read/write errors, but it’s still important to make sure that we consider all of the common and uncommon ways that our files and schemas can evolve over time. Most of our biggest challenges on the data engineering team were not centered around writing code, but around understanding the discrepancies between the systems that we use. As you can see, those seemingly small differences can cause big headaches when it comes to interoperability. Our job on the data team is to build a deeper understanding of how our tools interact with each other, so we can better predict how to build for, test, and evolve our data pipelines. | |
If you want to help us make Slack a little bit better every day, please check out our job openings page and apply.","4" | |
"ilblackdragon","https://medium.com/machine-intelligence-report/tensorflow-tutorial-part-1-c559c63c0cb1","0","{""Data Science"",""Machine Learning"",TensorFlow}","189","2.79245283018868","TensorFlow Tutorial— Part 1","UPD (April 20, 2016): Scikit Flow has been merged into TensorFlow since version 0.8 and now called TensorFlow Learn or tf.learn. | |
Google released a machine learning framework called TensorFlow and it’s taking the world by storm. 10k+ stars on Github, a lot of publicity and general excitement in between AI researchers. | |
Now, but how you to use it for something regular problem Data Scientist may have? (and if you are AI researcher — we will build up to interesting problems over time). | |
A reasonable question, why as a Data Scientist, who already has a number of tools in your toolbox (R, Scikit Learn, etc), you care about yet another framework? | |
The answer is two part: | |
Let’s start with simple example — take Titanic dataset from Kaggle. | |
First, make sure you have installed TensorFlow and Scikit Learn with few helpful libs, including Scikit Flow that is simplifying a lot of work with TensorFlow: | |
You can get dataset and the code from http://github.com/ilblackdragon/tf_examples | |
Quick look at the data (use iPython or iPython notebook for ease of interactive exploration): | |
Let’s test how we can predict Survived class, based on float variables in Scikit Learn: | |
We separate dataset into features and target, fill in N/A in the data with zeros and build a logistic regression. Predicting on the training data gives us some measure of accuracy (of cause it doesn’t properly evaluate the model quality and test dataset should be used, but for simplicity we will look at train only for now). | |
Now using tf.learn (previously Scikit Flow): | |
Congratulations, you just built your first TensorFlow model! | |
TF.Learn is a library that wraps a lot of new APIs by TensorFlow with nice and familiar Scikit Learn API. | |
TensorFlow is all about a building and executing graph. This is a very powerful concept, but it is also cumbersome to start with. | |
Looking under the hood of TF.Learn, we just used three parts: | |
Even as you get more familiar with TensorFlow, pieces of Scikit Flow will be useful (like graph_actions and layers and host of other ops and tools). See future posts for examples of handling categorical variables, text and images. | |
Part 2 — Deep Neural Networks, Custom TensorFlow models with Scikit Flow and Digit recognition with Convolutional Networks.","1" | |
"alinelernerllc","https://medium.com/free-code-camp/people-cant-gauge-their-own-interview-performance-and-that-makes-them-harder-to-hire-96cd51601437","4","{Recruiting,Programming,Tech,Startup,""Data Science""}","185","4.80566037735849","People can’t gauge their own interview performance. And that makes them harder to hire.","My experience as a software engineer — and as someone who recruits software engineers — has convinced me of two things: | |
These insights lead me to co-found an anonymous technical interviewing platform called interviewing.io. | |
On interviewing.io, questions tend to fall into the category of what you’d encounter during a phone screen for a back end software engineering role at a top company. And interviewers typically come from a mix of larger companies like Google, Facebook, and Twitter, as well as engineering-focused startups like Asana, Mattermark and KeepSafe. | |
One of the best parts of running an interviewing platform is access to a ton of interview data. We track everything that happens during an interview, including what people say, the code they write, stuff they draw, and so on. And we also track post-interview feedback. | |
In this post, I’ll tackle something surprising that we learned from our data — people simply can’t gauge their own interview performance, and that actually makes it harder to hire them. | |
When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard, and jump right into a technical question. | |
After each interview, people leave one another feedback, and each party can see what the other person said about them once they’ve both submitted their reviews. | |
If both people find each other competent and pleasant, they have the option to unmask. | |
Overall, interviewees tend to do quite well on the platform, with just under half of interviews resulting in a “yes” from the interviewer. | |
Here’s what one of our feedback forms look like: | |
In addition to one direct yes/no question, we ask about a few different aspects of interview performance using a 1–4 scale. | |
We also ask interviewees some extra questions that we don’t share with their interviewers. And one of those questions is about how well they think they did. | |
In this post, we’ll focus on the technical score an interviewer gives an interviewee (circled above), and the interviewee’s self-assessment. | |
For context, a technical score of 3 or above seems to be the rough cut-off for hirability. | |
Here’s the distribution of people’s actual technical performance (as rated by their interviewers) and the distribution of their perceived performance (how they rated themselves) for the same set of interviews. | |
Right away, you can see that there’s a disparity, but things really get interesting when you plot perceived vs. actual performance for each interview. | |
Here’s a heat map of the data where the darker areas represent higher interview concentration: | |
For instance, the darkest square represents interviews where both perceived and actual performance was rated as a 3. | |
If you run a linear regression on the data, you get an R-squared of only 0.24. And once you take away the worst interviews, it drops down even further to a 0.16. (Note that we tried fitting a number of different curves to the data, but they all sucked.) | |
For context, R-squared is a measurement of how well you can fit empirical data to some mathematical model. It’s on a scale from 0 to 1, with 0 meaning that everything is noise, and 1 meaning that everything fits perfectly. | |
In other words, even though some small positive relationship exists between actual and perceived performance, it’s not a strong, predictable correspondence. | |
You can also see there’s a non-trivial amount of impostor syndrome going on here, which probably comes as no surprise to anyone who’s worked as an engineer. | |
Gayle Laakmann McDowell of Cracking the Coding Interview fame has written quite a bit about how bad people are at gauging their own interview performance. It’s something that I had noticed anecdotally when I was doing recruiting, so it was nice to see some empirical data on that front. | |
In her writing, Gayle mentions that it’s the job of a good interviewer to make you feel like you did OK, even if you bombed. | |
I was curious about whether that’s what was going on here, but when I ran the numbers, there wasn’t any relationship between how highly an interviewer was rated overall and how off their interviewees’ self-assessments were, in one direction or the other. | |
Ultimately, this isn’t a big dataset. There are only 254 interviews represented here, because not all interviews in our data set had comprehensive, mutual feedback. | |
Moreover, I realize that raw scores don’t tell the whole story, and I’ll focus on standardization of these scores and the resulting rat’s nest in my next post. | |
That said, though interviewer strictness does vary. interviewing.io gates interviewers pretty heavily based on their background and experience, so the overall bar is high and comparable to what you’d find at a good company in the wild. | |
But we did find that this relationship emerged very early on, and has persisted with more and more interviews. To date, R-squared has never exceeded 0.26. | |
We’ll continue to monitor the relationship between perceived and actual performance as we host more interviews. | |
Now here’s the actionable and kind of messed up part. Remember those feedback forms I showed you where we ask interviewees whether they’d want to work with their interviewer? As it turns out, there’s a very statistically significant relationship (p value) between whether people think they did well and whether they’d want to work with the interviewer. | |
This means that when people think they did poorly, they may be a lot less likely to want to work with you — 3 times less likely, according to our data. | |
And by extension, it means that in every interview cycle, some portion of interviewees lose interest in joining your company just because they didn’t think they did well, despite the fact that they actually did. | |
How can you mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! | |
This way, interviewees don’t have time to put themselves through the self-flagellation gauntlet that follows a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway. | |
Want to see how well you can gauge your own interview performance… and land your next job in the process? Join interviewing.io.","1" | |
"PaulStollery","https://medium.com/@PaulStollery/all-of-the-fucks-given-online-in-2016-58c60edd6e44","6","{Politics,""2016 Election"",Data,""Data Visualization"",""Donald Trump""}","183","3.41037735849057","All of the fucks given in 2016","The field of fucks was far from barren in 2016. In fact, just shy of a billion fucks were given across social media the news this year. | |
It was, for lack of a better phrase, a very shitty year. For a start, most of the awesome people died. Things continued to go from bad to worse in Syria. Zika spread across the Americas. And Britain did the stupidest thing a country has done in a long, long time only for America to go ahead and top it. | |
In fact, 2016 was so bad that when you type ‘cel’ into Google, the first thing it suggests as an autofill is this: | |
The collective response to most of these things was: fuck. At the time of writing*, 946,158,697 fucks had been given (shared/commented/published in a blog or news article) in 2016. But what warranted the most fucks? | |
It won’t surprise you to know that the biggest day for fucks given was November 9, triggered by the election of Donald Trump. What might surprise you though, is just how many fucks were shared. | |
An average of 2,613,698 fucks were given per day in 2016. Most days — even the really shitty ones — saw between 2 and 4 millions fucks. On November 9, there were 7,638,384 — nearly three times the average. | |
What makes this all the more remarkable is that the second highest day only saw 3,518,781 mentions of the word fuck. Just look at the size of the spike: | |
Terms used | |
I typically despise word clouds, as they’re so often used for no other reason than to look pretty in a crappy Powerpoint presentation. | |
However the word ‘fuck’ can be used in a number of ways. It can be used to express joy, ‘fuck yeah’, sadness, ‘oh fuck’, anger, ‘fuck off’ or it can be used simply as a verb — no I’m not providing an example of this. In this case, a word cloud gives you a pretty good sense of people’s motivation for using the word. | |
As you can see, WTF was the driving sentiment behind the use of the word. | |
Associated people | |
Despite the number of fucks given on November 9, Trump wasn’t the person most commonly associated with the word ‘fuck’. As the below word cloud shows, ‘local hotties’ beat Trump to the number one spot. No points for guessing what’s going on there. | |
Aside from the infamous ‘Local Hotties’, people associated with the word ‘fuck’ are largely politicians, Jesus and a chap named Andrew. | |
Sweariest sex | |
Women were significantly more sweary on social, with 58% of posts mentioning fuck coming from a woman. This is possibly due to the whole vagina grabbing leader of the free world thing. | |
‘Index’ in the above graph is a metric which is corrected for the overall number of men and women in the data pool (i.e. even when corrected for an uneven number of men and women, women remain the sweariest sex). | |
Notes on the data: | |
I tracked all mentions of ‘fuck’ on Twitter, Facebook, Instagram, Youtube, blogs, comments, forums and in the news using a tool called Netbase, which is a very powerful tool when it comes to monitoring social media. | |
It’s worth noting here that different channels offer differing amounts of accuracy and reliability: Twitter and Tumblr are both really reliable as they’re robust as well as open, whereas data from channels like Facebook is incomplete due to the fact that most posts are private. Instagram’s data was sketchy and I’m always skeptical of news tracking on Netbase as it’s primarily designed for social media tracking. | |
I also noticed some issues with other channels — such as zero values on certain days on Instagram, which looking at other days seems very unlikely. | |
Still, when viewed as a whole, it paints an accurate picture of the year. | |
To download the source data, click here.","7" | |
"swardley","https://medium.com/hacker-daily/why-big-data-and-algorithms-wont-improve-business-strategy-54e4ebe2398","10","{""Artificial Intelligence"",Strategy,Business,""Big Data""}","183","10.1971698113208","Why big data and algorithms won’t improve business strategy","","1" | |
"SimonParkin","https://medium.com/how-we-get-to-next/the-gut-and-the-spreadsheet-how-fashion-forecasting-really-works-7ec5d97acd6d","11","{Fashion,Sartorial,Trends,""Big Data"",Forecasting}","178","10.111320754717","The Gut and the Spreadsheet: How Fashion Forecasting *Really* Works","For a few short years in the mid-1980s, when the pockets of New York’s broad-suited investors swilled with Reagan-era money, one item of clothing defined the luxurious moment above all others: the pouf dress. | |
Designed by Christian Lacroix, a couturier from Arles in the South of France, “le pouf,” as he referred to it, presented an explosion of material betwixt the waistband and the knee. This ball of fabric preposterously accentuated its wearer’s hips, giving a woman the silhouette of a tottering doll. The pouf was the must-have item in nouveau riche couture, filling ballrooms with what, at a glance, looked like a sea of brightly colored balloons. Then, in an instant, the pouf deflated. | |
Following the 1987 stock market crash, what was once seen as a shibboleth of status became a vulgar display of wealth. “The skirt’s flamboyant silhouette was suddenly deigned to be frivolous,” explained David Wolfe, creative director of The Doneger Group, the New York-based company that advises the fashion and retail industries on the ebb and flow of trends. “It crashed and burned overnight,” he said, the skirt to be immediately replaced by more austere cuts such as the pencil line. | |
Wolfe, who is balding with thin-rimmed, tortoiseshell glasses and a tidy gray beard, was one of fashion’s very first trend forecasters. Now in his mid-70s, it has long been his job to predict the next big thing; The New York Times has described him as “fashion’s most authoritative spokesperson.” | |
In what is often seen as an elite business, guided by the caprice of figureheads like Anna Wintour and big-name couture designers, for Wolfe the fate of the pouf skirt reveals an often-overlooked truth about the rip tides of what’s en vogue. The popular perception is that any given season, it’s the high fashion mavens that decide the newest styles — their clothing lifted from the runways, redesigned for the masses in more conventional cuts, and made popular by marketing campaigns played out on widescreen billboards and the miniature windows of trendsetting Instagram girls. | |
In fact, said Wolfe, trends are just as equally driven by a complex mesh of political, economical, social, and technological factors — everything from the state of war to the state of the economy. “Fashion is a response to how consumers react psychologically to every aspect, nuance, and factor of life,” he explained. “Everything that plays a part in creating the world we live in is reflected somehow in fashion trends.” | |
Understanding the science of fashion has, over the past 40 years, become big business. The scale of bets involved in designing, producing, and selling a new line of clothing has created a market for seers, people who are able to advise big business on what designs will prove popular in the near future, based on evidence rather than whim. David Wolfe was one of those first oracles. | |
After working as an illustrator at London department store Fortnum & Mason in the 1960s, Wolfe used his talent for observation and artistic record to fill a void—founding one of the world’s first fashion forecasting services, Imaginative Minds International. His objective at the time was to feed privileged information about the latest collections shown on cloistered Paris runways to English and American designers. | |
Wolfe’s methods in those early days were rudimentary but determined; he would disguise himself to gain entry into French fashion shows and then sit in the back row, incognito, while making surreptitious sketches of the dresses drifting down the catwalks. He was, by then, already recognized as one of London’s leading fashion artists, his sketches published in Vogue, Harper’s Bazaar, Women’s Wear Daily, and The London Times. Wolfe’s insider work attracted a raft of high-profile clients, including Calvin Klein, Ralph Lauren, and Versace. Through his insight, designers around the world were able to catch emerging trends and turn his pronouncements into realities as, en masse, they followed his advice. | |
Wolfe’s instincts proved correct — especially in terms of recognizing a market for trend forecasters. After moving to New York in the early 1980s to set up the first dedicated fashion forecaster in the United States, TFS The Fashion Service, Wolfe joined The Doneger Group in 1990. In earlier years, consultants like Wolfe could take a punt on what they thought was going to become the next big thing in fashion and, if they were fortunate, see their predictions turn into reality. Those with a knack for soothsaying (or whose predictions were later backed by multimillion-dollar marketing campaigns to become self-fulfilling prophecies) reaped the rewards. Those whose predictions proved to be off-trend, meanwhile, were considered snake-oil salesmen. Luck, as well as observational aptitude, surely played its part. | |
Today, advice must be backed by quantifiable facts and information. As technology has increased the amount of data that is gathered and quickened the pace at which it disseminates, so have Wolfe’s methods of gaining insight into future trends been forced to evolve and adapt. “It’s much more business-driven today simply because it is now possible to access in-depth facts and figures,” he said. “Today, future seasons are forecasted by looking at the success and failure rates of existing merchandise and combining this with comprehensive sets of information about movements in culture, politics, economics, social media, technology, and entertainment.” | |
This is serious work—no more does the solitary forecaster arrive at a fashion retailer like Moses coming down from the mountain, clutching prophetic wisdom. Trend-forecasting teams are multi-role affairs, often consisting of a host of directors (a creative director, a color director, a fabric director, an art director) and a copywriter, each of whom are further supported by teams of assistants, researchers, and beleaguered interns. Together they look for trends on the runway and the street, while simultaneously monitoring the slow tides of societal change, and how they might have an effect on shifting design patterns. | |
“Trends are a reflection of the society that responds and buys into them, so it is vital to understand and project-forward the current zeitgeist,” Wolfe said. In one example, he believes that recent shifts in wealth between the young and old are reflected in high street fashion. Millennials are the first generation to be worse off than their parents and grandparents. As the wealth base shifts, and as people are living longer and staying healthier, baby boomers are, Wolfe said, regaining economic control of fashion. This shift has influenced color and style, favoring designs that are ageless, rather than tailored to any particular age group or demographic. | |
For the younger designers and Wolfe’s forecasting descendants, however, there’s no substitute for getting feet on the ground. Many believe the next hot item can seem less to do with current affairs than the minutiae of what’s being worn in key cities around the world. Rachel Mcculloch, a senior designer for Topshop Boutique who has designed clothes for Gap, Topshop, and Topman, is part of a group that tours key cities as on-the-ground spotters. | |
“Our team takes a very collaborative approach to trend direction,” she said. “We undertake enormous amounts of research, shopping all over the world in areas where fashion and style have exciting ideas such as Seoul, Tokyo, Los Angeles, and Stockholm.” Mcculloch also does her virtual homework, sifting through fashion blogs and Instagram accounts, while analyzing runway seasons for common themes and, as she puts it, “outstanding new concepts.” | |
“The real task is being able to assimilate all the information from all the different sources and ascertain common emerging threads and ideas,” she said. “We come together as a team and discuss the big ideas that we are all seeing, and decide on which ones to take forward. The process can be quite methodical and scientific but is mostly decided on gut feeling. We decide upon colors, cut, mood, and detail in this way, and then our management team creates a comprehensive and clear guideline from this for all team members to adhere, too.” For Mcculloch, data plays a role, but ultimately “gut instinct and intuition are the most important thing.” | |
Relying on intuition may be the most attractive approach for artists and designers, but for shareholders and managing directors, these nebulous qualities are cause for concern — the area where art and business have forever rubbed up against one another. In 2016, fact-based forecasting is the crucial way to position oneself as a trend forecaster. Doneger has, Wolfe said, access to a “wealth of in-depth information that allows trend analysts to make projections based on facts, rather than ‘gut instinct.’” | |
In this way, forecasting is shifting toward something you might call “nowcasting” — the ability to recognize, through vast collections of data, which emerging fashions might, with a little encouragement, become full-blown trends. Geoff Watts and Julia Fowler co-founded EDITED in London in 2009 in response to Fowler’s frustration that too often decision-making in fashion design was based on feeling rather than metrics. At the time her partner, Watts, whom Fowler met while racing cars in their native Perth, worked for a dairy company in an industry where every decision was based on evidence, not instinct. | |
“We’re helping the industry move out of the old ‘trend forecasting’ model, which you might at best call educated guesswork, by actually putting the market’s real product, pricing, and promotional data at its disposal in real time,” Fowler said. “You could say EDITED removes the uncertainty of ‘forecasting’ and replaces it with facts.” | |
EDITED’s clients, which include Topshop, eBay, and Ralph Lauren, have certainly benefited, in particular from the company’s advice on how to price individual items competitively at key moments in each season. In 2014, for example, ASOS—one of the company’s first customers—attributed a 37-percent revenue increase to EDITED’s data insights. | |
Those insights are drawn from a close examination of its clients’ competitors products. The EDITED team then claims to track key data such as price, discounting history, and the timings of seasonal pushes. “This helps them use data to understand the market and have the right product at the right price, at the right time. It’s less about looking into a crystal ball, making a guess, and hoping for the best, which until now has pretty much been the status quo,” Fowler said. | |
This work isn’t so much trendsetting as trend-monitoring. Using the company’s services, fashion buyers and merchandisers can size up the commercial state of trends in the market. If they see, for example, that the products associated with a trend are experiencing low rates of sell-through and replenishment paired with high rates of discounting, they know that a trend is on the downturn. That information can be immediately fed down to their clothing designers, potentially saving millions of dollars on a poor or tardy gamble. | |
Fowler argues that the tools are useful for spotting emerging trends, too. “We can see where items are selling out quickly at full price with low market saturation.” EDITED also provides “crucial pricing and promotional data,” to enable its clients to know when to push a certain item of clothing, and when to discount it at the optimum moment. Fowler describes the company as owning “the world’s biggest apparel data-warehouse.” Buy its services and you gain access to those stores of information, enabling you to track new product launches, price changes, and discounts, all the way down to the content and timing of a rival’s promotional emails. | |
“We can contextualize all that data and make it available to our clients in real time so that they can use it in their market analysis and make confident pricing, product, and timing decisions based on the true state of the market,” said Fowler. “Reliable data can even provide enough support to justify forgoing soft launches or long, drawn-out testing periods — meaning you can get the right product to the right consumer even faster.” | |
These tectonic shifts in trend forecasting have been driven by technology. “Right now data is at the forefront of a significant shift in how the industry does business,” she said. “Anyone who isn’t already using real-time global data in their planning and trading strategy is setting themselves up to be left behind.” | |
While newer fashion forecasting has brought data and efficiency to the market, it may have unforeseen side effects. There is a danger with the methodology that fashion, which is supposed to enable individual expression, will become homogeneous. | |
Indeed, it can already seem as though major retailers are conspiring when they launch eerily similar designs at the start of a season. Mcculloch denies such a possibility. “Companies sharing ideas to create a trend force on the high street would never happen,” she said. Rather, any synchronicity is a function of the shared data flowing into the design process. | |
“Companies are competing to get the newest ideas in first—but as the world becomes more global, more and more I find that we are all viewing the same information and research,” she said. “The cities I travel to are slowly beginning to offer very similar things to each other. The differences are becoming less pronounced. It feels inevitable that eventually we will all be offering the same things. The questions of who can be the quickest and most innovative with those designs has, rightly or wrongly, become the real competition.” | |
When clothing in every major retailer from London to Tokyo to Los Angeles begins to look the same, there may no longer be a place for the trend forecaster—especially as companies fully shift their faith from instinct to algorithm. | |
This post is part of How We Get To Next’s Sartorial month, looking at the future of fashion throughout September 2016. If you liked this story, please click on the heart below to recommend it to your friends. | |
Read more from How We Get To Next on Twitter, Facebook, or Reddit, or sign up for our newsletter.","1" | |
"rchang","https://medium.com/@rchang/learning-how-to-build-a-web-application-c5499bd15c8f","6","{Programming,JavaScript,""Data Science""}","176","13.7084905660377","Learning How to Build a Web Application","Lessons Learned on Web Development, Data Visualization, and Beyond | |
Back in my undergraduate days, I did a LOT of mathematical proofs (e.g. Linear Algebra, Real Analysis, and my all time favorite/nightmare Measure Theory). In addition to learning how to think, I also learned to recognize many, and I mean many Greek and Hebrew letters. | |
However, as I took on more empirical work in graduate school, I realized that data visualization was often far more effective in communication than LaTeX alone. From crafting what information to convey to my readers, I learned that the art of presentation is far more important than I originally thought. | |
Luckily, I have always been a rather visual learner, so when it comes to beautiful data visualizations, they always grabbed my attentions and propel me to learn more. I began to wonder how people published their beautiful works on web; I became a frequent visitor of Nathan Yau’s FlowingData blog; And I am continued to be in awe when discovering visualizations like this, this, and this. | |
After many moments of envy, I couldn’t resist but to learn how to create them myself. This post is about the journey that I took to piece the puzzles together. By presenting my lessons learned here, I hope it can inspire all those who are interested in learning web development and data visualization to get started! | |
For this learning project, one of the things that I did early on was to consult with experts on what essential skills to learn. Fortunately, I received a very detailed answer from Alexander Blocker, a statistician at Google. In his words: | |
My favorite part of his answer is the following paragraphs: | |
I knew the only way that I could learn how these technologies work together is to build something integral, useful and fun. After some planning, I decided to build a Calendar Visualizer that OAuth to my Google Calendar and display how I allocate and spend my spare time. | |
When starting out, the first obstacle I ran into was to decide which language to use. I considered both Ruby and Python, but went with Python because I already have some familiarity with the language. Picking between frameworks such as Django and Flask was a bit harder, but eventually I chose Flask because the micro-framework feels more approachable to me — I could build things bottom-up and extend my web application as needed. | |
Coming from a non-CS background, the first transition that I needed to make was to think beyond scientific computing, i.e. imperative style scripting. At first, I was confused by what a Framework meant or why one would need it. It’s not until reading Jeff Knupp’s well articulated post “What is a Web Framework?” did I realize: | |
To hone in on the above points, here is a schematic that illustrates how a web application works in its essence. Client makes request through HTTP network, Server then processes this request and figures out the right response to return to the Browser. This model is called the client-server architecture. | |
We can zoom in further to take a closer look at the typical architecture of a modern web application: | |
Generally speaking, there are three essential layers in a web application: | |
Depending on your interests and goals, you might develop more specific skills in one area than the other. Given that my goal was to see how everything worked together, I took a breadth-first search approach and learned just enough to see how each layer works. In the following sections, I will dive into each layer in more details and highlight some of the big ideas and lessons learned. | |
Let’s start our journey by revisiting the fundamental question — How does a web server know what information to return for a given request? The key lies in the Application Layer. Web Frameworks like Flask enable us to leverage Routes and Templates which make the presentation logic so much easier. | |
In Flask, Routes are enabled by decorators. In case you don’t know what a decorator is, check out Simeon Franklin’s awesome post to learn more! Conceptually, the @route decorators notify the framework about the existence of specific URLs and the function meant to handle them. Flask calls our functions that get a request and return a response views. | |
When Flask processes an HTTP request it uses this information to figure out which views it should pass the request to. The function can then return data in a variety of formats (HTML, JSON, plain text) that will be used by Flask to create an HTTP response. Let us demonstrate this by an example: | |
The decorator webapp.route, upon receiving a request from /user, evokes the view which returns a HTML table. This is what happened under the hood when a user visits the /user page (see screenshot below). | |
Big Idea / Lesson Learned: Routes are the fundamental building blocks that enables client-server interaction! To see another simple example how this works, see Philip Guo’s short but instructive tutorial. | |
The above example was rather simple, we could hardcode the entire HTML page inline. However, real HTML pages are often more complex, and coding contents inline is simply too tedious, error-prone, and repetitive. In addition, Flask’s philosophy is to make method definitions as simple and as self-explanatory as possible. What can we do? Here comes templates! | |
The most intuitive explanation of templates again come from Jeff Knupp: | |
This design enables us to create scaffolds for different but similarly structured HTML pages. It makes presentation logic very customizable and reusable. Let’s revisit our view and see how a template can help: | |
Functionality wise, this view does the exact same thing as before, it returns the same HTML table. However, the only difference is that the method definition is now much more readable — to render HTML. Where is the HTML code then? It is actually modularized in user.html: | |
Notice that this file does not look like our typical HTML page. It is, in fact, templatized: | |
All these constructs facilitate us to write flexible HTML templates, and allow us to separate what to present from how to present. | |
Big Idea / Lesson Learned: Templates do not change what is presented to the users, but it makes the how much more organized, customizable, and extensible. To learn more examples, check out this detailed Jinja template documentation. | |
We now see how routes and templates enable client-server interactions, but where do all the data come from? Introducing database — the most common way to persistently store data. | |
Depending on the scale, different databases might be more suitable for handling different traffics. One of the simplest databases is SQLite, where data is persisted in a single local file. However, it is generally not the right choice for data intensive applications (instead, MySQL is the more standard choice). Given that our application analyzes and visualizes events, let’s see how a database can help us to persist this data. First, let us define a data model named dim_events that keeps track of event date, event name, and its duration: | |
With the table created, we can execute SQL statements to populate the table and perform additional CRUD (Create/Read/Update/Delete) operations. When the application needs to query this data, our database is responsible for handing the data from the data layer to the application layer: | |
As an example, the show_all_events view needs to display all the events. And one particular way to surface this data is to execute a SQL query inside the view function. The code is simple, readable, but unfortunately problematic: | |
How can we improve this? We need the concept of Object-Relational Mappers (ORM). My favorite explanation of ORM is from Full Stack Python: | |
One of the most popular ORMs in Flask is SQLAlchemy. Instead of creating the dim_events table in SQLite directly, it allows us to initialize the same events table in Python as a Class: | |
More importantly, SQLAlchemy allows us to represent data models as Class instances, so interacting with databases in the application layer is now much more natural in the application code. The example below only uses all and filter operators (which is equivalent to SELECT * AND WHERE in SQL respectively), but SQLAlchemy is much more versatile than that! | |
Let’s see all of these hard works in action when we visit /dbdisplay/Exercise: | |
Big Idea / Lesson Learn: A database enables data to persist in an application. The proper way to query data in the application code is to leverage ORM such as SQLAlchemy. To learn more, check out the official documentation & tutorial. | |
Often time, it is useful to expand data access beyond our own applications to third party developers (see here, here, and here). More openness enables more creative usage of data, which means more innovation. One of the most popular standards to expose proprietary data to the public is via RESTful APIs. | |
A good way to think about RESTful APIs is that they act as functions — functions that take in specific parameters as inputs and output standardized data in a controlled manner. The entire execution of the “function call” happens via HTTP: arguments are passed as part of the URL parameters, and data is returned by the function as a HTTP response. | |
With tools like SQLAlchemy, building API endpoints is actually not too different from what we have already done. Views take in the URLs and issue specific queries in order to return the results based on the parameters. Below are the two views that we have seen before, but slightly modified. Notice the only thing that really changes is the return type of the data is now in JSON. | |
Let’s see how things work when we hit these API endpoints: | |
Big Idea / Lesson Learned: APIs are convenient endpoints for developers to expose proprietary data to the outside world in a controlled manner. The specification of the data request is often composed as parameters in the URL, data are returned via HTTP, and often are presented in JSON form. I highly recommend reading Miguel Grinberg’s long but engaging post to learn more. | |
Next up, we will see how everything (routes, templates, database, API endpoints) fits together to create what we are going after in the first place — data visualizations. | |
The front end layer is the closest layer to the end users, and requires a lot of design and creativity. For the scope of this post, we will focus on how to display data visualizations using D3. | |
First of all, if you don’t already know D3, I highly recommend Scott Murray’s Interactive Data Visualization, it was one of the most valuable resources for me to start learning D3! Like many others, my first exposures to D3 came from tutorials. Typically, a simple example would hardcode a fake dataset, and explain in length how to create a bar/pie/line chart out from it. | |
While they are educational, I always had little idea on how things really work in a real application, i.e. Where do data come from? As I gained more experiences, I learned that D3 actually offers a wide range of options to load data into the browser, one particular method is called d3.json. | |
This makeGraph function takes two arguments — an URL and a callback function (not implemented in the above code snippet). | |
The callback function will take that data, bind it with DOM elements, and display the actual visualizations on the browser. This is usually the place where we write our D3 visualization code. | |
Let us see this through by a more elaborate example. In my web app, there is a tab called “Calendar View” that allows a user to display her activities in the form of calendar heatmaps: | |
For this visualization, each cell here represent a single day. The color intensity represents how much time I spent on a particular activity on that day. In the plot above, each highlighted block means that I did some form of exercise on that day. It’s obvious from the chart above that my New Year Resolution is to exercise more regularly in 2016. | |
Where does it fetch the data, and how does it display this information? How does one construct a calendar? Let’s deconstruct this step by step: | |
For each of the visualization that is being rendered, that’s essentially what is happening under the hood: | |
Big Idea / Lesson Learned: When a request triggers a view, the view will attempt to render the HTML and execute the Javascript file. The D3 code in the Javascript file will issue a query to the API endpoint, and returned data will be bind to actual DOM elements to be shown on the web browser. Routes, templates, databases, and APIs all work together to get this done! To learn more, here is another illustrative example that study BART data. | |
Now that we have all the essential components for a functional web application, the last touch is to beautify the look and feel of this application. During my time at Twitter, I noticed that a lot of the internal tools tend to have the same looks, then our designer pointed out to me Twitter Bootstrap: | |
Twitter Bootstrap is extremely powerful because it enables us to upgrade the look and feel of my application very easily. Below is an example where headers and table formatting essentially come free because of Twitter Bootstrap: | |
Big Idea / Lesson Learned: If you are interested in Design and UI, don’t re-invent the wheels. There is no shortage of layouts, components, and widgets to play with in Twitter Bootstrap. To learn more, I recommend this tutorial. | |
In this blog post, I barely scratch the surface of web development, but the goal was never to be comprehensive. Rather, I am interested in sharing my experience and point other enthusiasts to great resources in order to learn more. A little effort goes a long way, I am now able to present my analyses in more expressive and interactive ways! | |
For Data Scientists, there are certainly more lightweight approaches to produce (interactive) data visualizations, using tools such as ggplot2, ggviz, or shiny. But I think learning how a web application works in general also make one a stronger DS. For example, if you ever need to figure out how data is logged in an application, knowing how web application tend to be built can help you so much in navigating the codebase to do data detective works. This knowledge also helps you to establish common languages with the engineers in more technical engineering discussions. | |
If you are inspired, there are many more resources written by programmers who are much more qualified than I am on this topic (see here and here). | |
Start Flasking and keep hacking! | |
I would like to thank Krist Wongsuphaswat, Robert Harris, Simeon Franklin and Tim Kwan for giving me valuable feedbacks on my post. All mistakes are mine, not theirs.","1" | |
"quincylarson","https://medium.com/free-code-camp/code-briefing-the-best-classes-for-learning-statistics-8a3065b27735","2","{""Data Science"",Design,""Web Development"",Linux,Tech}","169","0.738050314465409","Code Briefing: The best classes for learning statistics","Here are three stories we published this week that are worth your time: | |
Bonus: It’s getting cold out there! So pick up a Free Code Camp hoodie today in our shop. | |
Happy coding, | |
Quincy Larson, teacher at Free Code Camp","3" | |
"gilgul","https://medium.com/render-from-betaworks/the-network-is-everything-89ed6b8b3290","7","{Startup,""Data Science"",Marketing}","168","4.51792452830189","The Network is Everything —","","5" | |
"drewwww","https://medium.com/mit-media-lab/highly-effective-data-science-teams-e90bb13bb709","1","{Startup,""Data Science"",Management}","166","4.06792452830189","Highly Effective Data Science Teams","For all its hype, Data Science is still a pretty young discipline with fundamental unresolved questions. What exactly do data scientists do? How are data scientists trained? What do career paths look like for data scientists? Lately, I’ve been thinking most about a related question: What are the markers of a highly effective data science team? | |
We often think first of “is there lots of data?” as the most important criteria for doing great data science work. I want to argue for a broader list that explores the processes of the team, the infrastructure that supports the team, and the boundaries between the team and the rest of the company. If you can organize those in a way that lets the team focus on the problems they own and remove friction around those problems, data scientists will excel. | |
This approach is inspired by Joel's test for software engineering teams. The structure of his framework is simple. You should be able to quickly answer each question with a yes or no. More yes’s are better! | |
This is a baseline measure of health — great teams might diverge on many other dimensions. These questions are as much about the ecosystem around the team as the team itself, but in my experience data science teams are so embedded that they must be acutely concerned with their organizational environment. You can also think about this from the perspective of someone thinking about joining your team; what would you ask about a team you were thinking about joining? | |
The first set of questions (1–3) focuses on whether the data science team is properly protected from tasks that could be better handled by better infrastructure, tools, or other specialists. Because data science is an interdisciplinary field and data scientists have at least basic skills in many adjacent domains (like engineering, dev ops, product management, math, research, writing, business, etc.) one of the easiest failure modes as a team is if they can’t focus on work that requires that entire set of skills to accomplish. Spending most of your time on ad-hoc requests, supporting simple data access, or doing data pipeline management displaces data science work. Because they can do that work well, it takes a disciplined organization to make sure they don’t have to. | |
A data team without rich data is flying blind, and questions 4–8 test whether the team has enough data and the associated tooling to work with it efficiently. If working with data is high friction because it conflicts with production systems, is undocumented or inconsistently collected, or simply not present, then it becomes challenging for a data science team to contribute in a timely fashion. They are also a measure of the level of organizational trust the team has; if product teams don’t get value from the data science team, building and fixing data collection systems will get de-prioritized. | |
Internal team processes (covered by questions 9–11) ensure the team is doing the kind of high quality research work that builds and maintains trust in the organization. Validating the work of a data scientist is out of reach for most of the team’s customers, so it is the responsibility of the team to commit to documenting their work, putting it through strenuous peer review, and evangelizing results. It should go without saying, but controlled experimentation is the most critical tool in data science’s arsenal and a team that doesn’t make regular use of it is doing something wrong. | |
If there is pressure for the data science team to make products look great even when evidence doesn’t support that view, then leadership is rotten. Teams must be able to report negative results confidently, otherwise everyone will lose trust in positive results. Data science teams need access to decision-makers with high leverage questions, and those decision-makers must have an honest relationship with data and evidence. One good proxy for this is whether there is demand for the data science team’s involvement and that leaders can rapidly identify how data science helped their team succeed. The final questions, 12–14, try to catch any of these issues. | |
This list is clearly not exhaustive or totally generalizable. The boundaries of what is and is not data science are still highly contested. I expect that teams who focus purely on building data products might have a very different perspective, as would those that intentionally blur the lines between data science and data engineering. Is there common ground between all data teams? Feel free to speak up in the comments and suggest new questions or vote to strike questions that you think aren’t broadly applicable! | |
Many thanks to my colleagues on the Twitch Science team (Brad, Wenjia, and Mark were particularly helpful) for reading, editing, and pushing me to complete this piece, plus Ryan Lubinski for helping generalize beyond some specifics of our experiences and arguing against too much brevity.","1" | |
"Mybridge","https://medium.com/mybridge-for-professionals/python-top-10-articles-in-september-859bc1070622","17","{Python,""Data Science"",""Big Data"",Programming,""Software Development""}","165","2.89811320754717","Python Top 10 Articles For The Past Month","In this observation, we ranked nearly 1,300 articles posted in September-October 2016 about Python. | |
Mybridge AI evaluates the quality of content and ranks the best articles for professionals. This list is competitive and carefully includes quality content for you to read. You may find this condensed list useful in learning and working more productively in the field of Python and Data Science. | |
30 Essential Python Tips and Tricks for Programmers | |
Compressing and enhancing hand-written notes with Python. Courtesy of Matt Zucker | |
An Introduction to Stock Market Data Analysis with Python [Part 1]. Courtesy of Curtis Miller | |
……………………………….……[Part 2] | |
Pyflame: Uber’s Ptracing Profiler for Python. Courtesy of Evan Klitzke, Software Engineer at Uber. | |
Creating Excel files with Python and XlsxWriter. | |
Dockerizing a Python Django Web Application. Courtesy of David Sale, Software Engineer at Sky News. | |
A Python Interpreter Written in Python. Courtesy of Allison Kaptur, Software Engineer at Dropbox. | |
Build a Slack Bot that Mimics Your Colleagues with Python. | |
Asynchronous Programming in Python at Quora . Courtesy of Manan Nayak, Software Engineer at Quora | |
Building a Paint App with Python. Courtesy of Derek Banas. | |
iGAN: Interactive Image Generation powered by GAN | |
[1,386 stars on Github] | |
. | |
Stitch: A Python library for writing reproducible reports in markdown | |
[300 stars on Github] | |
The Python Bible For Beginners: Everything You Need to know to build 11 projects in Python. | |
[4,346 recommends, 4.7/5 star] | |
For those who looking to host a website under 5 minutes | |
[One of the cheapest] | |
That’s it for Python Monthly Top 10. If you like this curation, read daily Top 10 articles based on your programming skills on our iOS app.","12" | |
"Mybridge","https://medium.com/mybridge-for-professionals/top-ten-machine-learning-articles-for-the-past-month-9c1202351144","16","{""Machine Learning"",""Big Data"",""Software Development"",Programming}","165","3.22547169811321","Top 10 Machine Learning Articles for the Past Month (v.July)","We’ve observed nearly 1,600 articles posted about machine learning, deep learning and AI in July 2016. | |
Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed — Arthur Samuel, 1959. | |
Mybridge AI evaluates the quality of content and ranks the best articles for professionals. This list is competitive and carefully includes great articles for you to read, so that you can learn and code more productively in the field of machine learning. | |
Machine Learning: A study of algorithms that learn from data and experience. | |
A Beginner’s Guide To Understanding Convolutional Neural Networks [Part I]. Courtesy of Adit Deshpande at UCLA | |
……………………………...Click here for Part II | |
Modern Face Recognition with Deep Learning — Machine Learning is Fun [Part 4] Courtesy of Adam Geitgey | |
Approaching (Almost) Any Machine Learning Problem. Courtesy of Abhishek Thakur at Kaggle Team | |
A Visual Introduction to Machine Learning [2015 post trending last month]. Courtesy of stephanie yee and Tony Chu | |
Deep Learning Tutorial at Stanford — Starting from Linear Regression. Courtesy of @AndrewYNg | |
Bayesian Machine Learning, Explained. Courtesy of Zygmunt Zając | |
Recurrent Neural Networks in Tensorflow II | |
……………………………Click here for Part I | |
“Can computers become conscious?”: My reply to Roger Penrose. Courtesy of Scott Aaronson, Computer Science Professor at MIT | |
Comparing the Top Five Computer Vision APIs. Courtesy of Gaurav Oberoi | |
tflearn: Deep learning library featuring a higher-level API for TensorFlow. | |
[3,232 stars on Github] | |
. | |
Magenta: Music and Art Generation with Machine Intelligence | |
[2,854 stars on Github] | |
. | |
Data Science and Machine Learning Bootcamp with R (most popular as of August 3) | |
[1,480 recommends, 4.8/5 rating] | |
That’s it for Machine Learning monthly Top 10. | |
If you like this curation, you can read Top 10 daily articles according to your skills on iPhone & iPad. | |
If you don’t have an iOS device, signup on the website and receive weekly Top 10 articles for your skills.","3" | |
"hoffa","https://medium.com/free-code-camp/always-end-your-questions-with-a-stack-overflow-bigquery-and-other-stories-2470ebcda7f","4","{""Big Data"",Stackoverflow,""Google Cloud Platform"",""Data Science"",Programming}","159","3.14905660377358","Want people to actually answer your Stack Overflow question? Add a question mark.","Last week, my team at Google announced that we’d be hosting all of Stack Overflow’s Q&A data on BigQuery. | |
Here are some of the most interesting insights about Stack Overflow that we’ve uncovered so far. | |
Nick Craver at Stack Overflow announced a new dataset dump on Friday: | |
We quickly loaded the full data dump into BigQuery: | |
Sara Robinson discovered that only 22% of Stack Overflow questions end with a question mark. | |
So I thought — hm… that’s interesting. But does adding a “?” actually help you get answers? | |
So I did an analysis of how many questions got an “accepted answer.” I then grouped them by whether or not they ended with a question mark. | |
It turns out that in 2016, 78% of questions ending in “?” got an accepted answer versus only only 73% of questions that didn’t end in “?”. And this pattern remains consistent if you look back through the years. | |
So if you want people to actually answer your Stack Overflow questions, end them with a question mark. | |
What about the number of answers a given question gets? Do questions that end with a “?” get more replies? | |
Yes, they do: | |
Using a question mark in 2015 and 2016 gave questions at least 7% more answers. This is even more noticeable in 2008 and 2009, during which questions with a “?” have received 23% more answers than questions without one. | |
Here’s the query I ran to get these results: | |
I built the above visualizations using re:dash. | |
Here’s a bonus visualization I did of how long it takes to get an answer depending on which programming language you’re asking about — and the total volume of questions and answers for each language: | |
Here’s an interactive version. | |
And here’s the query I ran to get these results: | |
Here’s Stack Overflow’s CEO announcing the fully query-able dataset: | |
One final interesting study: Graham Polley wrote a great post showing how to take Stack Overflow comments from BigQuery, run a sentiment analysis process on them with our Natural Language API and Dataflow, then bring them back to BigQuery to discover the most positive/negative communities. | |
His conclusion: | |
Check the GCP Big Data blog post, which includes queries on how to JOIN Stack Overflow’s data with other datasets like Hacker News and GitHub. | |
Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free. | |
Also, here’s:","8" | |
"ilikescience","https://medium.com/mission-log/datum-ipsum-designing-real-time-visualizations-with-realistic-placeholder-data-27b873307ff9","10","{Design,""Data Visualization"",Analytics,Data,Mockup}","159","8.18962264150943","Datum Ipsum: Designing real-time visualizations with realistic placeholder data","tl;dr — I’ve created a tool called Datum Ipsum to enable designers to use realistic data when mocking up data visualizations. Also, this is a written version of a talk I gave at Data Viz Camp 2016 in NYC. The slides for that talk are here. | |
In my career as a designer, both at Planetary and in my freelance work, I’ve constantly struggled with designing real-time data visualizations. You know the kind: dashboards, analytics, reporting tools — anything that displays data on an ongoing basis. Designers at publications like the New York Times, Guardian, Wall Street Journal, and Financial Times(paywall) have been pushing the envelope of design in data-first journalism and reporting, but applying the same principals to dynamic data sets is challenging. Though these visualizations are incredibly valuable to the people who rely on them every day, they’re some of the trickiest to get right. | |
My hunch is that no amount of aesthetic theory or design experience matters if a designer doesn’t understand the data. The real problem, then, is that … | |
This fact has a lot to do with the lack of overlap between design and statistics — designers are often the ones who avoided taking statistics class and opted instead to spend their time in studios and darkrooms. But even a background in statistics can’t change the reality that humans simply struggle with data. A few reasons for this: | |
Our brains are really good at finding patterns. It’s one of the things that makes design so rewarding — looking at orderly shapes and patterns feels good. So when we look at data, we look for and often find patterns, even if those patterns are misleading or downright wrong. | |
In addition to finding meaning where there is none, the patterns that are meaningful are often incomprehensible. Take the Birthday Problem, for example: in a group of people, what are the odds that two have the same birthday? For a group of 367 people, we can assume that there’s a 100% chance that two have the same birthday. What are the odds with a group of 70 people? | |
The surprising answer is 99.9%. Beyond that, for 23 people, there’s a 50/50 chance that two share a birthday. There’s certainly a pattern here, but it’s impossible to discern without analysis. | |
Try as we might, trying to fake data is a lost cause. One illustration of this is called the Gambler’s Fallacy, and it’s easiest to illustrate with a demo. We’re going to flip 20 coins (virtually); but before we do, enter in what you think the outcome will be — select either “Heads” or “Tails” in each row on the left, then click “flip!” to see how you fare. | |
What you’ll (probably) see is that your guess on the left alternates from heads to tails often, with few streaks, but the result on the right has streaks. Your guess likely has close to 10 heads and 10 tails, but rarely does the actual result come up an even 10 and 10. These differences are attributable to our assumptions about probability (that a coin flip has a 50/50 outcome) being applied to a larger problem (20 actual coin flips). | |
There’s even a simple statistical property (called Benford’s Law) that can be used to identify faked data sets. In short, in any large enough data set, patterns emerge in the first digits of each data point: more numbers will start with the digit “1” than any other number. This pattern is so pervasive and counterintuitive that it is often used to prove — in court — that data has been falsified. | |
Part of what makes data so tricky to work with is that our assumptions about the shape, domain, resolution, and volatility of a set of data are only just that: assumptions. | |
Even with the most reliable sets of data — say, an atomic clock — there’s always the possibility of error states or the occasional act of god. These one-in-a-million events are when great data visualization really shines: instead of breaking down in the face of anomaly, they highlight the rarity of the situation and respond appropriately. | |
But when we’re designing visualizations, how can we predict the unpredictable? It’s almost like something out of a Borges story; outliers aren’t really outliers if we’re expecting them. | |
Real data sets of all shapes and sizes are easy to come by if you know where to look. There’s a slew of large, well-formatted data sets on GitHub, collected and maintained by the (excellently-named) Open Knowledge group. From financial data to global temperature indexes, these data sets can easily provide a reality check for your designs. | |
Unfortunately, my experience working with historical data sets fails to overcome some of the fundamental challenges of designing robust visualizations. Here’s why: | |
Placeholder data is a lot like placeholder text. It’s a design tool meant to ensure the final product — in our case, a real-time data visualization — matches our mockups. It’s useful in all the same ways placeholder text is useful: | |
Just as with placeholder text, we can achieve varying degrees of placeholder data realism by choosing our parameters very carefully. When mocking up a page layout, I can choose the classic placeholder text “lorem ipsum dolor sit amet …” to help pick the right typeface, leading, measure, etc. A placeholder data version of “lorem ipsum” might look just like random numbers, a sort of meaningless pattern that looks like real data if you aren’t looking too hard. | |
I’ve often run into trouble using “lorem ipsum,” though, and you probably have, too: it is too obviously fake. Clients have questioned my sanity, when all their mockups look like they’ve been done in latin. And in making decisions with nonsense text, I sometimes run into problems down the line when the real text goes in. | |
Instead of using “lorem ipsum,” I’ve started to use real text from various sources. If the final text will be highly technical in nature, I can pull from scientific papers in the public domain. If it’s meant to be exciting, Jules Verne serves as a good placeholder. Choosing the qualities of my placeholder text helps me make better decisions, and sets better expectations with my clients. | |
How do we apply this concept to data? Can we create “flavored” placeholder data, without losing the benefit of randomness and unpredictability? | |
In 1983, a computer scientist named Ken Perlin was working on a very similar problem: how can a computer generate random images that don’t look like they were generated by a computer? While working on a movie titled Tron (the one without Daft Punk), Perlin developed an algorithm to draw textures that he could apply to the computer-generated images in the film to make them appear more realistic. And while the CGI in Tron looks laughably primitive compared to what we see in theaters today, at the time Perlin’s work was revolutionary. In 1997, he won an Academy Award for the algorithm, which by then was referred to as “Perlin Noise.” The award read: | |
This is an image of Perlin Noise. The technical description of the algorithm isn’t important for this post, so let’s just leave it at “random vector fields.”[1] | |
If we draw a line across this image, and track the color values as we go — “high” for white, “low” for black — we get a nice-looking smooth graph of the Perlin Noise field; its peaks and valleys are flattened out to a simple two-dimensional line. | |
Drawing lines across the noise field is akin to taking a walk with a FitBit and watching the elevation change as you go; a short walk might only have a few elevation changes, while longer walks have many more ups and downs. | |
One nice thing about this sort of data is that it can be added together to make more varied graphs. For instance, if we add up a few different walks across the noise field, we get a line that has short rises and falls, a plateau here and there, and the occasional rapid change. | |
Additionally, adding these graphs of Perlin Noise together with very basic functions like sine waves can generate data that resembles real-world patterns. For instance, a graph of the daily active users of a website follows a cycle; during the daytime, there are lots of users on the site, and in the late evening and early morning there are typically not as many. Combing a cyclical wave (in this case, a sine wave) with one of our noise graphs gives us a very realistic simulation of a website’s traffic! | |
Combining graphs of simple functions with Perlin Noise has a nearly infinite number of possible permutations, making it an incredibly effective tool for mocking up real-time data visualizations. In practice, Planetary uses this method to test our software platforms, like this one, which visualizes the performance of a publication’s stories in near-real time: | |
To make these noise functions accessible and easy to generate, I’ve created a tool called Datum Ipsum. Just like you might use placeholder text to mock up websites and apps, my hope is that you’ll use realistic placeholder data when creating data visualizations. | |
By using realistic placeholder data, you can make your designs more robust and free from bias. In putting constraints on the source material you work from, you’ll be challenged to solve problems that typically arise only after real data is available. With tools like Datum Ipsum, you can create tools that handle the unpredictable, difficult-to-understand nature of data in real time. | |
[1] If you’d like to learn more about Perlin Noise and its uses, I’d highly recommend Daniel Schiffman’s Youtube series on the subject.","1" | |
"dbreunig","https://medium.com/hacker-daily/do-algorithms-find-depression-or-cause-depression-2e047ef84cda","2","{""Machine Learning"",Amazon,Science,Depression,""Big Data""}","157","3.41352201257862","Do Algorithms Find Depression or Cause Depression?","You may have heard about the Harvard study wherein researchers “trained a machine to spot depression on Instagram.” The paper’s subject is perfectly weaponized to make the media rounds, combining data, AI, health, a popular social network, and an enticing question to encourage clicks (what filter means you’re depressed?). MIT Technology Review, Wired, The Next Web, The Independent, and others all hit post. The story has been lighting up Twitter for nearly a week now. | |
But once the depression filter was revealed (Inkwell, of course), I’m pretty sure everyone stopped reading. If they had, they’d have found a different story about depression: the crowdsourced workers who fuel the algorithms, which will evaluate us, are very depressed. | |
To find this sad story let’s run down the numbers cited in the MIT Technology Review article: | |
70% accuracy sounds pretty good! Allegedly, this hit rate is better than general practitioners. But it is hardly statistically significant. A test group of 100 is laughably small and the paper has yet to be peer reviewed. (Nick Stockton covers this on Wired, atoning for the publication’s earlier breathlessness.) | |
But they’ve buried the real story. | |
The depression rate among adults in the United States is 6.7%. | |
The depression rate among the crowdsourced workers who shared their photos is 41.2%. Over six times the national norm. | |
Working on Mechanical Turk, it appears, is crushing. | |
Mechanical Turk does not pay well. Because of their status as independant contractors, Turkers (as they are called) are not covered by labor laws. Their hourly pay ranges from $1-$5. | |
But poverty does not appear to be the driver for this high depression rate. According to the CDC, poverty doubles the average US depression rate. Mechanical Turk, according to the Instagram study, multiplies it by six. | |
With the recent rise of deep learning, Mechanical Turk has become a training ground for algorithms. Turkers sort data which will be used to create machine learning products. The best summary of Mechanical Turk, its workers, and the machines they train is this episode of NPR’s Planet Money. | |
Listening to Planet Money, its easy to see how crowd work can spur frustration and feelings of helplessness beyond poverty itself. There are no bosses or structure, just rapidly cycling tasks. Pay for repetitive work is generally insultingly low. There are no avenues for recourse other than self-organization and open letters to Amazon which generate no response. | |
When we discuss the issues inherent with AI and machine learning we usually focus on the perils of allowing computers to make decisions humans currently own. We rarely discuss the people whose work or attention create the algorithms themselves. | |
This is a mistake. Crowd work will only grow in the future, either through sharing-economy applications or online work. It’s existence without appropriate, modern regulation is worth discussion. | |
In an ironic twist, the decisions made by the powerless people on Mechanical Turk will be amplified in algorithms which will eventually have power over us all. Do the depressed judge depression or photos differently than the happy? If the people training these machines do not represent us, we will cede decisions to algorithms with which we will likely disagree. The case discussed here regarding Mechanical Turk is even worse: the work of sorting itself could turn a representative population into a depressed one, making skewed decisions unavoidable. | |
It is a missed opportunity that crowd work remains largely invisible while its output, machine learning, is a topic du jour.","6" | |
"sienaduplan","https://medium.com/free-code-camp/10-takeaways-from-22-data-visualization-practitioners-at-openvisconf-a4a3a5b96fcd","35","{""Data Visualization"",""Data Science"",Data,Design,""Women In Tech""}","154","7.32264150943396","10 Takeaways from 22 Data Visualization Practitioners at #OpenVisConf","Update | May 13, 2016: Videos of all talks are officially live! | |
rachel binx from NASA’s Jet Propulsion Laboratory eloquently described every data viz practitioner’s struggle: compromising between art and utility. | |
Visualization literacy and data literacy take time. In fact, Tony Chu taught us how data visualization is the use of space to control time (spacetime pun?). | |
Intuitive data exposition is constrained by attention — what Daniel Kahneman, author of Thinking, Fast and Slow, would call our System 2 thinking. | |
Tony advocates using techniques like animation and pacing (think scrollytelling) to feed bite-size bits of information (pun intended) to those digesting our visualizations, while Rachel focuses on using data viz to render tools instead of to tell stories or paint a picture. | |
Every speaker probed the realms of creativity in their mesmerizing visuals and approaches to technical rigor. Mariko Kosaka transformed polynomial functions into “pixel playgrounds” and brought kernel convolution to life with a “pixel social graph.” | |
Shirley Wu abused the force (d3’s force-directed graph layout) in a number of ways to simplify Illumio’s Illumination product (Shirley’s slides distill this complex idea into beautiful, understandable simplicity). | |
Amelia McNamara sneakily taught us some statistics — particularly, a creative solution to distinguishing real patterns in data from randomly generated data. | |
Nicky Case cleverly applied systems thinking to real-world phenomena like disease proliferation, ecosystem regeneration, and racial segregation. Nicky employed emojis in simulations to show that the world is not linear; it’s loopy. (Side note: emojis made a record-breaking number of appearances in this year’s talks.) | |
And Nadieh Bremer meticulously decoded SVG properties like gradient, animation, the glow filter, motion blur, the gooey filter, and color blending. These eye candy features spruce up even the most basic data visualizations. | |
“Data visualization” is quickly becoming an understatement for the field. The OpenVis Conference exploited much more than just our visual cortexes. Our ears sampled the building blocks of music in Kyle McDonald’s talk and we experienced a physics lesson in Ana Asnes Becker’s 3D ride along the Nasdaq. | |
Virtual reality is building a new playground for data visualization — side effects may include nausea. Patricio Gonzalez Vivo showcased the future of 2D and 3D mapmaking using shaders (check out Mapzen’s Tangram). | |
We also witnessed robots produce physical data visualizations and data art during breaks at the conference. | |
At the foundation of every well-fit model, successful predictive algorithm, and effective data visualization is hours of feature engineering and data scrubbing. | |
Oftentimes, the visualization is just the tip of the iceberg. Sometimes, a single line of code is just the tip of the iceberg (*cough, cough* rachel binx)! Data visualizers are all under-the-hood engineers as well as designers and artists. | |
Zan Armstrong taught us how to explore time series data to select the appropriate seasonality before modeling. Fernanda Viégas and Martin Wattenberg heavily reinforced the importance of wise feature selection and engineering prior to building a successful machine learning algorithm. | |
Fernanda Viégas and Martin Wattenberg from Google’s data visualization group demystified neural nets by demo-ing their recently released TensorFlow playground. | |
With the right amount of hidden layers, nodes, and training data, a neural net can classify just about anything… well, except for images of mugs (again, feature engineering is key). | |
Christopher Collins led us through a text analysis and visualization spree. In the limited time he had, Christopher whisked us through parallel tag clouds of our country’s legal cases, sunburst diagrams of barbie’s lexicon, and how anomalous information spreads across the web. Christopher’s techniques are novel and mark the launch of a new era of text analysis. | |
One seventh of the 21 talks focused on components of automation in data visualization. Adam Pearce’s tiny tools developed at Bloomberg Graphics drastically reduce the time spent in iterative tasks like writing redundant lines of code. | |
David Yanofsky’s charting platform uses d3.js and react.js to create brutally simple point-and-click dashboards and charts. Arvind Satyanarayan demo’ed Vega, the reactive programming tool that allows users to declaratively define interactivity and describe features of a visualization directly in a JSON. | |
By stone, I mean field of study. We saw disruptive data visualization techniques in a breadth of fields from wine and song to outer space and tennis. We, as data visualization experts, get to play in everyone’s backyard. | |
Kim Albrecht deconstructed the relationship between performance and popularity in professional tennis. Kim explained the intersection between data science and data visualization and how visualization unites scientists and sparks public dialogues. | |
Christine Waigl launched us into orbit around the planet and probed into regions of the world affected by climate change and natural disaster. We voyaged through her visualization trajectory from retrieving satellite data to generating quality raster images. | |
kennedy elliott showed us how pre-attentive processing of image attributes like color, shape, angles, position, curvature, volume, shading, area, length, and direction deeply impacts how humans interpret graphics. | |
As designers and data experts, our job requires us to be aware of the inherent biases in graphics to produce accurate and useful tools. | |
Mona Chalabi reminded us to humanize our data and to be receptive to channels of critique like comment sections. Mona would agree that the medium is both the message and the messenger. | |
In real life, we learn to navigate uncertainty from repeated experiences. Jessica Hullman used hypothetical outcome plots to place us in the midst of uncertainty. According to Jessica, “HOPs enable a user to experience uncertainty in terms of countable events, just like we experience probability in our day to day lives.” | |
The handsdown most valuable part of the conference was meeting the wildly accomplished and equally curious data visualization and data science explorers. | |
I am so grateful to Irene Ros, the rest of the team at Bocoup, and the sponsors for putting together an outstandingly organized conference. Post-conference depression is a real thing, I learned. Thank you for the knowledge, friends, skills, and ideas. Also, huge thank you to Salesforce for sponsoring me and my awesome manager Ernest for supporting my attendance! | |
One of my favorite stories from the conference happened at the party. I was standing near Mariko who prefaced her talk by sharing how she married her love for knitting sweaters with her love for coding. A data viz engineer at Apple turned to me and asked if I was a knitter myself. Being the R queen that I am, I assumed he was wondering if I was a knitr user — as in R’s report-generating package ‘knitr’ — to which I responded positively, “Yes! I am a big knitr slash sweave fan.” | |
At any other party, I probably would have received a strange look and spent the rest of the night wondering why no one would talk to me. But a data nerd party is different. He understood me immediately, laughed at my misunderstanding, and followed up with more questions about knitr 😂.","6" | |
"WhiteHouse","https://medium.com/the-white-house/a-six-month-update-on-how-we-ve-been-using-data-and-how-it-benefits-all-americans-b1221b5cbb0e","6","{""Data Science"",Data,Government}","154","6.19905660377359","A Six Month Update on How We’ve Been Using Data, and How it Benefits All Americans","Memorandum: A Six Month Update on How We’ve Been Using Data, and How it Benefits All AmericansTo: The American PeopleFrom: Dr. DJ Patil, U.S. Chief Data ScientistDate: August 19, 2015 | |
In my last memorandum, I discussed the opportunity to unleash the power of data to benefit all Americans. Now that it’s been six months, I wanted to provide an update on my team’s progress. | |
As I’ve had a chance to explore the different areas we’re working on across the government, it’s clear that this is the most data-driven Administration we’ve ever had. So, what does a data-driven government look like? It’s a connected organization that responsibly gathers, processes, leverages, and releases data in a timely fashion to enable transparency, create efficiencies, ensure security, and foster innovation to benefit the nation. | |
Looking across the entire federal government, there is an incredible opportunity to increase our ability to be the most data-driven government in the world. And it starts with this: | |
My team is helping federal departments and agencies as they work to ensure: (1) they are using data to benefit all Americans; and (2) they are using the data available responsibly. | |
Good data science and technology must benefit all Americans. As I outlined in my initial memo to the public, my team is focusing on living up to that mandate by: | |
The only way to make sure the federal government is using data responsibly is to engage with the public directly to seek input on what are acceptable uses of data. | |
(Or, How Open Datasets Can Make Our Treatment and Prevention Plans More Personalized and Effective) | |
One example of how the Administration is working to ensure data science benefits all Americans, while also providing for responsible data use, is the Precision Medicine Initiative (PMI). The mission of PMI is to enable a new era of medicine through research, technology, and policies that empower patients, researchers, and providers to work together toward development of individualized treatments and disease prevention. | |
To do this, we are collaborating with the public to get your input and support. Everyone who participated in our Twitter chats (#PMINetwork), National Institutes of Health (NIH) workshops, listening sessions, and wrote letters gets a big THANK YOU. Your help has been invaluable as we continue working on this incredible opportunity. For example, we’ve heard clearly that we need to ensure that the participant/patient is central to the project and their data is treated responsibly. With that feedback in mind, we recently announced our privacy and trust principles. | |
July 8 was an exciting day, as we welcomed our Precision Medicine “Champions of Change” to the White House. Make sure to watch the full video of the event to learn more about the progress being made. | |
If you’re interested in learning more about the sessions on Precision Medicine hosted by the NIH, visit here. The Precision Medicine Working Group closely reviewed feedback from those sessions and will publish their advice to the NIH Director this fall. | |
(Or, How We Can Use Big Data to Make People’s Lives Better) | |
Our data team is also continuing efforts to utilize and unleash data to improve lives. Thanks to the President’s Executive Order requiring that agencies make data open and machine readable, by default the power of data is being unleashed for all Americans. Data.gov hosts more than 130,000 datasets from nearly 80 agencies and 37 state, county, and city data catalogs. There are more than 8 million page views annually and we’re just getting started. | |
Take the Department of Commerce, which recently announced a big data project with Google, IBM, Microsoft, Amazon, and the Open Cloud Consortium to improve all the products that help people respond to the weather and the environment. The goal of this project is to liberate more than 50 terabytes per day of previously inaccessible information — for free — by combining government expertise with private-sector technological capabilities. Doing so will fuel new business innovation and support the Administration’s Climate Data Initiative. | |
Another example of using government data to help make people’s lives better was after the Nepal earthquakes and the incredible grassroots effort by data scientists around the world to use open data to help first responders and aid workers better understand the situation around them. To get a sense of the power of all these efforts, check out the talk I participated in at the International Open Data Conference (and hey, PDFs shouldn’t be considered “open data”!). | |
(Or, How a Police Chief in New Orleans Came to Write His First Line of Code) | |
Another example of the power of data to improve lives is through criminal and social justice efforts. Criminal justice systems — from small towns to massive cities — collect a lot data on the people and cases that move through them. Virtually every law enforcement action has a record associated with it. Minorities, the poor, and those with mental health concerns have a disproportionate number of interactions with law enforcement. If all of the data from law enforcement were effectively captured, analyzed, and shared, imagine how the effective sharing and analysis of that data could advance proven reforms, increase efficiency, and prevent injustices. | |
The recommendations from the Taskforce on 21st Century Policing called for the novel use of technology to tackle this social justice opportunity in the immediate term. One of the first projects, announced by the President in Camden, New Jersey, is the Police Data Initiative (PDI), which focuses on generating and implementing new data and technology innovations within key jurisdictions, civil society groups, and federal, state, and local agencies. PDI is centered on two key components: (1) using open data to build transparency and increase community trust, and (2) using data to enhance internal accountability through effective analysis. PDI has mobilized 26 leading jurisdictions across the country, bringing police leaders together with top technologists, researchers, data scientists, and design experts to use data and technology to improve community trust and enable a shift towards data-driven community policing. | |
This incredible effort has been led by two Presidential Innovation Fellows, Denice Ross and Clarence Wardell. To see the power of their efforts, read Denice Ross’s story about working directly with the police and youth in New Orleans and how New Orleans Police Superintendent Michael Harrison came to write his first line of code. | |
This is just the start, and I’ve been humbled by the support from all of you on our efforts. I’ve had the chance to meet many of you, on topics ranging from why community college is so important to hacking on civic data in South Bend, Indiana, and in the inner city of Chicago. And I hope to meet many more of you through future events. | |
Data science is a team sport, and we need your ideas, suggestions, and feedback to make it work! You can send those my way here on Medium with a response, or (for those thoughts and suggestions that can fit within 140 characters), tweet them to me at @DJ44. I look forward to sharing more updates as we continue making important progress on data usage. | |
See related posts:","9" | |
"mbostock","https://medium.com/@mbostock/command-line-cartography-part-1-897aa8f8ca2c","5","{JavaScript,Maps,D3,""Data Visualization"",Programming}","151","4.03710691823899","Command-Line Cartography, Part 1","[This is Part 1 of a tutorial on making thematic maps. Read Part 2 here.] | |
This multipart tutorial will teach you to make a thematic map from the command line using d3-geo, TopoJSON and ndjson-cli—free, open-source tools written in JavaScript. We’ll make a choropleth of California’s population density. (For added challenge, substitute your state of choice!) | |
The first part of this tutorial focuses on getting geometry (polygons) and converting this geometry into a format that can be easily manipulated on the command-line and displayed in a web browser. | |
The U.S. Census Bureau regularly publishes cartographic boundary shapefiles. Unlike TIGER—the Census Bureau’s most-detailed and comprehensive geometry product—the “cartographic boundary files are simplified representations… specifically designed for small scale thematic mapping.” In other words, they’re perfect for a humble choropleth. | |
The Census Bureau, as you might guess, also publishes data from their decennial census, the more frequent American Community Survey, and other surveys. To get a sense of the wealth of data the Census Bureau provides, visit the American FactFinder or the friendly Census Reporter. Now we must choose a few parameters: | |
It’s necessary to determine these parameters first because the geometry must match the data: if our population estimates are per census tract, we’ll need census tract polygons. More subtly, the year of the survey should match the geometry: while boundaries change relatively infrequently, they do change, especially with smaller entities such as tracts. | |
The Census Bureau helpfully provides guidance on picking the right data. Census tracts are small enough to produce a detailed map, but big enough to be easy to work with. We’ll use 5-year estimates, which are recommended for smaller entities and favor precision over currency. 2014 is the most recent release at the time of writing. | |
Now we need a URL! That URL can be found through a series of clicks from the Census Bureau website. But forget that, and just browse the 2014 cartographic boundary files here: | |
Given a state’s FIPS code (06 for California), you can now use curl to download the corresponding census tract polygons: | |
Next, unzip the archive to extract the shapefile (.shp), and some other junk: | |
(You should already have curl and unzip installed, as these are included in most operating systems. You will also need node and npm; on macOS, I recommend Homebrew to install software.) | |
A quick way to check what’s in a shapefile is to visit mapshaper.org and drag the shapefile into your browser. If you do that with the downloaded cb_2014_06_tract_500k.shp, you should see something like this: | |
As Mapshaper demonstrates, it’s possible to view shapefiles directly in your browser. But binary shapefiles can be difficult to work with, so we’ll convert to GeoJSON: a web-friendly, human-readable format. My shapefile parser has a command-line interface, shp2json, for this purpose. (Warning: there’s an unrelated package of the same name on npm.) To install: | |
Now use shp2json to convert to GeoJSON: | |
Note that this also reads cb_2014_06_tract_500k.dbf, a dBASE file, defining feature properties on the resulting GeoJSON. The glorious result: | |
We could now display this in a browser using D3, but first we should apply a geographic projection. By avoiding expensive trigonometric operations at runtime, the resulting GeoJSON renders much faster, especially on mobile devices. Pre-projecting also improves the efficacy of simplification, which we’ll cover in part 3. To install d3-geo-projection’s command-line interface: | |
Now use geoproject: | |
This d3.geoConicEqualArea projection is California Albers, and as its name suggests, is appropriate for showing California. It’s also equal-area, which is strongly recommended for choropleth maps as the projection will not distort the data. If you’re not sure what projection to use, try d3-stateplane or search spatialreference.org. | |
The projection you specify to geoproject is an arbitrary JavaScript expression. That means that you can use projection.fitSize to fit the input geometry (represented by d) to the desired 960×960 bounding box! | |
To preview the projected geometry, use d3-geo-projection’s geo2svg: | |
If you followed along on the command line, you hopefully learned how to download and prepare geometry from the U.S. Census Bureau. | |
In part 2 of this tutorial, I’ll cover using ndjson-cli to join the geometry with population estimates from the Census Bureau API and to compute population density. | |
In part 3, I’ll cover simplifying geometry and merging features using topojson-server, topojson-simplify and topojson-client. | |
In the forthcoming part 4, I’ll cover rendering in the browser using topojson-client and d3-geo. (If you want to peek ahead, see my bl.ocks.) | |
Questions or comments? Reply below or on Twitter. Thank you for reading!","2" | |
"ilikescience","https://medium.com/mission-log/tiny-data-visualizations-11bc44ea9b1e","6","{""Data Visualization"",JavaScript,Design}","147","4.48962264150943","Tiny Data Visualizations","This is a direct translation of a talk I gave at the NYC D3.js meetup in January of 2016. If you’d like to see a video of the talk, you can check it out on YouTube. | |
Have you seen this comprehensive list of Donald Trump’s insults? Or maybe this detailed rundown of the 2016 Primaries? Data is a big deal in the newsroom right now, and in 2016, we’ll likely see data take on an even bigger role in reporting feature stories. | |
The New York Times is just one example of an organization increasingly banking on data-driven reporting: in the Politics section of nytimes.com there is a blog called The Upshot that consistently uses analysis of detailed data sets to reveal surprising insights about news events on the front page. They typically include charts and diagrams to bolster their reporting: large, interactive explorations that are driving more and more traffic for the paper every day. | |
The Upshot, FiveThirtyEight and Buzzfeed, all of these publications are treating data visualization the same way that financial papers like the Wall Street Journal treat them: alongside the story, illustrating the story, but not integral to the story itself. | |
So what does data look like when it’s used as part of the story itself? How can we treat it as we treat verbs and nouns, integrated into, instead of apart from, the story? | |
Enter Sparklines | |
Edward Tufte describes sparklines in his book Beautiful Evidence as being word-sized graphics. “Typographic Resolution” is his key identifier of a sparkline, and he goes back to this distinction repeatedly without quite defining it. So what does “Typographic Resolution” mean? | |
Emojis are one example of word-sized graphics that we use every day: these pieces of design have typographic resolution, as they sit directly with the text to add meaning or change context. Emojis can turn “thanks 😀” into “thanks 😕”; a lot of meaning is imparted with a very small symbol. | |
Sparklines in practice | |
Let’s look at a data table that might benefit from sparklines. | |
This is a typographic table: just numbers and letters, spelling out the top 10 index funds over the past 4 years. We can add a small chart to each line with the historic trends for that fund. | |
This adds 120 more data points to each line, demonstrating the volatility of each fund, as well as demonstrating the forces of market cohesion and highlighting market-wide events. | |
In fact, we can make this chart far more compact by eliminating the numbers on the right, replacing milestones with historic trends. Some audiences might not need to quantify the specific returns of the funds, and for them, we can add data and reduce complexity. | |
For the right audience, this compact display of data is just as easy to parse as the original. It might be argued that it’s easier, in fact: the lines resonate aesthetically with our natural instinct to find patterns, using multiple mental modes to add additional dimensions to the data. | |
Designing Charts to be Tiny | |
Simply taking normal-sized charts and sizing them down doesn’t accomplish the kind of efficiency of space we’re looking for with our new visualizations. In order to succeed, charts have to be designed with their size in mind. | |
That can look something like this: | |
This is a proprietary platform that Planetary designed for The Daily Beast. It displays information about system-wide averages, local maximum and minimum, and current state into a small visualization that accompanies each report. | |
This approach also applies to paragraph-level charts. This is an example of how we might take a story about market performance and integrate charts into it: | |
These charts are designed with tininess in mind. They don’t include text or axes. They use very few colors or patterns. They focus on one piece of information and don’t try to draw correlations or associations. This simplicity allows them to add value without adding noise. | |
Tools to make tiny charts | |
There are a number of tools available to designers and developers that make it easy to embed tiny charts into paragraphs of text: | |
1. D3.js | |
A developer named Tom Noda has written up a great tutorial for rendering small charts in D3. He deftly addresses some of the technical challenges, such as smoothing and asynchronous data fetching. | |
2. FF Chartwell | |
While we don’t normally think of fonts as software, they are actually complex pieces of code, capable of really interesting applications. | |
FF Chartwell, a font originally published by FontFont, can parse numbers into a variety of different visualizations. It uses the inherent qualities of the text — color, style, and size — to style the charts and graphs. And because it’s a font, it can be embedded into web pages and rendered alongside other text; this takes advantage of the optimizations done by browsers when rendering text. It’s nothing short of black magic. | |
In Conclusion | |
We see the value of data visualization more and more each day, as large and interactive charts are published alongside stories in major publications. Newsrooms that explore the realm of word-sized charts, making data a central and essential component to stories, will stand apart; the next frontier of data is integration with the text of our stories.","2" | |
"larrykim","https://medium.com/marketing-and-entrepreneurship/16-eye-popping-statistics-you-need-to-know-about-visual-content-marketing-15fdc6ffa6f7","3","{Marketing,""Social Media"",""Content Marketing"",""Data Visualization"",Branding}","147","2.48207547169811","16 Eye-Popping Statistics You Need to Know About Visual Content Marketing","One marketing trend that’s impossible to ignore is the growing power and value of visual content. Just look at four of the fastest-growing social networks: Pinterest, Instagram, Snapchat, and Tumblr. | |
The way people — especially younger people — are consuming content is radically changing. If what people consume in just one minute online is any indicator, consumers are connecting to, searching for, watching, creating, downloading, and shopping for content more than ever. As a marketer, you must adapt your content marketing strategy to remain relevant. | |
Visual content increases message association, brand awareness, and engagement — and enhances the overall design of your website, as detailed in this Inc. article. | |
Photos, infographics, memes, illustrations, and videos are just a few forms of visual content that are having a huge impact on the way people consume information. All of these visual assets will only continue to grow in importance over the next few years. | |
Just look at these 11 eye-popping statistics on how content and technology are changing humans (as compiled in an infographic from WebDAM, below): | |
Here’s the full infographic for your visual consumption. | |
But that’s only the tip of the visual content iceberg. Here are five bonus eye-opening statistics on visual content marketing for Facebook and Twitter you need to know: | |
Bottom line: Visuals are memorable and effective, because they help people process, understand, and retain more information more quickly. | |
Are you creating stunning visuals or including videos on your website and in your advertising? | |
Originally published on Inc.com | |
Found this post useful? Kindly tap the ❤ button below! :) | |
About The Author | |
Larry Kim is the Founder of WordStream. You can connect with him onTwitter, Facebook, LinkedIn and Instagram.","3" | |
"uwdata","https://medium.com/hci-design-at-uw/introducing-vega-lite-438f9215f09e","9","{Visualization,""Data Science"",D3}","147","4.88301886792453","Introducing Vega-Lite","Today we are excited to announce the official 1.0 release of Vega-Lite, a high-level format for rapidly creating visualizations for analysis and presentation. With Vega-Lite, one can concisely describe a visualization as a set of encodings that map from data fields to the properties of graphical marks, using a JSON format. Vega-Lite also supports data transformations such as aggregation, binning, filtering, and sorting, along with visual transformations including stacked layouts and faceting into small multiples. | |
As you might have guessed, Vega-Lite is built on top of Vega, a visualization grammar built using D3. Vega and D3 provide a lot of flexibility for custom visualization designs; however, that power comes with a cost. With Vega or D3, a basic bar chart requires dozens of lines of code and specification of low-level components such as scales and axes. In contrast, Vega-Lite is a higher-level language that simplifies the creation of common charts. In Vega-Lite, a bar chart is simply an encoding with two fields. | |
Vega-Lite was inspired by other high-level visualization languages such as Wilkinson’s Grammar of Graphics, Wickham’s ggplot2 for R, and the VizQL formalism underlying Tableau. Motivated by the design of these languages as well as the underlying Vega and D3 systems, we arrived at a number of principles to guide the design of Vega-Lite. | |
Favor composition over templates. While chart templates (as found in spreadsheet programs) can be convenient, they limit the visualization types that can be created. Instead, Vega-Lite uses a compositional approach, describing a visualization based on the properties of graphical marks. This approach not only supports an expressive range of graphics, it also helps users rapidly refine and move between chart types. For example, you can create a histogram by mapping a binned field and the count of all records to a bar mark. You can then quickly edit the specification to create other types of visualizations, such as binned scatterplots. | |
Provide sensible defaults, but allow customization. Vega-Lite’s compiler automatically chooses default properties of a visualization based on a set of carefully designed rules. However, one can specify additional properties to customize the visualization. For example, the stacked bar chart on the left has a custom color palette. Vega-Lite uses a concise syntax, enabling rapid creation of visualizations without unduly restricting subsequent customization. | |
Support programmatic generation, sharing, and reuse. Like Vega, Vega-Lite can serve as a standalone file format for visualizations. In particular, Vega-Lite is designed to be a convenient format for automatic visualization generation by visual analysis tools. Examples of applications that use Vega-Lite are Voyager, a recommendation-powered visualization browser for data exploration, and Polestar, a web-based visual specification interface inspired by Tableau. Moreover, one can reuse Vega-Lite specifications across datasets with similar schemas. | |
Leverage Vega’s performance, flexibility across platforms, and expressivity. Vega-Lite specifications are compiled into Vega specifications and rendered using Vega’s runtime, which supports both browser-side and server-side rendering via SVG or Canvas. Vega-Lite directly benefits from Vega’s architecture. While Vega-Lite focuses on commonly-used charts, one can create more advanced designs by starting with Vega-Lite and then further customizing the resulting Vega specification. | |
With the 1.0 release, Vega-Lite provides a useful tool for visualization on the web. That said, we are even more excited about what comes next. | |
Composition and Interaction. A powerful aspect of modular approaches to visualization is the ability to create sophisticated graphics by composing simple ones. A static visualization typically provides at most a handful of insights into the data. The true power of visualization lies in the ability to interact with data and see it from multiple perspectives. So, we are building new methods for composite, interactive visualizations in Vega-Lite. In the coming months, Vega-Lite will add support for both layering and composing views side-by-side. We are also developing ways to describe not just visual encodings, but interaction techniques using a concise, composable, high-level syntax. For example, we will support linked views with cross-filtering. | |
Scalability. As Vega-Lite is a declarative language, we can reason about its behavior and automatically optimize and distribute computation. For example, a server could pre-aggregate data and send data in a compressed binary format to the browser. As a result, visualizations of large data sets can load more quickly and be more responsive. Moreover, this optimization should be possible without any changes to a Vega-Lite specification! | |
Design Tools. Creating visualizations with Vega-Lite should be easy, but we hope to make it even easier. We are developing a design validator that helps identify potentially ineffective visualizations. For example, a horizontal bar chart with an x-axis starting at a non-zero value is a valid specification, but it might cause readers to misinterpret the relative differences between values. We will also introduce support for theming in both Vega and Vega-Lite to customize the default look-and-feel. Plus, the next version of Lyra, a visualization design tool, will use Vega-Lite’s rule-based system for rapidly creating visualizations. | |
If you are interested in taking Vega-Lite for a spin, you can start visualizing and join the community today by: | |
This post was authored by Kanit “Ham” Wongsuphasawat, Dominik Moritz and Jeffrey Heer. We would like to thank all of the Vega-Lite contributors and the UW Interactive Data Lab for their assistance in the development of Vega-Lite.","2" | |
"nwokedi","https://medium.com/@nwokedi/machine-learning-isn-t-data-science-67cc66867dbc","3","{""Machine Learning"",""Data Science""}","147","7.19150943396226","Machine Learning Isn’t Data Science","Too often, Machine Learning is used synonymously with Data Science. Before I knew what both of these terms were, I simply thought that Data Science was just some new faddish word for Machine Learning. Over time though, I’ve come to appreciate the real differences in these terms. I’ve always wondered how misconceptions like these endure for so long — my current working hypothesis: people are deathly afraid of looking stupid. Too afraid of asking someone “what is machine learning? What is data science? What is the difference?” So, for those too afraid of asking, I’m going to pretend that you asked. Now, what follows is my hypothetical answers to your hypothetical questions :-). Enjoy. | |
Machine Learning is the set of techniques concerned with getting a program to perform a task better with respect to some metric as the program gains more experience. Amazon’s recommendation engine is an example of a machine learning system. The program is the recommendation engine. The task is to provide you with recommendations of things you’re likely to buy. Let’s say that the metric is the number of recommended purchases you’ve made over the number of recommendations the system sent you. The recommendation engine gets experience from monitoring what you view, what you buy. Machine Learning has three distinct areas that fully describe it: supervised learning, unsupervised learning, and reinforcement learning. | |
Supervised learning is the process of trying to approximate a function. Predicting next year’s home prices in San Francisco based on the previous ten years of housing prices. The function you’re attempting to approximate is the price of a San Francisco home next year. This function is probably impossible to compute exactly. We are beholden the data we can obtain and that data is rarely perfect. For instance, the ten years of historical prices may not track all the information we’d need to make perfect predictions. A historical house pricing data set that only has pricing information is very different from a data set that has pricing, geographic, number of bedrooms, last kitchen update, etc. The price of a home next year can be affected by all kinds of things outside of any individual’s control (e.g., natural disasters, economic boom/bust). It would difficult to construct a model that could perfectly predict the future in this way. Thankfully, for most use cases we are satisfied approximation predictions of the future and more generally approximations of the function we wish to find. | |
Unsupervised learning is the process of exploiting the structure of data to derive “interesting” summaries. Let’s assume that we have all statistics associated with each NFL team. Furthermore, let’s say we want to know how similar teams are because we think once we find these similarities, we might find certain attributes correlate with (un)successful franchises. Before we could embark on this path, we’d have to define what we meant by similar by defining what statistics we wanted to measure distances between (e.g., years of experience of offense, years of experience of head coach). We’d also have to make sure that euclidean distance was the type of distance we were interested in too. We’d apply some algorithm that cause the teams to form clusters of 1 or more teams based on their distance to each other. Teams that are closer to each other will tend to end up in up the same cluster, teams the further from each other will tend to not be in the same cluster. These clusters constitute summaries of the original NFL data. Now, here’s the important part: it will now take human judgement to determine if the obtained clusters are in fact “interesting”. | |
Reinforcement learning is the process of learning from delayed reward. There’s a notion an agent (or program), and it is taking action in the world toward some objective. However, the agent doesn’t get immediate feedback for the action it takes in its world. It doesn’t find out until many steps in the future whether or not the 1st, 2nd, or 3rd action it took was a fatal one or a glorious one. Think of the game checkers. The reward there is winning the game. After playing many games with a formidable opponent, the agent may realize that certain moves lead to certain failure and will tend to avoid those moves. The good agent will eventually learn to make better moves that will increase its odds of winning against a formidable opponent. | |
Although, I’ve described these subareas of Machine Learning independently, they can be combined to produce powerful systems (e.g., see IBM Watson). | |
Now, for data science. Data science is the newer term and thus more ill-defined. My definition of data science is derived from Johns Hopkins Data Science Specialization. Data science is the process of obtaining, transforming, analyzing, and communicating data to answer a question. If you’re the type of person that craves linear processes, one follows: | |
However, as you might guess, this linear picture doesn’t quite capture reality. That said this depiction isn’t completely useless. These are in fact the steps you’re moving through when doing data science. Now that you’ve been prepped with the fake, let’s take a look at the real: | |
This bus architecture captures the messiness of the process more accurately. Any future step can influence some previous step. Any previous step can influence some future step. For ease of discussion, we’ll use the linear process depiction. Let’s walk through each step. | |
The data question is the question that can be answered with data. It’s essential that the question asked can, in fact, be answered by the data you have or the data you can obtain in a reasonable timeframe. The question may be given to you, or it may be a question you develop. | |
The raw data is exactly what it sounds like. This is data required to answer your question, but in a “raw” state. In order for you to engage in the data analysis you want, you need to convert the raw data into tidy data. The process of turning raw data into tidy data is called cleaning the data. Suppose you downloaded the graduation rates for the past five years for males and females from universities around the country as a CSV file. This CSV file is the raw data. Beyond downloading raw data from a server with the click of a button, web scraping or programmatically pulling data from a distributed file system or database are also common. People rarely mention Sneakernet, but it’s also a thing. | |
The tidy data is the data after you’ve cleaned it for subsequent analyses. Continuing with the previously mentioned CSV file, on graduation rates what is likely is that the file wasn’t created specifically to support your analysis. Therefore, it’s likely to have other bits of information that are unlikely to be of interest to you like the ID of the person that entered in the data, or a last accessed timestamp. Moreover, it’s possible the file will have missing or invalid values in some entries (e.g., the value 432 as a graduation rate). For these reasons, you’ll need to rectify these issues as part of your custom script to get the tidy data. I’ll note that people have taken time to define what tidy data is and it’s worth checking out. | |
The data analysis is the result of the analysis performed. And this is the part that everyone tends to think about when they think of Data Science. It’s where things start to get sexy. Broadly speaking, there are a finite number of analyses one might engage in at this stage. So, let’s walk through them. | |
Descriptive Analysis | |
In this phase, you’re trying to understand the shape of your data. You’re principally interested in being able to summarize the properties of your data. Think min, max, mode, average, range, etc. | |
Exploratory Analysis | |
In this phase, you’re trying to find what relationships exist in the data. You’re usually constructing a lot of quick and dirty plots to determine what type of analyses you might like to try next on the data. Think histograms, box plots, and good ol’ x-y plots. | |
Inferential Analysis | |
If you’re interested in making a claim about a population based on a sample of that data, then this is the type of analysis you’ll want. Inferential analysis is often desirable because it communicates the estimated uncertainty associated with a claim. Think statistical hypothesis testing and confidence intervals. | |
Predictive Analysis | |
If your question has to do with predicting phenomena, then you’ll eventually find yourself in this phase. Here, you’re trying to identify the best set of features that will allow you to make predictions about something else. Think supervised learning. | |
Causal Analysis | |
If you wish to make claims such as “X causes Y” you’ll really need to be able to perform randomized controlled experiments. If this is unavailable to you and all you have is observational data (the common case), you may consider leveraging a Quasi-Experimental Design (but its validity is questionable). Things like moderation analysis tend to come up when people are thinking causal analysis as well. Fundamentally though, think randomized controlled experiments. | |
Mechanistic Analysis | |
This analysis requires you have a mathematical model (equation) to represent some phenomenon. This model isn’t chosen for statistical convenience (e.g., Gaussian Model) but for scientific reasons. With this model chosen for scientific reasons, you subsequently aim to determine exactly how a variable influences another variable with the data that you have. Think doing statistical analysis with scientifically chosen models. | |
Data product is how you communicate the answer to your question. This can take the form of a presentation, a literate program, a blog post, a scholarly article, an interactive visualization, or a web/mobile/desktop/backend application. Who you are trying to communicate results to will influence what type of data product you end up creating. | |
If you read everything above, you definitely know the answer to this question now. Machine Learning is a type of analysis you *might* perform as part of Data Science. Stated another way, Machine Learning isn’t a necessary condition of Data Science (Statistics is though!). If you happen to be doing a predictive task, you’re reaching for supervised learning. If you happen to be doing descriptive/exploratory analysis, you *might* reach for unsupervised learning. As for reinforcement learning, it’s not as popular as supervised learning or unsupervised learning, and even less popular in Data Science. | |
HTH.","1" | |
"Mybridge","https://medium.com/mybridge-for-professionals/python-top-10-articles-for-the-past-month-48abfd3dd67c","16","{Python,""Data Science"",Programming,""Web Development"",""Software Development""}","143","2.70094339622641","Python Top 10 Articles For The Past Month.","In this observation, we ranked nearly 1,250 articles posted about Python in August 2016. | |
Mybridge AI evaluates the quality of content and ranks the best articles for professionals. This list is competitive and carefully includes quality content for you to read. You may find this condensed list useful in learning and coding more productively with Python. | |
Computational and Inferential Thinking for Data Science in Python — UC Berkeley | |
10 interesting Python modules to learn in 2016 | |
HackerMath: Introduction to Statistics and Basics of Mathematics for Data Science (Python data stack). | |
Generating fantasy maps in Python | |
1M rows/s from Postgres to Python — magicstack | |
Real-world data cleanup with Python and Pandas | |
Why You Should Learn Python | |
Stitch: A Python library for writing reproducible reports in markdown | |
The One Python Library Everyone Needs | |
Python Tutorial: Datetime Module — How to work with Dates, Times, Timedeltas, and Timezones | |
Learn Python 3 from scratch to become a developer in demand (Most popular as of September, 2016) | |
[580 recommends, 4.8/5 rating] | |
On Generative Algorithms | |
10 Free Data Visualization Tools | |
That’s it for Python Monthly Top 10. | |
If you like this curation, you can read Top 10 daily articles based on your skills on our iPhone & iPad app.","2" | |
"d1gi","https://medium.com/@d1gi/the-election2016-micro-propaganda-machine-383449cc1fba","20","{Journalism,Politics,""2016 Election"",""Data Visualization"",""Fake News""}","142","11.8971698113208","The #Election2016 Micro-Propaganda Machine","After finding evidence that much of the “fake” and hyper-biased news traffic during 🇺🇸#Election2016 was arriving through direct hyperlinks, search engines, and “old school” sharing tactics such as email newsletters, RSS, and instant messaging, I thought I would do a small “big data” project. | |
I wrote this piece because I feel the argument about Facebook’s role in influencing the outcome of the U.S. election doesn’t address the real problem: the sources of the fake/misleading/hyper-biased information. Sure, Google’s ad network and Facebook’s News Feed/“Related Stories” algorithms amplify the emotional spread of misinformation, and social media naturally turn up the volume of political outrage. At the same time, I think journalists, researchers and data geeks should first look into the factors that are actually 1) producing the content and 2) driving the online traffic. | |
Rather than analyze “known unknowns” with incomplete metrics and partial analytics (i.e., measuring the famously opaque Facebook engagement metrics), this analysis looks directly at the source. | |
There’s a vast network of dubious “news” sites. Most are simple in design, and many appear to be made from the same web templates. These sites have created an ecosystem of real-time propaganda: they include viral hoax engines that can instantly shape public opinion through mass “reaction” to serious political topics and news events. This network is triggered on-demand to spread false, hyper-biased, and politically-loaded information. | |
For this analysis, I’m calling it “fake news.” | |
It’s what I term the #MPM: the “micro-propaganda machine” — an influence network that can tailor people’s opinions, emotional reactions, and create “viral” sharing (😆LOL/haha/😡RAGE) episodes around what should be serious or contemplative issues. The increasing influence of this type of behavioral micro-targeting and emotional manipulation — data-driven “psyops” — has become more noticable as people begin to reflect on the outcome of the recent #Brexit and U.S. election. | |
In my previous post, I found that only ~60% of incoming traffic from a sample of leading “fake” and hyper-biased news sites seemed to be coming out of Facebook and Twitter. The remaining ~40% of web traffic was organic — coming from direct website visits, P2P shares, text/instant messaging, subscription e-newsletters, RSS, and search engines. Again: Less than 0.1% of the traffic to the sites I looked at came from display advertising or (known) paid search content. | |
My guess was that this network — the #MPM — of small “fake” and hyper-biased sites has been pushing traffic through links — and helping to inject this content into platforms like Facebook and Twitter. This effort was likely ramped up around the time the 🇺🇸#Election2016 primaries concluded, as well as any time a new political issue (involving email servers, groin grabbing, immigrants, etc.) takes place. | |
The data in my last piece showed mail.google.com (📧Gmail) being one of the top “upstream” sources of traffic coming into Infowars.com, an influential player in the right-wing news sphere. For this project, I did a medium-scale data analysis — crawling and indexing 117 websites that are known to be associated with the propagation of fake news content and the spread of what I’m calling “hyper-biased” propaganda. | |
For the purposes of looking directly at what some have termed the “alt-right” political propaganda machine, I kept the sources in this analysis restricted to sites that have been 🗯⚠️publicly called out by internet users and listed by editors on the following verification sites: Snopes, Fake News Watch, Real or Satire, and Media Bias Fact Check. | |
Due to the sensitivity of this type of research, I feel complete transparency is key: Below is my list of the 117 sites I scraped and indexed in my #MCM election data project. | |
I crawled 🕷 every website on the list and extracted URLs one “level” deep. This scraping effort, given the relatively basic structure of these template-based websites, represents the majority of links on these sites (735,263 of them, to be exact). | |
🔂After a couple of hours, my scraping/indexing effort resulted more than 11,033 webpages, and 735,263 hyperlinks. Out of this data set, there were 80,587 hyperlink connections —aka shared URLS — across the 117 fake news websites. | |
I looked for patterns in the shared links to find what places these fake news websites seem to be linking to, as well as their most common inbound link destinations, and the structure of how the #MCM was embedded across the wider 🇺🇸#Election2016 mediascape. | |
{After exporting the dataset (.gexf file), I sorted out the news “network” at the widest scale using an open source tool, GEPHI, and the ForceAtlas2 algorithm. Any website with at least two shared URLs (links) to them from the 117 sites on my list above appear in my #MPM network graph. There were just over 2000 sites in the network, and all data obtained was publicly available and appeared on the websites as of 17-Nov-2016} | |
The circle, or “node,” size on the following graph(s) is proportional (1–100 scale) to the number of shared hyperlinks that link into the site from the 117 website sample. The colors are sorted according to actor type. | |
Red=🔴right-wing media; Purple=⚛government entities; Yellow=🤔interesting things; Blue=🔵social media; Green=✳️education; and the less prominent nodes were left gray. | |
The following website data map, called a network graph, can be used reflect on #Election2016. It can help us discover: | |
{What this data cannot show — at least, directly — is why these links exist or exactly when they were established. To put it simply, this map can show us the frequency and direction of “fake news” relationships, but can’t display the complete nature of the connections.} | |
Can Data Be Richer Than Trump? | |
This originally small project turned out to be an unexpectedly rich data capture — I could probably write about it for weeks. However, there are several fascinating themes that are displayed in this fringe-right propaganda network (see embed above for high resolution version). I’m publishing this now, since I feel it can help solve the #Election2016 equation. | |
First, as my previous post noted, the sites with the most inbound hyperlinks (the largest circles on the graph) in this fake news propaganda network are Google, YouTube, the NYTimes.com, Wikipedia, and strangely, Amazon.com. The larger the circle, the more links are coming in from the 117 #MCM network sites. | |
YouTube’s dominance was expected, as many sites — “left-wing,” “right-wing,” or otherwise —post links to videos, creator channels and documentary-style “educational” material. Again, the 🌀LARGEST circles are the domains that are linked to the MOST by the propaganda engine. I’ll come back to the separate “webpages” at the end — there’s a countdown of the “top ten” individual links. | |
You can see on the “zoomed out” graph (image below) the #MPM — i.e., ID’d right-wing, fake news, conspiracy, anti-science, hoax, pseudoscience, and right-leaning misinformation sites — in 🔴red. | |
If you look at the graph closely, you’ll see they basically surround most of the mainstream media, including the largest 🎨“liberal” media, on the network. This includes national newspapers like the New York Times, The Washington Post, and even “right-wing” media such as Breitbart.com, the Dailycaller.com, and the National Review. | |
The sites in the fake news and hyper-biased #MCM network have a very small “node” size — this means they are linking out heavily to mainstream media, social networks, and informational resources (most of which are in the “center” of the network), but not many sites in their peer group are sending links back. | |
The most influential sites line the 🌐periphery of the virtual propaganda network. You can see (image above) that many of the sites have a large flood of red hyperlinks flowing outward — some of these are in the thousands. For the purposes of this analysis, the red lines (each representing a different URL) “matter” most when they are headed towards the large nodes in the center of the network. | |
The #MCM network displays a high number of links to content creation and web asset-hosting services (Wordpress.com, Statcounter.com, WP.com, etc.). These likely are shared to help the website users produce content and measure the impact of their audiences. The zoomed-in views (see images below) also suggest that these fake news sites use social platforms to share as well as coordinate through hyperlinks: | |
I’ll explain: If you look around the largest picture of the network (see first image), you can see the 🤖🤖🤖 coordinating effect of individual page hyperlinks. In the next image, you can see how many larger red nodes have smaller “interest clusters” — these appear to consist primarily of Twitter accounts, public Facebook pages, and other miscellaneous issue-based websites: | |
In the next images (see below), there are an interesting number of links pointing to 👕consumer goods/commerce sites and ✂️digital production tools. These include CafePress (t-shirts), Feedburner (RSS news), and Addthis.com (social sharing scripts). | |
This could mean that the propaganda network may be use these resources internally to spread — as well as generate some income — off of politically-themed news events and political debates. These links might be pushed into other locations on the internet, especially social platforms like Facebook and Twitter. Oddly, Amazon.com (see first graph, above) is also a top inbound link destination in the #MCM network. Further analysis is needed to uncover the type of content/resources all of these links point towards. | |
Next, the #MCM network links heavily to a major poll site, Gallup, and crowdsourced fact-checking and reference resources —most notably Wikipedia, Reddit, and Wikimedia. Snopes and other fake news verification sites are in the “liberal” side of the network at the top-middle right (see the first large graph). | |
This is a preliminary data analysis, but beyond the specifics — like all network graphs — I feel the the widest picture of the network (again, the first full-size network graph) is intriguing. The network is clearly split into several ideological regions: The ⬅️far left and ↖️top left areas have the most “alt right” and “hard right” actors; the ⏫middle top region shows a strong religious base as well as a strong anti-Islamic component. | |
The ➡️far right side seems to be most 🎨“liberal,” and this side adds increasing numbers of governmental actors as it joins the harder-right religious conservative actors around the 🔼mid-to-upper center of the network. The ↙️bottom left region is primarily influential social media accounts, and the ↕️bottom center involves many international media outlets; similar to the upper half of the network, the lower half starts to pick up more university websites, environmental action and policy sites, and tech-oriented actors (e.g., EFF.com) as you move towards the ↘️bottom right. | |
To wrap up this post, I’m listing the most-shared non-domain links in the “micro-propaganda machine” network. This means the most commonly shared links (i.e., InDegree) out of the 80,587 URLs that link to individual pages (i.e., not the NYT front page, Facebook.com, Google email/searches). | |
⚡️The top 10 #MCM #Election2016 URL destinations: | |
📚Bonus: The large Amazon.com inbound link presence in the network appears to be through the fake news sites’ Amazon seller affiliate links. These links are for getting kickbacks on merchandise sold (books, magazines, etc.) from ads on their site, or through Amazon recommendations in original posts or book recommendations. | |
I translated the top individual Amazon.com link in the network through an affiliate code-matching website. It’s a subscription (through Amazon) to a conservative magazine: | |
This is just a 🔭glimpse of publicly available data related to the election. This post has a fair portion of what I found, but I do hope to look into the data more. I also plan to look at the exact opposite bias — meaning switching this network graph around and coding the “left-wing” websites to see what sorts of linking patterns play into 🎨liberal micro-propaganda from the recent election. | |
I hope this glimpse into a set of focused medium data offers another path to move forward, since I see little point in arguing about complex, ever-changing 🔢algorithms. | |
I also believe that platform-specific social network metrics are often more trouble than they are worth. For one, we don’t know how these proprietary measurement systems work for a reason: they involve💰multibillion-dollar business models and 🔐confidential IP. While Facebook’s engagement metrics are interesting to think about, they don’t really offer us much in the way of pinpointing the propaganda, misinformation, and viral/hoax clickbait that really shaped the election. | |
What does 📈“engagement” really mean? What does it drive? As Craig Silverman correctly stated at the end of his recent Buzzfeed “fake news” analysis, we don’t really know for sure: | |
I’ve tried to be 📝transparent in this analysis. I do expect to take some heat for the selective focus, which involves previously uncharted political data-journalistic waters. But I feel at this point ALL research involving fake news is a move in the right direction. | |
As I recently argued, turning around and blaming Facebook, Twitter, and Google for our 🌍 widespread social and cultural problems isn’t the best place to start. I mean, why look at the result when you can look at the 🎯 problem? That’s exactly what I’ve tried to do here. | |
🔬💼 Part II of this “fake news” research project ⤵️:","8" | |
"duhroach","https://medium.com/@duhroach/understanding-compression-69c874de6ad7","5","{Compression,""Data Science""}","142","5.76163522012579","Understanding Compression","Aka Why the hell did I write a book? | |
As engineers, we always have that dirty little side project that steals our attention. Things that don’t really relate to our day job, but they get us excited, and keep us passionate and up late at night obsessing over things we don’t know. | |
For me, that obsession (addiction really) is Data Compression. | |
I’ve never found an area of computer science that was just so brilliantly complex, mis-understood, and absolutely critical to the operation of modern computing. | |
As such, I’ve written a book called Understanding Compression (available through O’reilly or on Amazon) which is my humble attempt to get other engineers as excited about data compression as I am. | |
As the final version of the book was sent off to the printers today (w00t!), I felt that it would be appropriate to give a little history on how it all got started. | |
My ability to learn data compression has been at the very generous time of all those experts who’ve created the algorithms, and then went on to document things for future engineers. | |
There’s a ton of already great, information dense, books out there on the topic Here’s a few of my favorites: | |
Data Compression : The complete reference | |
Handbook of data compression | |
Managing gigabytes | |
Burrows Wheeler Transform | |
Variable Length Codes for Data Compression | |
Fundamental data compression | |
Compression algorithms for real programmers | |
I’ve read all those books, and they are absolutely fantastic. Full of brilliant insight, algorithm descriptions, and history on where these algorithms came from. For almost 30 years now, it’s been texts like the ones above that’s been responsible for teaching engineers how to write compression algorithms. If you want to get into data compression, this is basically where you need to start. | |
I remember lugging a copy of Data Compression : The complete reference around in my backpack during trips to conferences; for almost a decade now, it’s been my go-to for in-flight reading. | |
For all the research and reading, books on data compression tend to fall into two camps: | |
1) A language-specific discussion of the algorithms. Where you spend 2 pages describing how Arithmetic Compression works, and then 43 describing how to make it work in C++ where you don’t have arbitrary floating point division, and performance is a problem. | |
2) A math-specific discussion of the algorithms. Where you spend one paragraph discussing how things work, and then 2 pages of mathematical symbols which are used to describe the underpinnings of the theory. | |
If you stick with it, and re-read things like 40 times, it all starts to mesh together and make sense; but until then, both of these techniques create a large barrier to entry for engineers to understand and get excited about data compression. | |
This barrier was what has always bothered me about these types of books; They make it really hard for engineers with less than 5 years of CS experience AND a BS in Mathematics to get on board. They were technically accurate, from a lot of levels, but really difficult to approach for the day-to-day engineer. | |
For a long time now, I’ve wanted to try my hand at fixing this : How can we teach more engineers about Data Compression? | |
In 2014 we were able to film the Compressor Head video series, which at it’s core, was always about two things: | |
1) Explaining data compression in the simplest way possible. | |
2) Emulating Alton Brown as much as possible. | |
As such, on the first day of scripting, I put a big sign above my desk that said: | |
Explain it with pictures | |
Make it entertaining | |
This was my attempt at trying to find a middleground to teach data compression : instead of using math, or code to describe the algorithm (which focused either too much on the math, or too much on the syntax) instead focus on how the data moved through the algorithm, and how the data structures were updated as a result. And describe it all with diagrams, sticky-notes, and physical props. I figured that if I could describe these really hard concepts using sticky-notes, pretty much anybody could learn it. | |
The Make it entertaining part meant trying to keep your attention. I mean, Data compression is a dry subject, and is typically presented in a very dry manner. Not to mention that video is a medium that competes with cats on the internet. So to get people to want to learn it, I had to resort to puppets, interviews, physical gags and a horrible attempt at acting. | |
Although the views were never as large as some of the Android Performance Patterns content I’ve done, Compressor Head has always had a great reception by the folks who’ve seen the content: | |
Fun fact : I’ve been approached @ conferences more times for the Compressor Head series, than any of my work on Android Performance Patterns. | |
But there was always one thing wrong about Compressor Head: it never told a cohesive story. | |
The topics were sort of hodgepodge sampling of names and algorithms which developers would generally search for, and find. | |
As it turned out, really understanding these topics required them to be presented in context of other algorithms, and where things fit into the bigger picture. I mean, how does LZ and Arithmetic fit together? Where does Huffman fit in? Should I apply BWT before all that? Is any of this used in JPG? | |
I wanted to start addressing a lot of these gaps in Season 3 of the series, but sadly, the view counts weren’t high enough to warrant another season :( As such I needed to try a different approach. | |
When I originally approached O’Reilly about doing a book on data compression, the original title was “Fistfights with Claude Shannon”. | |
My hope was to present a text version of Compressor Head , where we explained algorithms with both story, history, diagrams, and again, no code. We could see how all the algorithms fit together in their grouping category, and see how they all played off each other to create large gains in compression. The intention was that you should understand the algorithm first, and then once that’s done, dive into all the cruft that needed to happen at an implementation level. All while keeping things light, funny, and entertaining (well.. as entertaining as a book could be..) | |
The title itself came from an observation I came to while writing the content : Most of the data transforms we go through (LZ, BWT, RLE etc) are about cheating the entropy value of a data stream. By transforming it into another form, we make the entropy value lower. So, in a way, data compression is all about sidestepping Claude Shannon’s work, while giving it a nod as we pass by. | |
In retrospect, the title is both infinitely awesome, and horrible for SEO. And thankfully, Aleks Haecky jumped on board to help rein in my obtuse ramblings, and turn the book into a cohesive story that made sense, and actually taught you something. | |
The result was the amazingly awesome tome we have today. | |
And that’s the whole point. I love data compression, and I want every computer engineer to love it too. It’s an amazing skillset that’s useful to have, no matter what your sub-industry or profession. So for about 8 years now, I’ve been trying to teach other people about data compression, and get them excited about it too. | |
Even if you don’t buy our amazing, awesome, fantastic book, you should still go learn about Data Compression. Read the other books. Watch the other videos. Download samples. Whatever. Just go do it. | |
Because, as I like to say : Every Bit Counts","5" | |
"srobtweets","https://medium.com/hacker-daily/which-programming-languages-have-the-happiest-and-angriest-commenters-ebe91b3852ed","4","{Bigquery,""Big Data"",Stackoverflow,Programming,""Programming Languages""}","142","2.63584905660377","Which programming languages have the happiest (and angriest) commenters?","It’s officially winter, so what could be better than drinking hot chocolate while querying the new Stack Overflow dataset in BigQuery? It has every Stack Overflow question, answer, comment, and more — which means endless possibilities of data crunching. Inspired by Felipe Hoffa’s post on how response time varies by tag, I wanted to look at the comments table (53 million rows!). | |
To measure happy comments I looked at comments with “thank you”, “thanks”, “awesome” or “:)” in the body. I limited the analysis to tags with more than 500,000 comments. Here’s the query: | |
Here’s the result in BigQuery: | |
And the chart: | |
R, Ruby, HTML / CSS, and iOS are the communities with the happiest commenters according to this list. People who ask questions about XML and regular expressions also seem particularly thankful for help. If you’re curious, here are the 15 highest scoring happy comments that were short enough to fit in a screenshot (and their associated tags) : | |
But because people sometimes get angry on the internet, you’re probably wondering… | |
For angry comments, I counted those with “wrong”, “horrible”, “stupid”, or “:(” in the body. The SQL is the same as above with the search terms swapped out. Here’s the result: | |
And the chart: | |
Clearly the angriest comments are those related to C derivatives. Many programming concepts also wound up here: multithreading, arrays, algorithms, and strings. And here are the highest scoring angry comments: | |
This analysis is not perfect, as the comment “that one’s so stupid it underflows and becomes awesome” appears in both lists. That’s where a machine learning tool like the Natural Language API would come in handy. | |
Between the two lists there were only a few tag overlaps. The most excitable tags (I’m interpreting tags that showed up in both the happy and angry list as ‘excitable’) are: ios, iphone, objective-c, and regex questions. And while the internet may seem like a dark place sometimes, there appears to be roughly six happy comments for every angry one. | |
Dive into the Stack Overflow dataset, or check out some of these awesome posts to get inspired: | |
If you have comments or ideas for future analysis, find me on Twitter @SRobTweets.","2" | |
"quincylarson","https://medium.com/free-code-camp/code-briefing-nasa-will-release-all-their-research-as-open-data-bbfc84cb5e4b","1","{""Data Science"",""Life Lessons"",""Web Development"",Tech,Design}","140","0.679245283018868","Code Briefing: NASA will release all their research as Open Data","Here are three stories we published this week that are worth your time: | |
Bonus: If you want to learn more about data science but don’t know where to start, check out Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t. | |
You can get the audiobook for free with a free trial of Audible, then learn while you commute: 15 hour listen | |
Happy coding, | |
Quincy Larson, teacher at Free Code Camp","3" | |
"ssprockett","https://medium.com/bytesized-treats/big-data-and-milton-glaser-9de489967c2e","3","{""Milton Glaser"",Design,""Data Science""}","131","3.92735849056604","Big Data vs. Milton Glaser","","2" | |
"dimitrisspathis","https://medium.com/cuepoint/visualizing-hundreds-of-my-favorite-songs-on-spotify-fe50c94b8af3","12","{Spotify,""Data Science"",""Data Visualization"",Music,""Music Business""}","131","7.70471698113208","Visualizing Hundreds of My Favorite Songs on Spotify","","20" | |
"mcgowankat","https://medium.com/backchannel/the-man-who-dissected-his-own-brain-a787161b6828","7","{Cancer,Brain,""Data Science""}","131","3.78584905660377","The Man Who Dissected His Own Brain","","1" | |
"binroot","https://medium.com/hacker-daily/getting-to-know-tensorflow-8873fdee2b68","5","{""Machine Learning"",TensorFlow,Programming,""Data Science""}","129","10.4069182389937","Getting to Know TensorFlow","This article was excerpted from Machine Learning with TensorFlow. | |
Before jumping into machine learning algorithms, you should first familiarize yourself with how to use the tools. This article covers some essential advantages of TensorFlow, to convince you it’s the machine learning library of choice. | |
As a thought experiment, let’s imagine what happens when we write Python code without a handy computing library. It’ll be like using a new smartphone without installing any extra apps. The phone still works, but you’d be more productive if you had the right apps. | |
That’s a lot of code just to calculate the inner-product of two vectors (also known as dot product). Imagine how much code would be required for something more complicated, such as solving linear equations or computing the distance between two vectors. | |
By installing the TensorFlow library, you also install a well-known and robust Python library called NumPy, which facilitates mathematical manipulation in Python. Using Python without its libraries (e.g. NumPy and TensorFlow) is like using a camera without autofocus: you gain more flexibility, but you can easily make careless mistakes. It’s already pretty easy to make mistakes in machine learning, so let’s keep our camera on auto-focus and use TensorFlow to help automate some tedious software development. | |
Listing 2 shows how to concisely write the same inner-product using NumPy. | |
Python is a succinct language. Fortunately for you, that means you won’t see pages and pages of cryptic code. On the other hand, the brevity of the Python language implies that a lot is happening behind each line of code, which you should familiarize yourself with carefully as you work. | |
This article is geared toward using TensorFlow for computations, because machine learning relies on mathematical formulations. After going through the examples and code listings, you’ll be able to use TensorFlow for some arbitrary tasks, such as computing statistics on big data. The focus here will entirely be about how to use TensorFlow, as opposed to machine learning in general. | |
Machine learning algorithms require a large amount of mathematical operations. Often, an algorithm boils down to a composition of simple functions iterated until convergence. Sure, you might use any standard programming language to perform these computations, but the secret to both manageable and performant code is the use of a well-written library. | |
That sounds like a gentle start, right? Without further ado, let’s write our first TensorFlow code! | |
First, we need to ensure that everything is working correctly. Check the oil level in your car, repair the blown fuse in your basement, and ensure that your credit balance is zero. | |
Just kidding, I’m talking about TensorFlow. | |
Go ahead and create a new file called test.py for our first piece of code. Import TensorFlow by running the following script: | |
This single import prepares TensorFlow for your bidding. If the Python interpreter doesn’t complain, then we’re ready to start using TensorFlow! | |
The TensorFlow library is usually imported with the tf qualified name. Generally, qualifying TensorFlow with tf is a good idea to remain consistent with other developers and open-source TensorFlow projects. You may choose not to qualify it or change the qualification name, but then successfully reusing other people’s snippets of TensorFlow code in your own projects will be an involved process. | |
Now that we know how to import TensorFlow into a Python source file, let’s start using it! A convenient way to describe an object in the real world is by listing out its properties, or features. For example, you can describe a car by its color, model, engine type, and mileage. An ordered list of some features is called a feature vector, and that’s exactly what we’ll represent in TensorFlow code. | |
Feature vectors are one of the most useful devices in machine learning because of their simplicity (they’re lists of numbers). Each data item typically consists of a feature vector and a good dataset has thousands, if not thousands, of these feature vectors. No doubt, you’ll often be dealing with more than one vector at a time. A matrix concisely represents a list of vectors, where each column of a matrix is a feature vector. | |
The syntax to represent matrices in TensorFlow is a vector of vectors, each of the same length. Figure 1 is an example of a matrix with two rows and three columns, such as [[1, 2, 3], [4, 5, 6]]. Notice, this is a vector containing two elements, and each element corresponds to a row of the matrix. | |
We access an element in a matrix by specifying its row and column indices. For example, the first row and first column indicate the first top-left element. Sometimes it’s convenient to use more than two indices, such as when referencing a pixel in a color image not only by its row and column, but also its red/green/blue channel. A tensor is a generalization of a matrix that specifies an element by an arbitrary number of indices. | |
The syntax for tensors is even more nested vectors. For example, a 2-by-3-by-2 tensor is [[[1,2], [3,4], [5,6]], [[7,8], [9,10], [11,12]]], which can be thought of as two matrices, each of size 3-by-2. Consequently, we say this tensor has a rank of 3. In general, the rank of a tensor is the number of indices required to specify an element. Machine learning algorithms in TensorFlow act on Tensors, and it’s important to understand how to use them. | |
It’s easy to get lost in the many ways to represent a tensor. Intuitively, each of the following three lines of code in Listing 3 is trying to represent the same 2-by-2 matrix. This matrix represents two features vectors of two dimensions each. It could, for example, represent two people’s ratings of two movies. Each person, indexed by the row of the matrix, assigns a number to describe his or her review of the movie, indexed by the column. Run the code to see how to generate a matrix in TensorFlow. | |
The first variable (m1) is a list, the second variable (m2) is an ndarray from the NumPy library, and the last variable (m3) is TensorFlow’s Tensor object. All operators in TensorFlow, such as neg, are designed to operate on tensor objects. A convenient function we can sprinkle anywhere to make sure that we’re dealing with tensors, as opposed to the other types, is tf.convert_to_tensor( … ). Most functions in the TensorFlow library already perform this function (redundantly), even if you forget to. Using tf.convert_to_tensor( … ) is optional, but I show it here because it helps demystify the implicit type system being handled across the library. The aforementioned listing 3 produces the following output three times: | |
Let’s take another look at defining tensors in code. After importing the TensorFlow library, we can use the constant operator as follows in Listing 4. | |
Running listing 4 produces the following output: | |
As you can see from the output, each tensor is represented by the aptly named Tensor object. Each Tensor object has a unique label (name), a dimension (shape) to define its structure, and data type (dtype) to specify the kind of values we’ll manipulate. Because we didn’t explicitly provide a name, the library automatically generated them: “Const:0”, “Const_1:0”, and “Const_2:0”. | |
Notice that each of the elements of matrix1 end with a decimal point. The decimal point tells Python that the data type of the elements isn’t an integer, but instead a float. We can pass in explicit dtype values. Much like NumPy arrays, tensors take on a data type that specifies the kind of values we’ll manipulate in that tensor. | |
TensorFlow also comes with a few convenient constructors for some simple tensors. For example, tf.zeros(shape) creates a tensor with all values initialized at zero of a specific shape. Similarly, tf.ones(shape) creates a tensor of a specific shape with all values initialized at one. The shape argument is a one-dimensional (1D) tensor of type int32 (a list of integers) describing the dimensions of the tensor. | |
Now that we have a few starting tensors ready to use, we can apply more interesting operators, such as addition or multiplication. Consider each row of a matrix representing the transaction of money to (positive value) and from (negative value) another person. Negating the matrix is a way to represent the transaction history of the other person’s flow of money. Let’s start simple and run the negation op (short for operation) on our matrix1 tensor from listing 4. Negating a matrix turns the positive numbers into negative numbers of the same magnitude, and vice versa. | |
Negation is one of the simplest operations. As shown in listing 5, negation takes only one tensor as input, and produces a tensor with every element negated — now, try running the code yourself. If you master how to define negation, it’ll provide a stepping stone to generalize that skill to all other TensorFlow operations. | |
Listing 5 generates the following output: | |
The official documentation carefully lays out all available math ops: https://www.tensorflow.org/api_docs/Python/math_ops.html. | |
Some specific examples of commonly used operators include: | |
Most mathematical expressions such as “*”, “-“, “+”, etc. are shortcuts for their TensorFlow equivalent, for the sake of brevity. The Gaussian function includes many operations, and it’s cleaner to use some short-hand notations as follows: | |
As you can see, TensorFlow algorithms are easy to visualize. They can be described by flowcharts. The technical (and more correct) term for the flowchart is a graph. Every arrow in a flowchart is called the edge of the graph. In addition, every state of the flowchart is called a node. | |
A session is an environment of a software system that describes how the lines of code should run. In TensorFlow, a session sets up how the hardware devices (such as CPU and GPU) talk to each other. That way, you can design your machine learning algorithm without worrying about micro-managing the hardware that it runs on. Of course, you can later configure the session to change its behavior without changing a line of the machine learning code. | |
To execute an operation and retrieve its calculated value, TensorFlow requires a session. Only a registered session may fill the values of a Tensor object. To do so, you must create a session class using tf.Session() and tell it to run an operator (listing 6). The result will be a value you can later use for further computations. | |
Congratulations! You have just written your first full TensorFlow code. Although all it does is negate a matrix to produce [[-1, -2]], the core overhead and framework are just the same as everything else in TensorFlow. | |
You can also pass options to tf.Session. For example, TensorFlow automatically determines the best way to assign a GPU or CPU device to an operation, depending on what is available. We can pass an additional option, log_device_placements=True, when creating a Session, as shown in listing 7. | |
This outputs info about which CPU/GPU devices are used in the session for each operation. For example, running listing 6 results in traces of output like the following to show which device was used to run the negation op: | |
Sessions are essential in TensorFlow code. You need to call a session to actually “run” the math. Figure 4 maps out how the different components on TensorFlow interact with the machine learning pipeline. A session not only runs a graph operation, but can also take placeholders, variables, and constants as input. We’ve used constants so far, but in later sections we’ll start using variables and placeholders. Here’s a quick overview of these three types of values. | |
That’s it for now, I hope that you have successfully acquainted yourself with some of the basic workings of TensorFlow. If this article has left you ravenous for more delicious TensorFlow tidbits, please go download the first chapter of Machine Learning with TensorFlow and see this Slideshare presentation for more information (and a discount code).","1" | |
"abhinemani","https://medium.com/@abhinemani/data-driven-policy-san-francisco-just-showed-us-how-it-should-work-c7725e0e2b40","2","{Cycling,Transportation,""Data Science"",Government,""Open Data""}","129","3.4437106918239","“Data-Driven Policy”: San Francisco just showed us how it should work.","For any city dweller brave enough to navigate their urban terrain on a bike — I am not — you know that it’s a tricky business. Limited bike paths, wayward pedestrians, and of course: cars. Auto collisions with bikes (and also pedestrians) poses a real threat to the safety and wellbeing of residents. Indeed, I have a friend who was “doored” twice in one month in San Francisco, which for those who don’t know, means that a passenger opened his/her door exactly when his bike was passing by, leaving him in the hospital and with scars for about 6 months. | |
But more than temporary injuries, auto collisions with bikes and pedestrians can kill people. And it does at an alarming rate. According to the city, “Every year in San Francisco, about 30 people lose their lives and over 200 more are seriously injured while traveling on city streets.” | |
As urban life becomes the dominant modality, this is a problem that needs to be addressed, and addressed now. | |
The city government, in good fashion, made a commitment to do something about. But in better fashion, they decided to do so in a data-driven way. And they tasked the Department of Public Health in collaboration with the Department of Transportation to develop policy. What’s impressive is that instead of some blanket policy or mandate, they opted to study the problem, take a nuanced approach, and put data first. | |
The SF team ran a series of data-driven analytics to determine the causes of these collisions. They developed TransBase to continuously map and visualize traffic incidents throughout the city. Using this platform, then, they developed the “high injury network” — the key places where most problems happen; or as they put it, “to identify where the most investments in engineering, education and enforcement should be focused to have the biggest impact in reducing fatalities and severe injuries.” Turns out that, just 12 percent of intersections result in 70% of major injuries. This is using data to make what might seem like an intractable problem, tractable. | |
It’s much easier to commit city resources to a handful of targeted problem areas, instead of dispatching crews to thousands and thousands of intersections. And with the data as a backup, there shouldn’t be concerns of political personal bias: the data speaks for itself. | |
So now what? Well, this month, Mayor Ed Lee signed an executive directive to challenge the city to implement these findings under the banner of “Vision Zero”: a goal of reducing auto/pedestrian/bike collision deaths to zero by 2024. | |
Now, having worked in and with City Government for sometime, I’m the first to tell you that an ED on its own means little more than the ink on the paper. (And well now, there’s hardly ever ink.) Unless there’s an implementation plan, it becomes like so many other policies, just a quick press hit and then silence. No real action, no real reform. | |
Fortunately, San Francisco took the next step: they put their data to work. | |
This week, the city of San Francisco announced plans to build its first “Protected Intersection”: | |
That’s apparently just the start: plans are underway for other intersections, protected bike lanes, and more. Biking and walking in San Francisco is about to become much safer. (Though maybe not easier: the hills — they’re the worst.) | |
*** | |
There is ample talk of “Data-Driven Policy” — indeed, I’ve written about it myself — but too often we get lost in the abstract or theoretical. This is real. People are getting hurt, and data, leveraged by the city, helped. | |
The broader lesson for me here is the sequencing. Data didn’t come in at the backend of policy to fix it; there wasn’t some data science that ended up sitting on the shelves. No, the city went from problem to analysis to policy to implementation. | |
That’s how it should work. | |
My thanks to my friends in the City of Los Angeles government who first taught me about the Vision Zero programs across the country: Seleta Reynolds, Mike Manalo and others. And of course, my applause to the City of San Francisco’s DPH, DOT, and data science teams for making this a reality.","1" | |
"chantastique","https://medium.com/@chantastique/design-concept-visualizing-conversation-50fb45cd2522","6","{Design,""Data Visualization"",Social}","127","4.7311320754717","Visualizing Conversation","Internet forums have been around since the digital Stone Age, acting as open asynchronous communication tools for the every-man. Nowadays, they’ve inherited a second use: they are also an archive. The conversations, questions and community are valuable resources, containing both factual information and historical records of the internet culture that are still useful years after the fact. | |
But, this information is currently trapped: forums and their contained conversations are hard to navigate through and often have chunky, ineffective search. | |
How can we tap into the archived information in these dated conversations and tame the unruliness of the forum? | |
Some psychologists have proposed that reading is not just noting the individual letters and mentally stringing them together; it’s also partially recognizing the form of the word. This is because letters have predictable configurations. | |
Similar to individual words, the structure of a post has a somewhat predictable pattern. Guided by form and shape of the text within the post, we can make educated guesses on its nature. Depending on the context, big blocks could be explanations, guides or rants. Blocks with short, jagged lines could be lists or lyrics. Tiny little blocks might be one-line zingers, reactions or pleasantries. | |
Following that idea, can we make an inference about a conversation’s nature if we see its structure? | |
Currently, however, one cannot see the conversation’s structure in its entirety, as most forums have bulky formatting. This makes recognizing patterns over the course of the conversation difficult. | |
What if we could see the entire conversation (or at least a bigger chunk of it) at once? And addition to seeing the conversation on fuller scale, what if we could tailor the visualization to our requirements? | |
In the mockup above, you see two sections. The left section is the conversation overall; individual posts are shrunken down dramatically, leaving the text hard to decipher but the shape more apparent. Let’s call this section the scroller. | |
The right section has a full-sized post, with its place in the context of the conversation indicated by the two arrows in the scroller. Clicking a post in the scroller brings it into the right-section for closer inspection. | |
In this particular visualization, a blue border around a block in the scroller indicates posts by the original poster. In the scrollbar, we also see blue chunks, informing us of original poster comments below the fold. | |
Let’s take this baby for a spin and pretend the original post in the above example was a question. If I was a user hunting for the answer to the original poster’s question, using this visualization, I could quickly scroll to where the original poster last spoke to see if they got the answer. (Their last post is a thin line on the scrollbar, suggesting possibly a short, “thank you!” post.) | |
We also see that the original poster dwindled away, but the thread continued. Likely, the topic went off-course, and the original poster had no interest after receiving the answer. | |
Here’s another example, where we give a colour to every user, not just the original poster, who is still represented by the sky blue colour. | |
Using the visualization, we can get a better understanding of the conversation flow. We see the original poster wrote something short, conversed briefly with another user, and then the thread exploded into magenta and orange: two users bouncing off each other with increasing post length with the original poster nowhere to be seen. What do you think is happening here? | |
If you guessed ‘heated debate’, ten points!* This example was based off this delightful bodybuilding thread, where users Justin-27 and TheJosh cannot agree on the number of days in a week, resulting in verbal warfare and a lot of bad math. Using the visualization, you don’t have to dig through all five pages of the alpha-bro-count-off. (Although you could, just for the entertainment value.) | |
Expanding on this concept, we could start calling out other facets of the conversation. Where do question marks appear? Where did a moderator or administrator post? Where do keywords appear? There are lots of ways a bird’s-eye view can help us avoid drowning in seas of pagination. | |
Please share! | |
*Points are non-redeemable. | |
Liked this post? You may also enjoy: How to Read a Medium Post","3" | |
"fredbenenson","https://medium.com/hacker-daily/on-to-the-next-2-271-days-309d6ba672d7","6","{""Data Science"",Startup,Kickstarter}","127","6.62169811320755","On to the next 2,271 days…","“All or nothing” describes Kickstarter’s unique model of fundraising — creators set a funding goal and a deadline by which to hit it. If they miss the mark, no money changes hands. It has a way of really focusing a collective energy towards that single goal. It’s also a mindset that has shaped my career and philosophy working for the company. | |
Now, after 2,271 days, I feel my project’s deadline has arrived, and I’ve reached my goal. It’s been an incredible and rewarding ride, and set the bar in ways that I didn’t even think possible. | |
In 2009, when I joined as our second full-time employee in New York, we didn’t even have an office. That didn’t matter — I had just used the site to raise money for Emoji Dick and remember thinking that if Kickstarter could make an emoji translation of Moby Dick possible, it was clear the product and community had a bright future. | |
Soon, I heard a vague description of a potential space on the Lower East Side that had a jacuzzi in it. The jacuzzi turned out to be a very large bath tub with a small crystal chandelier hanging over it — we would later use it to do our dishes since the kitchen sink was broken. | |
155 Rivington was cold and empty when we arrived but warmed up as the team grew and projects rolled in. Our landlord turned out to be very flexible and would later let us take over most of the building. We also discovered that Rancid filmed the video for “Time Bomb” on the top floor. | |
Early on, I was helping out as best I could, trying to pull my weight and contribute as much as possible. My previous job was at Creative Commons where I had worked in outreach and on some product ideas, but nothing had prepared me for the pace and demands of Kickstarter. It seemed like every day confronted us with a new issue to think about and a new decision to make. Suddenly the concept of being “in business” and creating “a product” didn’t feel so foreign — it was the consequence of good judgment, trust, great people, and focus. It’s hard to describe how invigorating that mix was: We were putting something good into the world that people wanted, and that thing was working. | |
While I had studied philosophy and computer science in undergrad and graduated ITP in 2008, I didn’t arrive at Kickstarter with an interest in software or engineering. But I was determined to use my background and skills to provide value to the company, so this led me to pick up some of the weirder and less glamorous technical tasks floating around. It turned out a lot of those involved using data to answer thorny questions about what did (or did not happen) on the site. | |
My early work eventually coalesced into a research and development role focused on supplying insights and building data-driven products. I wrote queries, pushed Google Analytics to its limits, published OKCupid-style blog posts with Yancey, and built lots of dashboards and graphs for our community and staff. It was captivating work, and I loved bringing data into every conversation I could. As the data scene in New York blossomed, I attended meetups and taught myself just enough statistics and math to make it all work. | |
At some point I even volunteered to be our full-time recruiter. That didn’t work out so well—it turned out that my interests and skills were better suited to the product and development side of the business — but it’s to Kickstarter’s credit that we were both able to recover from that experiment. Most importantly, I came away from the experience with a greater appreciation of how difficult it is to recruit good people. I took what I learned and went all-in on data at Kickstarter — recruiting an incredibly talented team of data scientists, analysts, and engineers, and leading that team as our first VP of Data. | |
Throughout it all, Kickstarter has never wavered from its focus and mission. Every day we get to watch people’s dreams come true. Our product, and I say this without hyperbole, changes people’s lives and creates good in the world. That mandate is not without risk, and things don’t always go smoothly with Kickstarter projects. But we’ve always leaned in and focused on trying to do the right thing. | |
These principles distinguish Kickstarter from its peers in the technology world, and our recent reincorporation as a Public Benefit Corporation has further cemented that distinction. | |
Kickstarter opened my eyes to how a for-profit company could both do well and do good — previously I was convinced the only way to make the world a better place was to work in the nonprofit or government sector. I now know that’s not true. | |
Tim O’Reilly has some terrific thoughts on working on stuff that matters that are very much in line with what I’ve learned here. | |
Years later, after settling into our home in Greenpoint, we’ve used data to build some amazing things, have a fully stocked team, and a sophisticated infrastructure for doing any kind of research we’d like. | |
One of the things I began to notice about sticking around for so long is that the value of my institutional knowledge has started to compete with the value of my active knowledge. Similarly, I’ve accrued intuitions about doing things with data that are really just a composite of first-hand experience and industry best practices. | |
I figured now would be a good time to share some of those. So in no particular order, here’s a list of things I’ve discovered over the years while running Kickstarter’s data operations: | |
Some of these things I’ve come to The Hard Way, and others I’ve been lucky enough to learn from trusted friends. Regardless, it’s been wonderful to do it with such supportive peers and in such a collaborative environment. | |
And that is precisely why it is so bittersweet to be moving on. | |
But someone once told me never to get too comfortable. And that’s actually a large part of the reason why I decided it was time — a lot of the things I set out to do early in Kickstarter history feel complete. We’ve built our own data infrastructure, we have tremendous capacity for doing research and hypothesis testing, and data now informs virtually every decision inside the company, from the senior team to our community support. | |
February 19 will be my final day, but until then I will be helping out searching for my replacement. Know anyone who loves data science and wants to lead a team at one of the most exciting places in New York? | |
I am not sure whether my next gig will be focused on data or machine learning, though it is hard to deny all the hype surrounding AI and deep learning frameworks— I will certainly be exploring some projects and issues in that space. What I do know is that whatever I do next professionally, it’ll involve thinking scientifically about things, and I plan on bringing everything I’ve learned about working in a fast-paced values-driven environment to bear. | |
So I’m excited to go it alone to seek some of my own all or nothing. In the short term I’ll be sticking around NYC, then doing a bit of travel, some more surfing, piano, and hopefully writing a lot more.","4" | |
"richarddmorey","https://medium.com/@richarddmorey/about-trumps-hands-86fd9a2c7c5","7","{""Donald Trump"",Politics,Statistics,Science,""Data Science""}","123","4.90660377358491","About Trump’s hands…","Update: You want the data and code? Here they are! | |
Update 2: In order to fend off any potential lawsuits, let me point out that an alternative interpretation is that “Donald Trump is very tall for his hands.” | |
While Donald Trump’s campaign attempts to stop hemorrhaging support, those of us who are interested in data can focus on more interesting things. We got a treat today in the form of an actual measurement of Donald Trump’s hands. Donald Trump’s hands have been a campaign issue for months, and with the election less than 100 days away, it is critical that we get to the bottom of the question everyone is asking: “Are Donald Trump’s hands really that small?” | |
The TL;DR version of this post is “Donald Trump’s hands are smaller than 84% of people of his height. So…yes. They are that small.” As a statistician and educator, I feel the need to tell you how I arrived at that number. Hopefully you’ll stick with me and learn a little something about statistics along the way. | |
The Hollywood Reporter — where I get all my celebrity anthropometry news — reported that Donald Trump’s hands were 7 1/4 inches long. I wanted to see how that stacked up to the general population. This required that I find a suitable dataset, and, thankfully, I found the 1988 US Army Anthropometry Survey (ANSUR) data freely available on Matthew Reed’s website at the University of Michigan, which includes measurements from 1774 servicemen. This dataset is a convenient choice; after all, we can be sure that Donald Trump isn’t already included in the data set because he never served (those darn heel spurs kept him from getting that purple heart he always wanted). | |
I needed to make sure that hand length was measured the same way in the ANSUR survey as Donald Trump’s hand was measured. As it turns out, I had to first correct Donald Trump’s reported hand length by about 6% because ANSUR measured from a point on the wrist. I wanted to make sure I got this right to avoid a libel lawsuit from Trump’s lawyers. | |
Trump’s corrected hand length is 198 mm (that’s 7.8 inches for my American readers). The Hollywood Reporter reported that the length of Trump’s hands was “slightly less than that of the average man;” so how does Trump’s hand length compare to the men in the Army survey? | |
The average serviceman has a hand length of 194 mm, meaning that Trump’s hands are somewhat larger than the average serviceman. In fact, Trump’s hands are larger than 71% of servicemen. The Hollywood Reporter claimed that the length of his hands were “slightly less than that of the average man,” but I think this is because they did not correct the measurement properly. Hollywood Reporter, if you need a statistical consultant to avoid these embarrassing mistakes in the future, I’m available at a reasonable rate. | |
So, this is the end of the story, right? Donald Trump’s hands are larger than average. Should I expect a call from Trump’s lawyers to serve as an expert witness in a slander lawsuit against Marco Rubio? Well, no. Donald Trump is a tall man; 1880 mm (6 foot 2 inches) tall, in fact. He is taller than 97% of the ANSUR sample. You may have a hunch where this is going: height and hand length are highly correlated; if you’re tall, you would be expected to have large hands. Trump is very tall, but has average hands. Your hunch should be this: compared to people his height, he has probably small hands. | |
We can quantify this hunch more precisely. The scatterplot to the left shows every serviceman in the ANSUR sample as a point. It is obvious from the scatterplot that tall people tend to have larger hands. We can actually describe this relationship with a technique called “linear regression,” which allows us to use all the servicemen’s data to work out the the expected hand length for someone of Trump’s height. | |
The scatterplot on the left shows the fitted linear regression line (dashed) that best characterizes the relationship between height and hand length. This line has equation: | |
length = 27 + (0.095 × height) | |
So for a person of Trump’s height (1880 mm), we’d expect them to have a hand length of about 206 mm. The relationship is not perfect, obviously; for every height, the measured hand lengths vary around the line. However, knowing someone’s height does allow us to make a good guess as to how large their hands are. Trump’s hands aren’t freakishly small for someone of his height, but they are smaller than we expect. | |
To get an idea of how much smaller they are, we look at how far every serviceman’s hand length deviates from the linear regression line. The “standard deviation” of servicemen’s hand lengths around the regression line is 7 mm, marked by the dotted lines above and below. Trump’s hand length is almost exactly 1 standard deviation below what we would expect of someone as tall as he is. | |
We know we expect of Donald Trump’s height to have a hand length of 206 mm, on average. We also know that the standard deviation around that number is about 7 mm. Trump’s hands are one standard deviation below what we expect. If we make one final assumption — that the spread of hand lengths tends to follow a bell-shaped curve we call the “normal distribution” — then we can say just how tiny Donald Trump’s hands really are. The vast majority of people Donald Trump’s height — about 84%, in fact — would be expected to have hands larger than Donald Trump. | |
There are, of course, some caveats and assumptions that I will avoid mentioning for the sake of brevity, but also the fact that probably no one has read this far anyway. But in case you are still with me, rejoice! We can finally tell the world:","4" | |
"ahoiin","https://medium.com/@ahoiin/crafting-a-custom-mobile-friendly-data-visualization-cb91a3024064","27","{""Data Visualization"",Infographics,""Design Process""}","121","16.6584905660377","Crafting a custom, mobile-friendly | |
data visualization","The design process, from science paper to responsive interactive | |
Mobile- and touch-optimized data visualizations are still hard to find. Especially more complex approaches such as network visualizations. The data visualization of free trade agreements FTAVis tries to tackle this issue. FTAVis (www.ftavis.com) is a tool to make the complexity of the world’s trade agreements visible, accessible and understandable. The project visualizes 789 trade agreements from 205 countries over the last 66 years. | |
It is a valuable tool both for economists, as well as journalists and freely accessible to the public. The project was carried out by the Global Economic Dynamics Project Team of the Bertelsmann Foundation with the help of data visualization designer Sebastian Sadowski and data analyst Tobias Pfaff. | |
The project was awarded with the Design Award 2016 of the Federal Republic of Germany for it’s excellence in Communication Design. | |
This article is the story of the design process starting with a scientific paper and the data set. Some early sketches will be presented and the path to the final visual language pictured. Some further steps like the design of the user interface and user experience are also discussed as well as the optimization for mobile, touch devices.The final visualization is available at ftavis.com and the iPad app in the App Store. | |
In 2014, senior project manager Dr. Ulrich Schoof of the Bertelsmann Foundation thought about presenting the linked, globalized world through trade agreements. “Trade agreement” is a term that appears more and more in the press recently, for instance the TTIP (Transatlantic Trade and Investment Partnership) which is a proposed Free Trade Agreement between the European Union and the United States of America. | |
However, the idea was on the one hand to present important trends and stories of trade agreements for the general public but also to create on the other hand a tool which could help experts to gain some new insights. Therefore, it became clear that it would be necessary to present the development of trade agreements over a longer timespan such as the last 60 years and more, thus getting a better impression of “the evolution”. | |
Inspired by the publication of a new dataset about the “Design of Preferential Trade Agreements” published by the World Trade Institute (University of Bern), the Bertelsmann Foundation asked interaction designer Sebastian Sadowski about crafting a data visualization. | |
The used data was collected by an international team of researchers working for the World Trade Institute, an interdisciplinary institute based at the University of Bern and is freely available at designoftradeagreements.org. It was originally published in the paper: Dür, A., Baccini L., and Elsig M. 2014. The Design of International Trade Agreements: Introducing a New Dataset. The Review of International Organizations. | |
Data scientist Dr. Tobias Pfaff helped with bridging the economic and data perspectives. His first rough analysis of the data set made clear: the data contains 789 free trade arrangement (FTA), which are signed between 1949 and 2014 for around 200 countries. He figured out that one of the most interesting and revolutionary fact of the data set is the provision of a depth value for each trade agreement. It is a simple additive index of seven key statements that can be part of a FTA (e.g., statements about tariff reduction or intellectual property rights protection). The statements have been found by researchers of the World Trade Institute who searched for them in the FTA’s contracts. | |
The paper already contained some very basic, static visualizations which highlighted the main findings and helped with getting a first overview of the FTAs over time or per country. Even though the approach was to develop a more attractive, visually appealing and complex, interactive data visualization, the simple graphs and highlighted findings were a good start for the visualization process. | |
In the next step, we extracted the most important information with a custom web-based data parsing tool from the data set and transformed it into an JSON-structure for further use. During the process we had to make some decisions about which data we want to include or exclude. For instance, the data set provided two measuring methods for the depth value of agreements, namely the depth index and what’s called the Rasch model. We decided to go with the depth index because the calculation is easier to understand and has also been used by colleagues of the economic research group Ifo Institute. | |
Besides choosing which data should be used, we also had to extend the data set. For instance, data for each single country had to be extracted (following the ISO 3166 standard) and it had to be figured out if and when each country joined the European Union so that European trade agreements could be assigned accordingly. | |
After analysing and working on the data set, the next step was to work on the visual output.Two main aspects of the data set should be presented: First of all, every country and their data and secondly the connections between them resulting in a “network”. Therefore, Sebastian started with a research to find already existing visualization projects which are either geo-based or network visualizations.A good tool to keep an overview of the visual research is Pinterest.Here is a collection of inspirational projects for the project: | |
After the research, three fast sketches had been programmed with D3.js to figure out how the data set could work with different visualization techniques.D3.js is a javascript visualization library which is becoming the standard tool in today’s web-based data visualization projects and the growing community and existing examples of different techniques helped with getting fast forward. | |
Geo-based visualizationThe first sketch was a simple geo-based visualization with a bubble for each country on the rough position of the geo-location. The background color identifies one specific FTA and the color radius highlights the amount of partners. Even though it became clear which areas worldwide are economically more active than others, the connections between countries could only hardly been seen and the visualization technique failed on presenting many FTAs. There is also a basic visualization by the WTO that tries to visualize some FTAs on a map. | |
The code is based on the D3.js example “D3.js world map with force layout” which had been created for the presentation of Olympic medals at an article for the Huffington Post. | |
Sankey diagramTo get a better overall picture of the network and connections between countries another sketch with the visualization technique of a sankey diagram had been programmed. The diagram highlights the FTAs with the most countries and also made clear that many countries are engaged. Unfortunately, the diagram is not very “sexy” and it is hard to follow some countries if the data set for another year had been chosen. Therefor another sketch had been created to present the network but also consider the geo-based position. | |
The code is more or less based on the D3.js example “Sankey diagram using names in JSON”. | |
Radial dendrogramThe third sketch has been inspired by the first two methods and focused on presenting the network of FTAs. It also became clear that with that technique it would be possible to position the country items on their rough geo-position (north/east/south/west). | |
Additionally, the visual aesthetic of the visualization technique was quite interesting and after pitching it to the Bertelsmann Foundation, we all agreed on developing the idea of the radial dendrogram further. | |
The aesthetics of the Eigenfactor Project by Moritz Stefaner was an important visual inspiration for further development. The diagram visualizes citation patterns among branches of science. Stefaner did a great job with creating the bezier curves using the hierarchical edge bundling technique. This technique is based on a standard tree visualization method and extended with a curve connecting one node to another which results of a hierarchical clustering calculation. This visualization method is also known as radial dendrogram. | |
Here is a visual example: | |
In 2009, Flash was a popular tool thus the Eigenfactor visualizations were built with flare.The requirements for the Bertelsmann visualization were not only that it should work on all major browsers but also on mobile devices (Flash doesn’t work on iPads). | |
Mike Bostock — one of the creators of D3.js and former editor in The New York Times’ graphics department — created already a good example of a hierarchical edge bundling visualization made with D3.js. | |
Starting with that sketch, the data had to be adapted to follow the Flare class hierarchy – courtesy Jeffrey Heer. Therefore, the custom data parsing tool was extended to output a useable Flare file. | |
Experimenting with the visualization technique of a radial dendrogramThe data set provides different kinds of country connections (e.g., between two countries, one country and a group of countries, etc.). So one of the first tasks was to figure out how the hierarchical edge bundling technique could help with visualizing the complex hierarchy. | |
The results were quite satisfying and even many FTAs with heaps of connections could be visualized. By grouping countries around continents the overall structure became clearer and some promising patterns could already been seen. The results were also visually appealing: | |
The visualization should show FTAs over the last 66 years so it was clear that all countries had to be visible, even though they had only one or a few FTA(s). | |
Unfortunately, if every country in the world (around 250) would be shown, it would be pretty hard to figure out which countries had some FTAs with others. Thus some further data processing removed all countries which had never been involved in any FTAs, at all.Under these 44 countries are mostly island countries such as New Caledonia and Puerto Rico or remote parts of the world such as the Antarctica. | |
Defining visual attributesThe radial dendrogram can visualize at least six variables:line color/opacity/weight, text size/weight/position. | |
The total amount of FTAs for each country should also have been visible therefore various visual attributes have been tested resulting in introducing another visual element: a circle in front of every text item. | |
CirclesThe circle in front of each country text item visualizes the total amount of FTAs with the circle size. In earlier versions the fill color represented the average depth index of the country but due to user feedback that was later removed so the user could focus and is not overstrained. | |
LinesEvery FTA is presented as a collection of lines bundled through the hierarchical edge bundling technique and highlighted with a color for visualizing the depth of that FTA. If a FTA has been signed in the past it will be drawn in the back and dimmed. Only the FTAs which are signed at the selected year are visually highlighted. | |
Line ColorsA great tool for finding some colors is chroma.js by Gregor Aisch which is there to help mastering multi-hued and multi-stops color scales. | |
The above color scale has been replaced by the one beneath due to user feedback.Providing only two color areas such as blue for a minor value and red for a major value seemed to be easier understandable than three color areas. Only one colour area has already been tested, such as light blue to dark blue, but didn’t manage to highlight the FTAs with a high depth index accordingly if there are many agreements/lines visualized. | |
Line positionsAnother challenge was the amount of lines and to identify each FTA. We solved that issue by placing the intra-continental agreements outside of the circle. Credits go to the intern of the Bertelsmann Foundation: Quentin Dumont. | |
Handling performance issuesWe tested the visualization on several devices — including tablets — and figured out that the performance of redrawing many lines were bad. One of the ideas was reduce the drawn lines, e.g. remove the lines of the FTAs which were signed before the current selected year. | |
However, we had a call with one of the data set publishers Andreas Dür — from the Department of Political Science and Sociology, University of Salzburg — and he argued that it is definitely necessary to present all signed FTAs in the background. It would help to understand that FTAs will last over a long period of time. | |
However, drawing thousand of (SVG) lines in a browser resulted in some serious performance issues. Additionally, for some years there are FTAs with over 100 member countries resulting in drawing heaps of lines for one agreement only. Finally, we had to find a better solution for drawing so many lines at once. | |
After processing the data and designing the visualization, an interface was needed to offer the user some interaction possibilities to explore the data in more detail. | |
To keep things simple we decided that the user should filter the data through selecting a specific country and year to find some new patterns and insights. Finally, the visualization should not provide a total overview of all years to the past but rather highlight the “evolution” of the economic development over the last 66 years worldwide and for each country. | |
Time sliderThe user should be able to select a specific year as well as start an animation to obey the development over time. Due to the mobile-optimization we decided to provide three methods of interaction: 1. Draggable time slider for a quick jump to a time range, 2. Forward/backward button to go through each year slowly and 3. Play/pause button for the animation. In an earlier version, the slider and trend highlighter were two separated parts but after some user feedback merged into one. The range slider is based on the Javascript library noUiSlider which is great for mobile use. | |
Country selectorThe user should be able to select a country in an easy way. Select2 has been chosen for that task. It is a jQuery-based replacement for select boxes and is usable on touch devices. Besides the interaction at the sidebar, the user can also hover over a country item to get a preview or click / touch on a country item to select it. | |
Trend highlighterIt became clear that it was quite a challenge to browse through countries and years to find, e.g., the most FTAs or the highest depth index of a country in one year. Inspired by the paper of Dür, Baccini, and Elsig (2014) we decided to include another small, interactive bar chart for each country to highlight trends and find interesting results or data stories faster. For instance the years with the maximum or minimum amount of FTAs or the deepest FTAs. | |
In an earlier version, two line charts represented the amount of FTAs and average depth over time but it turned out that the interpolated line was hardly readable for countries which had only a few FTAs over the last 66 years. Therefore, the two charts were replaced with a bar chart so it became clearer in which exact years were the max or min amount of FTAs. | |
Finally, the two charts were merged into one stacked bar chart and were combined with the time slider. This redesign is due to some user feedback, obeying that users were looking first on the charts to find some interesting years to look at and use the time slider next. Another reason was the creation of a cleaner user interface and to free up some space for the list of FTAs below. The stacked bar chart has been chosen to easily find some interesting patterns due to the fact that every FTA with their color of the depth index are previewed in the graph. | |
List of trade agreementsIn some scenarios there are too many lines so that it is hard to figure out how many and which FTAs are actually shown. | |
We decided to not include an interaction method to select a path by hovering/touching it but rather by presenting all FTAs in a separate list.The FTAs are sorted by the highest depth index and amount of member states thus the most important FTAs are likely to be shown on top. | |
By selecting a list item additional information are presented such as the agreement type (e.g., Plurilateral and Third Country) or the reason for the specific depth index.Due to the amount of items a resize/shrink button has been added below the list making it possible to focus on the shown FTAs. Unfortunately, a deep link to the FTAs for even more information is not provided, yet. Even though there is a list of a few agreements by the WTO. | |
Including a legendIf the user sees the visualization for the first time, some elements are not self-explanatory so some kind of legend is needed. | |
Interactive color scaleThe definition of the depth index is among the most interesting information in the data set. Therefore, we included an interactive color scale: a bar chart which highlights (higher position) all depth values which are currently visible. Additionally, by hovering over the color all FTAs with that index are highlighted. | |
Circle sizeThe circle in front of each country label has been added to provide another dimension of information: the total amount of FTAs until the selected time. The circle should highlight countries which are more or less involved in the “economic exchange”. By hovering over a circle all countries with up to that amount are visually highlighted. Therefore, countries who are not engaged until a specific time frame can be found as well as countries which signed the most FTAs. | |
Until that point the data has been processed, the data visualization designed and a user interface to manipulate the data has been created. As one of the final steps, the experience of the visualization had to be improved so the user can have some fun browsing through the data. In the best case, the user would be even more engaged with the data visualization and would recommend it to others. | |
Loading timeThe data for the visualization is around 2MB in size and for users with a slow internet connection a preloader has been included implemented with d3.js, too. The preloader should keep the user on the page and provide some information how much longer he or she has to wait.Mike Bostock created a nice example that demonstrates how to display the progress of an asynchronous request. | |
AnimationSome simple animations should convey the feeling of using an application instead of a website. After the data has been loaded, the middle radial visualization as well as the sidebar on the left and right are faded in. Some unimpressive animations has also been included such as fading effects by hovering over countries or by changing the selected year. | |
Responsive designThe optimization of web-based data visualizations for different devices and above all touch devices are still complicated. Therefore only some basic optimization has been implemented. | |
First of all, the visualization needs a specific amount of CPU/GPU power and device dimension to visualize the data visualization properly, e.g. tablets or desktop computers. Thus the user experience at small devices such as smartphones would not have been amazing. Therefore, smartphone users are getting notified about the minimum device requirements and can browser through the “Selection of Findings and Data Stories”. | |
Secondly, the data visualization is basically optimized for tablet devices. The filtering of countries or the selection of a specific year can be handled with a touch interaction as well as the selection of a FTA. The project is even accessible offline for iPad users with the native iPad app which can be downloaded at the Apple App Store. | |
Thirdly, the project is easily accessible on desktop computers with different screen resolutions as well as browsers. Due to the fact that on bigger screens the data visualization would be drawn in an accordingly big Canvas element, the performance would be problematically. Thus the maximum width of the project has been limited to 1300px and positioned in the center. | |
Providing more informationUser feedback highlighted that some of the most interesting data stories were hardly found by themselves. Therefore, some additional information are provided thus user can get even more into the topic and gain some knowledge. | |
First of all, a “Selection of Findings and Data Stories” section has been included below the visualisation which focuses on six findings and should enable the user to find more stories. | |
The second section “About” is more about a general introduction into the topic, a further explanation of the depth index and who the people are behind the project. | |
What is a visualization worth if it is not seen by people? Besides spreading the word about the project, the Bertelsmann Foundation decide to create a video about the project to further highlight some findings and tell some stories with the visualization. | |
A press kit has also been packed for the media. | |
Here is the final version available at ftavis.com. | |
The data visualization on a desktop computer | |
EvaluationAs already mentioned at the introduction the data visualization should not only be a presentation for consumption but also a tool for experts. It will be interesting to find out if and what kind of new findings will be found with the help of the project. | |
A brief evaluation with insight about some new findings of experts will be collected and hopefully published soon. Some new data stories can also be submitted to [email protected]. | |
Enjoy!","3" | |
"pahlkadot","https://medium.com/code-for-america/people-not-data-again-3da4f50c57e7","1","{Government,Civictech,""Big Data"",""Digital Service"",""2016 Election""}","121","3.4","People, Not Data. Again.","Well, the Presidential election turned out a little differently than most people expected. All that data, all that polling, all those models. Almost all wrong. | |
Data can be a powerful agent of change. But only when we have some understanding of what the data describe. We were adrift in a sea of numbers over the last six months. We missed the people behind the numbers. | |
Please, go (re)read Jake Solomon’s post People, Not Data from almost three years ago. Three years ago many people had just had their first taste of FiveThirtyEight and it was addicting. But despite working to apply lessons from technology to improve government services, Jake barely mentions data, or any other technology, in his post. Instead he spends his time sharing his first hand observations of people and how the obliviousness to real people that’s been inadvertently built into so many government services affect them. Here’s a taste of it: | |
I am often asked to serve on panels at conferences about big data and cities, big data for social impact, big data in the public sector. They’re not my favorite topic, partly because I’m not a practitioner, and those who work with data every day are better suited to speak to it. And it’s partly because the big data label is largely misapplied in a local government context. But it’s also that if we don’t understand the users about whom we are collecting data, that data doesn’t mean much. We’ve seen it in the ways that government often turns to marketing campaigns for programs when the data show low enrollment, not realizing that it’s not so much low awareness suppressing enrollment as poor customer experience and high rates of abandonment throughout an overly complex enrollment process. They look at data about the program, not about how users experience it. And not about how people feel about it, as Jake starts to get at in his post. Something went wrong with our data about the election too. I don’t know what it was, but I will echo the thousands of voices who are saying that the media (at least) obviously don’t understand the electorate. Lots of people are guilty of that at the moment. | |
[Update: danah boyd offers valuable thoughts on this topic: “I believe in data, but data itself has become spectacle.” Well worth a read.] | |
Jake’s observations about the users of our social safety net start to get at the understanding we need to make a difference with data. And let me point you to the words of another Code for America alum, Dan Hon, who wrote the following earlier this week. | |
Understanding users and user needs — people and people’s needs — should be the dominant framework. Data is one important tool to do that. But without deeper understanding, it can be terribly misleading. We will hopefully learn many lessons from 2016 election, but let one of them be the value of small data. Sometimes very small. Like a conversation. Or a visit to twelve-thirty-five.","2" | |
"akelleh","https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41","22","{""Data Science"",""Machine Learning"",Causality,""Big Data"",""Social Science""}","120","18.6462264150943","A Technical Primer On Causality","What does “causality” mean, and how can you represent it mathematically? How can you encode causal assumptions, and what bearing do they have on data analysis? These types of questions are at the core of the practice of data science, but deep knowledge about them is surprisingly uncommon. | |
If you analyze data without regard to causality, you open your results up for the possibility of enormous biases. This includes everything from recommendation system results, to post-hoc reports on observational data, to experiments run without proper holdout groups. | |
I‘ve been blogging a lot recently about causality, and wanted to go through some of the material at a more technical level. Recent posts have been aimed at a more general audience. This one will be aimed at practitioners, and will assume a basic working knowledge of math and data analysis. To get the most from this post you should have a reasonable understanding of linear regression and probability (although we’ll review a lot of probability). Prior knowledge of graphical models will make some concepts more familiar, but is not required. | |
How do you quantify causality? | |
Judea Pearl, in his book Causality, constantly remarks that until very recently, causality was a concept in search of a language. Finally, there is a mathematically rigorous way of writing what we’ve wanted to write explicitly all along: X causes Y. We do this in the language of graphs. | |
This incredibly simple picture actually has a precise mathematical interpretation. It’s trivial in the two variable case, but what it implies is that the joint distribution of X and Y can be simplified in a certain way. That is, P(X,Y) = P(Y|X)P(X). Where it gets really interesting is when we build the theorems of causality on top of this mathematical structure. We’ll do that, but first let’s build some more intuition. | |
Let’s consider an example of three variables. X causes Y, which in turn causes Z. The picture looks like this: | |
This could be, for example, I turn on the light switch (X), current flows through the wires (Y), and the light comes on (Z). In this picture, X puts some information into Y (knowing there is current, we know the switch is on). Y contains all of the information about X that is relevant for determining the value of Z: knowing the value of Y makes X irrelevant. We only need to know that there is power flowing — the switch is now irrelevant. X only effects Z through its effect on Y. This picture is a very general way of summarizing these relationships, and can apply to any system with the same causal structure. | |
How does this causal relationship look mathematically? The distribution of a variable summarizes all of our statistical knowledge about it. When you know something about a variable (say, if we’re interested in Z, we know the value of Y), you can narrow down the possible values it can take to a smaller range. There’s another distribution that summarizes this new state of knowledge: the conditional distribution. We write the distribution Z takes on given a certain value of Y, say y, as P(Z|Y=y). We’ll usually suppress the extra notation, and just write it as P(Z|Y). | |
Here’s where this picture gets interesting. When we know Y, X doesn’t provide any additional information. Y already summarizes everything about X that is relevant for determining Z. If you look at the distribution of values that Z takes when we know the value of Y, P(Z|Y), it should be no different from the distribution that we get if we also know the value of X, P(Z|Y,X). In other words, P(Z|Y,X) = P(Z|Y). When knowing Y makes X irrelevant, we say that Y “blocks” X from Z. Another technical term is that Y d-separates X and Z. A final way of writing this is Z ⏊ X | Y, which you can read as “Z is independent of X, given Y”. These graphs summarize all statements about “blocking” in the system we’re interested in. This concept goes well beyond the 3-variable chain example. | |
There’s a nice sense (and it’s not the ordinary one!) in which you can say Z is independent of X. It’s a version of independence in the context of other information. This kind of independence amounts to something more like “the dependence has been explained away”. It’s not too hard to show (do it!) that when P(Z|Y,X) = P(Z|Y), it’s also true that P(Z,X|Y) = P(Z|Y)P(X|Y). This is the definition of conditional independence. It looks exactly like independence (which you’d write as P(Z,X) = P(Z)P(X)), except that you condition on Y everywhere. Conditional independence is neither necessary nor sufficient for independence. In our example, even though X and Z are conditionally independent (given Y), they are statistically dependent. In our example, where X → Y → Z, X and Z will indeed be correlated (more precisely, they will be statistically dependent). | |
To summarize, the most important points up to here are that (1) we can quantify causation using graphs, and (2) these graphs have real implications for the statistical properties of a system (namely, they summarize independence and conditional independence). | |
Finally, it turns out there’s a very general rule when you have a picture like this. We saw that the direct cause of a variable contains all of the (measured) information relevant for determining its value. This generalizes: you can factor the joint distribution into one factor per variable, and that factor is just the probability of that variable given the values of its “parents,” or the other variables pointing in to it, where, for example, above, the parents of Z, or par(Z), is evaluated as par(Z) = {Y} (the set of variables containing only Y). To write it out generally, | |
Or, more succinctly using product notation, | |
Before we move on to an example, I want to make one point of clarification. We’re saying that causal structure can be represented as a graph. As it happens, there are plenty of graphs that represent factorizable joint distributions, but don’t have a causal interpretation. In other words, causal structure implies a graph, but a graph does not imply causal structure! | |
Example Time | |
First, lets look at this X → Y → Z example. It’s easy to generate a data set with this structure. | |
X is just random, normally distributed data. In real life, this would be something that’s generated by factors you’re not including in your model. Y is a linear function of X, and it has its own un-accounted-for factors. These are represented by the noise we’re adding to Y. Z is generated from Y similarly as Y is generated from X. | |
We can see that X and Y are correlated, and are noisily, linearly related to each other. Similarly, Y and Z are related, and X and Z are related. | |
From this picture, you’d never know that Y can explain away X’s effect on Z. This won’t be obvious until we start doing regression analysis. | |
Now, let’s do a regression. Z should be related to Y (look at the formulae and convince yourself!), and should have a slope equal to it’s coefficient in the formula before. We can see this by just doing the regression, | |
This is great, and it’s what we expect. Similarly, you can regress Z on X, getting the result | |
You find a similar result as before when you regress Z on X, where the coefficient is just the product of the X →Y coefficient and the Y → Z coefficient. | |
Now this is all as we expect. Here’s where it’s going to get interesting. If we regress Z on Y and X together, something weird happens. Before, they both had non-zero coefficients. Now, we get | |
The X coefficient goes away!! It’s not statistically significantly different from zero. Why is this!? (note: by chance, it’s actually relatively large, but still not stat. sig. I could have cherry-picked a dataset that had it closer to zero, but decided I wouldn’t encourage the representativeness heuristic). Notice there is also no significant improvement in the R^2! | |
You may have run into something similar to this in your work, where regressing on two correlated independent variables gives different coefficients than regressing on either independently. This is one of the reasons (that is, causal chains) for that effect. It’s not simply that the regression is unstable and giving you wrong estimates. You may try to get around the issue through regularization, but the problem is deeper than the empirical observation of some degeneracy in regression coefficients. One of your variables has been explained away. Let’s look at what this means quantitatively. We’ll see that the implications go way beyond our example. Let’s see how deep it goes. | |
Down the rabbit hole… | |
The regression estimator, z(x), is really the expectation value of a distribution. For some value of x, it’s your best guess for the value of z: the average z at that value of x. The regression estimate could properly be written | |
In this case, all of the relationships are linear (by construction), and so we’ll assume that the expectation takes on a linear form, | |
where the epsilon term is the total noise when you use this equation, and the beta coefficient is derived using the data-generating formulas in the code above. Looking at our original formulae, | |
and plugging in the formula for the value of y (given the value of x), | |
So the coefficient is just the product of the Y → Z and X → Y coefficients. Looking at these formulae, if we write the estimator z(x,y) as a function of x and y, it’s clear that we now know the value that both x and the disturbance term take. The variable y, together with some noise, are the only variables that determine z. If we don’t know y, then knowing x still leaves the fuzziness from the y noise term, \epsilon_{yx} (unfortunately, no inline LaTeX in Medium!). If we know y, then this fuzziness is gone. The value that the random noise term takes on has already been decided. This is why the R^2 from regressing on x is so much smaller than that from regressing on y! | |
Now, what happens when we try to make a formula involving x and y together? We’ll see that the x dependence goes away! This is because of d-separation! If you write the regression estimate of z on both y and x, you get | |
But we saw before that this distribution is the same without x! Plugging in P(z|x,y) = P(z|y), we get | |
In other words, the regression is independent of x, and the estimate is the same as just regressing on y alone! | |
Indeed, comparing the coefficients of these two regressions, we see that the y coefficient is the same in both cases, and we’ve correctly estimated the effect of Y on Z. | |
When you’re thinking causally, it’s nice to keep in mind the distributions that generate the data. If we work at the level of the regression model, we lose the probability manipulations that make it clear why coefficients disappear. It’s only because we go back to the distributions, with no assumptions on the form of expectations values, that the coefficient’s disappearance is clearly attributable to the causal relationships in the data set. | |
Direct vs. Total Effect | |
Let’s examine what we’re measuring a little more closely. There is a difference between the direct effect one variable has on another, and the effect it has through long, convoluted chains of events. The first kind of effect is called the “direct” effect. The sum of direct effects and indirect chains of events are the “total” effect. | |
In our example above, regressing Z on X correctly estimates the total effect of X on Z, but incorrectly estimates the direct effect. If we interpreted the regression coefficient as the direct effect, we’d be wrong: X only effects Z through Y. There is no direct effect. This all gets very confusing, and we really need some guiding principles to sort it all out. | |
Depending on what you’re trying to estimate (the total or the direct effect), you have to be very careful what you regress on! The hardest thing is the right variables to regress on depend on the graph. If you don’t know the graph, then you don’t know what to regress on. | |
You might be thinking at this point “I’ll just see if a coefficient goes away when I add variables to my regression, so I’ll know whether it has a direct or indirect effect”. Unfortunately, it’s not so simple. Up to this point we’ve only been thinking about chain graphs. You can get large, complicated graphs that are much more difficult to sort out! We’ll see that the problem is that, in general, you have to worry about bias when you’re trying to measure the total effect of one variable on another. | |
There are two other 3-node graph structures, and they turn out to be very nice for illustrating where bias comes from. Let’s check them out! | |
Bias | |
Let’s go ahead and draw the other two graph structures (technically a third is Z →Y →X, but this is just another chain). Below, we have the fork on the left, and the collider on the right. The left graph might be something like online activity (Y) causes me to see an ad (X), and to make online purchases for an irrelevant product (Z). The one on the right might be something like having skills at math (X) and skill at art (Z) both have an effect on my admission to college (Y). The left graph shows two variables, X and Z, that are related by a common cause. The right graph shows X and Z with a common effect. | |
Neither of these two diagrams have a direct or indirect causal relationship between X and Z. Let’s go ahead and generate some data so we can look at these graphs in more detail! | |
Plotting this data gives the following. For the fork (the left graph): | |
Notice that everything is correlated. If you regressed Z on X, you would indeed find a non-zero regression coefficient, even though there’s no causal relationship between them at all! This is called confounding bias. Try this plotting yourself for the collider (the right graph)! | |
We would like to be able to estimate the right effect — is there a way we can use d-separation and conditional independence to do this? | |
It turns out there is! If you read the previous post on bias, then you have a strong intuitive grasp of how conditioning removes bias. If you haven’t, I strongly encourage you to read it now. I’ll still try to give a concise explanation here. | |
The fork causes Z and X to be correlated because as Y changes, it changes both X and Z. In the example I mentioned in passing: if you’re a more active internet user, you’re more likely to see an ad (because you’re online more). You’re also more likely to purchase a product (even one irrelevant to the ad) online. When one is higher, so is the other. The two end up correlated even though there’s no causal relationship between them. | |
More abstractly, in the case of linear relationships (like in our toy data), Y being larger makes both X and Z larger. If we fix the value of Y, then we’ve controlled any variation in X and Z caused by variation in Y. Conditioning is exactly this kind of controlling: it says “What is the distribution of X and Z when we know the particular value of Y is y.” It’s nicer to write P(Z,X|Y) as P(Z,X|Y=y) to emphasize this point. | |
From this argument, it’s clear that in this diagram Z and X should be independent of each other when we know the value of Y. In other words, P(Z,X|Y) = P(Z|Y)P(X|Y), or, “Y d-separates X and Z”. You can actually derive this fact directly from the joint distribution and how it factorizes with this graph. It’ll be nice to see, so you can try the next one! Let’s do it: | |
Applying the factorization formula above, | |
then, from the definition of conditional probability | |
Indeed, we can see this when we do the regression. The x coefficient is zero when we regress on y as well. Notice this is exactly the same result as in the previous case! Indeed, if you want to estimate the effect of X on Y when they are confounded, the solution is to “control for” (condition on) the confounder. | |
To make it clear that this isn’t just because we’re working with linear models, you could calculate this result directly from the expectation value, as well: | |
so Z is indeed independent of X (given Y). This comes out directly as a result of the (statistical properties of) the causal structure of the system that generated the data. | |
Try doing this calculation yourself for the collider graph! Are X and Z dependent initially (consult your plots!)? Are they dependent when you condition on Y (try regressing!)? Spoilers below! | |
The other side of this calculation is that we can show X and Z, while having no direct causal relationship, should generally be dependent at the level of their joint distribution. | |
This formula doesn’t factor any more. It’s clear here that the coupling between the two pieces that want to be factors comes through the summation over Y. Except in special cases, X and Z are coupled statistically through their relationship with Y. Y causes X and Z to become dependent. | |
So we’ve seen a few results from looking at this graph. X and Z are generally dependent, even though there’s no causal relationship between them. This is called “confounding,” and Y is called a “confounder”. We’ve also seen that conditioning on Y causes X and Z to become statistically independent. You can simply measure this as the regression coefficient of X on Z, conditional on Y. | |
If you repeated this analysis on the collider (the right graph), you’d find that X and Z are generally independent. If you condition on Y, then they become dependent. The intuition there, as described in a previous post, is that knowing an effect of two possible causes, and knowing one of the causes, you learn something about the other. If the system you’re thinking of is a sidewalk, and the causes of it being wet or not (rain or a sprinkler), then knowing that the sidewalk is wet, and it didn’t rain, it’s more likely that the sprinkler was on. | |
This raises are very interesting point. In both of these pictures, there was no direct causal relationship between X and Z. In one picture, you estimate the correct direct effect of X on Z by conditioning on Y, and the incorrect effect if you don’t. In the other picture, you estimate the correct effect when you don’t condition, and the incorrect effect if you do! | |
How do you know when to control and when not to? | |
The “back door” criterion | |
There is a criterion you can use to estimate the total causal effect of one variable on another variable. We saw that conditioning on the variable in between X and Z on the chain resulted in measuring no coefficient, while not conditioning on it resulted in measuring the correct total effect | |
We saw that conditioning on the central variable in a collider estimated the wrong total effect, but not conditioning on it estimated the correct total effect. | |
We saw that conditioning on the central variable in a fork gave the correct total effect, but not conditioning on it gave the wrong one. | |
What happens when the graphs are more complicated? Suppose we had fully connected graph on ten variables. What then?! | |
It turns out that our examples above provide the complete intuition required to understand the more general result. The property for identifying the right variables to control is called the “back door criterion,” and it is the general solution for finding the set of variables you should control for. I’m going to throw the complete answer out there, then we’ll dissect it. Directly from Pearl’s Causality, 2nd Ed.: | |
This criterion is extremely general, so let’s pick it apart a little. Z are the variables he’s saying we’re controlling for. We want to estimate the effect of X_i on X_j. DAG stands for “directed acyclic graph,” and it’s just a mathematically precise necessary property for being a causal graph (technically you can have cyclic ones, but that’s another story). We’ll probably go into that in a later post. For now, let’s take a look at these criteria. | |
Criterion (i) says not to condition on “descendents” of the cause we’re interested in. A descendent is an effect, an effect of and effect, and so on. (The term comes from genealogy: you’re a descendant of your father, your grandfather, etc.). This is what keeps us from conditioning on the middle variable in a chain, or the middle variable of a collider. | |
Criterion (ii) says “if there is a confounding path between the cause and effect, we condition on something along the confounding path (but not violating criterion (i)!)”. This is what makes sure we condition on a confounder. | |
There’s a lot on interesting nuance here, but we’re just working on learning the basics for now. Let’s see how all of this comes into play when we want to estimate an effect. Pearl gives the formula for controlling: | |
but how is this related to regression? | |
Let’s think of a “causal regression”. We want the expectation value we expect y to take on given that X takes the value of x (with the hat on top). Another way of writing this is do(x), meaning that we intervene to fix the value of x. This is the true total causal effect of x on y. | |
So here, the expectation value is just our regression estimate! It behaves how we’ve seen above when we condition on various control variables, and does what we expect it to. | |
There’s an extra P(Z) term, and a summation over Z. All this is doing is weighting each regression, and taking an average regression effect over the values the control variable, Z, takes on. So now we have a “causal regression”! | |
Problems in practice | |
The main problem with implementing this approach in practice is that it assumes knowledge of the graph. Pearl argues that causal graphs are really very static things, so lend themselves well to being explored and measured over time. Even if the quantitative relationships between the variables change, it’s not likely that the causal structures do. | |
In my experience, it‘s relatively difficult to build causal graphs from data. You can do it from a mixture of domain knowledge, statistical tests, logic, and experimentation. In some future posts, I’ll take some sample open-data data sets and see what we can say about causality based on observational data. For now, you’ll have to take my word for it: it’s hard. | |
In light of this, this approach is extremely useful for seeing where estimating causal effects based on observational data breaks down. This breakdown is the basis for “correlation does not imply causation” (i.e. P(Y|X) != P(Y|do(X)). The underlying assumption that causality can be represented as a causal graph is the basis for the statement “there is no correlation without causation”. I haven’t yet seen a solid counter-example to this. | |
In the end, if you can show that a causal effect has no bias, then the observational approach is just fine. This may mean doing an experiment once to establish that there’s no bias, then taking it on faith thereafter. At the least, that approach avoids repeating an experiment over and over to make sure nothing has changed about the system. | |
This is an incredibly powerful framework. At the very least, it gives you a way of talking to someone else about what you think the causal structure of a system is. Once you’ve written it down, you can start testing it! I think there are some exciting directions for applications, and I’ll spend the remainder of the post pointing in one of them: | |
Machine Learning vs. Social Science | |
In machine learning, the goal is often just to reduce prediction error, and not to estimate the effects of interventions in a system. Social science is much more concerned with the effects of interventions, and how those might inform policy. | |
At IC2S2 recently, in Sendhil Mullainathan’s keynote, he called this the “beta-problem” (the focus on the regression coefficients, more common in social science) and the “y-hat problem” (the focus on the actual prediction, more common in machine learning). Extreme examples are as in lasso regression, where the variables that are selected vary in random subsets of the data set! There is no consistent estimate of the coefficients in that case: it’s a pure y-hat problem. | |
Here, there’s a very interesting application to machine learning: knowing the causal graph, you could use the back-door criterion for subset selection. Once you’ve solved the “beta-problem,” you might be better at solving the “y-hat” problem. This wouldn’t be hard to test using an ensemble of random data sets: simply generate random graphs, and compare lasso in cross-validation against the back-door approach. I went ahead and did this for a slightly simple approach to finding blocking sets. It turns out that the parents of a node d-separate it from its predecessors. Conditioning on the parents (X) of a node (Y), Z is the empty set. I used this simple rule along with this code for random data sets to calculate the cross-validation R^2 for a data set with 30 variables, and N=50. On average, the dependent variable had 9 parents. If we histogram the number of data sets vs. their cross-validation R^2, we can see that lasso is clearly an improvement over a naive linear regression | |
Now, compare lasso with regression on the parents alone (a simple way of applying the back-door criterion) | |
That looks pretty nice! In fact, regression on the parents worked better than lasso! (I chose the lasso parameter using grid-search to initialize [and find a bounding range for] a function minimization, so it wasn’t an easy competitor to beat!). Discovering and regressing on the direct causes of Y (instead of dumping all of the data into a smart algorithm) solves the “y-hat” problem for free. In other words, in this context, solving the “beta-problem” (that is, finding the true causes and estimating their direct effects) solves the “y-hat” problem.","8" | |
"faris","https://medium.com/context/personalized-advertising-is-an-oxymoron-77c95f608fb5","16","{Advertising,Brands,Personalization,""Big Data"",""Best of 2016""}","120","12.3952830188679","Personalized Advertising is an Oxymoron","This post was selected by the editors of Context as one of the best posts about the marketing and advertising published on Medium in 2016. | |
“How should marketing adapt to the era of personalization?” | |
In 1999 Stephen Spielberg was in pre-production for the film Minority Report, an adaption of Philip K Dick’s short story. He invited a group of prominent thinkers, writers and scientists to spend two days brainstorming what that future would look like. They were tasked with questions such as | |
Together, they created a vision of the future, one so well thought out and rendered in the film, that we have been trapped in it ever since, having unfortunately forgotten it was supposed to be terrifying. | |
In the key advertising section we see Tom Cruise as John Anderton walking down a corridor, being scanned by retinal cameras, assaulted from all sides by aggressive holographic billboards for Lexus and Guinness, all calling out his name in a cacophony of commercial exhortations. | |
Watching the scene, the tension, the unpleasantness of the assault, is palpable. Sensors identify you by the unique patterns of blood vessels in your retina, analyze your emotional state based on other biometric data, tailoring the ads with your name and a highly intrusive, personalized pitch. | |
Billboards themselves have a troubled, unique status among advertising media. They are the only medium you cannot turn off. David Ogilvy hated them because they interrupt without consent, without any value transferred to the individual. “Man is at his vilest when he erects a billboard”, he said, in Ogilvy on Advertising. He went on to say: | |
Howard Gossage felt the same way. “What is the difference” he wrote, “between seeing an ad on a billboard and seeing an ad in a magazine? The answer, in a word, is permission — or, in three words, freedom of choice.” | |
We can see the impact that the movie has had on the collective imagination of the industry by a casual scan of the innumerable articles that reference it: | |
The development being discussed in all of these pieces is the dynamic personalization of advertising, the topic under consideration. | |
Well, almost. | |
In order to frame this discussion, we need to separate marketing from one of its Ps: promotion, specifically advertising. The marketing function has been diminished, from establishing the right mix of products, pricing, place and has come to largely oversea the ever increasing complexities of promotion. This has conflated our understanding of marketing. Opt-in personalization of products and services can provide new value opportunities for customers and companies, but ‘personal advertising’ misunderstands how advertising works, and even what it is. | |
The Age of Personalization, in the form of the mass customization of products and services, has been in effect since the 1990s and has been highly successful across innumerable categories. | |
Jonah Peretti, founder of Buzzfeed, had his first taste of “viral” fame when he published letters from Nike refusing to print the word “sweatshop” on custom order shoes in 2001. | |
Nike went on to launch a bespoke design store concept around NIKEiD. | |
Levi Strauss pioneered the idea even earlier, back in 1994: | |
“its Original Spin jeans for women, measured customers in its stores and sent their details electronically to its factory. The customised jeans were then cut electronically and mailed to the customer.” | |
The great innovation of Dell was to let customers personalize the specifications of their computers; most car brand websites now let you “build your own” car. And so on. | |
As the economy shifts from products to experiences, consumers want personalized ones. | |
Marketing, in its broadest sense, embraced the age of personalization decades ago. | |
The difference today is the non-consensual collection of personal data with which to target and personalize advertising, or, “in a word, permission — or, in three words, freedom of choice.” | |
The online advertising industry is based on a number of interlocking elements, which have conspired to diminish the experience for the user, abusing their privacy, patience and purse, increasing the advertising load per piece of content, whilst offering no additional value in return. This is what has triggered the rise of ad-blocking. | |
The key elements that make online advertising like the dystopia of Minority Report are the decoupling of media and advertising, through ad serving networks, and the invasive tracking that accompanies this. | |
The first digital advertising banners were served by media companies themselves, but this changed due to what Cory Doctorow has called the ‘intrinsically toxic relationship between the three parties to the advertising ecosystem: advertisers, publishers and readers’. The lack of trust between publishers and advertisers created a new market segment, ad-tech, which most importantly includes third-party trackers. | |
The tracker is an independent company that follows you and the ads around the internet. | |
There is so much tracking code added to webpages that the amount of data downloaded when you pull up a single article from The Atlantic, for example, would take 15 old floppy discs to hold. All of this slows down the mobile web and costs you money in data charges. Our personal data are being bought and sold dozens of times a second on every website we visit, which has broken the traditional value exchange of advertising and pissed everyone off, especially in the age of PRISM, which uses this technology. | |
According to a technical analysis by Rob Leathern, when you load the mobile version of the New York Post, your full IP address is given out nearly 300 times as dozens of tracking networks exchange information about you. All in the name of personalized advertising, which no one actually wants. | |
The argument used to justify this is relevance. Any sufficiently relevant advertising is indistinguishable from news, as the adage goes. But this evades the issue of consent. When Amazon uses a collaborative filtering algorithm to show me books that people who buy the books I have bought also buy, there is an explicit relationship between me and the advertiser. It uses my personal data to create recommendations, often good ones, because it works inside a category or set of categories I shop, with consensual access to my prior behavior. | |
There is no such relationship between me and the hundreds of data brokers and ad networks. In Europe, websites are required to inform users that they use cookies, which is at least something, but since few understand what cookies are or how they are being used, it could be compared to obtaining consent from a drunk person. | |
It’s the “creepy Minority Report” style ads that drive ad blocking: Personalized targeting that is obvious and visible to users, especially retargeting. Retargeting uses your web browsing habits to personalize the ads that you see, and people hate it in principle — and even more so when they understand how it is done. | |
The lack of consent and the incredible increase in advertising load have created a tragedy of the commons: the scarce resource of human attention has been abused and is in danger of collapse. | |
If you want to take back control, to opt-out of tracking, on Hulu for example, here are the companies you have to visit, individually: | |
It was not the advent of online advertising the triggered the rise in ad-blocking, but the rise of invasive personalized targeting. | |
What’s more, people’s acceptance of the value exchange of advertising, that it pays for or subsidizes content in return for access to your attention, degrades in parallel with how much digital advertising they consume. It starts with ad blocking, but then it seems to make sense to ban all outdoor advertising for the same reason, as has now happened in São Paulo and Chennai, in what has been called a ‘global movement to ban billboards’. | |
Complete attention collapse is possible both online and outdoor, which are arguably the last remaining broadcast media. | |
The rejection of personalized digital advertising has made the industry a laughing stock, at a time when we are struggling to win the war for talent against the more culturally salient technology industry. | |
Note: the widget in bottom right is created by a tracking [not ad] blocker called Ghostery — it shows how many companies are tracking my behavior on the AdWeek site. It’s a lot.] | |
These arguments could be challenged if personalized advertising was demonstrably more effective, but it’s not. Like too many “rules” in advertising, it’s based on a prevailing set of ‘common sense’ principles, carved in stone, not in data. | |
It’s like the car marketer who thinks that test drives lead to purchases and so uses much of its communication budget to incentivize or simulate test drives. Test drive incentives have become “a fun way to get a little stay at home mom income” and the sheer volume of virtual test drives created each year is staggering. | |
[ e.g. Volvo and Google: http://www.fastcocreate.com/3038560/take-a-virtual-test-drive-with-volvo-and-google-cardboard or BMW Virtual Test Drive App http://www.carmagazine.co.uk/car-news/first-official-pictures/bmw/bmw-i3-virtual-test-drive-app-2013-car-review/] | |
Yet this is backwards. People who are going to buy cars take test drives, but making someone take a test drive will not make them buy a car. It’s a misunderstanding, created by the fiction of the marketing funnel, which suggests the wrong direction of causality. | |
Equally, pestering people who have previously expressed some kind of interest in a product, whilst seemingly smart, is specious. It understands advertising backwards. We don’t see an ad and then run to buy something, you cannot annoy someone into a purchase, though we may buy a certain something, at some point, when we have a relevant need, because we all saw an ad. | |
And, often even when we have bought something, we continue to get that thing advertised to us, based again on the seemingly sensible but often incorrect inference that this makes us a more likely purchaser of said product. | |
“It seems obvious, a brand’s currently heaviest buyers generate more sales and profits (per customer) so they should be the primary target for marketing.” | |
This is a misconception. | |
“If our aim is to grow sales then efforts should be directed at those most likely to increase their buying as a result of advertising. It takes only a moment of thought to realise that customers who already buy our brand frequently are going to be difficult to nudge even higher.” | |
Let’s take Coca-Cola as an example. Their average buy rate is 12 per year but most people buy only one or two Coca-Colas per year. | |
They do have some heavy buyers, people buying three Cokes a day, but very few of them. We can see that the bulk of customers were non or light buyers, getting just a one or two Cokes a year. Though light buyers may be purchasing only one or two Coca-Colas a year, they collectively add up because there are so many of them. | |
Two decades of data from Nielsen and the IPA show similar patterns across a range of household products, consumer financial services, and cars. | |
Your name is the most beautiful sound in the world. You hear it over the conversation at cocktail parties be cause it’s hugely important to you. | |
[Hi *insert name here*!] | |
But that doesn’t mean it should appear in ads, because this turns a cultural stimulus into a personal exhortation. That’s what I mean by saying personalized advertising is an oxymoron. Advertising with my name in it is selling, it’s direct marketing, and that’s not what advertising does. Rather it makes things sellable. It’s a confusion of tasks, a problem for communication strategy to untangle by understanding the appropriate roles for different channels. | |
Seeing this ad on Facebook with my name all over it, for example, pulls me out of a social, cultural space and into a defensive mode of being sold to. | |
A more successful way of utilizing the power of names is Coca-Cola’s Share a Coke campaign. Beginning in Australia, and now globally, the advertising uses generic first names to speak to personalization, but the service allows you to personalize the products specifically, at a significant increase in cost, that people have been happy to pay. | |
The snake oil sold to the advertising industry by the adtech industry was the idea of targeted scale, in essence the solution to Wannaker’s old shibboleth, that half the money spent on advertising is wasted, but that we didn’t know which half. | |
But this too is wrong. | |
As Tim Ambler has pointed out, “advertisers should seek to increase wastage”. It is the cultural visibility and associated expense of advertising that creates a “reliable signal” of quality, not the claims made in them, which are too often exposed to be hollow by huge trust violations such as VW’s. | |
The reliability of the signal is in direct proportion to the perceived wastage, so highly targeted personalized advertising is not fit for the purpose. Personalized ads make it impossible to distinguish between a reliable signal and a cold call trying to sell me something, and no one likes cold calls. | |
Charles Vallance of VCCP agrees: | |
Advertising is most effective not when it interrupts the right person at the moment of truth with the right message, but rather as a cultural and social stimulus to create mental availability, which is then (later) accessed by users when they come to buy. | |
We need to re-focus advertising on what it is best at, building brands and imbuing them with emotion, which create price inelasticity of demand for products and services, allowing strong brands to command significant price premiums over time. | |
Forward | |
The infrastructure and assumptions of online, personalized advertising are fundamentally wrong and have brought us to the brink of attention collapse. To solve this, advertising must re-engage at the cultural level, operate with consent, consider how it creates mass visibility and consumer acceptance, through a more balanced value equation. Brands should stop bleeding their budgets to death through endless ad-tech intermediaries in the quest for questionable impressions and concentrate on big, brand building creative ideas that can be deployed across channels, content engines like Red Bull’s Stratos. | |
As Nils Leonard, former CCO of Grey London wrote: | |
They also won’t play out inside retargeted banners. The mass customization of products and services are, and have been proven to be, the appropriate marketing response to the age of personalization, not personalized targeted advertising. | |
Minority Report was a paranoiac dystopian vision of a fascist surveillance state, where your every move is tracked and you can be arrested for crimes you have not yet committed, and the advertising it highlights is a reflection of that. So let’s please not keep using it as a roadmap. | |
Follow Context to stay abreast of the entire year-end feature and upcoming original stories from industry leaders in 2017. | |
Adapted from an essay originally published in ADMAP, as a commended entry to the essay prize Feb 2016.","12" | |
"francesc","https://medium.com/google-cloud/analyzing-go-code-with-bigquery-485c70c3b451","10","{Golang,Programming,""Big Data"",Bigquery,Github}","119","7.30660377358491","Analyzing Go code with BigQuery","","2" | |
"giorgialupi","https://medium.com/accurat-studio/sketching-with-data-opens-the-mind-s-eye-92d78554565","12","{""Data Visualization"",Drawing,Design}","118","5.83679245283019","Sketching with Data Opens the Mind’s Eye","When does drawing become design? When does design become a story? | |
(An edited version of this article appeared on the Data Points blog, National Geographic on July 2014) | |
The visual representation of information plays an increasingly critical role in every situation where data and quantitative information need to be translated into more digestible stories, both for the general public and for professionals who need to make sense out of numbers. | |
For many readers, the word “data-visualization” might be associated with heavy programming skills, complex softwares and huge numbers for the most part, but, believe it or not, lots of data visualization designers use old-fashioned sketching and drawing techniques on paper as their primary design tool: they sketch with data to understand what is in the numbers and how to organize those quantities in a visual way to gain meaning out of it. | |
In Infographic Designers’ Sketchbooks, authors Steven Heller and Rick Landers offer a fascinating collection of “behind the scenes” on the creation of infographics and data visualization, featuring preliminary drawings, unfinished mock-ups and intermediate prototypes from data artists and information designers from all over the world. | |
As Heller observes in the introduction “Making enticingly accurate infographics requires more than a computer drafting program or cut-and-paste template, the art of information display is every bit as artful as any other type of design or illustration, with the notable exception that it must tell a factual or linear story”. | |
The fascination for what lies behind any creative process is no new thing, and the book is a terrific anthology of beautiful examples from international artists: the finished works of the are presented alongside with the designers’ working sketches; but when does exactly drawing become design? What does sketching do to a designer’s mind and how does it affect the process of working with data? | |
I am an information designer myself. I have been trained as an architect, but soon after my graduation I started working with information in a visual way. | |
In my daily job I am the design director at Accurat, a data-driven design firm, and although we mostly create digital experiences with data, I still use drawing as my primary form of understanding and as my most important tool for designing every time. | |
Generally speaking, drawing plays an important role in the production and communication of knowledge, and in the formation of new ideas. | |
For many, it’s the principal method to uncover meaning in the things they analyze visually, to rationalize what their perception and senses tell them. | |
To this regard, and specifically for data visualization design, I’ve been asked an interesting question during a podcast episode of Data Stories: | |
“How about the real data? When do the actual data come to the table in the process of sketching a data visualization?” | |
I can recognize and name three main phases where I would use drawing for different purposes: | |
In the very first phase of any data explorations, I am mainly interested in the overall organization of the information, in the macro categories of the data we are analyzing (i.e. the kind of topics we are investigating, the eventual correlations and the number of elements we are working with: such if they are fifty or five hundreds). I rarely use real data here, I am just giving shape to first visual possibilities about the “architecture” of the visualization while having the data in mind. | |
On a second phase I would then focus on singular elements: the data points, to figure out which shapes, colors, features we might adopt or invent to better represent them, according to the type of variables we have. | |
This stage is fundamental to sum up every choices we’ve made along the way, before even going digital. | |
To conclude we would generally have a final phase when, after different digital tests with the actual numbers, we would prototype our visualization either in paper or digitally to have everything in the same place: here drawing facilitates the communication processes among designers, team-mates and clients. | |
Specifically for this visualization I have been working with visual designer Michele Graffieti; and sketching and drawing with data both analogically and digitally, has been the foundation for the whole piece. We would go back and forth several times; both of us sketching on paper and sketching digitally. And lots of the explorations have been fundamental for opening questions to the data itself. | |
So, sketching with data is not only a personal practice, for a lot of designers is the way to go, the way to deal with data, it is their communication tool. | |
In designing data visualizations, a very common and sometimes misleading approach is to start from what the tools you have can easily create, and maybe also from what we — as designers — feel more comfortable in doing with these tools, but this can lead to adopt the easiest, and maybe not the best, solution to represent important aspects of the information. | |
As opposite, when I am sketching to explore the dimensions of a dataset, I don’t have access to the actual data with my pen and paper, but only to it’s logical organizations, and this is an invaluable asset to focus on the meaning of information, and not on numbers out of context. | |
Drawing is a primary form of understanding reality and expressing thoughts and ideas. | |
Drawing, in any practice, helps to freely navigate possibilities and to visually think without limitations and boundaries, it allows connections to be made, it opens mental spaces. | |
The original version of the article appeared on the National Geographic Data Points blog in July 2015. | |
If you liked this post, complement it with Learning to See, Visual Inspiration and Data Visualization, Beautiful Reasons, Engaging Aesthetics for Data Narratives, and with The Architecture of a Data Visualization, Multilayered Storytelling through Info Spatial Compositions | |
Accurat is a data-driven research, design and innovation firm.We help our clients understand, communicate and leverage their data assetsthrough static and interactive data visualizations, interfaces and analyticaltools that provide comprehension, insight and engagement.We have offices in Milan and New York.","3" | |
"kevinbeacham","https://medium.com/cuepoint/analysis-of-the-most-successful-labels-in-hip-hop-chart-e264dddf996a","15","{Music,""Hip Hop"",""Data Visualization""}","118","35.1905660377358","The Most Successful Labels in Hip-Hop: A Detailed Analysis","I wrote this in support of “The Most Successful Labels In Hip Hop” project that was done in collaboration with Matt Daniels with myself and Skye Rossi over at Rhymesayers. http://poly-graph.co/labels/ | |
When Matt Daniels first shared this Billboard Data and some of his initial discoveries with me I was very intrigued to see what digging deeper would find. In my excitement over the project and the information it presented, as well as the theories it helped form or further shape, I wrote a lengthy analyst of the data, which you can explore in full below. — Kevin Beacham | |
As the story goes, in 1979 when rap music first became a marketable commodity via the 12” single, there was arguably as much resistance as there was celebration. Within the hip-hop culture where the music emerged, there was a sense of dismissal by many aimed at the first two rap records, as they were recorded by artists that weren’t recognized as being part of the heart and/or start of the culture; Fatback Band “King Tim III” and Sugarhill Gang “Rapper’s Delight” respectively. Plus, within the industry, many musicians, music executives, media outlets and parents in general considered rap to be noise, which was being made by kids with no musical training, who were just talking to music. Early reports suggested rap would be a fad that would likely burn out as quick, if not quicker than disco. However, “Rapper’s Delight” sold enough records and received enough radio play that some of those non-believers who were in positions of power — and surely some believers as well — to jump at the opportunity to at least make some fast cash. And with that, the rap record trend caught on pretty fast. | |
One of the problems with determining sales and radio charting stats on the earliest rap records is that the industry at large didn’t believe that these records were going to sell enough to justify paying attention to. And while that may have been true for the majority of very early rap releases, the few that snuck through and became hits were largely unexpected and for the most part, poorly documented. It took some time for the industry to fully embrace the idea that rap could infiltrate the mainstream and an even longer time to trust that it had staying power. That said, between the years of 1979-85, rap was focused on the 12” single. A handful of groups did albums, but those releases were arguably not rap albums. Artists like Kurtis Blow, the Sugarhill Gang and the Sequence, all who did albums in that time period, would generally have a couple rap songs on the album, which were usually the singles, and the rest would be filled with R&B songs and ballads. I suspect this was the action of record labels trying to appeal to their older, built in R&B fan bases, as well as grooming these rap artists to be ready for R&B careers when rap died off after a couple of years, as they fully expected to happen. Evidence suggests that it was the Russell Simmons and Rick Rubin led Rush Management and Def Jam Records that created the blueprint for the rap album, first with the successful self-titled debut album from Run-DMC on Profile Records, releasing in March of 1984. It was arguably the first proper full rap album, shortly followed by the Fat Boys’ debut on Sutra Records (May 1984) and later in the year by Whodini Escape on Jive Records (Oct 1984), followed by LL Cool J’s Radio in 1985. | |
However, even before the arrival of the official rap album, rap artists were enjoying some pop-like success. Artists like Grandmaster Flash & The Furious Five and Kurtis Blow were touring with some of the hottest R&B, funk and reggae bands of the time, such as Rick James, Cameo, and Bob Marley. Meanwhile, the Funky Four Plus One More broke rap into the TV market with their now-classic Saturday Night Live appearance on Valentine’s Day, 1981, courtesy a personal invite from Deborah Harry of Blondie. Plus some early rap songs found their way onto Billboard charts, such as the Billboard Hot 100 and Hot Black Singles (now known as the “Hot R&B/Hip Hop Songs” chart). But rap still had a long way to go before radio stopped treating it as a novelty and sales charts recognized it as a powerhouse, realizing its potential to make a mark on the pop charts. | |
Ultimately, that’s the basis of this study, taking a look at the history of Billboard’s Rap singles chart, which is called the “Hot Rap Songs” chart, originally known as the “Hot Rap Tracks” and then the “Hot Rap Singles” chart. I will refer to it by its current name throughout. In many ways, the evolution of this chart lends to the idea of looking at rap’s potential and successes as pop music, sometimes exclusively and other times while it is still rooted in and embraced by hip-hop culture. The data for this study is based on 26 years of the “Hot Rap Songs” Billboard chart. While record sales and chart positions, or lack thereof, don’t necessarily reflect the full scope of the artistry and creativity within hip-hop culture, those things were and/or are still often instrumental in effecting those very factors, as artists seek to try and remain relevant, chase a hit, or just hold on to a record deal. Immediately upon glancing at this data some things jumped out at me, many of which gave credence to theories I’ve long had in my head, but without any data to support it, until now. But there were also a handful of things that caught me off guard or that I wasn’t fully expecting, specifically in terms of the magnitude of what songs/labels started to dominate the charts over time. | |
Billboard launched its “Hot Rap Songs” chart in 1989. According to their website, which has a pretty comprehensive history of their charts, the first recorded “Hot Rap Songs” chart is on March 11, 1989 with the KRS One led project, Stop The Violence Movement’s “Self Destruction” in the top position. I suppose the most immediate question that came to my mind was, what was it about 1989 that caused Billboard to give rap its own chart? As I looked through extensive lists of rap records chronologically between 1979-88, one logical factor took precedence over any other theories. But that idea was the result of a few different paths merging into the same road. From 1979-81 rap records were made like most any other record at the time. Either the artist had their own band or the record label they were affiliated with had a crop of session musicians whose job it was to fill this role. It was the latter case that most applied to rap records in this time period. Furthermore, these session musicians were largely tasked with the job of reimagining or duplicating the sounds of the popular records that you might hear the DJ play at a hip-hop party, records such as Chic “Good Times,” Tom Tom Club “Genius Of Love,” Queen “Another One Bites The Dust,” or Taanya Gardner “Heartbeat.” This fits with the idea of record labels likely seeing rap as a way to use the MCs to reel in a young audience, while the R&B musical backdrops might intrigue their older crowd, including the R&B radio stations for airplay. | |
However a significant change in sound came into existence due to the introduction of the programmable drum machine in the early 80s, courtesy of the LM-1, Linn Drum, Oberheim DMX, and the Roland TR-808. By 1982, a dramatic change could be heard taking place in the sound of rap records. There was also a notable increase in the range of tempos of songs — from slower to faster — than what had been the norm. Also in ’82, thanks largely to Grandmaster Flash, the sound of the turntable scratch made its way unto rap records. That abrasive sound, along with the gritty beats of these early drum machines led to MCs altering their vocal tones to fit the new style. Additionally, the new tempos, textures, and tones also opened the doors to explore new subject matter or shift focus to topics that had only been lightly touched on before, but got lost in the ambiance of the party rap sound that had previously dominated. | |
These changes essentially were a few steps in the right direction of putting the sound of hip-hop back in the hand of the artists, at least in many cases. Programming a drum machine, letting the DJ scratch and adding some basic keyboard playing skills gave the opportunity to be more self-contained than depending on a band. Perhaps the main drawback in that evolved formula is my final point. Not every producer at the time had the musical talents of Ted Currier — who produced the Boogie Boys — or Larry Smith, who was responsible for the sophisticated sound of Whodini’s second record, Escape. In contrast, once rap artists moved further away from using seasoned studio musicians, a lot of the keyboard playing or other musicianship on the songs was often rudimentary. One can only guess that is what led to rap songs adopting riffs from show tunes and themes. The show tunes were catchy, generally easy to play and gave the listener a ring of familiarity to draw them in. But at the same time it took a step back in re-labeling rap music as a novelty. Between the years of 1985-87, rap records started to become sprinkled and then plagued with the sounds of many cartoons and TV shows, including Gilligan’s Island, The Pink Panther, I Dream Of Jeannie, Batman, and of course, Inspector Gadget. | |
Brooklyn’s Bad Boys released their self-titled single in 1985 using the Inspector Gadget theme music — creating a huge underground hit in the process — and then not long after Doug E. Fresh and the Get Fresh Crew (featuring the debut of Slick Rick) dropped “The Show” making use of the same music and also producing a hit. In fact, Bizzie of the Busy Boys told me that when their single first released they were doing shows with Doug E. Fresh. They were performing “Bad Boys” and Doug & The Get Fresh were performing “The Show,” but it didn’t have the Inspector Gadget music in it yet. He claims that’s something Fresh added later when “The Show” was officially released as a single. | |
The next couple of years saw artists and labels trying to chase a hit of their own by utilizing the same formula, but with extremely limited results. Pulling the opposite direction, right around this same time in 1984-86, you had the emergence of MCs who were beginning to push the limits of the art form. Artists such as T La Rock, Kool G Rap, Big Daddy Kane, KRS One, Rakim and a handful of others were staring to redefine what MCing and lyricism could be. Also of note, at this same time you saw the shift in popularity from large groups into solo artists. One reason that is important to the development of lyricism is that the large groups of the past limited microphone time for every MC. It also meant there was a need to create songs that fit within the ideas of the group. If you go back and listen to live shows from the time right before this shift, you’ll find MCs like Treacherous Three, Grandmaster Caz, Masterdon Committee and others doing some of their technically best lyricism on stage, displaying skill levels they rarely expressed on their records, which were usually concept based songs. So evidence exists that it’s not as if the early generation lacked the skills, but it’s arguably that being in large groups, as well as following record label’s plans to mimic R&B records, suppressed their ability to fully showcase it. | |
1986 also saw the rise of sampling technology, which helped move rap music from the gimmicky sounds of show tunes. Producers like Marley Marl, Ced Gee, Hurby Luv Bug, and Paul C were shaping the future of what rap music would sound like. Additionally I would also at least partially credit the sound of sampled production for helping evolve the styles and writing of the MCs, the same way drum machines did just a few years earlier. The more exciting and refined the music became, the more sophisticated the accompanying lyricism was in many cases. | |
That leads us to the merging of these different paths. By 1986-87 the industry had to rethink the idea of rap dying off as a fad. The prediction of rap’s demise had come and gone. Thanks to changes in the music inspired by the programmable drum machine and even further by the evolution in sampling technology — with equipment like the SP-1200 and Akai S900 — the sound of rap evolved and became more refined. All of that led to better music, which is why most consider the start of the Golden Era of Hip Hop to coincide right around the advent of sampling as a new standard. And what did all of that lead to? Bigger hits more often, all coming to a head around 1989. | |
Notable breakthrough successes came in 1986 with the Beastie Boys’ Licensed To Ill and Run-DMC’s Raising Hell albums. I would guess if it hadn’t been a consideration before then, that might have been the time when Billboard started to think about where rap fit in on their charting system, particularly with the commercial success of Beastie Boys’ “Fight For Your Right To Party” and Run-DMC’s “Walk This Way”. In particular, Licensed To Ill is noted as the first rap album to reach #1 on the Billboard album charts. From there, the rest of the 80s would continue to see an increase of the number of rap songs that would become pop hits of varying degrees. Salt-N-Pepa dropped their debut album in 1986, but it started to pick up steam in ‘87 with “Push It” taking off. Jazzy Jeff and the Fresh Prince hit the scene in a big way in ‘86 with the rising popularity of “Girls Ain’t Nothing But Trouble.” Plus 1987 gave us LL Cool J “I Need Love” and Kool Moe Dee “How Ya Like Me Now,” two rivals with two of the bigger rap hits of the year. 1988 saw an increase in big hits with Rob Base & DJ EZ Rock “It Takes Two,” L’Trimm “Cars With the Boom,” JJ Fad “Supersonic,” DJ Jazzy Jeff & the Fresh Prince “Parents Just Don’t Understand” and LL Cool J “Going Back to Cali” (released in ’87). | |
But there are a few artists with 1988 releases that are probably most responsible for the explosion of pop rap. In September of ‘88 MC Hammer dropped his second album, Let’s Get It Started on Capitol Records, after releasing his debut album independently. Let’s Get It Started produced the hit song “Turn This Mutha Out,” setting Hammer on a path to go 10X platinum with his follow up album, Please Hammer, Don’t Hurt ‘Em in 1990. LA’s Tone Loc released “Wild Thing” (as a B-side actually) in 1988 and it exploded as a pop hit. He quickly followed that up with another sure shot chart climber, “Funky Cold Medina.” Both of those tracks were co-written by a label mate of his by the name of Young MC. Young MC was signed to Delicious Vinyl in ‘87 and first dropped a couple singles showing off his fast raps, punchlines and humor, before dropping massive hit, “Bust A Move,” which earned him a Grammy, an American Music Award, and Billboard Award for the Best New Pop Artist. And while those are likely the biggest Pop Rap hits of ‘89, there was also notable chart action on Digital Underground “Humpty Dance,” De La Soul “Me Myself & I,” Heavy D & the Boyz “Somebody For Me,” and LL Cool J “I’m the Type Of Guy.” | |
Additionally, at this same time you had MTV becoming more and more supportive of rap music videos, which lead to the launch of an all rap music video show, Yo! MTV Raps, debuting in August of 1988. BET followed suit by launching their all rap music video program, Rap City, exactly a year later in August of ’89. Both shows were instrumental in exposing rap music to a wider audience and helping increase sales of rap albums. All of those factors point to 1989 being a perfect year to launch a rap chart. | |
First and foremost, I’d like to take just another second to point out that the first song to hit #1 on the “Hot Rap Songs” chart was BDP’s “Self Destruction,” a positive song that assembled an all-star cast of MCs to speak out about violence in the community, racism, and black-on-black crime. It’s certainly one of the rare times that a song focused on such a positive message sat at #1. In the #2 slot of that first chart, you have Tone Loc’s “Wild Thing,” followed by Boogie Down Productions’ “Jack of Spades,” so that’s two BDP tracks in the top 3. The rest of the chart is a nice mix of East and West Coast hits with Kid N Play, MC Hammer, Rob Base, Too Short, Slick Rick, Ice T, Cash Money & Marvelous, N.W.A, Eric B & Rakim, K-9 Posse, and Mamado & She sharing in the spotlight. | |
Looking at that first chart and the 1988-89 songs that were the likely key hits to lead to the existence of this chart, something of note comes to mind. It wasn’t that rap’s time had come and therefore the hits just started to pour in automatically. Sampling technology was a huge part of this growth and with it the rise of the rap producer as a more important figure. Previous to sampling, knowing who the producer was for a rap record didn’t mean the same thing. Furthermore, most rap records didn’t really tell you who was behind the music. The credit usually went to whoever owned the label and/or paid the studio session bills. Sampling caused a shift in that, yet some labels still failed to give the actual producers credit, but eventually that practice started to take place. When you look at many of those early rap songs that became hits after sampling came into prominence, a lot of those producers have something in common; they were young, loved hip-hop, had a musical background, and they knew and understood how to make a pop record. | |
In contrast, producers like Ced Gee and Paul C were technicians, interested in being as innovative as possible with sampling and sound engineering. It doesn’t seem like they were trying to make pop hits by any means. Pioneering Juice Crew producer Marley Marl had experience as a studio engineer and a mobile DJ, so he had a keen understanding of music and a great ear. But he was still producing a lot of his earliest records in his Queensbridge apartment with the primary goal to sound far removed from party rap records of years before. Marley wasn’t much of a fan of how those records sounded, he saw drum machines and specifically sampling as a way to give rap the sound it was supposed to have. It was a sound that matched with what he heard when a DJ was cutting up two raw, stripped-down break beats at the park jams, as opposed to the live jam band sound that rap had grown into in the early 1980s. Of course later on Marley Marl chose to tap into his ability to mix a gritty sound with his musical understanding. When he did, he finally started making the biggest hits of his career. | |
However producers like Hurby Luv, Teddy Riley and the L.A. Posse had a great musical sense and understanding of pop music and the proof is in the charts. Hurby Luv Bug — who never really gets his deserved props for his innovative mix of musicality and creative programming — produced hits for Kid N Play, Salt N Pepa, Dana Dane, Sweet Tee, and Kwame. Not many producers of the time had his level of pop success with such a variety of artists. Teddy Riley, spent the mid to late 80s giving many underground rap songs a unique sound that was somewhat rugged, but also sophisticated, producing a lot of those early Kool Moe Dee hits, before pretty much helping reinvent R&B in the late 80s/early 90s with new jack swing. | |
There is one non-rap song on the first “Hot Rap Songs” chart, Milli Vanilli’s “Girl You Know It’s True” and that also speaks to some industry influence on the definition of rap. Soon we started to see a lot of R&B music that used the hip-hop look and/or sound to their music labeled as rap. At that time, a lot of hip-hop fans rebelled against the idea, myself included. It seemed offensive to call most of these things rap, even if you enjoyed them for what they were; R&B or dance records. But, it became standard. Almost immediately artists like Technotronic, C&C Music Factory, TLC, Bell Biv Devoe, and many more got labeled as rap. Looking back, it was bound to happen. As rap got more and more comfortable sampling pop songs and producing rap records with R&B sensibilities, it was only a matter of time before that resulted in R&B music doing just the opposite; using the rising popularity of rap music to widen their appeal and audience. Simultaneously as many rap artists tried to be more R&B, many R&B artists —both young talent and R&B legends — tried to sound and look more hip-hop. Both sides had the goal of making bigger hits and ultimately that formula worked. | |
“Self Destruction” stayed #1 on the chart for five weeks in a row, which makes a KRS quote from his following album seem a bit out of place, at least in some regards. On “House Ni**a’s” from his Edutainment album (1990) he says: | |
Granted, his statement is addressing some real concerns that artists and even some fans were having at the time. Although “Self Destruction” did chart at #1, it was specifically on the Billboard Rap chart. It appears it didn’t see any placement on any of their other charts, which I’m guessing is at least partially what KRS One is speaking of. Although artists such as Tone Loc, Milli Vanilli, and MC Hammer were lower on this rap chart, they were all seeing positioning for their songs on other Billboard charts as well, getting bigger attention overall in the grand scheme of things. Additionally, over those five weeks of “Self Destruction” at being at #1, you slowly start to see a shift in the diversity of the chart. It’s not dramatic, but six weeks later, on the May 20th 1989 chart when De La Soul’s “Me Myself And I” becomes #1 and knocks “Self Destruction” to #2, the chart is dominated by songs that were made with pop sensibilities; strong hooks, danceable beats, and playful lyrics. Besides “Self Destruction,” the only track standing out as clearly set apart from that formula is Public Enemy’s “Black Steel In The Hour Of Chaos,” which is pretty much the antithesis of all those things. It’s safe to say that formulaic trend stayed the course and the diversity of the type of artists and sounds that filled it became lacking. Eventually, one could look at the Billboard charts and get an idea of what the formula was to have a better chance at making a hit. That brings two things to mind; 1) How did/does record label influence affect charting? and 2) Is content the only charting obstacle for more positive-minded or culture based records? | |
For the first question, I think looking at “Self Destruction” one last time is a perfect study. What put this song at #1? For Hip Hop fans that answer can be easy; it was a great song. But obviously that alone doesn’t put songs on the top of charts or make them sell millions of records. Certainly one part of that record’s success is the star power involved. You had BDP (KRS One, Just-Ice, D-Nice, Ms. Melodie), Kool Moe Dee, Public Enemy, Heavy D, MC Lyte, Doug E Fresh, and Stetsasonic, who all had a great buzz at that time and enjoyed some hits of their own, some big, some small. Along with the artists involved, you have an additional layer to the story, they were also donating all proceeds from the record to the National Urban League in an effort to help programs that would target focusing on “Black on Black crime and youth education.” That gave it a story that the media could latch onto and help create talking points for radio, press, etc. Also, the record followed some basic pop principles. The root part of the beat has a danceable, bouncy sound and there is a strong hook. Taking a quick glance thru time, the positive-minded songs that followed that formula were the ones that were more likely to find some chart position, which is true from Public Enemy’s “Fight The Power” up to Kanye West’s “Jesus Walks” and beyond. Though it’s also historically true that when many of those same artists that had chart success with positive-minded songs, they actually found their biggest hits with songs that were not focused on thought-provoking topics. This is true for most charting rap songs in general. | |
What does seem clear is that labels and many artists look at charts and record sales and use that information when trying to shoot for commercial success. The evolution of the chart over those few short weeks and how it continued to evolve is proof of that. In 1989, major labels were still figuring out which type of artists to sign and invest the most money in. There was also one noticeable game of perception that major labels seemed to be playing throughout the late 80s and early 90s. Labels who were not really known for hip-hop acts, started to launch rap departments, signing a bunch of artists and buying ads in rap magazines. For instance, Atlantic launched their Atlantic Street department and signed a handful of talented rap artists. Prior to that, Atlantic’s release of rap records was extremely limited and mostly underwhelming and gimmicky. Their first real go at it was partnering with First Priority Music in 1987, to be the parent label for releases by Audio Two, MC Lyte and others. Then in 1988, Atlantic signed Miami Bass female duo L’Trimm, who had a hit with “Cars That Go Boom.” The label also formed another partnership with Ruthless Records to release JJ Fad’s massive single and album Supersonic. | |
I’m guessing the just-above underground success of Audio Two’s “Top Billin’” and MC Lyte’s debut album hype — combined with the commercial success of L’Trimm and JJ Fad — greenlit the funds for their Atlantic Street department. The use of the word “Street” in their marketing campaign suggested that they were trying to appeal to the hardcore hip-hop fans or something to that effect. Atlantic Street did a rather nice job with their initial signings, starting off with a good balance of artists, who all released debut albums in 1989; Breeze (of L.A. Posse), Craig G (of the Juice Crew), Kwame, the W.I.S.E Guyz, The D.O.C and Cool C. I would say that most all of those artists were great lyricists who also had impressive flows. They were the type of artists that most hardcore rap fans would love or at least appreciate. But they also worked with producers with the ability to make music that was accessible to a broader base. Not fully pop, but all of them either released an underground hit or certainly had the potential to. | |
The exception to most of those rules is Cool C. He wasn’t the greatest lyricist and didn’t have the degree of flow possessed by the others, but he had a unique voice and was a good songwriter. It also helped that Cool C was affiliated with Jive Records’ Steady B and his Hilltop Hustlers crew. Those signings show that whoever was running the A&R for the rap department seemed to really know what they were doing (Shout Out to Sylvia Rhone, Merlin Bobb, Darryl Musgrove). Their signings were strategic for many reasons, but for one they were essentially introducing a crop of new talented artists to the world or at least greatly broadening their potential fan base. But all of the signings were connected to a proven formula that Atlantic could bank on. Breeze & The W.I.S.E Guyz came under a production deal for the L.A. Posse (Muffla & Big Dad) who had produced the groundbreaking album for LL Cool, Bigger And Deffer, so they were some of the most buzzworthy rap producers at the time. Craig G was produced by Marley Marl and Kwame was produced by Hurby Luv Bug, both hot producers on the East Coast. The D.O.C was part of the partnership Atlantic had with Ruthless Records, which already produced the JJ Fad success story. N.W.A was steadily on the rise at the time, so a full D.O.C. album produced by Dr. Dre was a no-brainer. | |
In many regards, it was true that Atlantic was appealing to the “streets,” but they were also banking on some big record sales as well. Atlantic continued to sign more rap artists in 1990, adding Yo Yo, Rodney O and Joe Cooley, Dangerous Dame, Doug Lazy, and K Solo to the roster. Again we can see a similar trend in strategy, but it definitely seems to lean more to trying to capitalize on the success of N.W.A and Too Short’s Jive Records debut, Born To Mack. West coast projects like the Ice Cube led project by Yo Yo, the signing of Compton’s Rodney O & Joe Cooley and Bay Area artist Dangerous Dame, who later wrote “Short But Funky” for Too Short. As for Doug Lazy, his album focused on the hip-house trend, which was having some fringe success at the time while EPMD affiliate K Solo piggybacked the Hit Squad’s buzz. Although Atlantic continued to sign new rap acts through the mid 90s and beyond, it seems their focus got less targeted, though they definitely hit a few short-term jackpots along the way with Das Efx, Snow (“Informer”) and Marky Mark and the Funky Bunch. They also released some underground records by Original Flavor, The Poetess, The Future Sound, Hard To Obtain, Kap Kirk, Yomo & Maulkie, etc. This is just one example of how major labels were rethinking their rap strategies in the late 80s and early 90s. | |
However, when this chart first launched, Atlantic wasn’t even in the top 5 labels dominating the charts (though they were in the top ten). The top 5 would be Ruthless, Capitol, Delicious Vinyl, Def Jam, and with the significantly biggest piece of the pie, Jive Records. Capitol’s inclusion was largely because of MC Hammer, with assistance from King Tee, Mellow Man Ace, Beastie Boys, and Oaktown 357 (an MC Hammer produced group). Delicious Vinyl had Tone Loc and Young MC, with some assistance from Def Jef. Ruthless had N.W.A, Eazy-E and The D.O.C. In some ways Def Jam was the most diverse of the top 5, with Slick Rick, Public Enemy, LL Cool J, and 3rd Bass. Jive was also pretty diverse, but it’s also interesting in many ways. The biggest influencer there was Boogie Down Productions. While “Self Destruction” was the biggest song for them that year, three songs from BDP’s third album, Ghetto Music: The Blueprint of Hip Hop also saw some chart action. Jive also had Too Short, Kool Moe Dee, and DJ Jazzy Jeff & the Fresh Prince. | |
In many ways it looks as if Jive avoided the stigmas associated with rap charts. BDP, considered the “conscious rap” act, led the way for their chart influence, even dropping an album that was purposefully anti-pop. The second biggest slice of the Def Jam pie was Too Short, which was also completely against the grain of what was considered pop, though his music is likely influential in the early steps of redefining what pop music could be. The Jive artists with the most pop sounding songs — Kool Moe Dee and DJ Jazzy Jeff & The Fresh Prince — were the lowest on the rap chart. Surely, DJ Jazzy Jeff & the Fresh Prince were seeing some other significant chart action in addition to this minimal rap chart action. And even though Kool Moe Dee wouldn’t be considered and all-out pop artist — nor would his Knowledge Is King”album be classified as a pop record — the singles released for the album were clearly catered to pop music fans. | |
I don’t think this is necessarily a sign that Jive was more cultured in their approach than other labels. I think what this shows is that labels were being strategic about what charts they went after for specific artists and many likely used their influence, marketing and in some cases payola dollars to influence these charts. Specifically in this case — and this is 100% speculation — it seems that Jive perhaps purposefully didn’t push for high charting on their pop sounding songs on the “Hot Rap Songs” chart. That would be a pretty decent strategy. If you have Jazzy Jeff and Fresh Prince doing great in the world of pop and selling units, there’s no real benefit to have them riding the top of the rap charts, at least not as much benefit as there would be in pushing your non-pop artists to the top there. If you put all that marketing and money into pushing that position for artists you believe in but you know are most likely not going to see any pop chart action, you are going to maximize your efforts. Hence, putting support for the rap charts for something like BDP and Too Short could make more sense. | |
Just one year later, 1990, there’s a noticeable shift in top influential labels. Def Jam becomes number one. Jive remains at the top, but is competing for space with Atlantic and Capitol. Tommy Boy and Profile show noticeable growth and Uptown Records makes an impressive debut. What you’re able to see as you look year by year is how different players came into prominence. You see how for years Def Jam and Jive wrestled for the top position. You notice how there are areas of the map, mostly all throughout the middle, that never get a bubble, which is not a surprise. It’s common knowledge that the music industry was concentrated in New York and L.A., so while throughout the 90s more artists were emerging from different cities all over the map, there was still the perceived need to go to New York or L.A. fully succeed in the business. The one early clear exception was the activity in Florida, mainly because they had their own built in sound with Miami bass. It was essentially ignored in the New York market and didn’t seem to be heavily explosive on the West Coast, likely because L.A. already had their electro era with Egyptian Lover, World Class Wreckin’ Cru, Arabian Prince and the LA Dream Team. Additionally, when the explosion of N.W.A and Too $hort took place, the L.A. electro scene became noticeably less prominent. | |
Earlier I mentioned how when the chart first launched just a few short weeks later you started notice a shift in the diversity, as things start leaning towards a bigger pop sound. However that was relatively short-lived. I surmise that the after-effects of the explosion of N.W.A, Too Short, and 2 Live Crew started to have more residual impact in the early 90s, and then in 1992 Dr. Dre’s The Chronic kicked the door off the hinges. For example, if you look at the year 1991, most of the top songs for any label with a bigger piece of the pie for the year are tracks made with some pop sensibilities; lots of bright feel good music, singing hooks and danceable beats. But, 1992 looks a little bit different, though not completely. You start to see the emergence of artists who are successfully combining gritty sounds, danceable beats, strong hooks, and a boom bap feel. Not always all of those traits at once, but generally a combination of those characteristics, making songs that had hit potential and still resonated with the core rap audience. Key examples include Cypress Hill “Phuncky Feel One,” A Tribe Called Quest “Check The Rime,” Public Enemy “Can’t Truss It,” and UMCs “Blue Cheese.” None of those are really a traditional pop hit, but they use elements of pop sensibilities. In the case of Public Enemy, they were able to communicate their positive-minded messages to broader bases by using this formula, especially a couple years prior with “Fight the Power.” Cypress Hill was able to turn psychedelic pro-marijuana rap into a hit phenomenon, which is rather surprising on its own, at least in some ways. I feel like those next few years transitioning into the mid-90s are critical in defining the shift to how this chart still works today. | |
Things remained similarly diverse over the next couple years, as we see an ebb and flow of different sounds and styles dominate the charts. We start to see trends such as dancehall and child rappers make their way into the mix. But there’s a noticeable change in 1997. First, looking at 1996 you can see that there isn’t the same level of dominance by a just a couple of labels and that there was a wider competition with what labels defined what this chart sounded like; Def Jam, LaFace, Ruffhouse, Loud, Tommy Boy, MCA, Duck Down, Chrysalis, Elektra, Ruthless, Death Row, Rap A Lot, etc. But when you move ahead just one year later to 1997, only two labels are truly dominating; Bad Boy (Puffy, Notorious B.I.G, Mase) and Violator (Foxy Brown, Cru, Beatnuts). Additionally, both of those labels were focused on a similar sound, as street/club music was winning the charts. Skip to 1999 and No Limit is the clear dominating force, more than any one label had been before. The closest example is Def Jam in 1991, but there was still Jive and Select Records with a decent sized presence. | |
2000 saw a return to the norm, with several labels sharing the pie, but while there were a lot of labels and artists in the mix, the variety of sound and style was not nearly as diverse as previous years. This is true specifically in the most charting labels, though you see some smaller presence from songs setting themselves apart, such as Common’s “The 6th Sense” and D.I.T.C.’s “Thick.” Looking through the remaining years there is a lot of diversity of different labels getting into the game — including an explosion of new indie labels — which includes an increase of representation from different parts of the map. The sound of the chart stays pretty consistent, focusing on primarily street music that works in the club. There is a notable absence of boom bap, positive messages, abstract styles (i.e. De La Soul, UMCs), and humor-based records (i.e. Biz Markie, Digital Underground or Jazzy Jeff and the Fresh Prince). But there’s also a lack of non-rap on this chart as their were in earlier years, with very little hip-house, dancehall, and or R&B songs, despite the fact that there is a heavy amount of R&B styles integrated into the rap hits. Some would simply say that it was just the natural changing of the guard and that east coast boom bap had its time and now it was time for other sounds to dominate. Others would counter that there was a conspiracy at play to limit the success of positively tinged music as there certainly is a notable lack of it charting over time. It’s probably less about conspiracy and more about following formulas in attempt to make as many hits as possible. | |
If my theory is true about labels using the rap chart to push things that they didn’t see having pop potential, that thinking started to change in 1995 with the arrival of Bad Boy, who competed with Def Jam for the top slot that year. At that point, the theory could still be ring true because Bad Boy released both Notorious B.I.G and Craig Mack debut albums in ’94, which may have been considered underground records with pop potential. I’m willing to guess that Puffy believed they had more than just potential to go pop and that Bad Boy had created a magic formula to appease the widest base of rap fans, from coast to coast, from the underground to the pop market. Puffy’s experience at Uptown Records with artists like Heavy D and Mary J. Blige was probably influential in his understanding of that. He was able to take those pop record sensibilities and combine them with the influence of the clean yet still rugged sound of Dr. Dre’s The Chronic, a game changer and one of the most influential rap albums of all-time. It was a perfect fusion of marketing genius. | |
It seems that this time period — influenced by Dr. Dre and further capitalized on by Bad Boy — marked the start of the dissolving of the divided lines in hip-hop that had long been so prevalent, such as east vs. west, indie vs. major, New York vs. everybody, commercial vs. underground, etc. It’s because of that blurring of lines that people became more vocally passionate about what side they were on, as this time marked the height of all these controversies. It was the “keep it real” era, where artists went out of their way to express how they hated major labels and were never going to sell out to pop music. That is, until their indie records got enough attention to get them major deals and potential to crossover, which happened repeatedly. However I agree with those who say that the media really were the ones fueling the fire, because these controversial stories sold magazines and newspapers. There’s also no denying that the media had a fire to fuel because many East Coast artists were opposing the merging of these sounds, as well as the dominating sales and popularity from other areas throughout the map. But despite all the resistance, diss songs, passionate-sounding diatribes about pledging allegiance to the “real hip hop,” the change was imminent. Truthfully, I can’t decide if most of the resistance was just hype or if people were really that scared and/or bothered by the change. Regardless, the resistors continued to lessen in numbers and many artists were soon trying to figure out how to fit into the changing times. | |
As these changes and shifts were taking place, another thing that this study reflects is at any given time how many labels were influencing the charts. In 1990 there were 53 different labels with songs on the charts. That number gradually increased every year up until 1996, with 111 different labels having chart positions. 1997 was the first year that number didn’t increase, but it didn’t change significantly, with 108 labels represented, and then those numbers started to consistently increase again for the next few years until it reached its peak. In 2000 there were 171 labels represented on the chart, the most to date. There was a noticeable decrease in 2001, but in 2002 the number was less than half, with only 81 labels representing this chart. 2003 was almost half that with only 50, and it has fluctuated around that general range ever since, with the lowest amount ever being 2004 with only 45 labels. We’ll see where that stands by the end of 2015, as of April of this year there were only 35 labels represented thus far. Additionally, like the domination of No Limit in 1999, you see a similar domination with Cash Money Records in 2008. Then between 2010–2015 you could almost just start referring to this as the Young Money chart, their domination has been nearly flawless in that time period. I think this supports my idea that over time this chart has gotten less diverse in what fills these chart positions. | |
The only year in this five-year span that Young Money had to share any comparable size of the pie is 2013, which marks another interesting rarity in the form of Macklemore. In 2013, Macklemore has the biggest share of the chart, with Young Money in a pretty close second place slot and no other label even close. For their position that year, Young Money was dependent on hits from Lil Wayne, Tyga and Drake. And that’s what is interesting. Never before do we see a label — or an entity in this case — dominating with the biggest impact on that chart and being completely reliant on only one artist to do so. Certainly in some of the cases, perhaps many, there is one artist that was the primary success story for a label on any given year, but there are other artists also contributing to that position. For 2013 there is just one artist who had the biggest piece of the pie, Macklemore. | |
In my personal opinion, I think all the introductions of new sounds, styles, and voices are good and in one way or another and they help the culture and music evolve. Despite all the talk of what’s killing hip-hop — which has been debated and discussed at least since rap music first hit wax in ’79 — I don’t believe anything really is. If anything, I would guess the biggest culprit would be close-minded people within the culture not willing to let it grow and diversify. I know a lot of people from my generation and a little later would probably not agree with that, but my research, analytical assessment, and admittedly, some theories, have convinced me it’s true. That in mind, on the other hand, I do agree that this polygraph is an example of how rap music did at some point strive to diversify, but then eventually chased the trend of the reigning formula and rarely looked back. The artists in the last few years who’ve made large impacts and went against the norm seem to be fewer and far in-between; Kanye West, J Cole, Kendrick Lamar, etc. Even though those artists are doing some notably different things that are creative, artistically engaging and often progressive, they are also still making at least some music that fits in the basic formula of what dominates this study. That would be only real critique of what this polygraph reflects. | |
In comparing 1990 to 2014 on this chart, this is what I see. 1990 you had the street sounds of N.W.A., the new jack swing of Heavy D, the abstract minds of De La Soul, the ridiculousness of Chunky A, the pop hits of Salt N Pepa, Kwame’s love ballads, the hip-house of Mr. Lee, MC Hammer’s dance soundtrack, Queen Latifah and Monie Love’s ode to the ladies, Kid Frost representing for “La Raza,” the boom bap of the D.O.C and 3rd Bass, and the melting pot of chaotic genius that is Slick Rick. While in 2014, the chart is filled by Drake, Young Dro, Wiz Khalifa, Bobby Shmurda, Igga Azalea, Young Thug, Eminem, Kid Ink, Future, Kendrick Lamar, Schoolboy Q. While I’m not suggesting all of those artists sound exactly the same, I would say it’s considerably less diverse of a representation of the range of sounds and styles of rap as a whole than the chart was in 1990. However, that’s not because rap music in general is less diverse. There is still a multitude of sounds and styles out there, but this chart and many other outlets are not really a reflection of that. It seems over time the various styles of rap have gotten more like a funnel, and no matter how much variety there is, only a small amount is getting out into the mainstream at any given moment. | |
Furthermore, is this chart a true accurate reflection of a record label’s success or level of influence? Certainly in some ways it definitely is. The labels that consistently had the bigger pieces of the pie are the labels that are the most well known in their respective times. These labels generally boasted the most iconic, hit artists. However, what about the labels who have little or no presence on there? For example, I could only find Rhymesayers Entertainment represented on this particular chart once, which was in 2007 via Atmosphere. However, Atmosphere has had higher positions on different and more competitive charts in Billboard, but has mostly been ignored by the rap chart. And while the radio side of things is not where I focus any attention to in my job at Rhymesayers, I don’t think in my 14 years here I’ve heard a discussion about the importance of making a Billboard chart. That’s not to say that it has no importance. It just that it’s never really been an integral part of our marketing plans or primary concerns. However radio play in general has certainly been a part of the process. So you have a label like Rhymesayers that is 20 years strong in the business and has only had one blip on the “Hot Rap Songs” chart. But still our artists tour all over the world, have large and loyal fan bases, and still sell a notable amount of records to live comfortable as full time musicians. And if you look at the last 20 years of this chart, you’ll see countless artists on it, some who were at the top at various points, that probably can’t say that. So, while the charts do reflect some levels of success and influence, they don’t solidify longevity. | |
Independent labels like Rhymesayers, Stones Throw, Strange Music, all whom have had very minimal success on the “Hot Rap Songs” chart, have found longevity and their own form of success. We’ve done this by establishing strong and direct connections to ever-growing fan bases and consistently releasing music that speaks directly to those people and at the same time also showing diversity. Stones Throw started out with Peanut Butter Wolf putting out records with MCs from around his way that he appreciated, such as Subcontents, Encore, Homeliss Derelix, Rasco, The Lootpack, etc. But, they continued to evolve by adding a lot of rare Funk reissues, as well as finding new acts like Tuxedo and Dam-Funk who mixed modern and retro styles. They also branched off into experimental jazz, psychedelic rock, electro-soul, modern R&B, and who knows what else. Stones Throw continues to engage their fans with awesome music, as well as cool marketing and merchandising. | |
Tech N9ne’s Strange Music is a machine on its own and proving once again that a Rap empire can be built anywhere if you have engaging music and are willing to put in the work. Built off the success of Tech and labelmates such as Krizz Kaliko, Kutt Calhoun, Grave Plott, and Big Scoob, Strange Music keyed in on a sound and built an impressive and loyal following, supported by a touring and street team combined effort that is second to none. In recent years they have signed artists that have helped diversify the label’s sound; Murs, MayDay, and Ces Cru. Another example that caught my attention was the lack representation from Odd Future. There’s no denying how much impact they have had in the industry over the last several years, but this chart does not capture that. Yet they were able to connect directly with their fan base, working as an artist collective to launch several successful careers, a music festival, clothing line, TV Show, and radio station. I would say all of that makes them one of the most interesting and impressive success stories of recent times as far as rap artists go. My point is, that while looking at the charts might give artists and labels an idea of the formula to make a successful charting record, on the another hand, looking at label models like Odd Future, Rhymesayers, Stones Throw and Strange Music, to name a few, could provide even greater insights to long term success. | |
If you enjoyed reading this, please log in and click “Recommend” below.This will help to share the story with others. | |
Follow Cuepoint: Twitter | Facebook","1" | |
"datalab","https://medium.com/data36/wannabe-data-scientist-here-are-10-free-online-courses-to-start-693c4e230059","1","{""Data Science"",Analytics,""Learning To Code"",Python,""Big Data""}","117","2.22641509433962","Wannabe Data Scientist! Here are 8 free online courses to start…","If you are a Data Scientist, you need to know 3 major areas: | |
Fortunately there are free courses for all the 3 around the internet. I am going to collect them here. (Note: I personally tested all of them. They are great!) | |
When it comes to data+coding, you need to learn 3+1 languages. These are: Python, SQL, R. And before you do this, I really suggest you to start with the Command Line. | |
This is the perfect language to do quick and dirty analyses. It’s also very flexible, so it’s especially useful for startups, where the structure of data could change really fast. | |
Free course: https://www.codecademy.com/learn/learn-the-command-line | |
I loved this course, because it’s interactive and it gets to the point. It’s a bit short, though. If you want to go further, this is your book (but it’s not free): http://datascienceatthecommandline.com | |
Python is very popular in Machine Learning, predictive analytics and text-mining. Some of the greatest Big Data languages (like Spark) have their own Python layers as well. | |
Free course: https://www.codecademy.com/learn/python | |
Free book: https://learnpythonthehardway.org/book/ | |
Not free, but really great data+python book: Python for Data Analysis | |
Worst name for anything, it’s not even googleable, right? :-) But, it’s a very useful language designed by mathematicians for mathematicians. It has a lot of statistical packages, too. | |
Free course: https://www.datacamp.com/courses/free-introduction-to-r | |
The most used query language. SQL is like Excel on steroids, but without the graphic interface. In exchange it “eats” and processes much more data much quicker, than any spreadsheet. I’d say, every company, who does anything with data, use SQL at some part of its data infrastructure too. | |
Free course1: http://www.sqlcourse2.com/intro2.html | |
Free course2: https://www.codecademy.com/learn/learn-sql | |
A nice GitHub depo: https://github.com/zoltanctoth/smalldata-training | |
And if you want to practice (maybe because you are trying to prepare yourself to a job interview), this a good place to do that: https://www.hackerrank.com/ | |
The business part is tricky, because mostly you need to learn it on the job — as different companies have very different businesses. | |
However a great free course on the topic “how to think about business with data” is the Google Analytics Course. If you take that, you will learn GA as well (obviously), which is the greatest standard in online analytics. | |
Free course: https://analyticsacademy.withgoogle.com/ | |
I highly recommend this book, too: http://leananalyticsbook.com/ | |
And this Free E-book: http://leananalyticsbook.com/analytics-lessons-learned-free-e-book-with-13-case-studies/ | |
I assume that if you are curious about Data Science, than you are at least a little bit into Statistics. But if you want to practice, again https://www.hackerrank.com/ is a cool website to do that. | |
+if you are not so much into it, then start with this book: http://www.goodreads.com/book/show/17986418-naked-statistics | |
So these are the courses. If you go through all of them, you will have a great base knowledge and by then you will realize, you have already done the first step to become a Data Scientist! | |
If you want to go further, read my next article about how to Create a Good Research plan: here. | |
Tomi Mestermy blog: data36.commy Twitter: @data36_com","4" | |
"jamesheathers","https://medium.com/@jamesheathers/the-grim-test-a-method-for-evaluating-published-research-9a4e5f05e870","0","{""Data Science"",Science}","117","11.0452830188679","The GRIM test — a method for evaluating published research.","HEADER NOTE: My follow-up piece to the below is now **here**. | |
Ever had one of those ideas where you thought: “No, this is too simple, someone must have already thought of it.” | |
And then found that no-one had? | |
And that it was a good idea after all? | |
Well, that’s what happened to us. | |
(Who is us? I’m going to use pronouns messily through the following almost-3000 words, but let the record show: ‘us’ and ‘we’ is Nick Brown and myself.) | |
The pre-print of this paper is HERE — “The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology”. | |
The GRIM test is, above all, very simple. | |
It is a method of evaluating the accuracy of published research. It doesn’t require access to the dataset, it just needs the mean (the average) and the sample size. | |
It can’t be used on all datasets — or maybe even on most datasets. | |
It will be much more use in the social sciences than other fields, so that’s where we’ve published it. We’ll assume from here that we’re working with samples of people, rather than animals or cells or climatic conditions etc. | |
Psychology is an area where we often collect data from small samples which are made up of whole numbers. | |
So for instance, we might ask 10 people how old are you? | |
Or we might ask 20 people on a scale from 1 to 7, how angry are you? (where 1 means not angry at all and 7 means furious) | |
Or we might ask 40 people which ethnicity are you? Choose the most appropriate from : Causasian, Asian, Pacific Islander, Other. | |
Or we might ask 15 people how tall are you (to the nearest centimeter)? | |
Small samples, whole numbers. | |
These sorts of samples are drawn from different data types (for instance, race data is categorical and anger survey data is ordinal and age is continuous) but those distinctions don’t matter here. If we add them up, they all return means or percentages which have a special property: they have unusual granularity. That is, they are composed of individual sub-components (grains) which means they aren’t continuous. | |
Think of it this way: if everyone was reporting their age to the nearest year (i.e. 33 years old), that is coarser than to the nearest month (i.e. 33 years, 4 months). Age to the nearest day (i.e. 33 years, 4 months, 20 days) is finer than age to the nearest month or year. | |
(This doesn’t apply so much to categorical data, because it involves the sorting of whole numbers into bins. You’ll see why this is important in a second… ) | |
Let’s make a pretend sample of twelve undergraduates, with ages as follows: | |
17,19,19,20,20,21,21,21,21,22,24,26 | |
The average age is 20.92 (2dp), and we run the experiment on a Monday. | |
However, the youngest person in our sample is about to turn 18. At midnight, their age ticks over, and we all run to the pub for a drink. | |
(If you live in a real country, of course. Sorry Uncle Sam.) | |
Now, hangovers notwithstanding, we run the experiment again on Tuesday. Now our has the following age data: | |
18,19,19,20,20,21,21,21,21,22,24,26 | |
The average age is 21 exactly. | |
Now, consider this: the sum of ages just changed by one unit, which is the smallest amount possible. It was 251 (which divided by 12 is 20.92), and with the birthday of the youngest member, became 252 (which divided by 12 is 21 exactly). | |
Thus, the minimum amount that the sum can change is by one, hence the minimum amount that the average can change is one twelfth. | |
So, what happens when you are reading a paper and you see this? | |
Well, usually, absolutely nothing whatsoever. This looks plausible, and it would be a cold day in hell before anyone ever thought to check it. | |
But: if you do check it, you find it’s wrong. The ages are impossible. | |
If you remember from before, when you’re adding up whole numbers, the minimum amount the age of a sample of twelve people can change by is one-twelfth. Now, look at the mean of the drug condition… | |
You can’t take the average of twelve ages and get 20.95. This is inconsistent with the stated cell size (n=12). The paper is wrong. Not ‘probably wrong’ or ‘suspicious’ — it’s wrong. | |
We formalised the above into a very simple test — the granularity-related inconsistency of means (GRIM) test. It evaluates whether reported averages can be made out of their reported sample sizes. | |
From analysing data that we were certain was fraudulent. | |
(Unfortunately, I can’t tell you what the data is. Or how we knew. That’s actually another story, and one that’s not available for the telling. At least, not yet.) | |
Data like this, where you’re certain there’s problems hidden in it, you can kick around forever. With enough poking and prodding from various angles, checking its normal statistical properties, correlations, assortment, etc. you always run across something which doesn’t fit properly. | |
Here’s a great quote on the process from Pandit (2012) I found recently: | |
I laughed like a drain when I found this paper, because of a note I’d left myself through the investigative process, which — with the Australian expressions redacted — said: | |
But, what do you do when you don’t have the data? | |
We had a few fake data sets to analyse, so it was easy enough to detect the problems. | |
(Note: as yet, the story as to why still can’t be told in full. That’s another tale for another day. Watch this space.) | |
But we also had a lot of other accompanying papers with no data whatsoever. And we were very unlikely to get any more data. If someone does dishonest research, and you start requesting more and more data from them, you’d be unsurprised to find out how often their dog would turn up and ‘eat their homework’. | |
So, for some of the simpler papers with smaller datasets, we tried to reverse engineer them — I wrote a series of Matlab scripts which would try to recreate various numerical combinations that gave us the right mean, standard deviation etc. and then reproduce the same statistics. Nick did the same in R. | |
The problem was, sometimes the scripts would work, and we could find a possible dataset. But sometimes everything went horribly wrong. | |
Originally, I thought it was my rather slapdash programming. But it wasn’t. The code was fine. | |
We eventually realised the problem was that we often failed to recreate a dataset from the described summary statistics because the mean was actually impossible. | |
It was such a simple and brutally straightforward observation that we went scuttering around looking for where this observation had been previously published. It must have been, right? | |
To our lasting surprise, it hadn’t been. At least, as far as we can tell. We discreetly asked a few people who know about these sorts of things — they hadn’t seen it published either. | |
And the best piece of evidence I’m aware of: if this had been published somewhere, and it was so easy to do, then why in this era of increased accountability wasn’t everyone already using it? | |
(It might be published out there somewhere… we just don’t know where.) | |
So, in summary | |
So, we put it to work. | |
We took hundreds of papers recently published in psychology journals and GRIM-tested them. | |
Specifically, we drew samples from Psychological Science, Journal of Experimental Psychology: General, and Journal of Personality and Social Psychology, using search terms that meant almost all contained scale data reported from the last five years. We used n=260. | |
As above, the applicability of the GRIM test changes with: | |
Most papers didn’t actually have any numbers that could be checked. | |
Of the subset of articles that were amenable to testing with the GRIM technique (N = 71) , around half (N = 36; 50.7%) appeared to contain at least one reported mean inconsistent with the reported sample sizes and scale characteristics, and over a fifth (N = 16; 22.5%) contained multiple such inconsistencies. We requested the data sets corresponding to N = 21 of these articles , received positive responses in N = 9 cases, and were able to confirm the presence of at least one reporting error in every one of these cases, with N = 2 articles requiring extensive corrections. | |
I’m going to repeat that, because I think it bears repeating: | |
In papers with multiple impossible values, we asked the authors for the datasets they used. This was so we could a) see if the method worked and b) check the numbers up close. | |
First of all, the GRIM test works very well, because we found an inconsistency in every dataset we received. These errors had a variety of sources: | |
Make no mistake about it, we made some errors. This was a Herculean task, which involved hand-checking all the results from all 260 papers. Nick, whose focus and attention to detail is much better than mine, did most of the work. It was marvelous fun and by that I mean it was dreadful. We found two instances where we misunderstood the paper and checked something that turned out OK. | |
This was very common — a paper would split a group of 40 people into two groups… and not tell you how big the groups were. You’d assume 20 each, right? Well, not so fast. Sometimes the groups were uneven (and this meant checking not just one mean, but all the possibilities) and everything appeared to be correct when we found consistent solutions with the published data. Other times, the cell sizes were wrong. | |
Sometimes, what we thought was a mistake might be the result of the items we scored having sub-items. For instance, if there was an impossible mean from a sample of n=20, but each person answered four questions to make up the mean, what appears to be a mistake might not be. Some papers left these details out. | |
Version control between authors, late nights, bad copy/paste job, spreadsheet mistake… it happens. We found Excel formulas that terminated in the wrong line, for instance. This was probably the major source of inconsistency overall. | |
Sometimes people in your study go missing, or drop out, or your equipment fails, or you spill coffee on your memory stick. Some papers report their overall sample sizes (how many people were enrolled in the study in the first place) but not how many people completed the study. Bit dodgy leaving these figures out — it never makes your paper look good to say “20 people started the study, but only 15 finished it” — but not a crime. | |
… Yes, let’s talk about this. | |
Obviously, this is the big one. | |
Do we know how much of it we found? Absolutely not. | |
Do we know who’s guilty or innocent of it? Not at all. | |
Are we accusing anyone of anything? Not on your life. | |
But. | |
Is it likely we found some? | |
Maybe. | |
Let me run a scenario by you: | |
After going to the trouble of running an experiment, an experimenter tallies up the results, and the primary result of interest is ‘almost significant’. That is, a statistical test of some form has been run and the means just aren’t far enough apart for the difference to meet our arbitrary (i.e. ridiculous) criteria for determining meaningful differences. | |
What would a dishonest person do? Well, change the means around a little. Not by much. Just a tick. (Assuming you couldn’t give them some kind of cheeky rounding procedure.) | |
Now, the statistical test which was almost-sort-of-significant is now reporting actually-significant. | |
The only problem is, of course, that sometimes when you do this, the means will be changed from a real mean to an impossible one. | |
And, with our technique in place, our dishonest researcher may have made a terribly grave error by publishing that mean in broad daylight, for everyone to see. Now, if the paper is amenable to GRIM testing, someone can come along at any time, and determine that this fictitious mean could never have existed. | |
In fact, now this result is out there, they probably will. Every study in the published record is now up for grabs. Don’t believe the paper? Well, check the means with the GRIM calculator and go from there. | |
Of course, by itself, a single inconsistent mean doesn’t mean much. | |
But say a paper had multiple inconsistent means… | |
And the authors’ previous work did also, going back several years… | |
Questions will be asked. | |
And while I’m being ominous here: we are far more concerned with the data we didn’t receive than the data we did receive. | |
We requested 21 papers worth of data. | |
We received 9. | |
What’s in the final 12? | |
Some of the below overlap, but here were some of the issues: | |
And, if you have any anxiety that we’re being unreasonable, these papers are almost all published in a journal where the authors have explicitly signed a document affirming the fact that they must share their data for precisely this reason — to verify results through reanalysis. | |
Make no mistake about it, we chose these journals not just for their profile, but because we are unambiguously entitled to check them as a condition of publication. | |
Which we’ll be doing. | |
(Should note: JEP:G and JPSP explicitly guarantee this, PS only has a looser broader commitment to open science. And two out of three ain’t bad. Apologies to music lovers.) | |
What happens from here will be interesting. | |
At best, a lot of researchers who weren’t previously much interested in meta-research will now have a simple tool for evaluating the accuracy of (some) published means in research papers. Especially pre-publication — we’re hoping first and foremost this will make a useful tool during review. | |
If there’s any uptake of this, we should start to see questions being asked at a level which allows a greater attention to detail than previously. | |
Imagine approaching a row of houses, where you want to look inside. Only some of the houses have windows, and only some of the windows you can reach. But even a tiny, smudged, crooked, frosted-glass window is useful— when circumstances line up right, it will let us see inside the house. And that’s better than what we had before.","1" | |
"D33B","https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89","9","{""Data Science"",""Machine Learning"",Statistics}","116","6.62264150943396","What to do with “small” data?","By Ahmed El Deeb | |
Many technology companies now have teams of smart data-scientists, versed in big-data infrastructure tools and machine learning algorithms, but every now and then, a data set with very few data points turns up and none of these algorithms seem to be working properly anymore. What the hell is happening? What can you do about it? | |
Most data science, relevance, and machine learning activities in technology companies have been focused around “Big Data” and scenarios with huge data sets. Sets where the rows represent documents, users, files, queries, songs, images, etc. Things that are in the thousands, hundreds of thousands, millions or even billions. The infrastructure, tools, and algorithms to deal with these kinds of data sets have been evolving very quickly and improving continuously during the last decade or so. And most data scientists and machine learning practitioners have gained experience is such situations, have grown accustomed to the appropriate algorithms, and gained good intuitions about the usual trade-offs (bias-variance, flexibility-stability, hand-crafted features vs. feature learning, etc.). But small data sets still arise in the wild every now and then, and often, they are trickier to handle, require a different set of algorithms and a different set of skills. Small data sets arise is several situations: | |
Problems of small-data are numerous, but mainly revolve around high variance: | |
1- Hire a statistician | |
I’m not kidding! Statisticians are the original data scientists. The field of statistics was developed when data was much harder to come by, and as such was very aware of small-sample problems. Statistical tests, parametric models, bootstrapping, and other useful mathematical tools are the domain of classical statistics, not modern machine learning. Lacking a good general-purpose statistician, get a marine-biologist, a zoologist, a psychologist, or anyone who was trained in a domain that deals with small sample experiments. The closer to your domain the better. If you don’t want to hire a statistician full time on your team, make it a temporary consultation. But hiring a classically trained statistician could be a very good investment. | |
2- Stick to simple models | |
More precisely: stick to a limited set of hypotheses. One way to look at predictive modeling is as a search problem. From an initial set of possible models, which is the most appropriate model to fit our data? In a way, each data point we use for fitting down-votes all models that make it unlikely, or up-vote models that agree with it. When you have heaps of data, you can afford to explore huge sets of models/hypotheses effectively and end up with one that is suitable. When you don’t have so many data points to begin with, you need to start from a fairly small set of possible hypotheses (e.g. the set of all linear models with 3 non-zero weights, the set of decision trees with depth <= 4, the set of histograms with 10 equally-spaced bins). This means that you rule out complex hypotheses like those that deal with non-linearity or feature interactions. This also means that you can’t afford to fit models with too many degrees of freedom (too many weights or parameters). Whenever appropriate, use strong assumptions (e.g. no negative weights, no interaction between features, specific distributions, etc.) to restrict the space of possible hypotheses. | |
3- Pool data when possible | |
Are you building a personalized spam filter? Try building it on top of a universal model trained for all users. Are you modeling GDP for a specific country? Try fitting your models on GDP for all countries for which you can get data, maybe using importance sampling to emphasize the country you’re interested in. Are you trying to predict the eruptions of a specific volcano? … you get the idea. | |
4- Limit Experimentation | |
Don’t over-use your validation set. If you try too many different techniques, and use a hold-out set to compare between them, be aware of the statistical power of the results you are getting, and be aware that the performance you are getting on this set is not a good estimator for out of sample performance. | |
5- Do clean up your data | |
With small data sets, noise and outliers are especially troublesome. Cleaning up your data could be crucial here to get sensible models. Alternatively you can restrict your modeling to techniques especially designed to be robust to outliers. (e.g. Quantile Regression) | |
6- Do perform feature selection | |
I am not a big fan of explicit feature selection. I typically go for regularization and model averaging (next two points) to avoid over-fitting. But if the data is truly limiting, sometimes explicit feature selection is essential. Wherever possible, use domain expertise to do feature selection or elimination, as brute force approaches (e.g. all subsets or greedy forward selection) are as likely to cause over-fitting as including all features. | |
7- Do use Regularization | |
Regularization is an almost-magical solution that constraints model fitting and reduces the effective degrees of freedom without reducing the actual number of parameters in the model. L1 regularization produces models with fewer non-zero parameters, effectively performing implicit feature selection, which could be desirable for explainability of performance in production, while L2 regularization produces models with more conservative (closer to zero) parameters and is effectively similar to having strong zero-centered priors for the parameters (in the Bayesian world). L2 is usually better for prediction accuracy than L1. | |
8- Do use Model Averaging | |
Model averaging has similar effects to regularization is that it reduces variance and enhances generalization, but it is a generic technique that can be used with any type of models or even with heterogeneous sets of models. The downside here is that you end up with huge collections of models, which could be slow to evaluate or awkward to deploy to a production system. Two very reasonable forms of model averaging are Bagging and Bayesian model averaging. | |
9- Try Bayesian Modeling and Model Averaging | |
Again, not a favorite technique of mine, but Bayesian inference may be well suited for dealing with smaller data sets, especially if you can use domain expertise to construct sensible priors. | |
10- Prefer Confidence Intervals to Point Estimates | |
It is usually a good idea to get an estimate of confidence in your prediction in addition to producing the prediction itself. For regression analysis this usually takes the form of predicting a range of values that is calibrated to cover the true value 95% of the time or in the case of classification it could be just a matter of producing class probabilities. This becomes more crucial with small data sets as it becomes more likely that certain regions in your feature space are less represented than others. Model averaging as referred to in the previous two points allows us to do that pretty easily in a generic way for regression, classification and density estimation. It is also useful to do that when evaluating your models. Producing confidence intervals on the metrics you are using to compare model performance is likely to save you from jumping to many wrong conclusions. | |
This could be a somewhat long list of things to do or try, but they all revolve around three main themes: constrained modeling, smoothing and quantification of uncertainty. | |
Most figures used in this post were taken from the book “Pattern Recognition and Machine Learning” by Christopher Bishop.","1" | |
"mattfogel","https://medium.com/@mattfogel/master-the-basics-of-machine-learning-with-these-6-resources-63fea5a21c1c","1","{""Machine Learning"",""Artificial Intelligence"",""Data Science""}","115","2.20754716981132","Master the Basics of Machine Learning With These 6 Resources","It seems like machine learning and artificial intelligence are topics at the top of everyone’s mind in tech. Be it autonomous cars, robots, or machine intelligence in general, everyone’s talking about machines getting smarter and being able to do more. | |
At the same time, for many developers, machine learning and artificial intelligence are nebulous terms representing complex mathematical and data problems they just don’t have the time to explore and learn. | |
As I’ve spoken with lots of developers and CTOs about Fuzzy.io and our mission to make it easy for developers to start bringing intelligent decision-making to their software without needing huge amounts of data or AI expertise, some were curious to learn more about the greater landscape of machine learning. Here are some of the links to articles, podcasts and courses discussing some of the basics of machine learning that I’ve shared with them. Enjoy! | |
This guide, written by the awesome Raul Garreta of MonkeyLearn, is perhaps one of the best I’ve read. In one easy-to-read article, he describes a number of applications of machine learning, the types of algorithms that exist, and how to choose which algorithm to use. | |
This piece by Stephanie Yee and Tony Chu of the R2D3 project gives a great visual overview of the creation of a machine learning model. In this case, a model to determine whether an apartment is located in San Francisco or New York. It’s a great look into how machine learning models are created and how they work. | |
A great starting point on some of the basics of data science and machine learning. Every other week, they release a 10–15 minute episode where hosts, Kyle and Linhda Polich give a short primer on topics like k-means clustering, natural language processing and decision tree learning, often using analogies related to their pet parrot, Yoshi. This is the only place where you’ll learn about k-means clustering via placement of parrot droppings. | |
This weekly podcast, hosted by Katie Malone and Ben Jaffe, covers diverse topics in data science and machine learning: teaching specific concepts like Hidden Markov Models and how they apply to real-world problems and datasets. They make complex topics extremely accessible, and teach you new words like clbuttic. | |
Plan for this online course to take several months, but you’d be hard-pressed to find better teachers than Peter Norvig and Sebastian Thrun. Norvig quite literally wrote the book on AI, having co-authored Artificial Intelligence: A Modern Approach, the most popular AI textbook in the world. Thrun’s no slouch, having previously led Google driverless car initiative. | |
This 11-week long Stanford course is available online via Coursera. Its instructor is Andrew Ng, Chief Scientist at Chinese internet giant Baidu. | |
Given the scope of machine learning as a topic, the above really only just begins to scratch the surface. Got your own favorite resource? Suggest it in the comments!","3" | |
"davidventuri","https://medium.com/@davidventuri/review-udacity-data-analyst-nanodegree-1e16ae2b6d12","5","{Education,Udacity,""Data Science"",Programming,""Learning To Code""}","115","8.38427672955975","REVIEW: Udacity Data Analyst Nanodegree","Udacity’s Data Analyst Nanodegree was one of the first online data science programs in the MOOC era. It aims to “ensure you master the exact skills necessary to build a career in data science.” Does it accomplish its goal? Is it the best option available? | |
I completed the program a few weeks ago. Using inspiration from Class Central’s open-source review template, here is my review for Udacity’s Data Analyst Nanodegree. | |
In early 2016, I started creating my own data science master’s program using online resources. (You can read about that here.) I enrolled in the Data Analyst Nanodegree for a few reasons: | |
Though the program can act as a bridge to a job (more on that later), I wanted to use the program as an introduction to more advanced material. This “more advanced material” applies to both subjects that are covered in the program and subjects that aren’t. | |
Udacity is one of the leading online course providers. They mostly focus on tech. Sebastian Thrun, ex-Stanford professor and Google X founder, runs the show as founder and CEO. | |
Nanodegrees are online certifications provided by Udacity. They are usually compilations of existing free Udacity courses that have projects attached to them. These projects are reviewed by Udacity’s paid project reviewers. Upon request, students also have access to Udacity Coaches a.k.a. experts in the courses taught at Udacity. | |
The Data Analyst Nanodegree was originally released in 2014. It was one of Udacity’s first four Nanodegrees. Though it has undergone some changes over the years, the core of the program is intact. | |
Because the Data Analyst Nanodegree is a compilation of free Udacity courses, there are several instructors. Their resumes often include prestigious roles in major tech companies and degrees from top U.S. schools. | |
They aren’t “instructors” per se, but Udacity’s project reviewers and forum staff are the people you actually interact with the most. They are so, so helpful. Again, more on that later. | |
The Data Analyst Nanodegree costs $200 per month, like most other Nanodegrees. If you graduate within twelve months, Udacity gives you a 50% tuition refund. | |
Udacity recommends that students: | |
I started the program in May 2016 when I had a few months of programming experience, mostly in C and Python. The vast majority of this experience was from the bridging module for my data science master’s program, where I took Harvard’s CS50: Introduction to Computer Science and Udacity’s Intro to Programming Nanodegree. | |
I had also finished my undergraduate chemical engineering program and had 24 months of quant-related job experience. This meant I had taken several statistics courses and was comfortable with data. | |
The Data Analyst Nanodegree is split up into eight sections: “P0” through “P7” (I am unsure if the “P” stands for Part, Project, or something else … Penguin?). P0 is optional and is basically an easy version of P1 to get you used to the Udacity learning environment. | |
Each section’s video content are either full Udacity courses or selections of videos from Udacity courses. Videos tend to range from 30 seconds to five minutes, as per Udacity’s style. Automatically graded quizzes often follow these short videos. These quizzes are usually multiple choice, fill-in-the-blank, or small programming tasks. | |
Sometimes problem sets follow big chunks of video content. These can take a few hours sometimes, though most are quicker. | |
Again, each section has a graded project. These projects and the feedback from Udacity’s paid project reviewers are where a lot of the value lies for most Nanodegrees. | |
My edition of the Data Analyst Nanodegree had the following syllabus: | |
P1: Statistics | |
P2: Intro to Data Analysis | |
P3: Data Extraction and Wrangling | |
P4: Exploratory Data Analysis | |
P5: Machine Learning | |
P6: Data Visualization | |
P7: Design an A/B Test | |
Grading | |
Projects are graded on a pass/fail basis according to a rubric. Each project’s rubric is unique. Your project must satisfy all sections of the rubric. | |
The automatically graded quizzes do not count towards your grade, though Nanodegrees don’t really have grades, anyway. You either pass all of the projects or you don’t. If an individual project submission doesn’t pass, your project reviewer gives you feedback, then you can adjust your work and try again. | |
Udacity’s estimated timeline for the Data Analyst Nanodegree was 378 hours when I started. They have changed up their timelines since then. They now say: “On average, our graduates complete this Nanodegree program in 6–7 months, studying 5–10 hours per week.” | |
According to Toggl (a time tracking app), the whole program took me 369 hours over five months. This timeline included dedicating serious time to making my projects portfolio quality, as opposed to producing the minimum to satisfy the pass/fail rubric. | |
The course content from P1, P2, P4, P5, and P7 get five stars out of five from me. P3 and P6 get four stars. | |
The exploratory data analysis content with Facebook employees (P4) was so illuminating. The intro machine learning course with Sebastian Thrun and Katie Malone (P5) was the most fun I’ve had in any online course. The A/B testing content with Google employees (P7) is so unique. I’d give those three courses six stars if I could. | |
The SQL and MongoDB content (P3) weren’t amazing. Same with the data visualization content (P6), though that probably was because D3.js is super difficult to teach to JavaScript newbies. These opinions aren’t uncommon, according to the Class Central’s reviews for those courses. Check them out here and here. | |
Some of the videos had mistakes in them, which were corrected in the notes section below the video. This issue is par for the course for most online courses, though. That doesn’t make it less annoying when you forget to check the video notes and spend a bunch of time trying to figure out what you’re missing, though. | |
Again, projects are where Udacity sets themselves apart from the rest of the online education platforms. They invest in their project review process and it pays off. The Data Analyst Nanodegree was no exception. | |
All of the projects reinforce the content you learned in the videos. Their project reviewers know their stuff. They tell you where you succeeded and where your mistakes and/or omissions are. Supervised learning by doing. It works. (No, not that supervised learning.) | |
The forums and the forum mentors are especially helpful when you get stuck. Search the forums to see if your problem is a common one (they usually are). No luck? Post a new question yourself. There is one forum mentor, Myles Callan, who seems to know everything about everything and responds within hours. I have my doubts that he sleeps. | |
If you’re curious to see what these projects look like, check out my Data Analyst Nanodegree Github repository. | |
The statistics content was easy for me because I had taken several stats courses in undergrad. This would probably be true for every topic in the Nanodegree if you had prior experience in it. | |
I’d categorize most of the Nanodegree as intermediate difficulty. Lecture content that doesn’t have a problem set attached can be a breeze. The projects exercise your brain. Each will probably take you more than twenty hours if you want to be thorough. | |
The P4 project was the most challenging to pass. It took me 3.5 submissions. Check out this Twitter thread for more details. | |
Can you apply for jobs immediately post-graduation? | |
You can. The program should equip you with the required skills for an entry-level data analyst role if you take it seriously. Eli Kastelein is a perfect example of that. You can read more about his story below. | |
You can also continue onto more advanced courses, both for the subjects covered in the program and for other subjects. This is what I chose to do. | |
Somewhere towards the end of the program, I started creating Class Central’s Data Science Career Guide. This entailed researching every single online course offered for every subject within data science. | |
Though I enjoyed the majority of courses within the Nanodegree, there are other courses from other providers that receive even better reviews for certain subjects. Statistics, for example. If I had access to my guide back when I started, I would strongly consider the separate-course-for-each-subject route since the most of the individual courses within the Data Analyst Nanodegree aren’t the best-rated courses for their subject area. | |
Udacity’s specialized forums and project review process, however, are so effective for learning that I would probably take it regardless. It’d be an effort logistically, but the ultra-optimized approach might be to take the best individual courses for each subject then enroll in the Nanodegree to complete their projects and receive their mentorship. | |
There are four alternatives that I have come across so far: | |
Udacity’s Data Analyst Nanodegree gives you the foundational skills you need for a career in data science. Post-graduation, you’ll be able to target your strengths and weaknesses, and supplement your learning where necessary. Plus, you’ll leave with a handful of portfolio-ready projects. | |
I loved it, as did others. | |
★★★★¾","7" | |
"airbnbeng","https://medium.com/airbnb-engineering/building-for-trust-503e9872bbbb","6","{""Data Science"",Trust,""Sharing Economy""}","114","8.65943396226415","Building for Trust","By Riley Newman and Judd Antin | |
That was the intro to a talk Joe Gebbia, one of Airbnb’s co-founders, recently gave at TED. You can watch it here to find out how the story ends, but (spoiler alert) the theme centers on trust — one of the most important challenges we face at Airbnb. | |
Designing for trust is a well understood topic across the hospitality industry, but our efforts to democratize hospitality mean we have to rely on trust in an even more dramatic way. Not long ago our friends and families thought we were crazy for believing that someone would let a complete stranger stay in their home. That feeling stemmed from the fact that most of us were raised to fear strangers. | |
“Stranger danger” is a natural human defense mechanism; overcoming it requires a leap of faith for both guests and hosts. But that’s a leap we can actively support by understanding what trust is, how it works, and how to build products that support it. | |
How best to support trust — particularly between groups of people who may not have the opportunity to interact with each other on a daily basis — is a core research topic for our data science and experience research teams. In preparation for Joe’s talk, we reflected on how we think about trust, and we pulled together insights from a variety of past projects. The goal of this post is to share some of the thoughts and insights that didn’t make it into the TED talk and to inspire more thinking about how to cultivate the fuel that helps the sharing economy run: trust. | |
When Airbnb was just getting started, we were keenly aware of the need to build products that encourage trust. Convincing someone to try us for the first time would require some confidence that our platform helps to protect them, so we chose to take on a series of complex problems. | |
We began with the assumption that people are fundamentally good and, with the right tools in place, we could help overcome the stranger-danger bias. To do so, we needed to remove anonymity, giving guests and hosts an identity in our community. We built profile pages where they could upload pictures of themselves, write a description about who they are, link social media accounts, and highlight feedback from past trips. Over time we’ve emphasized these identity pages more and more. Profile pictures, for example, are now mandatory — because they are heavily relied upon. In nearly 50% of trips, guests visit a host’s profile at least once, and 68% of the visits occur in the planning phase that comes before booking. When people are new to Airbnb these profiles are especially useful: compared to experienced guests, first time guests are 20% more likely to visit a host’s profile before booking. | |
In addition to fostering identity, we knew we also needed defensive mechanisms that would help build confidence. So we chose to handle payments, a complicated technical challenge, but one that would enable us to better understand who was making a booking. This also put us in a position to design rules to help remove some uncertainty around payments. For example, we wait 24 hours until after a guest checks-in before releasing funds to the host to give both parties some time to notify us if something isn’t right. And when something goes wrong there needs to be a way to reach us, so we built a customer support organization that now covers every timezone and many languages, 24/7. | |
One way we measure the effect of these efforts is through retention — the likelihood that a guest or host uses Airbnb more than once. This isn’t a direct measure of trust, but the more people come to trust Airbnb, the more likely they may be to continue using our service, so there’s likely a correlation between the two. Evaluating customer support through this lens makes a clear case for its value: if a guest has a negative experience, for example a host canceling their reservation right before their trip, their retention rate drops 26%; intervention by customer support almost entirely negates this loss — retention rebounds up from 26% to less than 6%. | |
We didn’t get everything right at first, and we still don’t, but we have improved. One thing we learned after a bad early experience is that we needed to do more to give hosts confidence that we’d be there for them if anything goes wrong, so we rolled out our $1 million guarantee for eligible hosts. But each year, more and more people are giving this a try because we’ve been able to build confidence that their experience is likely to be a good one. This isn’t the same as trusting the person they will stay with, but it’s an important first step: if trust is the building that hosts and guests construct together, then confidence is the scaffold. Just like the scaffold on a building, our efforts to build confidence make it easier for the work of trust-building to happen, but they won’t create trust. Only hosts and guests can do that. | |
Researchers define trust in many ways, but one interesting definition comes from political scientist Russell Hardin. He argues that trust is really about “encapsulated interest”: if I trust you, I believe that you’re going to look after me and my interests, that you’re going to take my interests to heart and make decisions about them like I would. | |
People who are open to trusting others aren’t suckers — they usually need evidence that the odds are stacked in their favor when they choose to trust a stranger. Reviews thus form the raw material that we can collect and then surface to users on our platform. This is one of our most important data products; we refer to it as our reputation system. | |
The reputation system is an invaluable tool for the Airbnb community, and it’s heavily used — more than 75% of trips are voluntarily reviewed. This is particularly interesting because reviews don’t benefit the individuals who leave them; they benefit future guests and hosts, validating members of the Airbnb community and helping compatible guests and hosts find each other. Having any reputation at all is a strong determinant of a host’s ability to get a booking — a host without reviews is about four times less likely to get a booking than a host that has at least one. | |
Our reputation system helps guide guests and hosts toward positive experiences and it also helps overcome stereotypes and biases that unconsciously affect our decisions. We know biases exist in society, and one of the strongest biases we have in life is that we tend to trust others who are similar to us — sociologists call this homophily. As strong a social force as homophily is, it turns out reputation information can help counteract it. In a recent collaborative study with a team of social psychologists at Stanford, we found evidence of homophily among Airbnb travelers, but we also found that having enough positive reviews can help to counteract homophily, meaning in effect that high reputation can overcome high similarity. (Publication forthcoming) | |
Given the importance of reputation, we’re always looking for ways to increase the quantity and quality of reviews. Several years ago, one member of our team observed that reviews can be biased upward due to fears of retaliation and bad experiences were less likely to be reviewed at all. So we experimented with a ‘double blind’ process where guest and host reviews would be revealed only after both had been submitted or after a 14-day waiting period, whichever came first. The result was a 7% increase in review rates and a 2% increase in negative reviews. These may not sound like big numbers, but the results are compounding over time — it was a simple tweak that has improved the travel experiences of millions of people since. | |
Once trust takes root, powerful community effects begin to emerge and long-standing barriers can begin to fall. First and foremost, people from different cultures become more connected. On New Year’s Eve last year, for example, over a million guests, hailing from almost every country on earth, spent the night with hosts in over 150 countries. Rather than staying with other tourists, they stayed with locals, creating an opportunity for cross-cultural connection that can break down barriers and increase understanding. | |
This is visualized in the graphic below, which shows how countries are being connected via Airbnb trips. Countries on the vertical axis are where people are traveling from, and countries on the horizontal axis are where people are traveling to. The associated link will take you to an interactive visualization where you can see trends in connections relative to different measures of distance. | |
Airbnb experiences are overwhelmingly positive, which creates a natural incentive to continue hosting. Hosts’ acceptance rates rise as they gain more experience hosting. And we see evidence of their relishing the cross-cultural opportunities Airbnb provides: guests from a different country than the host gain a 6% edge in acceptance rates. | |
Hosting also produces more practical benefits. About half of our hosts report that the financial boost they receive through Airbnb helps them stay in their homes and pay for regular household expenses like rent and groceries; depending on the market, 5–20% of hosts report that this added source of earnings has helped them avoid foreclosure or eviction. The remaining earnings pads long-term savings and emergency funds, which helps them weather financial shocks in the future, or is used for vacations, which provides similar economic benefits to other markets. | |
As communities become more trusting, they also become more durable; they can serve as a source of strength when times are tough. In 2012, after Hurricane Sandy hit the east coast, one of our hosts in New York asked for help changing her listing’s price to $0 in order to provide shelter to neighbors in need. Our engineers worked around the clock to build this and other features that would enable our community to respond to natural disasters which resulted in over 1,400 additional hosts making spaces available to those affected by the hurricane. Since then, we have evolved these tools into an international disaster response program, allowing our community to support those impacted by disasters — as well as the relief workers supporting the response — in cities around the world. In the last year we have responded to disasters and crises including the Nepal earthquake, the Syrian refugee crisis, the Paris attacks and most recently Tropical Storm Winston in Figi. | |
Joe, Brian, and Nate realized from the beginning how crucial trust was going to be for Airbnb. These are just some of the stories that have followed years of effort to build confidence in our platform and facilitate one-on-one trust. While the results are quite positive, we still have a long way to go. | |
One ongoing challenge for us is to concretely measure trust. We regularly ask hosts and guests about their travel experiences, their relationships with each other, and their perceptions of the Airbnb community overall. But none of those are perfect proxies for trust, and they don’t scale well. The standard mechanisms researchers often use to measure trust are cumbersome, and we can’t reliably infer trust from behavioral data. But we’re working to build this measurement capacity so we can continue to carefully design and optimize for trust over time. | |
Our motivation to understand trust doesn’t end with the need to build great products — it’s also about understanding what communities full of trust can do. In one night last year, 1.2m guests stayed with 300,000 hosts. Each of those encounters is an opportunity to break down barriers and form new relationships that further strengthen communities over time.","2" | |
"johnbattelle","https://medium.com/newco/as-we-may-think-data-world-lays-the-traceroutes-for-a-data-revolution-b4b751f295d9","1","{""Big Data"",""Data Science"",Entrepreneurship,Startup,Insurgent}","113","4.89056603773585","As We May Think: Data.world Lays the Traceroutes For A Data Revolution","“There will always be plenty of things to compute in the detailed affairs of millions of people doing complicated things.” — V. Bush | |
Quiet magic happens when an at-scale platform emerges unexpectedly — things previously thought impossible, or more aptly, things never imagined become commonplace faster than we can get used to them. Think of your first Google search. Your first flush of connection on Facebook. The moment a blue dot first guided you to a red destination. Coding before GitHub. Taxis before Uber. AR before Pokemon Go. | |
When a platform is built that allows for unexpected adjacencies, magic is unleashed and the world sparkles for a moment or two. | |
In 1945, well before the advent of personal computers or the Internet, Dr. Vannevar Bush authored a seminal essay positing a new knowledge platform titled As We May Think. Bush, who led US efforts to apply science to the war effort (including the Manhattan Project), outlined a challenge for the world’s scientists and researchers: Now that the war was over, it was time to harness knowledge for the good of all humankind. | |
Scientists, he wrote, “have been part of a great team. Now, as peace approaches, one asks where they will find objectives worthy of their best.” | |
His answer was the Memex — an entirely new kind of device designed to capture, classify, organize, and make available the entirety of human knowledge. As he described his proposed solution, Bush grew tantalizingly close to predicting the World Wide Web, digital computing, social networks, machine learning, and many other future developments — many years before they appeared. What he does nail is the link — the associative connection between two points of data or research. His Memex worked by creating “trails” of associated content — data, articles, photographs, or any other knowledge. He then suggests that researchers around the world could share their trails — creating rich associative lines of inquiry that taken together could change the course of human history. | |
For students of today’s media and technology world, Bush’s Memex remains frustratingly outside our grasp — we know such a system is now possible, but so far, no one’s built a platform capable of producing it. For my part, I believed that early search streams were precursors to a public, Memex-like platform, but the last decade has proven me wrong — private companies certainly have built splendid systems for their own internal leverage of data-driven associations (Facebook, Palantir, and Google come to mind), but so far, there’s not been an open, public platform that might propel humanity into the innovative leaps imagined by Bush. | |
Which brings me to the launch of data.world. Founded by an impressive set of Austin’s most experienced entrepreneurs, data.world is, well, a platform for data. On its face, that doesn’t sound particularly unique. There are already plenty of (mostly government inspired) platforms for open data. Then again, an index for the World Wide Web didn’t sound particularly new when Google showed up, and as for a list of your close friends — well, let’s just say Facebook wasn’t first, or even second. The genius isn’t in the concept, as we all know — it’s in the execution. | |
I met with Brett Hurt, CEO of data.world, and Jon Loyens, its Chief Product Officer, about a month ago in San Francisco. Data.world was about to launch, and they gave me a preview of what they’ve executed. And while it may seem a bit wonky, stick with me. If this thing tips, Hurt & co. may well have unleashed a blast of magic into the world. | |
Data.world sets out to solve a huge problem — one most of us haven’t considered very deeply. The world is awash in data, but nearly all of it is confined by policy, storage constraints, or lack of discoverability. Furthermore, one person’s work on a particular dataset is usually lost once that person’s work is published — a researcher may refine raw government census data into deep insights through a process of hygiene, association with other data sets, and clever scripting, but the results are usually confined to a published paper. The sparkling new data sets sit unused and disconnected on the researcher’s hard drive. | |
Woe to the next set of researchers who might want to pursue or build upon a similar line of inquiry — chances are they’ll never benefit from the work of their peers. Even if the original researcher publishes his or her new data to the Web, there’s no platform to unify that work with the work of others. | |
Data may be to the information economy what oil was to the industrial, but without new tools to refine and distribute it, it remains a gooey mess buried in the soil. 80% of work on data is preparing it for publication. Our information economy remains dependent on those with the capital and scale to privatize data’s insights — far from the reach of mere mortals like academics, journalists, analysts and students. But what if somehow we could unite every person with a data itch to scratch, on one single platform? | |
That’s essentially data.world’s mission. If you’re familiar with how GitHub works, it’s a bit like GitHub, but for data — which is a far larger market. One consistently formatted master repository, with social and sharing built in. Once researchers upload their data, they can annotate it, write scripts to manipulate it, combine it with other data sets, and most importantly, they can share it (they can also have private data sets). Cognizant of the social capital which drives sites like GitHub, LinkedIn, and Quora, data.world has profiles, ratings, and other “social proofs” that encourage researchers to share and add value to each others’ work. | |
In short, data.world makes data discoverable, interoperable, and social. And that could mean an explosion of data-driven insights is at hand. | |
But what really gets me excited about data.world is the decision Hurt and his co-founders have made about the essential purpose of the company. Like KickStarter and a growing number of other NewCos, data.world is a public benefit corporation, with a duty to its shareholders that goes beyond profit alone. Sure, the company is a for profit entity, and is backed by an impressive roster of investors. But here’s the company’s purpose statement: | |
The specific public benefit purposes of the Corporation are to (a) strive to build the most meaningful, collaborative and abundant data resource in the world in order to maximize data’s societal problem-solving utility, (b) advocate publicly for improving the adoption, usability, and proliferation of open data and linked data, and (c) serve as an accessible historical repository of the world’s data. | |
Ambitious? Yes. But the company is already seeing early signs of success. The site is in an invite-only beta period, but according to the company, applications to join have far exceeded expectations, with applicants as wide ranging as government space agencies to craft beer researchers (who knew?!). | |
If we are going to solve the world’s biggest problems, we’ll need new approaches to sharing data, insights, and learning. And when it comes to accelerating solutions, as Loyens told me, “Open beats closed.” It’s refreshing, and encouraging, to see a company so dedicated to such lofty principles. | |
If you liked this story, please recommend it by hitting the awesome green heart below. It really helps us spread the word!","4" | |
"airbnbeng","https://medium.com/airbnb-engineering/beginning-with-ourselves-48c5ed46a703","4","{Recruiting,""Data Science""}","111","9.31132075471698","Beginning with Ourselves","by Riley Newman and Elena Grewal | |
In a recent post, we offered some insights into how we scaled Airbnb’s data science team in the context of hyper-growth. We aspired to build a team that was creative and impactful, and we wanted to develop a lasting, positive culture. Much of that depends on the points articulated in that previous post, however there is another part of the story that deserves its own post — on a topic that has been receiving national attention: diversity. | |
For us, this challenge came into focus a year ago. We’d had a successful year of hiring in terms of volume, but realized that in our push for growth we were not being as mindful of culture and diversity as we wanted to be. For example, only 10% of our new data scientists were women, which meant that we were both out of sync with our community of guests and hosts, and that the existing female data scientists at Airbnb were quickly becoming outnumbered. This was far from intentional, but that was exactly the problem — our hiring efforts did not emphasize a gender balanced team. | |
There are, of course, many ways to think about team balance; gender is just one dimension that stood out to us. And there are known structural issues that form a headwind against progress in achieving gender balance (source). So, in a hyper-growth environment where you’re under pressure to build your team, it is easy to recruit and hire a larger proportion of male data scientists. | |
But this was not the team we wanted to build. Homogeneity brings a narrower range of ideas and gathers momentum toward a vicious cycle, in which it becomes harder to attract and retain talent within a minority group as it becomes increasingly underrepresented. If Airbnb aspires to build a world where people can belong anywhere, we needed to begin with our team. | |
We worried that some form of unconscious bias had infiltrated our interviews, leading to lower conversion rates for women. But before diving into a solution, we decided to treat this like any problem we work on — begin with research, identify an opportunity, experiment with a solution, and iterate. | |
Over the year since, the results have been dramatic: 47% of hires were women, doubling the overall ratio of female data scientists on our team from 15% to 30%. The effect this has had on our culture is clear — in a recent internal survey, our team was found to have the highest average employee satisfaction in the company. In addition, 100% of women on our team indicated that they expect to still be here a year from now and felt like they belonged at Airbnb. | |
Our work is by no means done. There’s still more to learn and other dimensions of diversity to improve, but we feel good enough about our progress to share some insights. We hope that teams at other companies can adopt similar approaches and build a more balanced industry of data scientists. | |
When we analyze the experience of a guest or host on Airbnb, we break it into two parts: the top-of-funnel (are there enough guests looking for places to stay and enough hosts with available rooms) and conversion (did we find the right match and did it result in a booking). Analyzing recruiting experiences is quite similar. | |
And, like any project, our first task was to clean our data. We used the EEOC reporting in Greenhouse (our recruiting tool) to better understand the diversity of our applicants, doing our own internal audit of data quality as well. One issue we faced is that while Greenhouse collects diversity data on applicants who apply directly through the Airbnb jobs page, it does not collect information on the demographics of referrals (candidates who were recommended for the job by current Airbnb employees), which represent a large fraction of hires. Then we combined this with data from an internal audit of our teams history and from Workday, our HR tool, in order to compare the composition of applicants to the composition of our team. | |
When we dug in, we found that historically about 30% of our applicants — the top of the funnel — had been women. This told us that there were opportunities for improvement on both fronts. Our proportion of female applicants was twice that of employees, so there was clearly room for improvement in our hiring process — the conversion portion. However, there wasn’t male/female parity in our applicant pool so this could also prove a meaningful lever. | |
In addition, we wanted to ensure that our efforts to diversify our data science team didn’t end with us. Making changes to the top of the funnel — to how many women want to and feel qualified to apply for data science jobs — could help us do that. Our end goal is to create a world where there is diversity across the entire data science field, not just at Airbnb. | |
We decided that the best way to achieve these goals would be to look beyond our own applicants to inspire and support women in the broader field. One observation was that while there were a multitude of meetups for women who code, and many great communities of women in engineering, we hadn’t seen the same proliferation of events for women in data science. | |
We decided to create a series of lightning talks featuring women in data, under the umbrella of the broader Airbnb “Taking Flight” initiative. The goals were twofold: to showcase the many contributions of women in the field, and to create a forum for celebrating the contributions of women to data science. At the same time, we wanted to highlight diversity on multiple dimensions. For each lightning talk, we created a panel of women from many different racial and ethnic backgrounds, practicing different types of data science. The talks were open to anyone who supported women in data science. | |
We came up with the title “Small Talks, Big Data” and started with an event in November 2014 where we served food and created a space and time for mingling. The event sold out, with over 100 RSVPs. Afterward we ran a survey to see what our attendees thought we could improve in subsequent events and turned “Small Talks, Big Data” into a series, all of which have continued to sell out. Given this level of interest, several of the women on our team volunteered to write blog posts about their accomplishments (for example, Lisa’s analysis of NPS and Ariana’s overview of machine learning) in order to circulate their stories beyond San Francisco, and to give talks and interviews (for example, Get to know Data Science Panelist Elena Grewal). Many applicants to our team have cited these talks and posts as inspirations to consider working at Airbnb. | |
In parallel to these large community events we put together smaller get-together for senior women in the field to meet, support one another, and share best practices. We hosted an initial dinner at Airbnb and were amazed at what wonderful conversations and friendships were sparked by the event. This group has continued to meet informally, with women from other companies taking the lead on hosting events at their companies, further exposing this group to the opportunities in the field. | |
Alongside our efforts to broaden our applicant pool, we scrutinized our approach to interviewing. As with any conversion funnel, we broke our process down into discrete steps, allowing us to isolate where the drop-off was occurring. | |
There are essentially three stages to interviewing for a data science role at Airbnb: a take-home challenge used to assess technicality and attention to detail; an onsite presentation demonstrating communication and analytical rigor; and a set of 1:1 conversations with future colleagues where we evaluate compatibility with our culture and fit for the role itself. Conversion in the third step was relatively equal, but quite different in steps one and two. | |
We wanted to keep unconscious bias from affecting our grading of take-home challenges, either relating to reviewers being swayed by the name and background of the candidate (via access to their resume) or to subjective views of what constitutes success. To combat this, we removed access to candidate names[1] and implemented a binary scoring system for the challenge, tracking whether candidates did or did not do certain tasks, in an effort to make ratings clearer and more objective. We provided graders with a detailed description of what to look for and how to score, and trained them on past challenges before allowing them to grade candidates in flight. The same challenge would circulate through multiple graders to ensure consistency. | |
Our hypothesis for the onsite presentation was that we had created an environment that catered more to men. Often, a candidate would be escorted into a room where there would be a panel of mostly male data scientists who would scrutinize their approach to solving the onsite challenge. The most common critique of unsuccessful candidates was that they were ‘too junior’, stemming from poor communication or a lack of confidence. Our assumption was that this perception was skewed by the fact that they were either nervous or intimidated by the presentation atmosphere we had created. | |
A few simple changes materially improved this experience. We made it a point to ensure women made up at least half of the interview panel for female candidates. We also began scheduling an informal coffee chat for the candidate and a member of the panel before the presentation, so they would have a familiar face in the room (we did this for both male and female candidates and both said they appreciated this change). And, in our roundup discussions following the presentation, we would focus the conversation on objective traits of the presentation rather than subjective interpretations of overall success. | |
Taken together, these efforts had a dramatic effect on conversion rates. While our top-of-funnel initiatives increased the relative volume of female candidates, our interviewing initiatives helped create an environment in which female candidates would be just as likely to succeed as any male candidate. Furthermore, these changes to our process didn’t just help with diversity; they improved the candidate experience and effectiveness of hiring data scientists in general. | |
The steps we took over the last year grew the gender balance on our team from 15% to 30%, which has made our team stronger and our work more impactful. How? | |
First, it makes us smarter (source) by allowing for divergent voices, opinions, and ideas to emerge. As Airbnb scales, it has access to more data and increasingly relies upon the data science team’s creativity and sophistication for making strategic decisions about our future. If we were to maintain a homogenous team, we would continue to rely upon the same approaches to the challenges we face: investing in the diversity of data scientists is an investment in the diversity of perspectives and ideas that will help us jump from local to global maxima. Airbnb is a global company and people from a multitude of backgrounds use Airbnb. We can be smarter about how we understand that data when our team better reflects the different backgrounds of our guests and hosts. | |
Second, a diverse team allows us to better connect our insights with the company. The impact of a data science team is dependent upon its ability to influence the adoption of its recommendations. It is common for new members of the field to assume that statistical significance speaks for itself; however, colleagues in other fields tend to assume the statistical voodoo of a data scientist’s work is valid and instead focus on the way their ideas are conveyed. Our impact is therefore limited by our ability to connect with our colleagues and convince them of the potential our recommendations hold. Indeed, the pairing of personalities between data scientists and partners is often more impactful than the pairing of skillsets, especially at the leadership level. Increasing diversity is an investment in our ability to influence a broader set of our company’s leadership. | |
Finally, and perhaps most importantly, increasing our team’s diversity has improved our culture. The women on the data science team feel that they belong and that their careers can grow at Airbnb. As a result, they are more likely to stay with the company and are more invested in helping to build this team, referring people in their networks for open roles. We are not done, but we have reversed course from a vicious to virtuous cycle. Additionally, the results aren’t just restricted to women — the culture of the team as a whole has improved significantly over past years; in our annual internal survey, the data science team scores the highest in employee satisfaction across the company. | |
Of course, gender is only one dimension of diversity that we aim to balance within the team. In 2015 it was our starting point. As we look to 2016 and beyond, we will use this playbook to enhance diversity in other respects, and we expect this will strengthen our team, our culture, and our company. | |
[1] We ended up discontinuing this after a couple months after running into logistical issues with using Greenhouse. Greenhouse does not allow us to remove names, so when we switched to using Greenhouse fully to track take home results, graders were able to see names when they logged in to give scores.","1" | |
"zephoria","https://medium.com/datasociety-points/where-do-we-find-ethics-d0b9e8a7f4e6","1","{Ethics,""Big Data"",""Complex Systems""}","111","3.50566037735849","Where Do We Find Ethics?","I was in elementary school, watching the TV live, when the Challenger exploded. My classmates and I were stunned and confused by what we saw. With the logic of a 9-year-old, I wrote a report on O-rings, trying desperately to make sense of a science I did not know and a public outcry that I couldn’t truly understand. I wanted to be an astronaut (and I wouldn’t give up that dream until high school!). | |
Years later, with a lot more training under my belt, I became fascinated not simply by the scientific aspects of the failure, but by the organizational aspects of it. Last week, Bob Ebeling died. He was an engineer at a contracting firm, and he understood just how badly the O-rings handled cold weather. He tried desperately to convince NASA that the launch was going to end in disaster. Unlike many people inside organizations, he was willing to challenge his superiors, to tell them what they didn’t want to hear. Yet, he didn’t have organizational power to stop the disaster. And at the end of the day, NASA and his superiors decided that the political risk of not launching was much greater than the engineering risk. | |
Organizations are messy, and the process of developing and launching a space shuttle or any scientific product is complex and filled with trade-offs. This creates an interesting question about the site of ethics in decision-making. Over the last two years, Data & Society has been convening a Council on Big Data, Ethics, and Society where we’ve had intense discussions about how to situate ethics in the practice of data science. We talked about the importance of education and the need for ethical thinking as a cornerstone of computational thinking. We talked about the practices of ethical oversight in research, deeply examining the role of IRBs and the different oversight mechanisms that can and do operate in industrial research. Our mandate was to think about research, but, as I listened to our debates and discussions, I couldn’t help but think about the messiness of ethical thinking in complex organizations and technical systems more generally. | |
I’m still in love with NASA. One of my dear friends — Janet Vertesi — has been embedded inside different spacecraft teams, understanding how rovers get built. On one hand, I’m extraordinarily jealous of her field site (NASA!!!), but I’m also intrigued by how challenging it is to get a group of engineers and scientists to work together for what sounds like an ultimate shared goal. I will never forget her description of what can go wrong: Imagine if a group of people were given a school bus to drive, only they were each given a steering wheel of their own and had to coordinate among themselves which way to go. Introduce power dynamics, and it’s amazing what all can go wrong. | |
Like many college students, encountering Stanley Milgram’s famous electric shock experiment floored me. Although I understood why ethics reviews came out of the work that Milgram did, I’ve never forgotten the moment when I fully understood that humans could do inhuman things because they’ve been asked to do so. Hannah Arendt’s work on the banality of evil taught me to appreciate, if not fear, how messy organizations can get when bureaucracies set in motion dynamics in which decision-making is distributed. While we think we understand the ethics of warfare and psychology experiments, I don’t think we have the foggiest clue how to truly manage ethics in organizations. As I continue to reflect on these issues, I keep returning to a college debate that has constantly weighed on me. Audre Lorde said, “the master’s tools will never dismantle the master’s house.” And, in some senses, I agree. But I also can’t see a way of throwing rocks at a complex system that would enable ethics. | |
My team at Data & Society has been grappling with different aspects of ethics since we began the Institute, often in unexpected ways. When the Intelligence and Autonomy group started looking at autonomous vehicles, they quickly realized that humans were often left in the loop to serve as “liability sponges,” producing “moral crumple zones.” We’ve seen this in organizations for a long time. When a complex system breaks down, who is to be blamed? As the Intelligence & Autonomy team has shown, this only gets more messy when one of the key actors is a computational system. | |
And that leaves me with a question that plagues me as we work on our Council on Big Data, Ethics, and Society whitepaper: | |
No matter how thoughtful individuals are, no matter how much foresight people have, launches can end explosively. | |
Points: In this Points original, “Where Do We Find Ethics?” danah boyd takes us back to 1986 in order to pose the question of where we locate ethics in complex systems with distributed decision-making. Stay tuned for the forthcoming whitepaper from the Council on Big Data, Ethics, and Society (now available here).— Ed.","2" | |
"akelleh","https://medium.com/@akelleh/understanding-bias-a-pre-requisite-for-trustworthy-results-ee590b75b1be","4","{""Data Science"",""Machine Learning"",Causality,Analytics}","108","8.59056603773585","Understanding Bias: A Pre-requisite For Trustworthy Results","It turns out that it’s shockingly easy to do some very reasonable things with data (aggregate, slice, average, etc.), and come out with answers that have 2000% error! In this post, I want to show why that’s the case using some very simple, intuitive pictures. The resolution comes from having a nice model of the world, in a framework put forward by (among others) Judea Pearl. | |
We’ll see why it’s important to have an accurate model of the world, and what value it provides beyond the (immeasurably valuable) satisfaction of our intellectual curiosity. After all, what we’re really interested in is, in some context, what is the effect of one variable on another. Do you really need a model to help you figure that out? Can’t you just, for example, dump all of your data into the latest machine-learning model and get answers out? | |
What is bias? | |
In this second post (first here) in our series on causality, we’re going to learn all about “bias”. You encounter bias any time you’re trying to measure something, and your result ends up different from the true result. Bias is a general term that means “how far your result is from the truth”. If you wanted to measure the return on investment from an ad, you measured a 1198% increase in searches for your product, instead of the true 5.4% . If you wanted to measure sex discrimination in school admissions, you measured strong discrimination in favor of men, when it was actually (weakly) in favor of women. | |
What causes bias? How can we correct it, and how does our picture of how the world works factor in to that? To answer these questions, we’ll start with some textbook examples of rain and sidewalks. We’ll return to our disaster example from the last post, and compare it with something called “online activity bias”. | |
A Tale of Wet Sidewalks | |
Judea Pearl uses a simple and intuitive example throughout his discussion of paradoxes. We’ll borrow his example, mainly for its clarity, and then move on to some other examples we might care more about. | |
In this example, we’re examining what causes the sidewalk to get wet. We have a sprinkler that runs on a timer, and gets the sidewalk wet whenever it comes on. We also know that the sidewalk gets wet whenever it rains. We record these three variables every day, and come up with a nice data set. Our diagram summarizes our model for how the world works, and it suggests a nice sanity check: we can check to see if the rain is correlated with the sprinkler being on. When we do this on our hypothetical data set, we find that it’s not. Everything looks good! | |
Now, consider a thought problem. What if I know (1) that the sidewalk is wet, and I know that (2) it didn’t rain. What does that tell me about whether or not the sprinkler was on? | |
If we remove one explanation for the sidewalk being wet (we know it didn’t rain), then the others have to become more likely! If you know that the sidewalk is wet, suddenly knowing that it didn’t rain tells you something about whether the sprinkler is on. In the context of our knowledge about the wetness of the sidewalk, the sprinkler and the rain become statistically dependent! This isn’t a harmless effect. Let’s spend another minute trying to understand what’s going on. | |
If we restricted our data to only include days when the sidewalk was wet, we’d find a negative relationship between whether it has rained and whether the sprinkler was on. This happens for the reason we’ve been talking about: if the sidewalk is wet, and it hasn’t rained, then the sprinkler was probably on. If the sidewalk is wet and the sprinkler wasn’t on, then it has probably rained. Even though the two are uncorrelated in the original data, in the restricted data they are negatively correlated! | |
This happens because we’re not examining the world naively. We know something. If “the sidewalk is wet” and “it didn’t rain”, then “the sprinkler was probably on”. Statements like “If … then …” are called “conditional” statements. When we’re reasoning in the context of knowing something (the part that follows the “if”, before the “then”), then we’re talking about “conditional” knowledge. We’ll see that conditioning without realizing it can be extremely dangerous: it’s causes bias. | |
It turns out that this effect happens in general, and you can think of it in terms of these pictures. Conditioning on a common effect results in two causes becoming correlated, even if they were uncorrelated originally! This seems paradoxical, and it has also been called “Berkson’s paradox”. Looking at the diagram, it’s easy to identify a common effect, and trace the variables upstream from it: we know that conditional on this common effect, all of the upstream variables can become dependent. | |
We can put precise math terms on it for anyone who is interested (understanding the rest of the article doesn’t depend on understanding the next two sentences). The sprinkler and rain are independent, but they are not conditionally independent. Conditional independence doesn’t imply (and it not implied by) independence. | |
Now, we can see how the same structure leads to a type of bias that can easily happen in an experiment. | |
Do Math Nerds Have Poor Social Skills? | |
You’re applying for a job, and a company will hire you either if you have very good social skills (and are competent technically), or if you have very good technical skills (and are competent socially). You could have both, but having very good skill at one or the other is a requirement. This picture of the world looks something like fig. 2. Look familiar? | |
If you’re only looking at people within the company, then you know they were all hired. Possibly without realizing it, you’ve conditioned on the fact that everyone was hired (think: the sidewalk is wet). In this context, knowing someone has great social skills makes it less likely that they have great technical skills (and vice versa), even though the two are uncorrelated in the general population. | |
This effect introduces real bias into experiments. If you’re doing online studies (even randomized AB tests!) on a website, you’re conditioning on the fact the person has visited your site. If you’re doing a survey study at a college, there can be bias due to the fact that everyone has been admitted. Bias introduced from this kind of conditioning is called “selection bias”. The situation is worse: bias is introduced even if we’re conditioning on effects of being hired, like job title or department (e.g. by surveying everyone within a department). Conditioning on downstream effects can introduce bias too! | |
From these examples, you might conclude that conditioning is a terrible thing! Unfortunately, there are also cases where conditioning actually corrects bias that you’d have without conditioning! | |
Wait, what? | |
It turns out that the picture is the key. Before, we were considering bias due to conditioning on common effects (variables where arrows collide). Now, we’ll switch the arrows around, and talk about bias due to not conditioning on common causes (variables from which arrows diverge). | |
Consider the (simplified) disaster example from last time, in fig. 3. In this picture, a disaster might cause traffic. It also might cause my alarm clock to fail to go off (by causing a power failure). Traffic and my alarm going off are otherwise independent of each other. | |
If I were to check whether traffic was correlated with my alarm going off, I’d find that it was, even though there’s no causal relationship between the two! If there is a disaster, there will be bad traffic, and my alarm will fail to go off. Unplugging my alarm clock doesn’t cause traffic outside, and neither does traffic (say, from sporting events) cause my alarm clock to fail to go off. The correlation is spurious, and is due entirely to the common cause, the disaster, effecting both the alarm and the traffic. | |
If you want to remove this spurious relationship, how can you do it? It turns out that conditioning is the answer! If I look at data where there is no disaster, then I’ll find that whether my alarm goes off and whether there is traffic is uncorrelated. Likewise, if I know there was a disaster, knowing my alarm didn’t go off doesn’t give me additional information (since I already know there was a disaster) about whether there will be traffic. | |
Real World Bias | |
Bias due to common causes is called “confounding”, and it happens all the time in real contexts. This is the source of the (greater than) 1000 percentage point bias we mentioned in the introduction. The world actually looks something like figure 4. It’s the reason why it’s wrong to naively group objects by some property (e.g. article category) and compare averages (e.g. shares per article). | |
In this picture, we’re interested in whether people search for a product online. We want to understand how effective an advertisement is, and so we’d like to know the causal effect of seeing the ad on whether you’ll search for a product. Unfortunately, there’s a common cause of both. If you’re an active internet user, you’re more likely to see the ad. If you’re an active internet user, you’re also more likely to search for the product (independently of whether you see the ad or not). This kind of bias is called “activity bias,” and the effect you’d measure without taking it into account is more than 200 times greater than the true effect of the ad. | |
Fortunately, experiments can get around this problem. If you randomize who gets to see the ad, then you break the relationship between online activity and seeing the ad. In other words, you delete the arrow between “activity” and “sees ad” in the picture. This is a very deep concept worth its own post, which we’ll do in the future! | |
You could also remove the bias by conditioning, but that depends strongly on how good your measurement of activity is. Experiments are always the first choice. If you can’t do one, conditioning is a close second. We’ll also detail some different approaches to conditioning in a future post! For now, let’s try to draw out the basic conclusions. | |
To Condition or Not to Condition? | |
We’ve seen that bias can come from conditioning when you’re conditioning on a common effect, and doesn’t exist when you don’t condition. We’ve also seen that bias can come from not conditioning on a common cause, and goes away when you do condition. The “back-door criterion” tells you, given any sufficiently complete (strong caveat!) picture of the world, what you should and shouldn’t condition on. There are two criteria: (1) You don’t condition on a common effect (or any effects of “Y”), and (2) you do condition on common causes! This covers all of our bases, and so applying the back-door criterion is the solution, but the requirement that you have to know the right picture of the world is strong one. This leaves the question open about how we should “do science”. Do we try to build the picture of the world, and find the right things to condition on, so we can estimate any effect we like? Is the world sufficiently static that the picture changes slowly enough for this approach to be okay? (Pearl makes arguments that the picture is actually very static!). Or will it only ever be feasible to do one-off experiments, and estimate effects as we need them. Physics certainly wouldn’t have gotten as far as it has if it had taken the latter approach. | |
Finally, you may have realized that conditioning on “the right” variables is a stark contrast to the usual, dump-all-of-your-data-in approach to machine learning. This is also a topic worth its own post! It turns out that if you want the best chance of having a truly “predictive” model, you probably want to do something more like applying the back-door criterion than putting all of your past observational data into a model. You can get around this by being careful not to put downstream effects into the model. The reason why has to do with the “do” operation, and the deep difference between intervention and observation. It’s definitely worth its own article! | |
(Oh, and by the way — if you totally love this stuff, BuzzFeed is hiring Data Scientists)","1" | |
"grove","https://medium.com/google-news-lab/the-google-news-lab-in-2016-and-where-were-headed-17b16a8ee63d","4","{Journalism,Verification,""Virtual Reality"",Diversity,""Big Data""}","107","5.49622641509434","The Google News Lab in 2016, and where we’re headed","It’s been quite a year for the news industry. The events of the last year have brought many of the opportunities and challenges that news organizations face every day to national and international attention. Topics like fake news, the future of data journalism and polling, fact-checking, investigative journalism, and many more are no longer just happening at news industry conferences — they’re now part of the public conversation. | |
That’s a good thing, because the tools and technology the Internet makes possible today have great potential to make us more informed — but also present challenges to our information ecosystem. It will take news organizations, tech companies, and news consumers working together to ensure the future of quality journalism is bright. | |
Here at the Google News Lab, our mission to collaborate with journalists and entrepreneurs to help build the future of media. Throughout 2016, we’ve been working alongside news organizations and innovators around the world to help address the most important topics at the intersection of media and technology. In particular, we’re focusing on Trust and Verification, Data Journalism, Inclusive Storytelling, and Immersive Storytelling. | |
So as we close out the year, I wanted to share some of what we’ve done and talk about where we’re heading, with a goal of opening us up for feedback from the broader news community. Please leave your ideas or thoughts in the comments below. | |
Here’s a snapshot of what our work looked like in 2016: | |
Trust and Verification | |
Trust in media has never been a more important topic. The internet has empowered everyone to create content and engage in eyewitness reporting at a scale never before possible — but it’s also made it harder to separate fact from fiction. How can we help newsrooms and journalists leverage the opportunity of the web to expand their coverage while working together to improve trust in news organizations? | |
At the center of our efforts around this topic is the First Draft coalition, which we started in 2015 with seven social media verification organizations to create standards and best practices in the news industry for verifying eyewitness media content and combating fake news. In 2016, we expanded the coalition to over 80 partners worldwide, and were thrilled to welcome Facebook and Twitter as well. Any news organization can sign up, and in 2017, we’ll be working with the coalition to engage in more projects like Electionland, our polling place verification initiative. | |
We’re also working hard to make our own platforms better for eyewitness media content with the YouTube Newswire, our partnership with Storyful to verify YouTube videos for use in newsrooms. And we’re supporting the Trust Project, whose goal is to establish protocols and tools to help news organizations signal the value of their reporting to both audiences and technology platforms. | |
Data Journalism | |
Unprecedented computing power — and access to data — has empowered journalists to find insights and tell stories that were never before possible. Yet this growing field requires significant resources, training, and collaboration to become a universal skill set for all journalists. Since we launched our team two years ago, we’ve been asking ourselves — how can we help data journalists grow their work and leverage the best of Google data to bring new insights to their readers? | |
The foundation of our work is Google Trends, which anonymizes and aggregates in real time the trillions of searches that happen every year around the world to help journalists get insights on what’s on our collective minds. Our data curators at the News Lab work with journalists around the world to derive insights from the data and visualize it in ways that bring new insights to their readers. | |
In addition to empowering news organizations with new insights from Google data, we’re working to convene the data journalism community and help advance the open data movement. We want to help data journalists at all levels of ability around the world get the support they need to grow, which is why we support data journalism hackathons and awards programs like the Global Editors Network Data Journalism Awards. And we’re releasing tools like our tilegrams visualization, opening up data sets on GitHub to empower the community to do new and exciting things with Google data sets. | |
Diversity and Inclusive Storytelling | |
Every journalist — and every reader — comes to a story with their own bias. Diversity and inclusion are essential to create media that reaches new audiences, and opens us up to new perspectives. How can we help journalism seek out and amplify voices that aren’t regularly heard in the mainstream, offering coverage that’s representative of the audiences they seek reach? | |
This is a newer area of focus for us, and one we’re excited to engage in through training, partnerships, and technology. Part of that means bringing a focus on diversity to all of our newsroom trainings. Through partnerships with Ida B. Wells Society, Future News, Le Monde Académie, Neue Deutsche Medienmacher, and the Paris Street School, we’re working to provide trainings on Google tools to journalists from diverse and/or non-traditional backgrounds. | |
We’re also partnering with organizations like Witness to focus on the stories online that affect marginalized populations. Our latest project with the Witness Media Lab, Capturing Hate, chronicles the landscape of transgender violence videos uploaded for entertainment on online video platforms. The team analyzed over 300 videos of transgender violence and the viewer engagement around these videos, which shed light on the abuse of transgender people and the role of video in perpetuating that abuse. These are important topics and we’re looking forward to several more projects like this with the Witness Media Lab in 2017. | |
Immersive Storytelling | |
Significant advances in virtual reality, augmented reality, and drones can produce richer journalism and create empathy in readers. Yet there are significant storytelling, technical, and ethical challenges around these new technologies. How can we experiment together and create best practices and standards for these important new fields of journalism? | |
At the heart of our work in this field of storytelling is our Journalism360 coalition, which we created with the Knight Foundation and the Online News Association to build thought leadership, training, and experimentation to VR. The coalition is funding experimentation by providing $500k in grants to news organizations to experiment with VR storytelling — applications will be open early in 2017. | |
We’ve also been experimenting ourselves with various publishers, such as partnering with the Guardian on 6x9 — a VR story exploring solitary confinement that won a CINE Golden Eagle Award, and with Berliner Morgenpost on Refugees in Berlin, which won a Grimme Online Award for excellence in online reporting. Here at Google, Daydream is our big focus for VR moving forward, and we’ve already seen impressive Daydream apps from the likes of The Guardian, Huffington Post/RYOT, New York Times, Wall Street Journal and CNN. | |
Given the recent FAA rulings on drone usage in journalism, we’ve been leaning into this area as well, holding the first-ever Drone Journalism bootcamp at the University of Nebraska-Lincoln earlier this year. We’ll be holding four more of these bootcamps in 2017, with UNL, The Poynter Institute, and the National Press Photographers Association. | |
We’d love your feedback on each of these focus areas. Are we asking the right questions? Are we tackling the right areas? What would you like to see the Google News Lab do to improve the future of journalism? | |
We’re looking forward to 2017. As always, you can follow our work at g.co/newslab, or follow us here on Medium. | |
Steve Grove is the Director of the News Lab at Google.","6" | |
"datalab","https://medium.com/@datalab/how-to-break-into-the-data-science-market-f0e0b79b42f7","4","{""Data Science"",""Learning To Code"",Startup,Analytics,""Digital Marketing""}","106","5.42452830188679","How to get your first job in Data Science?","How can one get his/her first entry level job as a Data Scientist/Analyst? If you scroll through the data science subreddit, you will find many questions around this topic. Readers of my blog (data36.com) asking the same from me time to time. And I can tell you this a totally valid problem! | |
I have decided to summarize my answers for all the major questions! | |
Good news — bad news. | |
I will start with the bad one. The skills that they teach you at the universities in 90% of the cases are not really useful in real life data science projects. As I’ve written about that several times, in real projects these 4 data coding skills are needed: | |
It really depends on the company, which 2 or 3 they use. But if you’ve learnt one, it will be much easier to learn another. | |
So the first big question is: how can you get these tools? Here comes the good news! All of these tools are free! It means, that you can download, install and use them without paying a penny for them. You can practice, build a data pet-project or anything! I wrote a step by step article recently on how to get and install these tools on your computer. LINK:Data Coding 101: Install Python, SQL, R & Bash (for non-devs) | |
There are 2 major sources of learning data science — easily and cost-efficiently. Don’t worry, none of these are to attend €1000+ conferences or workshops. | |
Kinda old-school, but still a good way of learning. From books you can get very focused, very detailed knowledge about online data analysis, statistics, data coding, etc… I highlighted 7 books I recommend — in my previous article here: LINK. | |
Data science online courses are fairly cheap ($10-$50) and they cover various topics from data coding to business intelligence. If you don’t want to spend money on this, I’ve listed free courses and learning materials in this post: LINK. | |
This is a tricky one, right? Every company wants to have people with at least a little bit of real life experience… But how do you get real life experience, if you need real life experience to get your first job? Classic catch-22. And the answer is: pet projects. | |
“Pet project” means that you come up with a random data project idea, that makes you excited. Then you simply start to build it. You can think about it as a small startup, but make sure that you keep focusing on the data science part of the project and you can just ignore the business part. To give you some ideas, I listed here some of my pet projects from the past few years: | |
Be creative! Find a data science related pet project for yourself and start coding! If you hit the wall with a coding problem — that can happen easily, when you start to learn a new data language — just use google and/or stackoverflow. One short example of mine — on how effective stackoverflow is: | |
Notice the timestamp! I’ve sent in a sort of complicated question and I’ve got back the answer in 7 minutes. The only thing I needed to do after this is to copy-paste the code into my production code and boom, it has just worked! | |
(UPDATE1: Cross Validated is another great forum for Data Science related questions. Thanks for the addition for nameBrandon from reddit.) | |
Even if it’s a little bit difficult, try to get a mentor. If you are lucky enough, you will find someone, who works in a Data Scientist role at a nice company and who can spend 1 hour weekly or biweekly with you and discuss or teach things. | |
If you haven’t managed to find a mentor, you can still find your first one at your first company. This is gonna be your first data science related job, so I suggest not to focus on big money or on super-fancy startup atmosphere. Focus on finding an environment, where you can learn and improve yourself. | |
Taking your first data science job at a multinational company might not fit in this idea, because people there are usually too busy with their things, so they won’t have time or/and motivation help you improving (of course, there are always exceptions). | |
Starting at a tiny startup as a first data person on the team is not a good idea either in your case, because these companies don’t have senior data guys to learn from. | |
I advise you to focus on 50–500 sized companies. That’s the golden mean. Senior data scientists are on board, but they are not too busy helping and teaching you. | |
Okay, you have found some good companies… How to apply? Some principals for the CV: highlight your skills and projects, not your experience (as you don’t have too much years on your paper yet). List out the data coding languages, you use, and link some of the related github repos of yours, so you can show, that you really have used that language. | |
Also in most cases companies are asking for cover letter. It’s a good opportunity to express your enthusiasm of course, but you could add some practical details as well like what would you do on your first few weeks, if you’d be hired. (Eg. “Looking on your registration flow, I guess the ____ page plays a great role in it. On my first few weeks, I’d make ___, ___ and ___ specific researches around that to prove this hypothesis and understand it deeper. It could help the company to improve _____ and eventually push the _____ KPIs.”) | |
Hopefully this would land you a job interview, where you can chat a little bit about your pet projects, your cover letter suggestions, but it will be mostly about personality fit-check and most probably some basic skill-test. If you had practiced enough, you will pass this… but if you are a nervous type and you want to practice more, you can do it on hackerrank.com. | |
Well, that’s it. I know it sounds easier, when it’s written, but if you are really determined to be a Data Scientist, it won’t cause you any problem to make it happen! Good luck with that!And if you want to learn more about data science, check my blog (data36.com) and/or subscribe to my Newsletter! | |
Thanks for reading! | |
Enjoyed the article? Please just let me know by clicking the 💚 below. It also helps other people see the story! | |
Tomi Mesterauthor of data36.comTwitter: @data36_com","2" | |
"Antonio_Pedro","https://medium.com/data-science-brigade/primeiro-parlamentar-devolve-dinheiro-após-denúncia-da-operação-serenata-de-amor-9f8f1132da9d","3","{""Serenata De Amor"",Deputado,""Câmara Dos Deputados"",""Data Science""}","100","1.7122641509434","Primeiro parlamentar devolve dinheiro após denúncia da Operação Serenata de Amor","","1" | |
"larryweru","https://medium.com/matter/the-trouble-with-the-purple-election-map-31e6cb9f1827","12","{Politics,Design,""Data Visualization""}","98","5.44056603773585","The Trouble with the Purple Election Map","By Larry Weru | |
Every 4 years, we’re given a map like this: | |
Usually, a dissatisfied data visualizer is less than thrilled with entire states being designated as either red or blue. The analyst, wanting a clearer picture of the political landscape, breaks the original map down to a county-level: | |
With this graph, the analyst can see that the states are a patchwork of red and blue counties. But there’s an unanswered question. Did every single person inside a given county cast the same vote? Or, were the voters within the counties divided? This graph doesn’t show this information. It just shows which choice won, even if it was by one vote. So we decide to blend each county’s red and blue vote ratios together, instead of letting the winner take all. This way if we get a pure red or pure blue county, then we know that everybody casted the same vote. And if we get a purple county, then the voters were divided. This is the result: | |
A nice blend revealing smooth purple transitions between red and blue regions. We can now see that there was hardly a landslide victory in any county. But there’s still a problem. When looking at some of the purple counties, it can be hard to decipher if the purple is leaning more towards red or blue. And it gets difficult the closer the counties get to 50/50 red/blue. In fact a R>B county can be confused as a R<B county depending on adjacent colors. In an attempt to see the margins within the counties, we end up with something that’s close, but no cigar. What’s wrong? | |
The problem is that in the purple map we’re no longer discerning between two distinct hues (a specific red hue and specific blue hue), but an indiscrete number of purple hues found in-between red and blue. | |
Magenta (I will call purple this from now on) is what our brain registers when we observe an equal mixture of red and blue light with the absence of green. We created the purple map because pure red and pure blue counties didn’t show us the vote margins. But the purple hues are nonsense. We end up looking for the quality of red-ness or blue-ness inside each purple hue to create sense. We don’t need purple at all. | |
So is there a way for us to blend together red and blue hues without creating any new nonsensical hues in the process? | |
Of course! We add green, and we get a Neutralizing election map: | |
Interesting. How does this work? Basic color theory. | |
We can describe all colors as combinations of Red, Green, and Blue light. Earlier, we established that Magenta is the mixture of Red and Blue light, with the key absence of Green. But we don’t have any reservations about the absence of green for our purposes. We just want to create a gradient between red and blue. So instead of falling victim to Magenta, we can use Green to neutralize the purple-ranged colors. | |
For a given purple-ranged hue, if we take the weaker intensity value of that purple’s Red and Blue components — let’s call that value J — and then set the purple’s green component to that J value, the purple disappears. What’s left behind is either red or blue- whichever one was stronger in the original purple. | |
So if we take RGB[175/0/150] and Neutralize it to RGB[175/150/150], then the red hue sticks out, with some desaturation. | |
If we take RGB[150/0/175] and Neutralize it to RGB[150/150/175], then the blue hue sticks out, with some desaturation. | |
And if we Neutralize magenta, which is a 50/50 split between red and blue, the result will be a completely desaturated color since the red, green, and blue values would all be exactly the same. For example: RGB[128/0/128] Neutralizes to RGB[128/128/128]. | |
This makes sense, because when mixing together high-intensity Red, Green and Blue light, we get pure white light, which has no color saturation. | |
This is awesome! Thanks to this neutralizing algorithm we can pit two hues against each other and only one of the original 2 hues will be the resulting victor. No new hues! The only hues on the Neutralizing map are hue#0/hue#360 (Red) and hue#240 (Blue). So now the degree of red/blue saturation reveals the margins. | |
If a county even has the slightest hint of color saturation, then that color is that county’s marginal vote. There can be no dispute. This allows us to observe minute margins, such as a 49/51 red/blue split. *But there are limits to how nuanced the margins can be, since RGB’s 8-bit color-space can’t accept too many significant figures. | |
This Neutralizing Map can bring to our attention things we didn’t notice before — such as the blue streak in the Bible Belt isolated from the reds by a grey buffer, which doesn’t really call out to us in the Purple map. | |
Here’s how relative voter-turnout appears on the Neutralizing graph: | |
The principle behind the neutralizing map algorithm can work for all sets of colors. For example, we can neutralize a graph that has 4 fields represented by red, green, orange, and yellow. Grey would still be the intermediary between all the hues. It would just take a little extra work since we’d need to abstract away RGB. | |
Engineers use visualizations like this all the time, and the results can be useful. Take a look at Teddy_the_Bear’s cartogram. He diverges by adjusting gamma & luminosity, resulting in darker reds and brighter blues. Here’s a Massachusetts map that does something similar by avoiding purple. Although, unlike the Neutralizing map, the blues and reds in the Massachusetts map end up changing in hue when transitioning towards their assigned neutral. There’s also professor Robert Vanderbei’s margin-of-victory/tilt maps. Obvious differences are that the Neutralizing map results in a grey neutral instead of a brighter, white neutral. | |
All maps copyright © Lawrence Weru | |
Lawrence Weru is an alumnus of Florida State University where he received degrees in Biological Science and Studio Art while doing research in Software Engineering. He’s the founder of RoomFlag, a cloud service that uses colors to help facilities such as clinics and spas track their room statuses. | |
This project uses data from the 2012 presidential election. Special thanks to Geoffery Miller for assistance in working with CSV data in the browser.","1" | |
"trentmc0","https://medium.com/the-bigchaindb-blog/blockchains-for-artificial-intelligence-ec63b0284984","17","{Blockchain,""Artificial Intelligenc"",""Big Data"",""Machine Learning""}","97","22.2415094339623","Blockchains for Artificial Intelligence","[This is based on a talk I first gave on Nov 7, 2016. Here are the slides. And, it was first published on Dataconomy on Dec 21, 2016; I’m reposting here for ease of access.] | |
In recent years, AI (artificial intelligence) researchers have finally cracked problems that they’ve worked on for decades, from Go to human-level speech recognition. A key piece was the ability to gather and learn on mountains of data, which pulled error rates past the success line. | |
In short, big data has transformed AI, to an almost unreasonable level. | |
Blockchain technology could transform AI too, in its own particular ways. Some applications of blockchains to AI are mundane, like audit trails on AI models. Some appear almost unreasonable, like AI that can own itself — AI DAOs. All of them are opportunities. This article will explore these applications. | |
Before we discuss applications, let’s first review what’s different about blockchains compared to traditional big-data distributed databases like MongoDB. | |
We can think of blockchains as “blue ocean” databases: they escape the “bloody red ocean” of sharks competing in an existing market, opting instead to be in a blue ocean of uncontested market space. Famous blue ocean examples are Wii for video game consoles (compromise raw performance, but have new mode of interaction), or Yellow Tail for wines (ignore the pretentious specs for wine lovers; make wine more accessible to beer lovers). | |
By traditional database standards, traditional blockchains like Bitcoin are terrible: low throughput, low capacity, high latency, poor query support, and so on. But in blue-ocean thinking, that’s ok, because blockchains introduced three new characteristics: decentralized / shared control, immutable / audit trails, and native assets / exchanges. People inspired by Bitcoin were happy to overlook the traditional database-centric shortcomings, because these new benefits had potential to impact industries and society at large in wholly new ways. | |
These three new “blockchain” database characteristics are also potentially interesting for AI applications. But most real-world AI works on large volumes of data, such as training on large datasets or high-throughput stream processing. So for applications of blockchain to AI, you need blockchain technology with big-data scalability and querying. Emerging technologies like BigchainDB, and its public network IPDB do exactly that. You no longer need to compromise on the the benefits of traditional big-data databases in order to have the benefits of blockchains. | |
Having blockchain tech that scales unlocks its potential for AI applications. Let’s now explore what those might be, by starting with the three blockchain benefits. | |
These blockchain benefits lead to the following opportunities for AI practitioners: | |
Decentralized / shared control encourages data sharing: | |
Immutability / audit trail: | |
Native assets / exchanges: | |
There’s one more opportunity: (6) AI with blockchains unlock the possibility for AI DAOs (Decentralized Autonomous Organizations). These are AIs that can accumulate wealth, that you can’t shut off. They’re Software-as-a-Service on steroids. | |
There are almost surely more ways that blockchains can help AI. Also, there are many ways that AI can help blockchains, such as mining blockchain data (e.g. Silk Road investigation). That’s for another discussion:) | |
Many of these opportunities are about AI’s special relationship with data. So let’s first explore that. Following this, we’ll explore the applications of blockchains for AI in more detail. | |
Here, I will describe how much of modern AI leverages copious quantities of data for its impressive results. (This isn’t always the case, but it is a common theme worth describing.) | |
When I started doing AI research in the 90s, a typical approach was: | |
If this sounds academic, that’s because it was. Most AI work was still in academia, though there were real-world applications too. In my experience, it was like this in many sub-fields of AI, including neural networks, fuzzy systems (remember those?), evolutionary computation, and even slightly less AI-ish techniques like nonlinear programming or convex optimization. | |
In my first published paper (1997), I proudly showed how my freshly-invented algorithm had the best results compared to state-of-the-art neural networks, genetic programming, and more — on a small fixed dataset. Oops. | |
But, the world shifted. In 2001, Microsoft researchers Banko and Brill released a paper with remarkable results. First, they described how most work in their domain of natural language processing was on less than a million words — small datasets. Error rates were 25% for the old / boring / least fancy algorithms like Naive Bayes and Perceptrons, whereas fancy newer memory-based algorithms achieved 19% error. That’s the four datapoints on the far left of the plot below. | |
So far, no surprises. But then Banko and Brill showed something remarkable: as you added more data — not just a bit more data but orders of magnitude more data — and kept the algorithms the same, then the error rates kept going down, by a lot. By the time the datasets were three orders of magnitude larger, error was less than 5%. In many domains, there’s a world of difference between 18% and 5%, because only the latter is good enough for real-world application. | |
Moreover, the best-performing algorithms were the simplest; and the worst algorithm was the fanciest. Boring old perceptrons from the 1950s were beating state-of-the-art techniques. | |
Banko and Brill weren’t alone. For example, in 2007, Google researchers Halevy, Norvig and Pereira of Google published a paper showing how data could be “unreasonably effective” across many AI domains. | |
This hit the AI field like an atom bomb. | |
The race was on to gather way more data. It takes significant effort to get mountains of good data. If you have the resources, you can get data. Sometimes you can even lock up data. In this new world, data is a moat, and AI algorithms a commodity. For these reasons, “more data” is a key imperative for Google, Facebook, and many others. | |
Once you understand these dynamics, specific actions have simple explanations. Google doesn’t buy satellite imaging companies simply because it likes space; and Google gives away TensorFlow. | |
Deep learning directly fits in this context: it’s the result of figuring out how, if given a massive enough dataset, to start to capture interactions and latent variables. Interestingly, backprop neural networks from the ’80s are sometimes competitive with the latest techniques, if given the same massive datasets. See here. It’s the data, silly. | |
My own coming-of-age as an AI researcher was similar. As I was attacking real-world problems, I learned how to swallow my pride, abandon the “cool” algorithms, build only as much was needed to solve the problem at hand, and learned to love the data and the scale. It happened in my first company, ADA (1998–2004), as we pivoted from automated creative design to “boring” parameter optimization; which incidentally became less boring in a hurry as our users asked us to go from 10 variables to 1000. It happened in my second company, Solido (2004-present), as well, as we pivoted from fancier modeling approaches to super-simple but radically scalable ML algorithms like FFX; and once again was un-boring as our users pulled us from 100 variables to 100,000, and from 100 million Monte Carlo samples to 10 trillion (effective samples). Even BigchainDB, the product of my third and current company, emerged from the need for scale (2013-present). Zoom in on features, zoom up on scale. | |
In short: decentralized / shared control encourages data sharing, which in turns lead to better models, which in turns leads to higher profit / lower cost / etc. Let’s elaborate. | |
AI loves data. The more data, the better the models. Yet data is often siloed, especially in this new world where data can be a moat. | |
But blockchains encourage data sharing among traditional silos, if there is enough up-front benefit. The decentralized nature of blockchains encourages data sharing: it’s less friction to share if no single entity controls the infrastructure where the data is being stored. I give more benefits later on. | |
This data sharing might happen within an enterprise (e.g. among regional offices), within an ecosystem (e.g. for a “consortium” database), or across the planet (e.g. for a shared planetary database, a.k.a. public blockchain). Here’s an example for each: | |
Enemies sharing their data to feed an AI. 2016 is fun! | |
In some cases, when data from silos is merged, you don’t just get a better dataset, you get a qualitatively new dataset. Which leads to a qualitatively new model, from which you can glean new insights and have new business applications. That is, you can that do something you couldn’t do before. | |
Here’s an example, for identifying diamond fraud. If you’re a bank providing diamond insurance, you’d like to create a classifier that identifies whether a diamond is fraudulent. There are four trusted diamond certification labs on the planet (depending who you ask, of course:). If you only have access to the diamond data for one of these labs, then you’re blind about the other three houses, and your classifier could easily flag one of those other houses’ diamonds as fraud (see picture below, left). Your false positive rate would make your system unusable. | |
Consider instead if blockchains catalyze all four certification labs to share their data. You’d have all the legitimate data, from which you would build a classifier (below, right). Any incoming diamond, for example seen on eBay, would be run through the system and be compared to this all-data one-class classifier. The classifier can detect legitimate fraud and avoid false positives, therefore lowering the fraud rate, to benefit of insurance providers and certification labs. This could be simply framed as a lookup, i.e. not needing AI. But using AI improves it further, for example by predicting price based on color, carats, etc. then using “how close is price to expected value” as an input to the main fraud classifier. | |
Here’s a second example. An appropriate token-incentive scheme in a decentralized system could incentivize datasets to get labeled that could not be previously labeled, or labeled in a cost-effective fashion. This would be basically a decentralized Mechanical Turk. With new labels we get new datasets; we train on the new datasets to get new models. | |
Here’s a third example. A token-incentive scheme could lead to direct data input from IoT devices. The devices control the data and can exchange it for assets, such as energy. Once again, this new data can lead to new models (Thanks to Dimi de Jonghe for these last two examples.) | |
Hoard vs. share? There’s a tension between two opposite motivations here. One is to hoard data — the “data is the new moat” perspective; the other is to share data, for better/new models. To share, there must be a sufficient driver that outweighs the “moat” benefit. The technology driver is better models or new models, but this driver must lead to business benefit. Possible benefits include reduced fraud for insurance savings in diamonds or supply chains; making money on the side in Mechanical Turk; data/model exchanges; or collective action against a powerful central player, like the music labels working together against Apple iTunes. There are more; it requires creative business design. | |
Centralized vs. decentralized? Even if some organizations decide to share, they could share without needing blockchain technology. For example, they could simply pool it into an S3 instance and expose the API among themselves. But in some cases, decentralized gives new benefits. First is the literal sharing of infrastructure, so that one organization in the sharing consortium doesn’t control all the “shared data” by themselves. (This was a key stumbling block a few years back when the music labels tried to work together for a common registry.) Another benefit is that it’s easier to turn the data & models into assets, which can then be licensed externally for profit. I elaborate on this below. (Thanks to Adam Drake for drawing extra attention to the hoard-vs-share tension.) | |
As discussed, data & model sharing can happen at three levels: within an enterprise (which for multinationals is harder than you might think); within an ecosystem or consortium; or within the planet (which amounts to becoming a public utility). Let’s explore planet-scale sharing more deeply. | |
Planetary-level data sharing is potentially the most interesting level. Let’s drill further into this one. | |
IPDB is structured data on a global scale, rather than piecemeal. Think of the World Wide Web as a file system on top of the internet; IPDB is its database counterpart. (I think the reason we didn’t see more work on this sooner is that semantic web work tried to go there, from the angle of upgrading a file system. But it’s pretty hard to build a database by “upgrading” a file system! It’s more effective to say from the start that you’re building a database, and designing as such.) “Global variable” gets interpreted a bit more literally:) | |
So, what does it look like when we have data sharing with a planet-scale shared database service like IPDB? We have a couple points of reference. | |
The first point of reference is that there’s already a billion-dollar market (recently), for companies to curate and repackage public data, to make it more consumable. From simple APIs for the weather or network time, to financial data like stocks and currencies. Imagine if all this data was accessible through a single database in a similar structured fashion (even if it’s just a pass through of the API). Bloomberg x 1000. Without worrying that there was a single choke point controlled by a single entity. | |
The second point of reference comes from the blockchain, in the concept of “oraclizing” outside data to make it consumable by a blockchain. But we can oraclize it all. Decentralized Bloomberg is just the start. | |
Overall, we get a whole new scale for diversity of datasets and data feeds. Therefore, we have qualitatively new data. Planetary level structured data. From that, we can build qualitatively new models, that make relations which among inputs & outputs which weren’t connected before. With the models and from the models, we will get qualitatively new insights. | |
I wish I could be more specific here, but at this point it’s so new that I can’t think of any examples. But, they will emerge! | |
There’s also a bot angle. We’ve been assuming that the main consumers of blockchain APIs will be humans. But what if it’s machines? David Holtzman, creator of the modern DNS, said recently “IPDB is kibbles for AI”. Unpacking this, it’s because IPDB enables and encourages planet-level data sharing, and AI really loves to eat data. | |
This application addresses the fact that if you train on garbage data, then you’ll get a garbage model. Same thing for testing data. Garbage in, garbage out. | |
Garbage could come from malicious actors / Byzantine faults who may be tampering with the data. Think Volkswagen emissions scandal. Garbage may also come from non-malicious actors / crash faults, for example from defective IoT sensor, a data feed going down, or environmental radiation causing a bit flip (sans good error correction). | |
How do you know that the X/y training data doesn’t have flaws? What about live usage, running the model against live input data? What about the model predictions (yhat)? In short: what’s the story of the data, to and from the model? Data wants reputation too. | |
Blockchain technology can help. Here’s how. At each step of the process to build models, and to run models in the field, the creator of that data can simply time-stamp that model to the blockchain database, which includes digitally signing it as a claim of “I believe this data / model to be good at this point”. Let’s flesh this out even more… | |
Provenance in building models: | |
Provenance in testing / in the field: | |
We get provenance in both building the models, and applying them. The result is more trusted AI training data & models. | |
And we can have chains of this. Models of models, just like in semiconductor circuit design. Models all the way down. Now, it all has provenance. | |
Benefits include: | |
Data gets a reputation, because multiple eyes can check the same source, and even assert their own claims on how valid they believe the data to be. And, like data, models get reputations too. | |
A specific challenge in the AI community is: where are the datasets? Traditionally, they have been scattered throughout the web, though there are some lists here and there pointing to main datasets. And of course many of the datasets are proprietary, precisely because they have value. The data moat, remember? | |
But, what if we had a global database that made it easy to manage another dataset or data feed (free or otherwise)? This could include the broad set of Kaggle datasets from its various ML competitions, the Stanford ImageNet dataset, and countless others. | |
That’s exactly what IPDB could do. People could submit datasets, and use others’ data. The data itself would be in a decentralized file system like IPFS; and the meta-data (and pointer to the data itself) would be in IPDB. We’d get a global commons for AI datasets. This helps to realize the dream of the open data community. | |
We don’t need to stop at the datasets; we can include the models built from those datasets too. It should be easy to grab and run others’ models, and submit your own. A global database can greatly facilitate this. We can get models that are owned by the planet. | |
Let’s build on the application of “shared global registry” of training data and models. Data & models can be part of the public commons. But they can also be bought & sold! | |
Data and AI models can be used as an intellectual property (IP) asset, because they are covered by copyright law. Which means: | |
I think it’s pretty awesome that you can claim copyright of an AI model, and license it. Data is already recognized as a potentially huge market; models will follow suit. | |
Claiming copyright of and licensing data & models was possible before blockchain technology. The laws have served this for a while. But blockchain technology makes it better, because: | |
IP on the blockchain is near and dear to my heart, with my work on ascribe going back to 2013 to help digital artists get compensated. The initial approach had issues with scale and flexibility of licensing. Now, these have been overcome, as I recently wrote about. The technology that makes this possible includes: | |
With this, we get data and models as IP assets. | |
To illustrate, using ascribe, I claimed copyright of an AI model that I’d made years ago. The AI model is a CART (decision tree) for deciding which analog circuit topology to use. Here is its cryptographic Certificate of Authenticity (COA). If you’d like to license an edition from me, just email me:) | |
Once we have data and models as assets, we can start to make exchanges for those assets. | |
An exchange could be centralized, like DatastreamX already does for data. But so far, they are really only able to use publicly-available data sources, since many businesses see more risk than reward from sharing. | |
What about a decentralized data & model exchange? By decentralizing data sharing in an “exchange” context, new benefits arise. In being decentralized, no single entity controls the data storage infrastructure or the ledger of who-owns-what, which makes it easier for organizations to work together or share data, as described earlier in this essay. Think OpenBazaar, for Deep Nets. | |
With such a decentralized exchange, we’ll see the emergence of a truly open data market. This realizes a long-standing dream among data and AI folks, including yours truly:) | |
And of course we’ll have AI-based algorithmic trading on those exchanges: AI algorithms to buy AI models. The AI trading algorithms could even be buying algorithmic trading AI models, then updating themselves! | |
This riffs on the previous application. | |
When you sign on to use Facebook, you’re granting Facebook very specific rights about what they can and can’t do with any data that you enter into their system. It’s licenses on your personal data. | |
When a musician signs with a label, they’re granting the label very specific rights, to edit the music, to distribute it, and so on. (Usually the label tries to grab all of copyright, which is super onerous but that’s another story!) | |
It can be the same thing for AI data, and for AI models. When you create data that can be used for model-building, and when you create models themselves, you can pre-specify licenses that restricts how others use them upstream. | |
Blockchain technology makes this easy, for all the use cases, from personal data to music, from AI data to AI models. In the blockchain database, you treat permissions as assets, where for example a read permissions or the right to view a particular slice of data or model. You as the rights holder can transfer these permissions-as-assets to others in the system, similar to how you transfer Bitcoin: create the transfer transaction and sign it with your private key. (Thanks to Dimitri de Jonghe for this.) | |
With this, you have far better control for the upstream of your AI training data, your AI models, and more. For example, “you can remix this data but you can’t deep-learn it.” | |
This is likely part of DeepMind’s strategy in their healthcare blockchain project. In data mining healthcare data puts them at risk of regulation and antitrust issues (especially in Europe). But if users can instead truly own their medical data and control its upstream usage, then DeepMind can simply tell consumers and regulators “hey, the customer actually owns their own data, we just use it”. My friend Lawrence Lundy provided this excellent example (thanks Lawrence!). He then extrapolated further: | |
This one’s a doozy. An AI DAO is AI that owns itself, that you can’t turn off. I’ve previously discussed AI DAOs in three posts (I, II, III); I’ll summarize the “how” below. I encourage the interested reader to dive deeper. | |
So far, we’ve talked about blockchains as decentralized databases. But we can decentralize processing too: basically, store state of a state machine. Have a bit of infrastructure around this to make it easier to do, and that’s the essence of “smart contracts” technologies like Ethereum. | |
We’ve had decentralized processes before, in the form of computer viruses. No single entity owns or controls them, and you can’t turn them off. But they had limits — they basically try to break your computer, and that’s about all. | |
But what if you could have richer interactions with the process, and the process itself could accumulate wealth on its own? That’s now possible via better APIs to the process such as smart contracts languages, and decentralized stores of value such as public blockchains. | |
A Decentralized Autonomous Organization (DAO) is a process that manifests these characteristics. It’s code that can own stuff. | |
Which brings us to AI. The AI sub-field called “Artificial General Intelligence” (AGI) is most relevant. AGI is about autonomous agents interacting in an environment. AGI can be modeled as a feedback control system. This is great news, because control systems have many great qualities. First, they have strong mathematical foundations going back to the 1950s (Wiener’s “Cybernetics”). They capture the interaction with the world (actuating and sensing), and adapting (updating state based on internal model and external sensors). Control systems are widely used. They govern how a simple thermostat adapts to a target temperature. They cancel noise in your expensive headphones. They’re at the heart of thousands of other devices from ovens to the brakes in your car. | |
The AI community has recently embraced control systems more strongly. For example, they were key to AlphaGo. And, AGI agents themselves are control systems. | |
An AI DAO is an AGI-style control system running on a decentralized processing & storage substrate. Its feedback loop continues on its own, taking inputs, updating its state, actuating outputs, with the resources to do so continually. | |
We can get an AI DAO by starting with an AI (an AGI agent), and making it decentralized. Or, we can start with a DAO and give it AI decision-making abilities. | |
AI gets its missing link: resources. DAO gets its missing link: autonomous decision-making. Because of this, AI DAOs could be way bigger than AIs on their own, or DAOs on their own. The potential impact is multiplicative. | |
Here are some applications: | |
This essay has described how blockchain technology can help AI, by drawing on my personal experiences in both AI and blockchain research. The combination is explosive! Blockchain technologies — especially planet-scale ones — can help realize some long-standing dreams of AI and data folks, and open up several opportunities. | |
Let’s summarize: | |
If you liked this, you might like the following: | |
Thanks to the following reviewers / editors: Ali Soy, Bruce Pon, Adam Drake, Carly Sheridan, Dimitri de Jonghe, Gaston Besanson, Jamie Burke, Lawrence Lundy, Philipp Banhardt, Simon de la Rouviere, Troy McConaghy. If you have more feedback, please let me know and I’ll be happy to update the article.","14" | |
"dmca","https://medium.com/@dmca/the-mba-data-science-toolkit-8-resources-to-go-from-the-spreadsheet-to-the-command-line-cbb59ea82144","2","{""Data Science"",Learning,""Personal Development""}","97","8.39842767295598","The MBA Data Science Toolkit: 8 resources to go from the spreadsheet to the command line","I recently had the pleasure of speaking on a few panels about analytics to my fellow MBA students and alumni, as well as many Penn undergrads. After these talks, I’ve been asked for my advice on what the best resources are for someone coming from the business world (i.e., non-technical) who wants to develop the skills to become an effective data scientist. This post is an attempt to codify the advice I give and general resources I point people towards. Hopefully, this will make what I have learned accessible to more people and provide some guidance for those who realize that the future belongs to the empirically inclined (see below) but don’t know where to start their journey to becoming part of the club. | |
However, I would caution the reader that what I propose here is only a starting point on a journey towards really understanding the power of good data science. And, as Sean Taylor once told me, learn only what you need to accomplish your goal; if there are things on this list that you know you don’t need then skip them, you won’t hurt my feelings. At its core, data science is really about curiosity, optimism, and continual learning, all of which are ongoing habits rather than boxes to be checked. Therefore, I expect this list to evolve as the tools themselves change and as I continue to discover more about data science itself. | |
Linear algebra is a topic that underlies a lot of the statistical techniques and machine learning algorithms that you will employ as a data scientist. I like to recommend a MOOC I took through Coursera years ago, Coding the Matrix: Linear Algebra through Computer Science Applications. As the name implies, the course teaches linear algebra in the context of computer science (specifically using Python, which lends itself well to data science). There is also an optional companion textbook that makes a great reference manual. | |
Given that we use R at Wealthfront, I have a few resources that I think are important here. The first, written by Garrett Grolemund and Hadley Wickham, R for Data Science will be published in physical form in July 2016 but is available for free online now. And rather than explain what the book is about in my own words, here are a few from the authors directly: | |
Next up, our friend Hadley has also written Advanced R, which covers functional programming, metaprogramming, and performant code as well as the quirks of R. | |
Hadley is also responsible for some of the packages I use every day that make 90% of common data science tasks quicker and less verbose. I recommend checking out the following libraries; they will change the way you write code in R: | |
For extra credit, check out yet another of Hadley’s books: R Packages. This is a great follow-up resource for those of you that want to write reproducible, well-documented R code that other people can easily use (other people includes your future self!) | |
This is probably the easiest section of the guide as you can teach yourself most of SQL in a few hours. Code School has both introductory and intermediate courses that you can get through in an afternoon. | |
The Sequel to SQL covers everything from aggregate functions and joins to normalization and subqueries. And while mastering these skills takes practice, you can still get an idea of what SQL can and cannot do without too much work. | |
Without wading into the age-old Frequentist vs. Bayesian debate (or non-debate), I think that a solid foundation in Bayesian reasoning and statistics is a crucial part of any data scientist’s repertoire. For example, Bayesian reasoning underpins much of modern A/B testing and Bayesian methods are applied in many other areas of data science (and are generally covered less in introductory statistics courses). | |
John K. Kruschke has a great ability to break down complex material and convey it in a way that is intuitive and practical. Along with R for Data Science, this book is probably one of the best all-around resources for learning how to do data science in the R programming language. | |
Additionally, Kruschke’s blog makes a great companion resources to the textbook if you’re looking for more examples of problems to solve or answers to questions you still have after reading the book. And if a textbook isn’t exactly what you’re looking for, then Rasmus Bååth’s research blog, Publishable Stuff, is another great resource for learning about Bayesian approaches to problem-solving. | |
While most data scientists use far less machine learning than most people would think, there are plenty of tools from this domain that can be applied to answer questions that less exotic approaches might struggle with. In fact, the most important lessons to take away from courses such as Andrew Ng’s Machine Learning course on Coursera are the strengths and weaknesses of various algorithms. Knowing the limitations of different approaches can save hours, or even days, of frustration by allowing you to avoid using the wrong tool to solve a particular problem. Andrew Ng is an example of another academic who has a gift for making the complex seem simple. This is my favorite MOOC of all time and is worth taking even if becoming a data scientist is low on your list of priorities. | |
Much of what you will build as a data scientist will be code, and code needs to be stored, tracked and deployed. Learning how to use a Distributed Version Control System (DVCS) such as Git will allow you do all of these things. More importantly, it will allow you to easily collaborate on code with your team and, in the context of the right engineering infrastructure, provide a level of protection from deploying irreversibly broken code. | |
If you are new to the world of Git it can be confusing to understand but once you get it it seems super simple. The best courses I found to learn Git were from the team at Code School again. There’s probably a solid weekend’s worth of work here but trust me, it is worth the investment. | |
Then there’s GitHub, a web-based Git repo hosting service. Understand the typical workflows associated with using a remote repository structure is critical. It makes everything that you’ll learn in Git Real 1 and 2 significantly more useful. By the time you’ve taken these 3 courses you’ll know more than you’ll probably ever need to about Git and GitHub. | |
I’m using Haskell here as a stand in for functional programming, and Learn You a Haskell for Great Good! is one way to learn it. In the words of Roberto Medri: | |
While there are many languages out there that are well-suited to the functional paradigm, Haskell has a book that makes the language and functional programming incredibly simple. Learn You a Haskell is really entertaining to read and the exercises really help you understand what you are doing. | |
I think most data scientists would agree this is one of the most important skills in the toolkit. Taking the maths, statistics, modeling and coding that go into good data science and learning something new about the world or generating some novel insight can be wasted if you aren’t able to effectively communicate it to others. The most powerful tool we have to effectively convey information is visualization, without which data scientist would be somewhat useless. | |
There are many great writers on this topic and, therefore, many great books, so I don’t mean to claim that this particular recommendation is either superlative or exhaustive. That said, Now You See It, by Stephen Few is a fairly comprehensive overview of the theory behind, and practical application of, conveying quantitative information through visual media. It’s a resource that I have found myself coming back to time and time again when deciding how to display data or communicate information. | |
I hope these resources can provide a roadmap to help other people bridge the gap between the technical and business domains that data science within. However, while knowing maths and statistics and being able to write code are all crucial to being a data scientist, these things are just tools that merely enable the deep work that constitutes a lot of the typical day in data science land. | |
In fact, developing these skills is not the hardest part of becoming proficient in data science. Learning to define feasible problems, coming up with reasonable ways of measuring solutions and, believe it or not, storytelling, are some of the less concrete, but certainly more challenging, aspects of data scientists that I’ve had to get better at. These skills come from practice, from making mistakes, and learning from them as you progress in your career. For more insight, Yanir Seroussi has a great blog post that I think sums this up well. | |
Lastly, there are three traits that people in data science seem to possess in varying but significant proportions: genuine curiosity, optimism in the face of uncertainty, and a desire to learn. I don’t think there is a book or a MOOC to teach these things but if you have them then you can learn the rest. I hope this guide can be a starting point for others that chooses to do so.","2" | |
"mattkiser","https://medium.com/emergent-future/machine-learning-trends-and-the-future-of-artificial-intelligence-2016-15c15cd6c129","4","{""Machine Learning"",""Artificial Intelligence"",""Big Data"",""Data Science"",Analytics}","96","4.7377358490566","Machine Learning Trends and the Future of Artificial Intelligence 2016","Every company is now a data company, capable of using machine learning in the cloud to deploy intelligent apps at scale, thanks to three machine learning trends: data flywheels, the algorithm economy, and cloud-hosted intelligence. | |
That was the takeaway from the inaugural Machine Learning / Artificial Intelligence Summit, hosted by Madrona Venture Group* last month in Seattle, where more than 100 experts, researchers, and journalists converged to discuss the future of artificial intelligence, trends in machine learning, and how to build smarter applications. | |
With hosted machine learning models, companies can now quickly analyze large, complex data, and deliver faster, more accurate insights without the high cost of deploying and maintaining machine learning systems. | |
“Every successful new application built today will be an intelligent application,” Soma Somasegar said, venture partner at Madrona Venture Group. “Intelligent building blocks and learning services will be the brains behind apps.” | |
Below is an overview of the three machine learning trends leading to a new paradigm where every app has the potential to be a smart app. | |
Digital data and cloud storage follow Moore’s law: the world’s data doubles every two years, while the cost of storing that data declines at roughly the same rate. This abundance of data enables more features, and better machine learning models to be created. | |
“In the world of intelligent applications, data will be king, and the services that can generate the highest-quality data will have an unfair advantage from their data flywheel — more data leading to better models, leading to a better user experience, leading to more users, leading to more data,” Somasegar says. | |
For instance, Tesla has collected 780 million miles of driving data, and they’re adding another million every 10 hours. | |
This data is feed into Autopilot, their assisted driving program that uses ultrasonic sensors, radar, and cameras to steer, change lanes, and avoid collisions with little human interaction. Ultimately, this data will be the basis for their autonomous, self-driving car theyplan to release in 2018. | |
Compared to Google’s self-driving program, which has amassed just over 1.5 million miles of driving data. Tesla’s data flywheel is in full effect. | |
All the data in the world isn’t very useful if you can’t leverage it. Algorithms are how you efficiently scale the manual management of business processes. | |
“Everything at scale in this world is going to be managed by algorithms and data,” says Joseph Sirosh, CVP of Data Group and Machine Learning at Microsoft. In the near-future, “every business is an algorithmic business.” | |
This creates an algorithm economy, where algorithm marketplaces function as the global meeting place for researchers, engineers, and organizations to create, share, and remix algorithmic intelligence at scale. As composable building blocks, algorithms can be stacked together to manipulate data, and extract key insights. | |
In the algorithm economy, state-of-the-art research is turned into functional, running code, and made available for others to use. The intelligent app stack illustrates the abstraction layers, which form the building blocks needed to create intelligent apps. | |
“Algorithm marketplaces are similar to the mobile app stores that created the ‘app economy,’” Alexander Linden, research director at Gartner said. “The essence of the app economy is to allow all kinds of individuals to distribute and sell software globally without the need to pitch their idea to investors or set up their own sales, marketing and distribution channels.” | |
For a company to discover insights about their business, using algorithmic machine intelligence to iteratively learn from their data is the only scalable way. It’s historically been an expensive upfront investment with no guarantee of a significant return. | |
“Analytics and data science today are like tailoring 40-years ago,” Sirosh said. “It takes a long time and a tremendous amount of effort.” | |
For instance, an organization needs to first collect custom data, hire a team of data scientists, continually develop the models, and optimize them to keep pace with the rapidly changing and growing volumes of data — that’s just to get started. | |
With more data becoming available, and the cost to store it dropping, machine learning is starting to move to the cloud, where a scalable web service is an API call away. Data scientists will no longer need to manage infrastructure or implement custom code. The systems will scale for them, generating new models on the fly, and delivering faster, more accurate results. | |
“When the effort to build and deploy machine learning models becomes a lot less — when you can ‘mass manufacture’ it — then the data to do that becomes widely available in the cloud,” Sirosh said. | |
Emerging machine intelligence platforms hosting pre-trained machine learning models-as-a-service will make it easy for companies to get started with ML, allowing them to rapidly take their applications from prototype to production. | |
“As companies adopt the microservices development paradigm, the ability to plug and play different machine learning models and services to deliver specific functionality becomes more and more interesting,” Somasegar said. | |
When open source machine learning and deep learning frameworks running in the cloud, like Scikit-Learn, NLTK, Numpy, Caffe, TensorFlow, Theano, or Torch, companies will be able to easily leverage pre-trained, hosted models to tag images, recommend products, and do general natural language processing tasks. | |
“Our world view is that every company today is a data company, and every application is an intelligent application,” Somasegar said. “How can companies get insights from huge amounts of data and learn from that? That’s something that has to be brought up with every organization in the world.” | |
As the data flywheels begin to turn, the cost to acquire, store, and compute that data will continue to drop. | |
This creates an algorithm economy, where the building blocks of machine intelligence live in the cloud. These pre-trained, hosted machine learning models make it possible for every app to tap into algorithmic intelligence at scale. | |
The confluence of data flywheels, the algorithm economy, and cloud-hosted intelligence means: | |
“We have come a long way,” Matt McIlwain says, Madrona Venture Group managing director. “But we still have a long way to go.” | |
*Disclosure: Madrona Venture Group is an investor in Algorithmia","12" | |
"washingtonpost","https://medium.com/thewashingtonpost/this-is-what-its-like-to-investigate-a-police-department-accused-of-misconduct-e501efb4a473","4","{Police,""Criminal Justice"",""Big Data"",""Reporters Notebook"",""Criminal Justice Reform""}","96","6.09245283018868","This is what it’s like to investigate a police department accused of misconduct","By Kimbriell Kelly | |
The Department of Justice in 2012 recommended that a suburban Chicago police department track officer behavior in order to find officers who might be prone to misconduct, after allegations that officers used excessive force against people. | |
The Harvey Police Department said it installed the system. The sheriff’s department had its doubts. | |
Could it be true that the database wasn’t being used? It would not have been the first time. The Justice in its own investigations found departments like Baltimore, New Orleans and Newark operated these systems only in theory, but that they no longer worked. | |
I decided to investigate. I gave myself three days to find out and booked a flight to Chicago. Here’s the story behind my story for The Washington Post. | |
Even though I live in Washington, D.C., I was born in Harvey. And while I never lived in the suburb, I went to church there, had friends there and played tennis matches in town against our school district rivals. | |
One of the city’s aldermen who went to high school with my best friend agreed to show me and Joshua Lott, the photographer for the story, around the area. The alderman, his assistant, Joshua and I all piled into a car, and we were off. | |
At 12:52 p.m., four hours into our tour, we were at the police station taking pictures when Joshua was approached by a man who identified himself as a police officer. He asked Joshua what he was doing. The man left, and then returned to ask for Joshua’s press credentials. He obliged. | |
Next, we headed to a neighborhood where most of the homes in the area were boarded up and the grass was overgrown. Joshua and I decided to get out of the car and look around. | |
Suddenly, two squad cars appeared out of nowhere and boxed us in. | |
Something seemed odd. The neighborhood was virtually empty. So I took out my phone and began recording. It was 1:07 p.m., just ten minutes after our first encounter with police at the police station. | |
One officer pulled alongside the driver’s side and said: “What are you doing?” He motioned for the alderman to roll down his window. He told him not to park in the middle of the street and to pull off to the curb. Then, both squad cars drove off. | |
We headed out of the neighborhood and to the downtown. We parked on the side of the road in front of a local business. As I was taking notes, a squad car again appeared in front of us, facing the opposite direction of traffic. It was 1:19 p.m, twelve minutes after our second encounter with police. The squad car remained there for five or 10 minutes and then left. | |
At this point, I had been in town four hours and had been stopped or surveilled by four squad cars. | |
Coincidence? Who knows. | |
This was only the beginning. I still had to talk to the police chief, who seemed to have been avoiding me. Before the visit, I had phoned and sent e-mails that went unanswered. | |
I found out there was a board meeting the next night. What better way to talk to a city official than the board meeting where department heads are required to attend? | |
I arrived early. The chief was there. But before I could get to him, a man intercepted me in the hallway. | |
“You must be the reporter from The Washington Post,” he said. | |
He introduced himself as Sean Howard, the spokesman for the mayor and the police chief. We had talked before my visit. As he stood before me, I asked him for an interview with the police chief. He declined, and said that we could set up an interview another time. | |
I was standing at the entrance of the council chambers, and the chief would need to pass me to get into the room. Surely there were a few minutes he could talk to me. | |
Howard offered to introduce me, but that was it. | |
I was running out of time, and the meeting was about to start. I had to do something, and quickly. | |
As the chief approached, Sean introduced the two of us and before the chief could continue on his way, I could do nothing but blurt out: “I was born here and I went to school here!” | |
The chief’s face relaxed, he smiled and then he started to ask me questions. It turns out he knew my old pastor, his son and I went to the same college, and his son now teaches at my high school alma mater. | |
“Stop by tomorrow morning,” the chief said. The spokesman, as I’m still standing there, tells the chief that I was only saying those things to try and somehow woo him. | |
Well, yes. But it was also true. | |
We agreed to meet at 11 a.m. | |
The next morning, I got a text from the spokesman. He wanted to reschedule, but I told him I’m already on my way. He said the chief forgot about another meeting that was coincidentally around the same time as the interview. | |
I said I was flying out in a matter of hours and promised to keep my interview short. Howard responded tersely: “He’s not in…unless by phone.” | |
I was desperate. I hadn’t come all this way for nothing. | |
“I’m a few minutes away. I can meet him wherever,” I texted. | |
No response. | |
Something told me to stay the course. I arrived at the department, went to the front desk and told the woman there I had an 11 a.m. appointment with the chief. She picked up the phone, “Chief, your 11 a.m. is here.” | |
She hung up and said he’d be right out. | |
But somewhere between the phone call from the front desk and when the chief stepped out to meet me, Howard got a hold of him. The chief apologized and said he couldn’t talk to me after all. He said he wasn’t allowed to talk to talk to the press without Howard present. | |
“Fine, then he can listen by phone,” I said. | |
Moments later, I was sitting in the chief’s office, joined by his deputy chief who oversaw the implementation of the early intervention system and Howard on speakerphone. Over the course of more than an hour, the two chiefs walked me through the intricacies of the system. It was functioning. | |
I asked to take photos as proof that it worked and to better understand the categories that it tracked. I must admit, I was relieved that for the sake of the people who lived in Harvey, the system was operating. But for my story, it was probably a deal breaker. What’s the point of writing about a system that supposedly never got implemented, but actually did? | |
I called my editor to let him know. | |
I arrived back in D.C. and went through the photos of the system and re-read my documents about Harvey. After looking over the pictures I noticed a major category was missing. I called the department to make sure I wasn’t missing a photo. | |
But it turns out I was right. | |
While the system was installed, a key category that the Justice Department told Harvey to include was missing. For the two years Harvey had been operating the system, it had never tracked lawsuits filed against its officers for misconduct. Among those excluded were two recent suits, one alleging that an officer assaulted a pregnant teenager, causing her to miscarry. The other against the same officer for allegedly forcing a pregnant 20-year-old to have sex after pulling her over during a traffic stop. | |
The Department of Justice had recommended that they include that category. They didn’t. | |
The first case settled for $500,000. The other is currently in court. | |
One month after my visit, the police department informed me that the officer had quit. Two weeks after that, the department told me that they consulted with the city’s attorney and were now going to track lawsuits in their system going forward. | |
And that’s what I wrote. Read my investigation here. | |
Kimbriell Kelly is an investigative reporter for The Washington Post. Follow @kimbriellwapo","1" | |
"Mybridge","https://medium.com/mybridge-for-professionals/algorithm-top-10-articles-v-november-e73cba2fa87e","17","{Algorithms,Programming,""Software Development"",""Data Science"",""Machine Learning""}","96","2.8377358490566","Algorithm Top 10 Articles (v.November)","In this observation, we ranked nearly 1,000 articles posted in October-November 2016 about Algorithm and Mathematics for Programmers. (1% chance) | |
Mybridge AI ranks the best articles for professionals. Hopefully this condensed reading list will help learn more productively in the area of algorithms. | |
Implementation of Reinforcement Learning Algorithms in Python, OpenAI Gym, Tensorflow [1672 stars on Github] | |
Algorithms for making more interesting mazes | |
A Swift Introduction to Algorithms — Part 1 | |
……………………………………. [Part II] | |
Sorting Secret: Two different sorting algorithms are actually the same — Graham Hutton, Professor of Computer Science at the University of Nottingham | |
The problem with p-values: It’s time for science to abandon the term ‘statistically significant’ — David Colquhoun, Professor at University College London | |
Algorithm for automatically grading multiple choice tests/exams from photos | |
Calculating Box Normals — The Nim Ray Tracer Project, Part 4. | |
Learning Fine-Grained Similarity for Fashion with Deep Multi-Modal Representations | |
Genetic Algorithms — Learn Python for Data Science | |
Algorithms Might Be Everywhere, But Like Us, They’re Deeply Flawed. Jonathan Albright | |
The Mathematics of Machine Learning | |
. | |
. | |
Bayesian Linear Regression — Notebooks Open Source | |
[247 stars on Github] | |
. | |
Learn the most popular & practical Data Structures and Algorithms in Java + HW | |
[1,519 recommends, 4.6/5 rating] | |
React.JS for Beginners by Wes Bos. | |
[9,232 recommends] | |
. | |
That’s it for Algorithm Monthly Top 10. If you like this curation, read daily Top 10 articles based on your programming skills on our iOS app.","7" | |
"airbnbeng","https://medium.com/airbnb-engineering/academia-to-data-science-99e68f36485e","1","{""Data Science"",Academia,Careers}","95","6.26037735849057","Academia to Data Science","By Avneesh Saluja, Alok Gupta, Cuky Perez | |
When you’re in graduate school, it seems like the only career option available is to remain in the ivory tower. And it’s reasonable to see why — your advisor and peers are very likely to encourage you to follow their chosen career path. Indeed, the selection bias is strong amongst those who surround you. And when you are a professor, you believe that the only job where you can expand the knowledge base, teach, and mentor others, is within the academic setting. | |
However, taking a quick glance at the Ph.D. labor market shows that the number of doctorate-holders we produce annually exceeds the number of positions available. Also, Silicon Valley (and the tech industry in general) has been an appealing destination for many former academics and those with a research bent, but the leap from academia to industry is not an easy one. Michael Li documents the mindset shift required in one of his recent blog posts, and frames this shift within the context of delivering business-impactful results quickly (in industry) compared to delivering perfect results (in academia). | |
While we agree with this basic trade-off, at Airbnb we feel that the mindset shift is slightly more nuanced. In this post, we first discuss the skills (both hard and soft) we look for in candidates hoping to transition from academia to Airbnb, followed by specific pieces of advice for those looking to move into the fast-paced startup world of Silicon Valley. | |
Data Science is very much an overloaded term these days for all things data-related at technology and startup companies. It sits at the intersection of Mathematics/Statistics, business domain knowledge, and ‘hacking’. Data Scientists are asked to extract insights from data to drive a company’s metrics. At Airbnb this can mean munging data to inform which experiment to launch next or building a machine learning model to optimize our user experience. At Airbnb, when considering candidates coming out of advanced courses at graduate school, in addition to technical attributes and alignment with our core values, we think about 4 attributes: | |
With your advanced level of education, we expect that you are likely top of your field and very successful at everything you have touched so far academically. However, academic success and experience does not necessarily translate to industry success. We hope candidates are level headed and mindful of what they do not yet know about the business. Airbnb has a strong culture of mentorship and personal growth — getting here is not the end of the journey, there is still a lot to learn. We look for people who are eager to learn more and have an open mind to expand their skill set outside of their area of expertise. | |
We expect that a PhD candidate has learnt the art of self management. More than anything, graduate school should teach a student how to direct and prioritize their own learning. A senior researcher typically learns how to see a dead end approaching earlier, and quickly pivots their energy into a likely more fruitful direction. Research in a competitive field also provides opportunity to challenge peers and push back on assumptions. In Data Science at Airbnb we expect this to translate into not accepting the status quo but pushing the boundaries of our assumptions. | |
Sometimes we see senior academics underperform in their communication. Airbnb is a very collaborative environment with a Data Scientist typically working with other Data Scientists but also with Engineers, Designers, Product Managers, and non-technical people. Being good with data is important, but at Airbnb we need the insights to be well translated to all audiences — from a Data Scientist on your team to the CEO — otherwise the recommendations might not have the impact they merit. In both written and verbal communication, articulation of insights, methods, and assumptions must be crisp, convincing, and sympathetic to the audience. | |
It can take years to publish an article in academia but in industry the turnaround is much faster. That’s not to say the quality is poorer, it’s just that at Airbnb we expect to get a first version of a data product out as soon as possible and then continue to iterate where potential improvement is likely. Throughout our Data Science interview process at Airbnb we are looking for entrepreneurial spirit and candidates that can get past needing a perfect solution before shipping a data product or sharing an insight. | |
An academic’s personal experience when transitioning away from the ivory tower will vary significantly based on the field they are coming from. Given the tighter coupling in subject matter between industry and academia these days (especially within fields like computer science and applied economics), the lag between a research idea and integration into an end product is often on the order of months now as opposed to years. Thus, those who make the leap often possess the “hard skills” required for the job, such as strong programming and scientific computing knowledge. However, it is often harder to learn the softer skills and adapt one’s mindset towards industry. We break down the mindset shift into 4 broad areas: | |
Academic research does an excellent job of abstracting the core problem from the messy details that often surround that problem in the real world. Sometimes we are provided sets of training data that are nicely curated and cleaned, and evaluation is performed on a well-documented and benchmarked test set that many others have evaluated. Some thought may go into additional data cleaning, but much of the information contained in the data, especially the training labels, are a given. Other times we may collect our own data but from a tightly controlled field or laboratory experiment where we can minimize data contamination. | |
Unfortunately, this is not the case in industry. A lot more thought and creativity has to go into how to set up the problem in the first place. Are the labels that we extract or derive from logged data the signals that we need to solve our problem? Are there any bugs in the instrumentation? Are we even logging the information we need? Data Scientists need to understand the problem domain deeply and use that information to transform both the problem and the data to be able to produce something meaningful. Getting things to work in these fast-moving environments often requires as much, if not more creativity, as an abstract research problem. | |
Following up on that note, in academia we are often focused on a fairly different regime of the problem. Given quality training and test data a priori, we are often tasked with investigating novel, state-of-the-art solutions to improve performance from an already strong level to something even stronger. On the flip side, the task in industry is to deploy a model where none has ever existed before. There is nothing to compare against, and translating intrinsic evaluation metrics (e.g., AUC) to business impact is challenging at best and perilous at worst. | |
Given these realities, it’s advisable in these situations to deploy a model that isn’t perfect, but is “80% of the way there”. Optimizing for the remaining 20% of performance oftentimes reflects not only a lack of prioritization (“premature optimization”) but may also simply be not possible given the disconnect between intrinsic and extrinsic evaluation metrics. | |
While Michael mentioned in his post that it’s more important to deliver bottom-line impact than disseminate knowledge, at Airbnb we feel that the ultimate goal is to achieve a balance between these two extrema. Many of us were motivated to enter academia in the first place because we genuinely enjoy producing and disseminating knowledge, and within an industry setting there is significant value in effectively spreading these nuggets of knowledge and not have people reinvent the wheel. To this end, Airbnb has built the knowledge repository, which we recently open-sourced. The repository is our in-house peer-reviewed publications forum, and Data Scientists are encouraged to participate as actively as they can. We also have weekly seminars where Data Scientists or other leaders in the field can present their work, and mentorship is encouraged throughout the company. | |
A successful researcher is one who knows not only the solutions to difficult problems, but also the right questions to ask. Asking the right questions requires a proactive attitude — instead of expecting to be handed a problem to work on, you need to identify the opportunities and craft a direction of inquiry accordingly. Proactive inquiry is a skill that is well-developed in academia, and is one of the reasons Airbnb encourages top academics to apply. | |
To rehash a well-known quote: “it doesn’t make sense to hire smart people and tell them what to do; we hire smart people so they can tell us what to do”. We hire people from all kinds of academic backgrounds and qualifications precisely because these different vantage points provide a unique array of tools to tackle the kinds of interesting and challenging problems we face at Airbnb. | |
If you are interested in learning more about Data Science opportunities at Airbnb then you can visit our careers page here.","1" | |
"clarecorthell","https://medium.com/@clarecorthell/hybrid-artificial-intelligence-how-artificial-assistants-work-eefbafbd5334","15","{""Artificial Intelligence"",""Machine Learning"",""Data Science""}","94","9.77547169811321","Hybrid Intelligence: How Artificial Assistants Work","A world with artificial intelligence was once science fiction, but is now the daily work of the software industry. From self-parking cars to business process automation and city planning to voice translation, artificial intelligence technology is transforming products and processes across industries. | |
The goal of these technologies is to reduce human involvement and repetitive effort in the processes it targets, yet we won’t stop writing email or folding towels anytime soon. We’re in a transitional era in which responsibilities are apportioned among humans and computers based on competence and priority. | |
In the transition, there’s been much confusion about whether humans or computers are the primary actors behind products like Facebook M and Operator. But as I’ll explain, humans will always be part of the system. So we shouldn’t focus on how to replace humans in a system, rather how to optimally integrating the contributions of humans and computers. Here, I outline how this currently works and frame the intelligence technology goals moving forward. | |
In the year 2000, many technologists had a grand vision of what the internet would become. Some believed that the internet would be the great catalogue of human knowledge; it would be highly structured and standardized, organized beyond any book previously written. Tim Berners Lee, the visionary of a movement called the Semantic Web, explained his vision: | |
He believed content could be easily understandable for both humans and machines if we agreed on a structured format. For example, the phrase “Paul Schuster was born in Dresden” would be formatted: | |
If we could collate our human knowledge appropriately, perhaps we might simply give it over to the machine, let it do the rest, and profit. | |
But the reality is that the internet is a huge mess. Understandably, no one decided to painstakingly notate their text as the Semantic Web prescribed, so the internet is full of language only humans can easily understand. Machines can’t take over our tasks, and we still call customer service to get answers from real people. | |
So these realities birthed a new need — to collate, corral, and catalog the internet after its contents were brought into existence. In fact, the internet is such an intractable mess that the most valuable company in the world builds a product called “search” that allows users to dig through the trash heap to find what they need. We’re back to that original problem, of structuring data such that eventually computers can use it. | |
Search, like other ordinating technologies, has utilized a huge variety of software approaches to attack this tough problem. And one discipline has become our most valuable ally. | |
Artificial Intelligence includes any system that emulates intelligence (by whatever definition is in vogue; was once winning chess, now is scheduling flights). A few technologies are likely to be part of an AI system: | |
Broadly stated, Machine Learning takes patterns from data about the past and infers what might happen in the future. For example, say I build a machine learning algorithm using many pictures of cats. It may learn how to recognize a picture of a cat. | |
But what does it call a picture of anespresso machine? A “not cat.” It’s exclusively acquainted with cats, decidedly not with espresso machines. | |
For reasons like this (and pictures of cats without hair), machine learning algorithms can typically only achieve about 70–80% accuracy and completeness under normal conditions, and up to 90% in certain circumstances. | |
This means that if you build a service on top of one of these algorithms, it will be inaccurate or fail 1 out of 4 times on average. Imagine if a customer service agent gave the wrong information one out of every four times — that’s not reliable enough for a consumer. | |
Because machine learning hits a wall of accuracy, it hits a wall of usefulness. But where computers will be stumped, humans really shine: | |
The important insight is that computers are useful for specific discrete tasks, but less useful in broader inductive tasks. Neither the computer nor the human is objectively better — they simply have different strengths. | |
Perhaps the most subtle and important human characteristic is the ability to learn new things. A computer cannot intuit when it encounters something it’s never seen before. Humans have a je ne sais quoi, a creative and adaptive capacity to theorize beyond past experiences. | |
The ultimate upshot is that we aren’t going to see one become superior to the other. Knowing this, we have to take a different approach in building businesses and technologies with artificial intelligence, because intelligence will never be 100% artificial. | |
In the effort to leverage what humans and computers each do best, a new paradigm has emerged. In Hybrid Intelligence, the system asks humans to make judgments whenever the computer is less confident — resulting in the most accurate, trustworthy system. | |
The most prevalent use of hybrid intelligence has been in the data systems of enterprise companies, primarily for their own purposes. Large tech companies tend to use a combination of humans and computers to create, enrich, and validate data. This data is often used to train further machine learning algorithms to progressively increase automation. | |
In the last few years, we’ve also started seeing consumer applications that utilize hybrid intelligence. To the curious excitement of consumers, artificial assistants (or “personal agents”) perform simple tasks similarly to a human assistant, such as scheduling meetings, finding flights, delivering flowers to your boyfriend. | |
As useful as these may become, there’s a great deal of mystique around the inner workings of artificial assistants. Are they human? Computer? Magic? Let’s define how they work. | |
In machine learning, a (supervised) model is learned from a set of training data, and using it on new data results in predicted values. In the Active Learning design pattern, supervised machine learning and human decision making are integrated. Tasks will be sent to the computer, but when it is less confident, a human will be called upon to make a judgment. | |
For the uninitiated, a design pattern is a general form of a solution to a problem. | |
This intelligently reduces the workload. Instead of humans processing all data, humans monitor only the most atypical or outlying cases, or where the computer lacks confidence. The confidence metric determines whether the prediction is diverted to a human or not. The pattern is “active” because the human’s prediction is sent back to the algorithm to reinforce it and improve its performance. | |
For example, a message might be broken down: | |
We go from a normal English sentence to a structured request, almost as if entered in a form. One could imagine talking to the internet this way, filling out forms simply using english, or speaking to Siri and asking her to make a reservation for you tonight. It should be simple, right? | |
If only. How extracted elements are related, combine, and constrain one another is anything but straightforward. Here’s a journalist’s interpretation of how the computer determines the answer to a request: | |
At this point, it’s important to note that this request is much more complex to a computer than a human. A computer needs disparate data sources and the capacity to choose between options to ultimately arrive at a recommendation. A human would appreciate the nuances of the task much more quickly, and focus on the wine pairing after little hesitation to understand the remaining context. Human intelligence is built to distill exactly this type of complexity and nuance in communication. | |
Let’s briefly return to the original goal: to reduce the amount of human involvement in intelligent systems. Using a computer might allow tasks to be faster and easier to accomplish. Active Learning allows us to scale systems 10 or 20 times beyond the size and efficiency of what humans could accomplish. | |
Yet the Active Learning pattern is not always adequate, such as when: | |
Because of these issues, we’ve seen another pattern arise that integrates humans and computers even further, allowing for intra-task division of labor and complete validation. | |
In Hybrid Interaction, the computer structures the request and suggests a response, which a human then decides to either send or recompose. Hybrid Interaction is newly possible with more accessible distributed workforces of people, who opportunistically perform tasks. | |
In contrast to Active Learning, the human always makes a decision at the end of the process, but doesn’t always do the bulk of the work for the task. The computer gives a suggested response, and the human decides whether to correct it before sending it. Because a human always audits the final outcome, the system optimizes for a more accurate response, leveraging what the computer is best at and combining it with the judgment and intuition of the human. | |
Of course, the core tradeoff of this pattern may be obvious. In order to gain accuracy, there is a lower throughput (rate of requests processed) and higher human effort (every request). | |
Interestingly, this solution didn’t arise from a technological breakthrough. A better, more intelligent solution was created by combining a distributed human workforce with artificial intelligence, and determining whether to lean on the computers or humans for validation. | |
Intelligence technologies can be visualized based on how much humans and computers are involved in tasks, from 0% to 100% for each. This covers a spectrum of intelligence capacities — from ordering more detergent and classifying news stories to booking hotels and getting tech support. | |
In this graphic, artificial assistants fall under Hybrid Interaction, while enterprise data enrichment (vendors shown, though in-house systems qualify) fall under Active Learning. Though consumer applications (bots, artificial assistants) are likely more recognizable names, the enterprise data management and virtual call center spaces are lucrative and equally worthy of attention. | |
One founder of an artificial assistant company told me that 25% of their work is routed through a computer. Humans still do the bulk of the work, about 75%. Though artificial assistants don’t work well, there’s no reason to believe they will continue to perform inefficiently. As any investor will tell you, the focus is on investing in early technology as it progresses toward the future. | |
For artificial assistants, the technological goal is to split work 90% computer, 10% human. As we’ve discussed, humans will always be part of the equation, ultimately dealing with the 10% most difficult or unknowable cases. In these cases, their work will not be boring or routine, but meaty, tough cases. That’s the dream. | |
But there’s one last important thing to pay attention to — and it has to do with design. | |
Artificial assistant applications anthropomorphize themselves, using human names and interacting through natural human language. Any bot can be asked to tell a joke or whether they like you, always answering in an ever-placid tone. | |
And therein lies the trick — humans know how to interact with an english-speaking entity that sweetly answers benign questions. It’s a familiar, intuitive interface on which we were first trained by our mothers. We expect the computer to understand us, and in return we often feel gratitude, a sense of being taken care of. We treat these systems as a single artificial identity and intelligence, though they are a coordinated dance of database queries, logic, and many-human editing under one humanoid name. It’s a simple psychological sleight of hand, and it works. When Alexa says she doesn’t understand, we admit to giving her too little information and expect her to improve. We treat her like a learning child. It’s not far from reality. | |
As we grapple with how to build the most profitable, useful, efficient intelligence technologies, the most important question becomes: How do we best enable humans and computers to work together? The belief that we can end human intervention in intelligent technology is unrealistic, even given infinite resources. Let us focus on building toward Hybrid Intelligence, the optimally efficient integration of the talents of humans and computers. Done well, we make the smartest team. | |
This was originally a talk at the Humans+Machines Conference in Manila, February 2016","5" | |
"felipebcabral","https://medium.com/data-science-brigade/como-funciona-a-operação-serenata-de-amor-25ba256e0e11","1","{Corruption,""Data Science"",""Big Data"",""Open Source"",""Serenata De Amor""}","94","2.09433962264151","Como funciona a Operação Serenata de Amor","Antes de começar qualquer coisa, nós tivemos a sorte de viver em tempos que a democracia conta com a transparência como métrica de corrupção. Quanto mais transparente um governo é, menos corrupto ele é. | |
Por sorte temos algumas ações nessa linha no governo nacional. Tanto na lei, quanto na distribuição de informação. Tivemos acesso a todos os gastos da quota de atividade parlamentar. Cada deputado gastou em média 266 mil Reais. | |
O objetivo é criar um banco de dados que tem a função de oferecer respostas à diversas perguntas porque ele conecta MUITA informação, e também aprende sozinho, observando a realidade. Pelo fato de aprender sozinho sobre corrupção, estamos chamando ele de Robô. | |
Antes de montar cérebros de robôs que ficam observando dados públicos houve um trabalho em refinar os dados. Tornamos possível que esses dados fossem analisados e transferidos com facilidade, já que o que se encontra no portal de transparência não é NADA prático. Essa história está aqui. | |
A partir daí começamos a perguntar coisas como: Qual partido mais gastou? Qual deputado federal mais gastou? Qual a nota mais cara? Com quem? Quais são TODOS os endereços de cada CNPJ (são mais de 2 milhões e crescem 25 mil/mês)? Existe nota que demonstra que um deputado ter estado em dois lugares ao mesmo tempo? Qual é a quantia de dinheiro gasta especificamente com comida? Existem notas duplicadas? Existem valores discrepantes? Quais? Os primeiros resultados estão nesse link do Cuducos. | |
Nos perguntamos quais são todos os endereços. Em breve teremos também o valor comercial por área e vamos cruzar com todos imóveis locados através da quota e perguntar: Existe algum superfaturado? | |
Mas às vezes esbarramos em coisas como: qual é esse objeto/serviço descrito aqui? O que foi entregue justifica o valor? Para responder perguntas como essa contamos com uma participação humana que vai responder essas questões para o Robô: | |
OU | |
E nossa equipe vai traduzindo as respostas para que o Robô aprenda.No futuro é possível adequar as respostas para que possam ser digeridas pelo Robô, assim ele entenda as denúncias sozinho, sem auxílio humano. Para ajudar na construção dessas respostas, preencha esse formulário: https://goo.gl/forms/ic80HWqC29skDtsi1 | |
Com qualquer suspeita em mãos vamos investigar e procurar o parlamentar, informar do conteúdo do dossiê e perguntar se ele têm algo à dizer. (Direito de resposta). Sem pronunciamento, ou caso a resposta simplesmente reforce o entendimento do caso como corrupção, desvio, criaremos uma denúncia no Ministério Público e publicaremos o dossiê com todas as informações. | |
Todo o projeto está público (No respositório Open Source no GitHub — https://github.com/datasciencebr/serenata-de-amor) e qualquer órgão de combate à corrupção pode fazer uso dos algoritmos. Em qualquer lugar do mundo. | |
Siga as publicações da Data Science Brigade para saber mais da Operação Serenata de Amor. | |
Para permitir nossa dedicação de tempo integral ao projeto acesse https://www.catarse.me/pt/serenata","4" | |
"juddantin","https://medium.com/@juddantin/how-facebook-convinced-itself-fake-news-isnt-a-problem-737b1941a995","0","{Facebook,""Big Data"",""UX Research""}","91","5.01132075471698","How Facebook Convinced Itself Fake News Isn’t a Problem","If you’ve used Facebook this election season, you might find it a little strange just how easily Mark Zuckerberg dismissed the idea that fake news on Facebook contributed to the election’s outcome. By now his announcement has been thoroughly dismantled, and there’s even word of a rogue group of Facebook employees working on fake news. And to be fair, it sounds like Facebook is, in fact, working on this problem. | |
Who knows the true effect of fake news, but Zuckerberg’s ready denial feels out of touch. I used to work at Facebook, though, and I’m not surprised. Facebook prides itself on being a data and experimentation driven company, but this is a story about how myopic data-driven thinking can lead you astray. | |
My goal isn’t to throw Facebook under the bus — this stuff is really hard and I assume the well-intentioned and brilliant people there are doing the best they can. But there is a lot to learn for the rest of us about the power and limits of making decisions based on big data alone. | |
I think there are likely at least two fallacies at work here. | |
Mark seems to suggest that things that happen less than 1% of the time can’t be obvious, important, or change outcomes. But it doesn’t take much thinking to debunk that idea. Indeed rare events are sometimes the most influential, if only because they’re out of the ordinary. As Rick Webb compellingly writes, the research suggests it’s not just possible but probable that <1% of Facebook stories could have an effect on the election. | |
Mark also seems to have forgotten his denominator, something that’s disturbingly easy to do when you focus only on descriptive statistics. Nearly 1.2 billion people use Facebook every day. If 1% of them see a fake news story, that means 12 million people see at least one fake news story each day. Here’s another cut. Let’s say the average FB user sees 100 unique News Feed stories in a day and 1% are fake news. Well, you do the math. These are hamfisted calculations, but you get the point. | |
This is the first way that data-driven decision-making may have gone astray at Facebook. Sitting atop the big machine, burdened by the need to make decisions that influence 1.7 billion people each month, it’s necessary to aggregate and abstract. Literally all the possible things happen on Facebook at least a little bit, and getting bogged down by any one of them is a recipe for paralysis. | |
But a shift in perspective could shift the decision-making process. As other commentators have pointed out, if 1% of articles the New York Times or the Guardian printed were wrong, that would be utterly unacceptable. A total failure of journalism, and like it or not Facebook is a journalistic entity. How does Facebook perceived responsibility as a news organization operate in this situation? | |
This is entirely about how you use data to understand experience. If all you’re doing is looking at aggregates, it’s easy to lose sight of the scale of even rare things. The trick here is to humanize big data in the course of decision-making — what does it mean and for whom. Few companies have data at the scale that Facebook does, but every company can face a flavor of this problem. Using big data to inform decisions is the right thing to do, but contextualizing that data in the experiences of users and in the role and impact of events is something Zuck seems to have missed. | |
I know we’ve all heard this catchphrase, and this is a dramatic illustration. Facebook is fantastic at counting things with high precision and large scale in real time. The ad business runs on it. Decision-making relies on it. But this is an example where it falls apart. | |
Mark seems to be focusing on frequency because it’s a known quantity. I saw this many times at Facebook — the winning argument is one that has a large N and many leading zeroes on the p value. When something cannot be measured in this way, it falls out of the decision-making process. In my time at Facebook I observed some leaders be actively hostile to these other ways of measuring. | |
What Mark can’t know from his count data alone, though, is the impact of fake news. How upsetting is the average fake news story? What’s the distribution of outrage across reactions to fake news? What impact did fake news have on political perceptions and voting decisions? | |
This is the moment where rigor falls apart. Having carefully counted the incidence of fake news stories, Mark and his leaders seem to have applied their own intuition with a hefty dose of bias and self-interest. He jumped to conclusions about what it means and why, likely in the absence of good information about how fake news can matter. And the absence of information gets filled with dirt. | |
Granted, these are difficult questions. But let’s not pretend the answers are unknowable. Research is a large toolbox. They could consult the literature to learn about the impact of rumors and misinformation online, and engage with outside experts. They could find people who have been exposed to a lot of fake news and understand their experiences. They could look through feedback channels that Facebook has at scale for comments about fake news and systematically analyze them. They could survey Facebook users in all kinds of clever ways to understand their potential effect. None of these are interesting if you care just about prevalence, but only if you care also about impact. | |
And look, maybe Facebook has done all these things — if so I hope they’ll come out and share it. I know the researchers who work on News Feed, and they’re brilliant. But I’m reacting to Mark’s quick disavowal, and my knowledge of how Facebook operates. When data-driven becomes data-mypoic, we all suffer. I worry that Facebook’s decision-making has lost its humanity. And that’s frightening given the central role it plays in our world. | |
This is one thing that I’m deeply proud of about how we approach decision-making at Airbnb. Big data and experimentation are a crucial part of how we make decisions, but we try very hard to be data-informed rather than data-driven. We want to do the driving. So we rely on the sacred-triumvirate of big data, rigorous multi-method research, and design/product vision. We actively seek to understand what our data means and apply it with our goals and our mission in mind, and with the context of the everyday experiences of guests and hosts. | |
I trust that the brilliant people at FB are working on this, and that they recognize their role. Even as he dodged responsibility, Zuck admitted as much. For the rest of us, I think there’s a big lesson here: | |
Importance = Prevalence x Experience Impact | |
Common things that are trivial can still, in aggregate, have a dramatic impact on user experiences. So can rare but dramatic things. Ultimately, a single-minded focus on counting things will leave you with only half the picture. Understanding impact usually requires that larger toolkit: full-blooded, multi-method, empathy soaked research. Deep qualitative work focused on the how and the why followed by surveys which help us understand prevalence from a different view. This is the one-two punch we use at Airbnb.","5" | |
"deancasalena","https://medium.com/swlh/3-million-judgements-of-books-by-their-covers-f2b89004c201","17","{""Data Visualization"",""Web Development"",Books}","91","3.83396226415094","3 Million Judgements of Books by their Covers","Last week, my friend Nate Gagnon and I launched Judgey, a browser-based game that gave users the opportunity to literally judge books by their covers. We’re both makers, and Nate is a writer, and I’m technical. So excuse me if I get technical — I promise to reward you with pretty graphs. | |
As it so happened, the internet approved of our goof, as did Goodreads (whose API we used), various public libraries and book stores, Book Riot, Adweek, and a few articles (some yet to surface). We also got a tenuous mention in The Washington Post (points if you can find it, but we’ll take it). | |
Having both seen what kind of traffic a reddit front page could bring (Nate for Forgotify and myself for Netflix’s “Spoil Yourself”) we did the needful technical bolstering to prevent a Reddit Hug of Death. I’d love to tell you about said bolstering, as well as technical aspects of the game’s development, but I’ll save that for a future post. | |
We tracked various datapoints of our first-week 300,000+ visitors using Google Analytics to monitor how many levels they completed and how judgmental they were. We waited to turn on more detailed event tracking until after the reddit spike, because Too Many Event Actions can get your tracking turned off entirely. | |
One last preface: This isn’t a scientific study. The results do not account for how well known a book is (which would influence the rating despite the cover), nor do they account for the fact that Goodreads does not allow ratings under 1 star. Each book’s results certainly had a pattern however, some we found very interesting. | |
Tweet this graphic | |
1. If you’re wondering what all the black bar spikes are — people seem to way prefer rating books in complete or half stars. i.e 2.5 instead of 2.4 or 2.6. That’s a self-imposed limitation that the game did not dictate. | |
2. Try to look not just at the overall averages, but the shape of the graph itself. Ratings all bunched together indicate there was somewhat of a consensus — showing some real meaning coming from the data. Ratings all spread out meant people truly didn’t know what to think one way or another. | |
3. People came out the gate swinging. The second worst rated book cover — beaten only by Justin Bieber: His World, was the first book of the game: Domingo’s Angel. How did Bieber get 4.4 on Goodreads anyway?? | |
4. No matter which book, a few people would slam it with a zero… | |
…or praise it with a 5. | |
5. For most books, judgments made on the cover were worse than the book’s goodreads rating. The exceptions, I would say, have remarkable covers (and titles). | |
Fellow data nerds: For which books did you find the results graph interesting? To what do you attribute the shape? | |
Fellow makers: What are you making? | |
Let’s chat on twitter @deancasalena and @nateyg. | |
Published in Startups, Wanderlust, and Life Hacking | |
-","3" | |
"matthewscarroll","https://medium.com/3-to-read/databasic-a-suite-of-data-tools-for-the-beginner-5165651e46a5","4","{Journalism,""Data Visualization"",Data}","90","7.12641509433962","DataBasic: A suite of data tools for the beginner","By Matt Carroll <@MattatMIT> | |
I’m a data geek from way back at The Boston Globe. Naturally enough, I have a deep interest in data, data visualizations, and the tools used to build them. | |
That includes starting to review new data and data viz tools as they become available. This is my first review, and I’m looking at a suite of three tools for data beginners, released under the name DataBasic.io. The basic idea is to introduce “easy-to-use web tools for beginners that introduce concepts of working with data,” as the site explains. | |
The tools are: Word Counter, WTFcsv, and SameDiff. Each aims to solve a particular data problem, and they do their work well. But what’s of particular interest to me is what all three accomplish in a very deep way — they are easy to use. | |
That’s unlike many other data tools, which are geared for hard-core users. Not that there’s anything wrong with complicated tools for hard-core users. Sometimes the tools do complicated data work, and by necessity are fairly complex. I’ve used data tools for a long time, so am not put off (too much) by sharp learning curves and crappy user experiences, as long as the tool does what’s promised. Plus, frankly, I’m paid to put up with the pain, so I work through it. | |
But I also understand how new users can be intimidated and turned off by these types of tools and experiences. That’s really too bad, because there’s so much data available that can help people understand so much more about their community, politics, the environment, even their own finances. The possibilities are limitless. What’s needed are simple tools that can be mastered easily so more people can participate in data surfing. | |
That’s where the DataBasic tools fit in. They are beautiful and simple, easy and intuitive to use, and are great for beginners. Who are they useful for? Anyone who might want to dive into data, but is unsure where to start, including students, community groups, or journalists (that’s me). The tools were tested in classrooms and workshops to make sure they worked well and were easily understandable. The developers get that learning data tools can be a miserable experience. As the web site says: “These tools were born out of frustration with things we were trying to use in our undergraduate classes, so we feel your pain.” | |
They’re even determined to put some fun into data. For training, they even include some data sets such as song lyrics from Beyonce and a list of survivors from the Titanic. | |
Here’s a quick rundown, with a more detailed description below: | |
The tools were developed by two people with a deep interest and knowledge of data and data visualizations, Rahul Bhargava, a research scientist at the MIT Media Lab’s Center for Civic Media, and Catherine D’Ignazio, an assistant professor in the Journalism Department at Emerson College in Boston. (BTW, big transparency alert: I’m hopelessly conflicted writing this review, as I’m friends with both Catherine and Rahul. Also, I’m Catherine’s partner in creating a cool photo engagement app for newsrooms called NewsPix, and I work with Rahul in Civic Media.) | |
I’m looking at these tools from the point of view of a beat reporter wondering how to use them to help find stories or dive deeper into material. And while they are designed for beginners, I can see Word Counter and SameDiff gaining traction with experienced reporters, as well. | |
Word Counter takes text and analyzes it in several ways. It creates your basic word cloud, but also does word counts, in several interesting ways. | |
For instance, let’s take the speech Donald Trump’s speech in June when he announced his candidacy for president. | |
WordCounter breaks its analysis into a few different pieces — A word cloud, Top Words, Bigrams, and Trigrams. | |
Top Words is a basic word count. In Trump’s speech, the most common words were People (47 mentions), Going (44), Know (42), and Great (34). | |
Bigrams and Trigrams are counts on two- and three-word combos. Top Bigrams: Don’t (39), going to (38), I’m (37) | |
And top Trigrams: “I don’t” and “(We’)re going to”, tied at 11; and “that’s the,” “going to be,” and “in the world,” tied at 10. | |
A nice touch: Each of the lists of the Top Words, Bigrams and Trigrams can be downloaded as a csv. | |
It’s a nice tool that might help quickly analyze or provide some insight into someone’s speech, talk, or writing. Unfortunately, this tool didn’t help our aspiring data journo come up with a story. | |
Too bad there wasn’t a tool in this suite that would let us take Trump’s speech announcing his candidacy and compare it with Obama’s speech doing the same. | |
Oh wait — there is, and it’s called… | |
I took the speeches Obama and Trump made announcing their candidacy for president and easily imported them as text documents into SameDiff. | |
The first thing SameDiff does is give us a report, which announces: “These two documents are pretty different.” A shocker, huh? Who would think that Obama and Trump might share widely divergent views? (Yes, that’s a joke.) But that’s good. Big differences means that there is a lot of contrast between the two speeches, which gives us something to write about. | |
SameDiff creates three columns: The first and third columns are basically word clouds noting the top words used by each candidate (you can get the exact word count by hovering your cursor over a word.) The middle column lists the words in common. | |
It’s interesting to see the differences in what the two men talk about. With Obama, top words include: Health, future, divided, opportunities, family and children. | |
With Trump, some top words: Going, China, Trump, Mexico, stupid, Obamacare. | |
What does this tell us? We can see that Obama in his speech was focused very much on individual needs, including healthcare, and was concerned about divisions in the country. | |
Trump focussed more on problems in the international arena (not to mention he likes talking about himself). And he’s not above throwing around derogatory terms like “stupid.” | |
The column that shows which words they used in common? Not that illuminating, in this case: Great, country, jobs… Words that you might find in most political speeches given by any random politician. | |
As a reporter, I can’t say I could write a story based on this analysis. But it does help shine a little light on the slant each candidate was pushing, which could help inform whatever story I do write. This tool was helpful. | |
The goal of WTFcsv is to show the data newbie what’s inside a spreadsheet, or a file that can be imported into a spreadsheet. Like the other apps, it is well designed and simple, with a clean interface. | |
It was simple and intuitive to import a file (it can take a csv, xls, or xlsx, which are three common spreadsheet file types). | |
Once a file is imported, it provides basic information about what’s in the file. To test it, I imported a US Census file on college graduation rates at the state level. The state-by-state data shows the education levels of state residents, 25 or older. There are 12 columns of information, which range from state names and populations to education levels, such as “Less than 9th grade,” and “Bachelor’s or higher.” | |
The analysis, provided in a nice card display, broke down each column. For instance, it looked at the percents of the 25 and older population with a bachelor’s degree or higher education. The WTFcsv histogram broke the info into ranges that, for example, that four states had between 18–22 percent. (Ahem, and using my own analysis: For Bachelor’s or higher, Massachusetts led all states with 39.4 percent.) | |
It also provides a “What do I do next?” set of questions that can help prod the beginner. | |
Bottom line: Maybe of use for people who are new or are intimidated by spreadsheets, but that’s about it. It will be of some use for awhile, but as data users become more experienced, they will turn to other, more powerful tools, such as pivot tables, to do the same types of analysis. | |
I like all three tools. All are all simple to use. The design and user experience is great. “Intuitive” is the key word. I had no trouble importing files. My favorite was SameDiff, because I can see how useful this would be for even an experienced reporter, but I can see how all three would benefit wanna-be data reporters. | |
Matt Carroll runs the Future of News initiative at the MIT Media Lab and writes the “3 to read” newsletter, which is a weekly report on trio of stories and trends from across the world of media.","3" | |
"robsimmon","https://medium.com/planet-stories/whats-false-about-true-color-2951ea5a4b5a","19","{Space,Science,""Data Visualization""}","89","5.0811320754717","What’s False About True Color","I’ll start with a quiz: is this picture real—is it what an astronaut would see from space? | |
How about this picture, seemingly from a similar vantage point, but including the far side of the Moon? | |
Or this view of the Pacific Ocean? | |
In a sense, they’re all real, even though they were collected in different ways. The first is a photograph taken by the crew of Apollo 11, on their way to the moon. The second is from the Deep Space Climate Observatory (DSCOVR), from its vantage point at 1,000,000 miles from Earth. The third is a day’s worth of data from NASA’s MODIS instrument, wrapped around a sphere and rendered with a simulated atmosphere. | |
But what makes a realistic picture? After all, our eyes and brain are reconstructing a sharp, full-color, three-dimensional environment from a tiny amount of information. | |
This illustration simulates the information sent to your brain by the retina. Visual acuity is highest in the fovea, which covers an amount of your field-of-view roughly equivalent to the size of your thumbnail, held at arms’ length. Away from that tiny spot, sharpness and color perception fall off. | |
Our brains infer the appearance of surfaces based on assumptions about the relative positions of lights, objects, and the shadows they cast. For example, look at the checkered pattern above. We can tell square A is black, even though it is brightly lit, and square B is white, even though it is in deep shadow—right? | |
Perhaps not. | |
Both squares are exactly the same color. The eyes and brain adjust perceived color and light based on the surroundings—they are not making precise measurements. | |
In addition to localized adjustments in perception, the visual system also adapts to global differences. For example, a white piece of paper will look white if we view it under orange candlelight or bluish light from an LCD screen. Photographs and other images need to be corrected accordingly—a process called white balancing. The photo above, of the International Space Station and southern South America, was processed to correct for 5 different color temperatures, from 2800K to 9800K. The center strip matches the color temperature of the Sun: 5800K. | |
Satellite images need to be processed to account for these and other features (quirks?) of our vision. This is the raw version of a Planet Labs image, showing fields in the Egyptian desert. | |
I adjusted the white balance of this version of the image to account for the color imparted by the atmosphere, and the brightness to correct for the non-linear response of our eyes—both global corrections. (If you’re curious, I’ve written a description of my workflow.) It’s improved, but lacks sharpness in the fields, and bright areas are washed out. | |
An additional set of local adjustments brings out details, and emphasizes structure in the desert sands. These processed images are more true to how we see, and convey more information, than the quantitative data provided by scientific instruments. | |
Similar processing techniques can help with the interpretation of otherwise abstract data. Compare this nighttime view of Italy taken from the ISS… | |
… with 9 months of city lights data, merged with a map of the Earth’s surface tinted blue to simulate night—a technique that dates back to the silent film era. It’s data visualization that reads like a photograph. | |
Sometimes, a realistic image is too detailed, like this Planet Labs snapshot of Utah’s Canyonlands. Complex topography, a silt-filled river, and unfamiliar lighting (sunlight is coming from the lower right) makes the landscape hard to interpret. | |
This National Park Service map uses abstraction to enhance readability. It’s less realistic—shadows are idealized, texture and color is computer-generated, and river water is a flat blue—but it’s easier to see the relationships between high mesas and deep canyons. | |
True color views can be limited. This red, green, and blue Landsat 8 image of Alaska shows green and brown boreal vegetation, silt-filled rivers, and spreading smoke from a wildfire. | |
The same area, shown with shortwave infrared, near infrared, and green light reveal subtleties in the vegetation, and clearly differentiates land from water. The infrared light even penetrates smoke, showing the advancing flames beneath. | |
We can also look outward into space, instead of inward towards Earth. This is what our eyes would see looking through a telescope—but without further enhancement—at the constellation of Orion. Dozens of stars appear against a black background. | |
But with a long exposure and special filters the dark dust and glowing gases of the Horsehead Nebula appear. | |
Scientific imagery and data visualizations rarely match what we would see with the naked eye, which is limited by our physiology. The best visualizations—even visualizations of the invisible—work within those constraints to reveal hidden truths.","1" | |
"hyonschu","https://medium.com/@hyonschu/big-data-is-dead-all-aboard-the-ai-hype-train-ae89c8d64cc3","6","{""Artificial Intelligence"",""Machine Learning"",""Big Data""}","89","4.99905660377359","Big Data is Dead. All Aboard the AI Hype Train!","It’s 2016, and businesses big and small, far and wide have finally stopped using the term Big Data. | |
The consensus seems to be converging on the idea that data alone doesn’t solve problems. It’s true. You still need to understand, analyze, and test test test data using hypotheses to prove intuitions and make solid decisions. Things that should be happening regardless the size of your data. | |
But instead of developing creative uses for the data that we have, we’re all now looking to ‘cognitive computing’ and ‘artificial intelligence’ to save us. Companies like Google, Facebook, Microsoft, and IBM are all having an arms race against each other trying to outsmart and out-engineer each other for better accuracy. Meanwhile, marketing teams with lots of money have entranced us all with the possibility of having computers think for us, tell us what our problems are, and auto-magically fix them and improve our business processes. | |
And we bought into it, freely throwing around terms like ‘artificial neural networks’ and ‘deep learning,’ reinforcing the belief that computers are becoming more like us. (IT LITERALLY REFERS TO BRAINS!) We’ve become comfortable with terms like ‘The Singularity’ as we, as a society, welcome our new robotic overlords. Indeed, 2016 is shaping up to be the year of AI. | |
In response to the death of Big Data, companies who need to sell more stuff are now telling us now that you have this data, what your business really needs is analysis done by super-fast, omniscient computer brains. Which is a nice idea, but ‘artificial intelligence’ isn’t anywhere close what most people consider it to be. | |
Near the end of last year, analysts were proclaiming that 2016 is the year the algorithms make data useful. Gartner made headlines by proclaiming “Data is Dumb. Algorithm is Where the Value Lies.” IBM seems to allude to the notion that it can help Bob Dylan improve songwriting in TV ads. And nearly everyone’s afraid these AI algorithms will eventually destroy the world. | |
Sanity check: If these algorithms are so smart and therefore valuable, why are Facebook and Google (and scikit-learn) giving away their state-of-the-art algorithms for free? | |
Consider how Google operates. The MapReduce paradigm was so crucial to Google’s core business that its very existence was kept close to the vest. It was a key business driver and led to enormous growth within the company. When Google decided to reveal and give away MapReduce, they were so far ahead of the data parallelization game that they didn’t need it anymore. | |
Following that logic, Google giving away their “AI engine” TensorFlow should mean that Google already has something that is so mind-blowing that it should be able to tell what you’ll have for dinner tonight. | |
Or perhaps the more likely explanation is that Google has no idea how to extract value from it. I mean, other than recognize pictures of cats.* | |
I know it’s a very bold statement to make. But in practice, neither Google nor Facebook have found a way to use their “artificial intelligence” superpowers to improve their core business: getting me to view or click their ads. | |
IBM’s “cognitive computing platform” Watson is in a similar situation. Sure, it did a great job of retrieving facts and winning Jeopardy in 2008, but quickly faded into relative obscurity. In 2014, IBM put together a $100MM fund to help app development for Watson, and all they seem to have have to show is 8 featured apps on their home page, none of which I completely understand. Not even a giant pile of money couldn’t bring a high-visibility app to Watson. Curious. | |
In the Bob Dylan ad, Watson claims it can read millions of documents quickly, with the only conclusion being that Dylan’s major themes are time passes and love fades. Dylan then suggests they write a song together, to which, Watson evades the suggestion in its sole stroke of brilliance by saying, “I can sing.” | |
While it is an exciting research field, AI in its current state is nothing more than just algorithms—math instructions. Algorithms are fast. Algorithms are often elegant. But algorithms are still dumb. Even when they’re “self-correcting,” they still need an immense leveraging of human intelligence and input to do something simple. It is currently nowhere near the levels promised in Wired, TechCrunch, or Gartner reports. | |
If algorithms are truly the key to business success, why do the largest companies who have spent the most money developing and and investing in clever algorithm platforms just give them away? Why has Yahoo chosen to give away 13.5 terabytes of real consumer data for research and recruiting? Could it be possible that the greatest minds in Silicon Valley are giving away algorithms and data, effectively saying, “We don’t know what to do with this, either!”? | |
There is an inherent belief ‘round these parts that all problems (ie. world hunger) can be engineered away if you code enough lines, and that intelligence is just a matter of sufficiently-programmed algorithms. Pump enough Big Data™ into this Artificial Intelligence Engine™ and all your problems will be solved by intelligent computers. But I believe intelligence requires much more than just engineering and processing power. Intelligence has overtones of curiosity, problem-solving (and problem-creation), and a touch of insanity, none of which have been replicated in any AI lab. | |
Maybe we’ll get there some day. But for now and the foreseeable future, the best way to attack your business problems is still done the old fashioned way: creative, smart, and curious people who can ask the right questions and know how to get them answered. Big, dumb algorithms and warehouses of data are useless without them. After all, they are still very much missing the critical portion of the puzzle: actual intelligence. | |
PS. Feel free to ping me on twitter or LinkedIn with your thoughts. | |
* Update: Go. They taught it to play Go.","7" | |
"smfrogers","https://medium.com/google-news-lab/tilegrams-make-your-own-cartogram-hexmaps-with-our-new-tool-df46894eeec1","7","{""Data Visualization"",Maps,""Open Data"",""2016 Election""}","88","3.34433962264151","Tilegrams: Make your own cartogram hexmaps with our new tool.","This US election season, you will see a lot of maps. And mostly, they will look kind of like this: | |
And what’s wrong with that? It is, after all, what the United States looks like. | |
But try and find Rhode Island or Connecticut. Tricky, isn’t it? It’s not just finding the states that’s difficult. It’s getting a sense of how “big” those states actually are. After all, when people often think about “big”, they’re referring to metrics like population rather than actual size. On a map like the one above, New York — with 8.4M people — is barely more than a pixel, while Washington D.C. has roughly the same population as Wyoming. All of that context is lost. | |
People are beginning to notice that accurate geography isn’t always the most meaningful way to understand a map. | |
Maps tell stories. And if you’re looking for stories beyond those that just have to do with accurate geography, you have to go further. | |
Welcome to the world of cartograms. | |
A cartogram is a distorted map. Rather than reflect actual geographic boundaries and spaces, the boundaries and spaces are changed to more accurately tell the story of the data it’s showing. They’ve been around since at least the 1960s, often as a tool for academic geographers. In the UK, Danny Dorling and Ben Hennig have turned them into an art form, representing multiple different types of social data in cartogram form, from stretched maps to maps comprised of hexagons. | |
But it’s not just a British phenomenon — US designers have been thinking heavily about how to use cartograms. And this election season, there are a lot of them around: | |
How do you make one for yourself? There are lots of tools out there to create a traditional map, but fewer to make a cartogram. Where would you even begin? | |
We were facing those same frustrations and wanted to make it easier. This is where Adam Florin and Jessica Hamel come in with Tilegrams. The two Pitch Interactive designer-developers have developed a tool that can help you build your cartogram. | |
The tool is designed for serious developers and amateurs (like me) who want to make their own hexmaps. It’s published on github so anyone can take the code and re-use it. It allows you to make both static and interactive maps. | |
The tool comes pre-loaded with a number of key US hexmaps to get you started, including the FiveThirtyEight-style map of the electoral college, the NPR state visual, and Pitch’s hexmap of the US population (where each hexagon represents 500,000 people). | |
If the pre-loaded maps aren’t what you need, then choose a dataset and see it visualised by clicking ‘Generate cartogram from Dataset’. You can also upload and paste in your own dataset in CSV format. | |
The ‘resolution’ slider changes what each hexagon represents. And if you don’t like the way the map looks, you can drag the hexagons around to create your own map style. | |
There are two export options: svg to produce a static image that can be edited afterwards; and TopoJSON for the interactive version. | |
You can read more about how to use it in this blogpost from the Pitch team. | |
We can’t wait to see what you do with it. Time to see the world differently. | |
Simon Rogers is data editor on the Google News Lab and director of the Data Journalism Awards.","1" | |
"Wolle","https://medium.com/baqend-blog/real-time-stream-processors-a-survey-and-decision-guidance-6d248f692056","12","{""Big Data"",NoSQL,""Distributed Systems"",""Data Science"",""Stream Processing""}","86","24.3349056603774","Scalable Stream Processing: A Survey of Storm, Samza, Spark and Flink","","6" | |
"kvnwbbr","https://medium.com/redelastic/diving-into-akka-streams-2770b3aeabb0","7","{""Big Data"",Scala,Akka,Programming}","86","9.21981132075472","Diving into Akka Streams","Streaming is the ultimate game changer for data-intensive systems. In a world where every second counts, the batch jobs of yesteryear with overnight processing times measured in hours are nothing but an expense. Per-event streaming puts the emphasis on recent data that can be actioned right now. | |
We’ve lost our way. We’ve become enamoured with the size of data rather than the value of data. Big data. Huge data. Massive data. Too much data! We’ve become hoarders, hiding away data that we don’t need in long forgotten dusty corners to process that data with some magical algorithm at an unknown time in the future. | |
Our big, huge, massive data will make us rich one day! All we need to do is lease petabytes of storage, choose an enterprise big data platform, staff up a big data team, hire security experts to ensure all of our data is secure (so we conform to the many industry standards around sensitive data such as HIPAA), and… you get the idea. Or we can come to the collective realization that data is inventory. If data is not generating revenue today it’s an expense, not an asset. | |
Fast data is what matters most. Let the others horde. We want to build real-time systems. Data that is recent is relevant. Data that is stale is history. Historical data is good for generating insights that may influence long-term strategic thinking — or perhaps cranking out a few infographics — but when it comes to the tactical behaviour of running a business, immediacy is the primary concern. We don’t want to restock inventory or decline a credit card transaction based on what happened yesterday (or last month or last year), we want to make critical decisions based on what’s happening right now. Yes, we must use a certain amount of historical data for business context — for instance, we may decline a credit card application based on a previous bankruptcy. But the ultimate decision we make in any given moment must primarily be influenced by events happening now. | |
Thinking in events enables us to rethink our systems in a purely reactive way — ingesting events, issuing commands which emit new events. We can feed those events back into our system, or emit them for further downstream processing. Forward thinking organizations are using event-driven architectures like this to transform batch-mode systems into real-time systems, reducing latencies from hours to seconds. | |
While the rest of the article expects some familiarity with Akka, being an Akka expert isn’t required to explore Akka Streams. In fact, the Akka Streams API provides a higher level of abstraction than Akka. It makes Akka Streams well suited to build event-driven systems using a very simple dialect. | |
Akka Streams is a toolkit for per-event processing of streams. It’s an implementation of the Reactive Streams specification, which promises interoperability between a variety of compliant streaming libraries. | |
If you’re new to the world of stream processing, I recommend reading the first part of this series, A Journey into Reactive Streams, before continuing. The rest of this article assumes some familiarity with the content outlined in that post, as well as a high-level understanding of Akka. | |
Let’s start by describing a simple linear data flow. A linear flow is a chain of processing steps each with only one input and one output, connected to a data source and a data sink. In this example we’ll use Akka Streams to ingest a CSV file which contains records of all flight data in the US for a single year, process the flight data, and emit an ordered list of average flight delays per carrier in a single year. | |
We start with an Outlet, that’s our Source. A source of data can be a queue (such as Kafka), a file, a database, an API endpoint, and so on. In this example, we’re reading in a CSV file and emitting each line of the file as a String type. | |
We connect the Outlet to a FlowShape — a processing step that has exactly one input and one output. This FlowShape ingests String types and emits FlightEvent types — a custom data type that represents a row of data in the CSV file. The FlowShape uses a higher-order function called csvToFlightEvent to convert each String to a FlightEvent. | |
Let’s explore the source code below. Akka Streams supports both Scala and Java — our examples will be in Scala. | |
First, we define the overall blueprint, called a graph, using the GraphDSL, and assign it to a value, g. | |
Blueprints can be created, assigned to values, shared, and composed to make larger graphs. | |
Next we need both an ActorSystem and an ActorMaterializer in scope to materialize the graph. Materialization is a step that provisions Akka Actors under the hood to do the actual work for us. We kick off materialization by calling run() on the graph, which provisions actors, attaches to the source of data, and begins the data flow. At runtime the materializer walks the graph, looks at how it’s defined, and creates actors for all of the steps. If it sees opportunity for optimization, it can put multiple operations into one actor or create multiple actors for parallelization. The key concept is that the hardest work is done for you — all of this optimization would need to be done manually if we dropped down from Akka Streams to Akka Actors. | |
Digging deeper into the blueprint, we notice a third step, an Inlet. This is a generalized type that defines our Sink. There are a number of different Sink types which we will explore in more detail shortly. A runnable graph must have all flows connected to an inlet and an outlet, so for this initial example we’ll use an ignore sink to wire everything together but not actually do anything with the results of our first flow step. The ignore sink will act as a black hole until we decide what to do with the events it consumes. | |
Let’s pause. If we created a system like this before Akka Streams, we would have needed to create an actor for each step in the flow and manually wired them together for message passing. That adds a significant amount of cognitive overhead for the relatively simple task of localized data flow processing. | |
You’ll perhaps also notice another significant difference between Streams and Actors — types. We can specify the types of each step in our flow rather than relying on pattern matching as we would do with Akka Actors. | |
There are four main building blocks of an Akka Streams application: | |
That’s a good starting point, but without anything else we could only create linear flows — flows made up entirely of steps with single inputs and single outputs. | |
Linear flows are nice… but a bit boring. At a certain point we’ll want to split apart streams using fan-out functions and join them back together using fan-in functions. We’ll explore a few of the more useful fan-in and fan-out functions below. | |
Fan out operations give us the ability to split a stream into substreams. | |
Broadcast ingests events from one input and emits duplicated events across more than one output stream. An example usage for a broadcast would be to create a side-channel, with events from one stream being persisted to a data store, while the duplicated events are sent to another graph for further computation. | |
Balance signals one of its output ports for any given signal, but not both. According to the documentation, “each upstream element is emitted to the first available downstream consumer”. This means events are not distributed in a deterministic fashion, with one output being signalled and then the other, but rather whichever downstream subscriber happens to be available. This is a very useful fan-out function for high-throughput streams, enabling graphs to be split apart and multiple instances of downstream subscribers replicated to handle the volume. | |
Fan in operations give us the ability to join multiple streams into a single output stream. | |
Merge picks a signal randomly from one of many inputs and emits the event downstream. Variants are available such as MergePreferred, which favours a particular inlet if events are available to consume on multiple incoming streams, and MergeSorted, which assumes the incoming streams are sorted and therefore ensures the outbound stream is also sorted. | |
Concat is similar to Merge, except it works with exactly two inputs rather than many inputs. | |
The separation of blueprints from the underlying runtime resources required to execute all of the steps is called materialization. Materialization enables developers to separate the what from the how; developers focus on the higher-level task of defining blueprints, while the materializer provisions actors under the hood to turn those blueprints into runnable code. | |
The materializer can be further configured, with an error handling strategy for example. | |
In the example above, we’ve defined a decider that ignores arithmetic exceptions but stops if any other exceptions occur. This is because division-by-zero is expected but doesn’t prevent successful completion. In the example above the result will be a Result will be a Future completed with Success(228). | |
Akka enables asynchrony — different steps within the flow can be parallelized because they aren’t tied to a single thread. Using Akka Actors enables local message passing, remote message passing, and clustering of actors, providing a powerful arsenal for distributed computing. Akka Streams provides a subset of this flexibility, limiting distribution to threads and thread pools. In the future it’s entirely possible that Akka Streams will support distribution of steps remotely over the network, but in the meantime multiple Akka Stream applications can be connected together manually and chained for more complex, distributed stream processing. The advantage of using Streams over Actors is how much complexity Streams removes for applications that don’t require advanced routing and distribution over network connections. | |
Our solution has come together. We’ve chained together processing steps using the GraphDSL that removes the complexity of manually passing messages between Akka Actors. Akka Streams also handles much of the complexity of timeouts, failure, backpressure, and so forth, freeing us to think about the bigger picture of how events flow through our systems. | |
We perform the following computation on the CSV file: | |
We’ve crunched through a 600MB CSV file in ~1 minute and output meaningful results with ~100 lines of code. This program could be improved to do a number of things, such as stream raw events to a dashboard, emit events to a database (with backpressure!), feed aggregate data into Kafka for processing by Spark — the sky is the limit. | |
Full source code can be obtained here:https://github.com/rocketpages/flight_delay_akka_streams | |
Flight data can be obtained here: http://stat-computing.org/dataexpo/2009/the-data.html | |
In a future post we will demonstrate the big-picture architectural possibilities of Akka Streams and use it as glue between inbound data (API endpoints, legacy back-end systems, and so forth) and distributed computation using Spark. Lightbend calls this architecture fast data, which enables a whole new type of creativity when working the volume of data that flows through a modern organization. It elevates data to the heart of the organization, enabling us to build systems that deliver real-time actionable insights. | |
If you’re interested in a deeper conceptual overview of fast data platforms, I invite you to read a whitepaper, Fast Data: Big Data Evolved, by Dean Wampler, member of the OCTO team and “Fast Data Wrangler” at Lightbend. And if you’re as excited as I am about transforming batch jobs into real-time systems, stay tuned for the next post in this series! | |
Kevin Webber is CEO of RedElastic, a boutique consulting firm that helps large organizations transition from heritage web applications to real-time distributed systems that embrace the principles of reactive programming. | |
He was formerly Enterprise Architect and Developer Advocate at Lightbend. In his spare time he organizes ReactiveTO and Programming Book Club Toronto. He rarely writes about himself in the third-person, but this is one of those moments.","2" | |
"smfrogers","https://medium.com/google-news-lab/visualizing-the-rhythm-of-food-searches-390c866a740","4","{""Data Visualization"",""Big Data"",Data}","85","3.11132075471698","Visualizing the rhythm of food searches","One of the striking things about working with Google Trends data is the insight it gives us into search as a daily utility. When it comes to healthcare, schools, music, or travel, how we search reflects what’s going on in our lives. And how we search for food is particularly revealing. | |
Food searches can tell us about traditions, culture, immigration and fashion. With data going back to 2004, we can see how those foodie fads have changed over time. This is where the Rhythm of Food comes in. | |
This interactive data explorer is built by acclaimed designer Moritz Stefaner and his team at Truth & Beauty, using Google Trends data. It’s also the second in the Google News Lab’s series of visual experiments, with the first being a project with Alberto Cairo and the world’s best designers to develop innovative newsroom interactive visualisations. | |
Here are some of the things you can learn on the site: | |
A key part of these data visualisations is that they find a new way to show data. And, in order to investigate seasonal patterns in food searches, the design team developed a new type of radial “year clock” chart to reveal seasonal food trends for individual vegetables, fruit, confectioneries, dishes, and drinks. | |
Each segment of the chart indicates the search interest in one of the weeks of the past 12 years, with its distance from the center showing the relative search interest, and the color indicating the year. This allows the user to spot rhythms that repeat on a yearly basis (such as natural season, peaks at holidays, etc.) as well as seeing year-over-year trends (such as the rise of avocado or the collapse of interest in energy drinks). | |
How did Moritz choose such a distinctive design? Initially, as part of the initial data exploration, he plotted the data as area, or line charts. Immediately you can see the strong repeating patterns. | |
Stefaner says: “So I became curious”. He then made these overlaid line charts, where each year has a line of its own, and which repeat every twelve months. | |
He then proceeded to code the custom radial version “which you don’t get in standard software libraries”. | |
The aim of the News Lab data visualisation project is to provide not only inspiration but examples too. This is a living dynamic project — right now, the site comprises 195 topics and presents 130,048 individual data points — and we will be adding more over time. | |
Says Stefaner: | |
“Looking at shifting interest towards individual ingredients, dishes, and recipes over the years has been fascinating. More than 12 years of weekly Google Trends data supplied us with a rich dataset to explore food trend over the years. But the most interesting revelations happened when we looked at the seasonal rhythm of food in our radial charts. We immediately saw how each vegetable, fruit, dish or drink had its own signature seasonality pattern — some are tied to natural seasons, some to special holidays, some are popular all year long.” | |
Simon Rogers is data editor at the Google News Lab and is also director of the Data Journalism Awards.","2" | |
"zenthoughts","https://medium.com/axiomzenteam/the-core-of-machine-learning-5319a57f2941","6","{""Machine Learning"",""Artificial Intelligence"",AI,Technology,""Data Science""}","81","8.88962264150943","An Honest Guide to Machine Learning : Part Two","If you joined us last week for our introduction to Machine Learning Without Math, you’ll remember that our next step on the road to expertise was going to be Natural Language Processing. We’re still walking in that direction, but based on some feedback we received, we realized there are a few more steps we have to take before you’ll be able to fully appreciate NLP’s intricacies. That’s why this week, we’re diving a little deeper into machine learning. | |
The first thing we’re going to talk about is the basic organizational structure of machine learning (ML). ML is organized into buckets called tasks. Tasks are divided into three parts: an input, a model, and a desired output. That task then has two phases: training and using (decoding). Remember, it’s called machine learning for a reason — before a task can be used to actually solve a problem, it has to be trained. That’s referred to as training the model. Once the model is implemented, it’s ready to use for decoding. | |
When you begin to train a model, you start with input (a set of data points). Your input is related to the problem you want to solve; for every input data point, you have a list of features. Features are informative measures about each data point inside the input. For example, the problem you want to solve might be identifying whether an animal is a cat or a dog. In that case, pictures of cats and dogs would be your input. For cat, you would program a list of features that would include whiskers, triangle ears, a small pink nose. There are two kinds of features: binary features (Is the animal grey, yes or no) and categorical features (What is the colour of the animal, grey, tan, brown, or black?). Sometimes features are too complicated to fit either kind: for instance, features that are values, such as the weight of the animal. Values can be broken into groups, which then become categorical (What is the weight of the animal, 0–2 pounds, 2–4 pounds, 4–6 pounds); and categorical features can be further broken down into binary features (Is the weight more than 4 pounds, yes or no?). The more features you have, the harder it is to train the model, and the more data you need. | |
Machine learning tasks are often classified based on output type. We’ve broken it down into six of the most common output types, plus a final bucket to capture all of the myriad smaller potential outputs. One of the interesting challenges in machine learning is that there is no “accepted terminology.” That can make it very difficult to, say, find a paper where someone is working on the same avenue as you are, because other projects will be using different terms. | |
1. Classification. This is the most common type of machine learning output. In this instance, your output is a set of predefined labels. The previous example of cat vs dog identification is a perfect example: if you have access to a large number of labelled photos of each, you would probably aim for classification output. Two classes (cat vs dog) is binary classification, and after that goes up based on the number of classes: 3 class, 4 class, etc.. Example Project: Gmail uses this system to detect whether or not an email is spam. The input is a list of keywords or phrases plus some info from the header of your email, and the output is spam or not (ham). | |
2. Clustering. Let’s say you have a bunch of photos of animals, but they aren’t labelled — you don’t know what kinds of animals they are. You want to categorize them, but you can’t use classification because you don’t have the right input labels. In this case, you would use clustering to find the similarities and group those images. Sometimes, you might not even know how many possible outputs there might be — perhaps you have photos from a trip to the zoo, but you don’t know how many different animals the zoo keeps. That’s called hierarchical clustering. Example Project: Medical machine learning uses clustering to identify mutations. Their input is gene sequences, and their output is types of mutations. | |
3. Regression. This works similarly to classification, but instead of categories (labels), you have numbers. Based on input, you have to predict one number at the end. A famous example of regression is trying to predict how much a house will sell for. The input is features for previously sold houses, such as number of rooms and square footage; your output is a single number, the price the house will sell for. Example Project: This is used today to predict stock prices. The input is recent news and tweets about a company, and the output is a prediction of what the stock will be worth. | |
4. Dimensional reduction. Having enough data is always a tricky part of machine learning. If you don’t have enough data, you have to reduce your features somehow. Dimensional reduction lets you map the features you have into smaller groups, or select only the features there is enough data for. Sometimes the model will try to find a better combination of previous features; for example, it’ll combine three features into one category. The famous technique for this is called PCA — principle component analysis. Example Project: You have thousands of sensors to detect minerals in the soil; it’s too large a sample size, so you narrow it down to only a few sensors. | |
5. Anomaly detection. If you have the input, and you want to understand which part of the input is not in harmony with the others, you use anomaly detection. The output in this case is always an isolated part of the input. Example Project: Credit card companies use this form of machine learning to detect strange purchases. For instance, if a client makes six purchases in a row from the same gas station, machine learning flags that as an anomaly in normal purchase patterns, where generally only one purchase is made per store. | |
6. Association Rule Learning/Detecting. For this output type, you have a list of features and you want to find out how they are connected to each other. Your output is what associations each of those items have. Project Example: Walmart and other grocery chains use this to analyze shopping carts. Thanks to machine learning they can tell that most people who buy tortilla chips also buy dip, so they know to put those items close together on the shelves. | |
7. There are an almost unlimited number of potential outputs. While those six are the ones most commonly used, we didn’t want to imply that there aren’t others out there that get the job done too! For instance, collaborating filtering and density estimation are both fairly popular. You could fall into a Wikipedia hole trying to learn about them all. | |
In contrast to the epic number of potential outputs, there are only two large categories of model: generative and discriminative. | |
Generative models, as per their name, generate output. Their goal is to model the world. For example, if you feed it images of cats and dogs and teach it distinguish between them, it will eventually learn to generate its own picture of a cat or a dog. | |
Discriminative models don’t try to generate, but rather to discriminate between two (or multiple things. They aren’t as strong, but they’re much easier to train. A discriminative model wouldn’t be able to create it’s own dog, but it can learn to tell the difference between dogs and cats. If you don’t have enough data, discriminative models are the ones to choose. | |
Based on the input and output, you should have a mathematical formula you want to optimize. The model does that by moving from input to output. Say your input is pixels: you want to change that input to an output which is either cat or dog. This is where the math comes in, but don’t worry — the model handles the math for you. Each model has a formulation, which you can imagine as having knobs and control factors. Training a model is just setting and tuning those factors. Imagine you have a microphone (input) and a speaker (output). Your model is the amp: you know it will increase your voice, but you have to tune the bass and treble to get the sound you want. The “bass” and “treble” are called parameters. When you train a model you optimize those parameters — which is what we’re talking about when we say machine learning is looking for the optimum. And just like you don’t need to know how an amp works to use it, you often don’t need to know exactly how the formulation works — you just need to know how to tune those knobs and controls. | |
So, what should this formulation be? You only have two model types, but there are many formulations you can use within that binary. Some popular formulations include Support Vector Machines (SVM), Decision Trees, Conditional Random Fields, Neural Networks (this is where we get deep learning, which you’ve no doubt heard of. DL is neural networks rebranded), and Log Linear Models, but these are only some of many. A machine learning engineer will take a problem, categorize it based on its features, and select the right model to optimize the parameters. They will then define the a success metric. That ensures the model isn’t simply memorizing the data, and that it’s generalized enough to predict a future outcome without being surprised. | |
Now that you understand the different kinds of output, and you understand the models through which those outputs are generated, there’s one last thing to go over: input. There are several different kinds of input, which can affect both your model and your output. | |
If you have input data, and you have labels for that data, the process by which you train the algorithm is called supervised learning. Humans had to put the labels on that data, which is a time-consuming process, and then they’ve fed it to the training set. Classification is a good example of an output which can be supervised, though it can also be semi-supervised (see below). | |
If you don’t have any labels for your data, you can’t necessarily feed the algorithm in the same way. This is called unsupervised learning, because the machine will have to generate its own labels. Clustering, which we described in detail above, is always unsupervised. | |
There are times when you will have some labelled data, but a limited amount, meaning the training set will begin to train by example, but continue from there unsupervised. This can happen with classification, and is called semi-supervised learning. | |
Finally, there’s reinforcement learning. In this instance you don’t have labels, but you do have a constant influx of new information — new labels. Based on predictions, the environment creates a reward system that encourages a certain kind of behaviour, such as getting the labels right. | |
It used to be that machine learning could handle only one task at once, but computational power is improving to the point where people are now moving from learning only one task to learning multiple ones, and to more complicated models. This can help arrest generalization, and is certainly the direction of the future. | |
If you hire a machine learning engineer, their job is to break your problem into one or more of these tasks and then select the right models. They have to find the right training data, define the evaluation metric, and define the test set. Then for each task, based on the type of input and output you have, they select a model, and then train it to get the best evaluation. After the training is done. it will put it into the decoding/in action stage, where it will continue as long as you keep feeding it input. | |
Congratulations! If you made it through, you officially understand (or at least have the tools to understand!) how machine learning works. | |
Join us next week for our deep dive (for real this time) into natural language processing. | |
— — | |
written by Ramtin Seraj and Wren Handman for Leviathan.ai | |
Leviathan is a weekly newsletter and blog series, bringing you the most interesting pieces of artificial intelligence and machine learning news.","2" | |
"quincylarson","https://medium.com/free-code-camp/code-briefing-confessions-of-an-insecure-designer-b8f8fa6b8580","2","{Design,""Data Science"",""Web Development"",Tech,Startup}","81","0.726729559748428","Code Briefing: Confessions of an Insecure Designer","Here are three stories we published this week that are worth your time: | |
Bonus: Celebrate the Open Data movement with this awesome T-shirt. We just launched it today (in fitted women’s sizes, too) in our shop. | |
Happy coding, | |
Quincy Larson, teacher at Free Code Camp","3" | |
"JeffGlueck","https://medium.com/foursquare-direct/millennials-are-spending-more-time-at-an-unexpected-location-theme-parks-b30d2fb727e5","5","{Foursquare,Millennials,""Theme Parks"",""Harry Potter"",""Big Data""}","81","6.86729559748428","Millennials Are Spending More Time at an Unexpected Location: Theme Parks","","3" | |
"akelleh","https://medium.com/@akelleh/causal-data-science-721ed63a4027","0","{""Data Science"",Data,Causality,""Causal Inference""}","80","1.04905660377358","Causal Data Science","I started a series of posts aimed at helping people learn about causality in data science (and science in general), and wanted to compile them all together here in a living index. This list will grow as I post more: | |
The goal of this post is to develop a basic understand of the intuition behind causal graphs. It’s aimed at a general audience, and by the end of it, you should be able to intuitively understand causal diagrams, and reason about ways that the picture might be incomplete. | |
2. Understanding Bias: A Prerequisite For Trustworthy Results | |
This post aims at a general audience. The goal is to understand what bias is, where it comes from, and how drawing a causal diagram can help you reason about bias. | |
3. Speed vs. Accuracy: When Is Correlation Enough? When Do You Need Causation? | |
The goal of this article is to understand some common errors in data analysis, and to motivate a balance of data resources to fast (correlative) and slow (causal) insights. | |
4. A Technical Primer on Causality | |
This is a very technical introduction to the material from the previous posts, aimed at practitioners with a background in regression analysis and probability. | |
5. The Data Processing Inequality | |
In order to understand observational, graphical causal inference, you need to understand “conditional independence testing”. CIT can be sensitive to how you encode your data, and it’s a problem that is sometimes swept under the rug. This article brings it into the spotlight, and is a pre-cursor to our discussion on causal inference! | |
6. Causal graph inference from observational data! (coming soon!)","5" | |
"smfrogers","https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8","6","{Google,""Google Trends"",""Data Journalism"",Journalism,""Data Science""}","80","5.85566037735849","What is Google Trends data — and what does it mean?","A little more than a year ago, we made Google Trends data available in real time; and increasingly, it’s helping people around the world explore the global reaction to major events. | |
The vast amount of searches — trillions take place every year — make Google Trends one of the world’s largest real time datasets. Examining what people search for provides a unique perspective on what they are currently interested in and curious about. | |
So when a big news story happens, how can you best interpret this data? | |
Trends data is an unbiased sample of our Google search data. It’s anonymized (no one is personally identified), categorized (determining the topic for a search query) and aggregated (grouped together). This allows us to measure interest in a particular topic across search, from around the globe, right down to city-level geography. | |
You can do it, too — the free data explorer on Google Trends allows you to search for a particular topic on Google or a specific set of search terms. Use the tool and you can see search interest in a topic or search term over time, where it’s most-searched, or what else people search for in connection with it. | |
There are two ways to filter the Trends data: real time and non-real time. Real time is a random sample of searches from the last seven days, while non-real time is another random sample of the full Google dataset that can go back anywhere from 2004 to ~36 hours ago. The charts will show you either one or the other, but not both together, because these are two separate random samples. We take a sample of the trillions of Google searches, because it would otherwise be too large to process quickly. By sampling our data, we can look at a dataset representative of all Google searches, while finding insights that can be processed within minutes of an event happening in the real world. | |
It’s a unique and powerful dataset, which can complement others, like demographic data from the census, as shown here in the Washington Post. As a sample, it gives us a way to analyse what people are searching for in real time as events unfold. But combining data can be tricky — for instance, it doesn’t make sense to compare Google Trends to other Google datasets, which are measured in different ways. For example, AdWords is meant for insights into monthly and average search volumes, specifically for advertisers, while Google Trends is designed to dig further into more granular data in real time. | |
Google Trends is a powerful tool for storytelling because it can allow us to explore the magnitude of different moments and how people react to those moments. We can look back and compare different terms against each other, like how different sports have ranked since 2004. We also can take the total searches for an event to help understand its sheer magnitude. When we released our 2015 Year in Search, we found there were astoundingly over 439 million searches on Google when Adele came back with ‘Hello’. | |
What’s most useful for storytelling is our normalized Trends data. This means that when we look at search interest over time for a topic, we’re looking at that interest as a proportion of all searches on all topics on Google at that time and location. When we look at regional search interest for a topic, we’re looking at the search interest for that topic in a given region as a proportion of all searches on all topics on Google in that same place and time. | |
For instance, if we look at the Trends around Bernie Sanders, we can see that Vermont has the highest search interest in the current senator. This is because of all states, Vermont has the highest percentage of searches for Sanders out of all searches in that state. If we had looked at raw data rather than normalized values, we would’ve seen larger states with higher populations rise to the top of the ranks. | |
That normalization is really important: the number of people searching on Google changes constantly — in 2004 search volume was much smaller than it is today, so raw search numbers wouldn’t give you any way to compare searches then and now. By normalizing our data, we can make deeper insights: comparing different dates, different countries or different cities. | |
The context of our numbers also matters. We index our data to 100, where 100 is the maximum search interest for the time and location selected. That means that if we look at search interest in the 2016 elections since the start of 2012, we’ll see that March 2016 had the highest search interest, with a value of 100. | |
If we look at search interest in only March 2016, though, we can see that March 16 has the highest search interest, because we’ve re-indexed our values for just that month. | |
Because Google Trends data is presented as an index, we often get the question: “how important is this?” | |
There are a few ways to assess this. The first is understanding relative search interest in the topic compared to itself — or what we would call a “spike”. | |
As the results came in for the recent EU referendum, Google Trends showed what people were inherently curious about. Search interest in the BBC’s David Dimbleby’s tie spiked, and people searching for “getting an Irish passport” also surged by 100%. Understanding the percent increase in a search topic can be a useful way to understand how much rise in interest there is in a topic. This percent increase is based on a topic’s growth in search interest over a distinct period of time compared to the previous period. | |
Those “spikes” are a sudden acceleration of search interest in a topic, compared to usual search volume. We know these are interesting because they are often reflective of what’s going on in the real world — there has been a rise in applications for Irish passports in the UK since the vote, for instance. | |
To get a sense of relative size, we can add additional terms, which helps put that search interest into perspective. For instance, after the Cleveland Cavaliers won the NBA Championships this year, we saw the Cavaliers spike past Taylor Swift, a topic that has consistently high search volume on Google. This helps put into context how large the volume around the “Cavaliers” search query was when it spiked. | |
We’ve seen lots of reporters use this approach. In the aftermath of the Oregon shooting, Huffington Post saw that search interest in gun control spiked above search interest in gun shop. By looking at this data in the year leading up to the tragedy, they found that this was a pattern for other recent shootings in America. | |
Looking at related searches can also help to understand conditions that might be driving spikes in Google Trends. During its annual Person of the Year special, TIME looked back at search interest around each of the candidates. To understand the context around each spike, TIME highlighted the related searches to each topic when it spiked in search to gain a better sense of what drove people’s curiosity at that moment in time. | |
Trends data can provide a powerful lens into what Google users are curious about and how people around the world react to important events. We’re committed to making Trends easier to use, understand and share. We look forward to continuing the conversation. | |
I am Data Editor at Google’s News Lab. To get the most recent updates from the team, follow our new Medium publication here.","4" | |
"enjalot","https://medium.com/hi-stamen/an-ode-to-d3-js-projections-9d6477d6da0b","27","{""Data Visualization"",D3,GIS}","80","5.93018867924528","An ode to d3.js projections","When it comes to making maps online there are many tools available, but they all have one thing in common: Geographic coordinates go in and a two-dimensional image comes out. Converting from geographic coordinates (longitude,latitude) to pixel coordinates is accomplished via a map projection. Most of the tools made for online map making focus on facilitating the navigation of a map with layers of geospatial data. Leaflet and Mapbox GL are two such tools and they do their job well. There is one tool that focuses on the projections and lets you decide how to present the geospatial data: d3.js projections. | |
What happens when you bring geospatial data into the rich and expressive d3.js data visualization ecosystem? The possibilities are endless, and this post attempts to explore and categorize some of them. | |
One of the first things that stands out is the potential for custom styling. Once you have used a projection to create SVG elements you can style them to your hearts content. The following example utilizes some of SVG’s features like <defs> and <use>, and the textures.js library to render custom patterns on a map. | |
It’s not just SVG either, here is an example of generating a sketchy style using an algorithm that renders to canvas. | |
The key idea here is that we get to use all of the tools we are familiar with in HTML5, SVG and CSS to express our geospatial data. We have complete control over the rendering, which leads to some powerful possibilities. | |
The most common use of d3.js projections is customizing the representation of data that has a geospatial component. The following examples show a small sample of the variety of ways that people have visualized data in a geographic context. | |
From these examples we can see that the size, shape and even number of maps can all be changed to suit the visualizations purpose. The wide variety can be seen in this search of d3 examples. | |
A powerful concept in d3 is transitions. Animating a visualization between two states when the underlying data changes is a great way to provide consistency to the viewer. This is no different when using geospatial data as d3 can animate a transition between two projections. You can also transition the parameters of the projection, such as the center, scale and rotation. | |
Projection transitions are used here to help understand the differences between projections. | |
This example transitions between two different representations of the data, a standard choropleth map and a dorling cartogram. | |
This example allows the user to choose between a globe projection and the traditional Mercator projection. | |
Transitions can also be abused to induce sea sickness. | |
As they say, “change is the only constant” so being able to incorporate change into your visualization opens the door to many interesting posibilities. | |
One of the more exciting aspects of web based visualization is the ability to include interaction. Because projections allow you to create DOM elements, you can leverage the existing browser APIs (made easier with d3’s selections and event handlers) to provide custom interactions with your geospatial data. These examples show some simple but powerful interactions with the mouse cursor. | |
In the following example a zoomed-in version of the map is rendered on the right hand side, centered on wherever the cursor is pointing in the left map. | |
This example uses fish-eye distortion around the cursor to allow the user to inspect densely plotted points inside the same map. | |
It may be desirable to create geometry from data in order to approximate it, or group it together visually. With d3 projections you can run any algorithm that works on two dimensional coordinates. A fairly common algorithm is the voronoi diagram seen in these two examples. | |
A lesser known but useful technique is to generate a concave hull to group data together. | |
The key takeaway is that d3 projections bring you into the well-studied realm of 2D graphics, your tool set can be expanded from mapping-specific to anything that has been invented for manipulating x,y coordinates. | |
It is not just the rendering of the geographic shapes that can be customized, but what you do with them once they are rendered as well. Many geographic features are instantly recognizable (at least to relevant audiences) and can be used as icons. d3’s projections make it convenient to render single features independently, for arrangements outside of a traditional map. | |
Layering two locations in the same projection make for an educational example. | |
Once your geospatial data becomes a set of DOM elements, you can lay it out with the same flexibility and specifity as any other HTML and CSS project. | |
d3 includes an extensive set of projections, and that set can be extended when people implement other projection with d3’s API. | |
Some countries have territories that are not geographically close, so this library makes composite d3 projections that plot them together. | |
All projections introduce some sort of distortion when they attempt to flatten the 3D globe on to a 2D plane. This example allows you visually inspect the distortion of several d3 projections. | |
Interrupted projections allow an aspiring cartographer to decide how the earth should be cut in the process of flattening it out, BECAUSE THE EARTH IS NOT FLAT. | |
You can actually print out the map and cut along the “interruption.” | |
Glue the result to a tennis ball and you’ve created your own custom globe! | |
We’ve covered custom styles, data layouts, transitions, geometries, compositions and projections. The real conclusion is that we are just getting started with the potential for geospatial visualization. Use the links in all of the above examples as starting points, and don’t forget to search for more. | |
If you are in the San Francisco Bay Area at the end of June or beginning of July | |
for our two d3.js & mapping workshops to get a solid foundation in coding and designing with d3.js projections! Read more about our philosophy and methodology behind the workshops.","4" | |
"backchannel","https://medium.com/backchannel/be-healthy-or-else-how-corporations-became-obsessed-with-fitness-tracking-b0c019faff8d","5","{Healthcare,Health,Tech,Fitness,""Big Data""}","80","3.91635220125786","Be Healthy or Else: How Corporations Became Obsessed with Fitness Tracking","Employers, which have long been nickel and diming workers to lower their costs, now have a new tactic to combat these growing costs. They call it “wellness.” It involves growing surveillance, including lots of data pouring in from the Internet of Things — the Fitbits, Apple Watches, and other sensors that relay updates on how our bodies are functioning. | |
The idea, as we’ve seen so many times, springs from good intentions. In fact, it is encouraged by the government. The Affordable Care Act, or Obamacare, invites companies to engage workers in wellness programs, and even to “incentivize” health. By law, employers can now offer rewards and assess penalties reaching as high as 50 percent of the cost of coverage. Now, according to a study by the Rand Corporation, more than half of all organizations employing fifty people or more have wellness programs up and running, and more are joining the trend every week. | |
There’s plenty of justification for wellness programs. If they work — and that’s a big “if” — the biggest beneficiary is the worker and his or her family. Yet if wellness programs help workers avoid heart disease or diabetes, employers gain as well. The fewer emergency room trips made by a company’s employees, the less risky the entire pool of workers looks to the insurance company, which in turn brings premiums down. So if we can just look past the intrusions, wellness may appear to be win-win. | |
Trouble is, the intrusions cannot be ignored or wished away. Nor can the coercion. Take the case of Aaron Abrams. He’s a math professor at Washington and Lee University in Virginia. He is covered by Anthem Insurance, which administers a wellness program. To comply with the program, he must accrue 3,250 “HealthPoints.” He gets one point for each “daily log-in” and 1,000 points each for an annual doctor’s visit and an on-campus health screening. He also gets points for filling out a “Health Survey” in which he assigns himself monthly goals, getting more points if he achieves them. If he chooses not to participate in the program, Abrams must pay an extra $50 per month toward his premium. | |
Abrams was hired to teach math. And now, like millions of other Americans, part of his job is to follow a host of health dictates and to share that data not only with his employer but also with the third-party company that administers the program. He resents it, and he foresees the day when the college will be able to extend its surveillance. “It is beyond creepy,” he says, “to think of anyone reconstructing my daily movements based on my own ‘self-tracking’ of my walking.” | |
My fear goes a step further. Once companies amass troves of data on employees’ health, what will stop them from developing health scores and wielding them to sift through job candidates? Much of the proxy data collected, whether step counts or sleeping patterns, is not protected by law, so it would theoretically be perfectly legal. And it would make sense. As we’ve seen, they routinely reject applicants on the basis of credit scores and personality tests. Health scores represent a natural — and frightening — next step. | |
Already, companies are establishing ambitious health standards for workers and penalizing them if they come up short. Michelin, the tire company, sets its employees goals for metrics ranging from blood pressure to glucose, cholesterol, triglycerides, and waist size. Those who don’t reach the targets in three categories have to pay an extra $1,000 a year toward their health insurance. The national drugstore chain CVS announced in 2013 that it would require employees to report their levels of body fat, blood sugar, blood pressure, and cholesterol — or pay $600 a year. | |
The CVS move prompted this angry response from Alissa Fleck, a columnist at Bitch Media: “Attention everyone, everywhere. If you’ve been struggling for years to get in shape, whatever that means to you, you can just quit whatever it is you’re doing right now because CVS has got it all figured out. It turns out whatever silliness you were attempting, you just didn’t have the proper incentive. Except, as it happens, this regimen already exists and it’s called humiliation and fat-shaming. Have someone tell you you’re overweight, or pay a major fine.” | |
All of this is done in the name of health. | |
Adapted from WEAPONS OF MATH DESTRUCTION: HOW BIG DATA INCREASES INEQUALITY AND THREATENS DEMOCRACY Copyright © 2016 by Cathy O’Neil. Published by Crown, an imprint of Penguin Random House LLC. | |
Available for purchase here.","2" | |
"noahhl","https://medium.com/signal-v-noise/getting-your-recommended-daily-chart-allowance-4502a782ad70","5","{""Data Visualization"",""Data Science"",Analytics,""Business Intelligence""}","79","4.7314465408805","Getting your recommended daily chart allowance","","8" | |
"cedricbellet","https://medium.com/biffures/part-2-the-beauty-of-bitwise-and-or-cdf1d8d87891","7","{Programming,JavaScript,""Data Visualization"",Bitwise}","78","4.75188679245283","Part 2: The beauty of bitwise AND (∧ or &)","The result of a bitwise AND operation between two bit words b1 and b2 is a bit word containing 1s in slots where both b1 and b2 contain 1s. In the example above, b1 and b2 both have 1s in positions 2 and 3 (from right to left), hence b1 ∧ b2 is 00000110. Simple. | |
While the result is easily computed and the process easily understood in base 2 (i.e., using the ‘bit strings as language’ lens), the numerical AND function looks in fact non-trivial: | |
You can test by yourself that this functions indeed gives 22 ∧ 78 = 6. | |
Now, if you find that function dreadful — I am with you. When I first wrote it down, I found it both complicated and unhelpful; after formulating it, my mind was no closer to understanding the kind of pattern the AND function followed, if any. | |
Ignoring the complex math formula above, we can still find a number of interesting properties regarding the AND function. Considering the function f : (a, b) → a ∧ b, where ∧ is the AND operation, f has the following (easy to demonstrate) properties: | |
This little information is enough to get us started; with some additional calculations, we find the following graph: | |
The horizontal axis is for values of the first operand a; the vertical axis is for values of the second one b; values in cells is the result of f(a,b) = a ∧ b. | |
This chart has striking features that we did not foresee in our preliminary analysis. Observations: | |
And very naturally, through simple observations and intuitions, we have found a generalized way to draw any arbitrary-sized graph for the AND function: | |
Initialization: set T1 | |
Repeat: for any n> 1 | |
I get that my hand-drawn charts are far from perfect. If you have survived till this section, you deserve to see the better graphs — made this time with d3js. | |
As before, cells represent f(a,b) = a ∧ b for values of a determined by the horizontal axis; and values of b on the vertical axis; colors are a function of the value (darker is low, brighter is high). | |
We easily recognize the features determined above: the linear growth along the bisector, the symmetry, the tiers and blocks, and the mathematical relation between all those blocks. | |
While this says a lot about the AND function and could be interpreted for many more minutes, I also find this graph simply nice to contemplate. There is a beauty to this AND graph similar to the one I find in Pascal’s triangle. And I hope that beauty has become slightly more visible to you today. | |
Edits based on the HN discussion: Thanks for the great response! Truly humbled to see your interest for that topic and article. Based on feedback: (i) corrected, f(a, b) ≤ min(a, b) (vs max initially). (ii) The assumption is indeed that we work here with natural integers only; though bit strings for negative integers behave the same way as bit strings for natural ones in bitwise operations, the transcription from base 2 to base 10 for those two groups does not work the same (see 2’s complement). For the sake of simplicity, we assume bit strings as numbers to always be positive integers using the convention that any bit in position i is just worth 2^i.","1" | |
"davidventuri","https://medium.com/free-code-camp/new-coders-how-salary-and-time-spent-learning-vary-by-demographic-359ef1ed0da8","17","{Tech,Programming,""Data Science"",""Learning To Code"",""Gender Equality""}","78","6.89056603773585","New Coders: How Salary and Time Spent Learning Vary by Demographic","More than 15,000 people responded to Free Code Camp’s 2016 New Coder Survey, granting researchers (like me!) an unprecedented glimpse into how people are learning to code. The entire dataset was released on Kaggle. | |
The demographic distributions for the 15,620 new coders who responded to the survey are as follows: | |
This next salary is their first one after advertising their new coding skills. Expected next salary is one of two main questions in the Free Code Camp survey whose answers depend on the quality of coding resources. | |
The 25th percentile North American expects the same as the 75th percentile European: $50k. The median North American expects $60k per year. | |
I wonder if some Europeans forgot to convert from pounds, euros, or any of the other European currencies to US dollars. | |
By the way, here’s how to read this chart (and the other box plots in this article): the “x” is the mean. The horizontal line is the median (a.k.a. the 50th percentile). The bottom of the box is the 25th percentile, and the top of the box is the 75th percentile. Whisker length is 1.5 times the height of the box. The circles are outliers. All y-axes are on a logarithmic scale to better visualize the outlier-heavy data. | |
The median female expects $9k more than the median male. The 25th percentile female expects $14k (!) more than her male equivalent. Female new coders appear optimistic about the changing diversity landscape in the workplace. | |
The gap in medians is $10k. The gap in first quartiles is $15k. Minorities also appear optimistic about the changing diversity landscape in the workplace. | |
Those who dedicate 40+ hours per week have a mean expected salary that is $3k to 5k higher than the others, but this could be caused by random chance. Only 694 of the 15,000+ respondents spend this much time learning. | |
So expected next salary varies wildly by continent. There appears to be a reverse wage gap trend going on with gender and ethnic minority status. | |
Less than 5% of new coders are dedicating 40+ hours to learning each week. | |
Most of these respondents are in their early twenties and have a bachelor’s degree, which suggests that some are forgoing traditional forms of higher education (like master’s and professional degrees) and using those 40+ hour weeks to learn code. | |
By the way, this is the situation I myself am in with my personalized data science master’s degree. | |
As the awareness of the quality and affordability of online education rises, I expect more people to join the higher brackets. | |
The hours dedicated to learning question is another one whose answer depends on the quality of coding resources. | |
(Note that for trans people, the difference was not statistically significant.) | |
So unlike expected next salary, hours dedicated to learning doesn’t vary much by demographic. The bulk of each spends between 5 to 20 hours learning weekly. | |
For both males vs. females and ethnic majorities vs. minorities, two grouped scatter plots follow: | |
Each has a best-fit line labeled with its correlation, as well as dashed lines representing the median for each axis variable. I removed new coders that are 65+ years old since they are statistical outliers. | |
Note the current salary vs. age correlations of 0.267 and 0.192. | |
Male new coders do have an above average proportion of very high ($150k+) current salaries, which corresponds to a slightly higher mean (not plotted). | |
The correlations are much lower, with both near 0.1. Again, we see the huge gap in medians: $59k for females and $50k for males. These are both higher than each gender’s current salary median. | |
Note the current salary vs. age correlations of 0.253 and 0.243. | |
Ethnic majorities do have an above average proportion of $150k+ current salaries, but it doesn’t correspond to a higher mean (not plotted) this time. | |
Both of the correlations are much lower, near 0.1, again. The $10k gap in median expected next salary is striking: $60k for ethnic minorities and $50k for ethnic majorities. As with gender, these are both higher than their current salary median. | |
So, like expected next salaries, female new coders have higher median current salaries than male new coders. So do ethnic minorities vs. ethnic majorities. | |
Older females don’t do as well as older males, however, which is the only hint of a wage gap that I could find in the entire dataset. | |
For both gender and ethnic minority status, it appears that older new coders are willing to take a pay cut when transitioning to a job where they advertise their new coding skills, while younger individuals intend to capitalize on coding demand with a hefty salary early in their career. | |
Hours dedicated to learning to code is pretty much constant across gender, ethnic minority status, and continent. Most people spend between 5–20 hours each week. | |
Expected next salary (post-c0ding skills acquisition) varies wildly by continent. The lowest median is $30k (Europe) and the highest is $60k (North America). | |
Older new coders appear willing to take a pay cut in their new job where they advertise their new coding skills, while younger new coders look to start their careers with substantial salaries with their in-demand skill. | |
The traditional gender and ethnic minority wage gaps aren’t prevalent in the 2016 New Coder Survey. In fact, they are reversed. Perhaps new coders aren’t reflective of the working population in general, where data suggests that both wage gaps still exist in 2016. | |
Do you have a hunch as to why the ethnic minority status and gender wage gaps might not apply to new coders? Please share anything relevant (or contradictory!) in the responses. | |
You can find a more detailed version of this analysis on Kaggle, where you’ll find statistical tests supporting the inferences in this article. | |
Be sure to check out my other pieces exploring Free Code Camp’s 2016 New Coder Survey: | |
If you have questions or concerns about this series or the R code that generated it, don’t hesitate to let me know.","8" | |
"cjgallo","https://medium.com/signal-v-noise/data-is-man-made-cbbe6e2992ca","4","{""User Experience"",""Product Management"",""Customer Service"",""Customer Experience"",""Big Data""}","78","4.00188679245283","Data is Man-Made","Here’s a secret from the support team at Highrise. Customer support metrics make us feel icky. | |
Our team doesn’t know our satisfaction score. We’ve never asked any of the people that use Highrise to try those types of surveys. | |
We can’t give you an exact number for our average response time. It depends. Sometimes it’s 90 seconds, and other times it’s within 24 hours. | |
We can’t tell you our average handle time for an issue. Our team has a general idea, but no exact number. | |
These types of customer support metrics aren’t wrong. We’re sure they work for other support teams. | |
We’re just not sure they’re right for us. | |
Because there is one piece of knowledge we’ve come to realize: data is man-made. | |
Data or metrics or stats are all man-made. A human decides what to measure, how to measure it, how to present it, and how to share it with others. | |
But why does it matter to measure these things? And what’s the point? | |
A lot of times people avoid these questions when it comes to data. Companies copy what other teams measure, ignoring the fact if it’s important to measure the same things in the same way, or if it’s even important to measure it at all. | |
Numbers are black and white. Concrete. You can trust the numbers. | |
Right? | |
Nope. Almost all data is built on biases and judgement. Because humans are deciding what to measure, how to measure, and why to measure. | |
Numbers fit perfectly into a spreadsheet or a graph. A number gives a definitive answer to questions like how much or how many. | |
That doesn’t mean you should treat those numbers as insights and act immediately. Data shouldn’t be used to prove a point. | |
Data should be used to fuel your imagination. | |
Qualitative data isn’t easy. There aren’t any formulas or simple math. It doesn’t fit into a spreadsheet. It doesn’t answer questions. It’s not black and white. | |
It’s colorful. Messy. Qualitative data creates more questions. It’s not simple to present or share with others. It takes some time. | |
Our support team has found one thing to be true. Qualitative data is worth it. 100 percent worth it. | |
For example, our team recently updated the filters in Highrise. This update was to an earlier revision to filters we made during the year. | |
It was driven by one piece of qualitative data from a new user: | |
This hit all of us across the nose. The filters looked better. It was a much more clean than the original design. | |
We didn’t make these changes just for aesthetic reasons though. The original design had a lot of trouble for most of our users who had more than a handful of custom fields. But how to use the filters wasn’t as obvious any longer. | |
Folks need to find a specific set of contacts in the city of: Chicago, that have the value: Interested in the custom field: Status, and that are tagged: Potential. | |
It wasn’t clear how to do that, so our team made a change. | |
We made it abundantly clear what to click on to add a filter. | |
Quantitative data didn’t tell us we needed to make this change. It was all qualitative. | |
Questions from customers and questions from our team. It was a conversation. There is not a numerical value you can put on that. | |
Instead of striving to lower our average response time or improve our customer satisfaction score, our support team is aiming for something a bit different. Something harder to measure. It’s not a number. | |
As Alison would say, we strive to put ourselves out of work. | |
Don’t confuse that with us not wanting to work at Highrise. We love it, and love working with our small team. | |
What we mean is we want to make it easier for people to use Highrise. We want to create a product that is so obvious and so easy to use that we seldom get questions on how to use it. | |
And when folks do have questions, we want to have resources available to them right away, so they can help themselves. So if someone has a question at 2 am in the morning, and we’re not around, they can find an answer without waiting for us. | |
Because we don’t believe managing a number is going to improve our support. We believe focusing on customers and what they are trying to do with Highrise is going to make a better product, and better support. | |
If you enjoyed this post, please click the 💚 to share it with others. Please don’t take this as gospel either. What works for our team, might not work for your team. And vice versa. | |
Also, chapter 9 of Clayton Christensen’s recent book, Competing Against Luck, was a big inspiration for this post. The entire book is great, and you should check it out.","2" | |
"joeyzwicker","https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f","1","{""Big Data"",Docker,""Data Science""}","77","7.05283018867925","Let’s build a modern Hadoop","If you’ve been around the big data block, you’ve probably felt the pain of Hadoop, but we all still use it because we tell ourselves, “that’s just the way infrastructure software is.” However, in the past decade, infrastructure tools ranging from NoSQL databases, to distributed deployment, to cloud computing have all advanced by orders of magnitude. Why have large-scale data analytics tools lagged behind? What makes projects like Redis, Docker and CoreOS feel modern and awesome while Hadoop feels ancient? | |
Modern open source projects espouse the Unix philosophy of “Do one thing. Do it really well. Work together with everything around you.” Every single one of the projects mentioned above has had a clear creator behind it from day one, cultivating a healthy ecosystem and giving the project direction and purpose. In a flourishing ecosystem, everything integrates together smoothly to offer a cohesive and flexible stack to developers. | |
Hadoop never had any of this. It was released into a landscape with no cluster management tools and no single entity guiding it’s direction. Every major Hadoop user had to build the missing pieces internally. Some were contributed back to the ecosystem, but many weren’t. Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop six years ago and have kept it closed source. | |
This is not how modern open source is supposed to work. I think it’s time to create a modern Hadoop and that’s exactly what we’re trying to do at Pachyderm. Pachyderm is a completely new storage and analytics engine built on top of modern tools. The biggest benefit of starting from scratch is that we get to leverage amazing advances in open source infrastructure, such as Docker and Kubernetes. | |
This is why we can build something an order of magnitude better than Hadoop. Pachyderm can focus on just the analytics platform and use powerful off-the-shelf tools for everything else. When Hadoop was at this stage, they had to build everything themselves, but we don’t. The rest of this essay is our blueprint for a modern data analytics stack. Pachyderm is still really young and open source projects need healthy discussion to continue improving. Please share your opinions and help us build Pachyderm! | |
NOTE: The Hadoop ecosystem has been around for 10 years and is very mature. It will be a while before Pachyderm has analogs for everything in the ecosystem (e.g. Hive, Pig). This comparison will be restricted to just the distributed file system, analytics engine, and directly related components that are present in both systems. | |
In Hadoop, MapReduce jobs are specified as Java classes. That’s fine for Java experts, but isn’t for everyone. There are a number of different solutions available that allow the use of other languages, such as Hadoop streaming, but in general, if you’re using Hadoop extensively, you’re going to be doing work in Java (or Scala). | |
Job pipelines are also a constant challenge with distributed processing. While Hadoop MapReduce shows actively running jobs, it doesn’t natively have any notion of a job pipeline (DAG). There are lots of job-scheduling tools that have tried to solve this problem to varying degrees of success (e.g. Chronos, Oozie, Luigi, Airflow), but ultimately, companies wind up using a mishmash of these and home-brewed solutions. The complexity of mixing custom code with outside tools becomes a constant headache. | |
Contrast this with Pachyderm Pipelines. To process data in Pachyderm, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it’s all just going in a container. Pachyderm will inject data into your container by way of a FUSE volume and then automatically replicate the container, showing each one a different chunk of data. With this technique, Pachyderm can scale any code you write to process massive data sets in parallel. No more dealing with Java or JVM-based abstraction layers, just write your data processing logic in your favorite language with any of your favorite libraries. If it fits in a Docker container, you can use it for data analysis. | |
Pachyderm also creates a DAG for all the jobs in the system and their dependencies and it automatically schedules the pipeline such that each job isn’t run until it’s dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows exactly which data has changed and which subsets of the pipeline need to be rerun. | |
Comparing Docker to the JVM is a bit of a stretch. We’ve categorized them as the “job platform” because they define the input format for jobs. | |
The JVM is the backbone of the Hadoop ecosystem. If you want to build anything in Hadoop, you need to either write it in Java or use a special-purpose tool that creates an abstraction layer between the JVM and another language. Hive, which is a SQL-like interface to HDFS, is by far the most popular and well-supported. There are also third-party libraries for common use cases such as image processing, but they are often far less standardized and poorly maintained. If you’re trying to do something more esoteric, such as analyzing chess games, you’re generally out of luck or need hack together a few different systems. | |
Docker, on the other hand, completely abstracts away any language constraints or library dependencies. Instead of needing a JVM-specific tool, you can use any libraries and just wrap them in a Docker container. For example, you can `npm install opencv` and Pachyderm will let you do computer vision on petabytes of data! Tools can be written in any language so it’s ridiculously easy to integrate open source technology advances into the Pachyderm stack. | |
Finally, Pachyderm analysis pipelines are both portable and shareable. Since everything is bundled in a container, it’s guaranteed to run in a predictable way across different clusters or datasets. Just as anyone can pull the Redis container from DockerHub and immediately be productive, imagine being able to download an NLP container — just put text in the top and get sentiment analysis out the bottom. It just works out of the box on any infrastructure! That’s what we’re creating with Pachyderm pipelines. | |
HDFS is one of the most stable and robust elements of the Hadoop ecosystem. It’s great for storing massive data sets in a distributed fashion, but it lacks one major feature — collaboration. Large-scale data analysis and pipelining is a naturally collaborative effort, but HDFS was never designed to be used concurrently by a company’s worth of people. Rather, it entails a great deal of jerry-rigging to keep users from stepping on each other’s toes. It’s unfortunately quite common for a job to break or change because someone else alters the pipeline upstream. Every company solves this internally in different ways, sometimes with solutions as rudimentary as giving each user their own copy of the data, which requires a ton of extra storage. | |
The Pachyderm File System (pfs) is a distributed file system that draws inspiration from git, the de facto tool for code collaboration. In a nutshell, pfs gives you complete version control over all your data. The entire file system is commit-based, meaning you have a complete history of every previous state of your data. Also like git, Pachyderm offers ridiculously cheap branching so that each user can have his/her own completely independent view of the data without using additional storage. Users can develop analytics pipelines or manipulate files in their branch without any worry of messing things up for another user. | |
Pfs stores all your data in generic object storage (S3, GCS, Ceph, etc). You don’t have to trust your data in some new technology. Instead, you get all the redundancy and persistence guarantees you’re used to, but with Pachyderm’s advanced data management features (see Unix philosophy above). | |
Version control for data is also very synergistic with our pipelining system. Pachyderm understands how your data changes and thus, as new data is ingested, it can run your workload on only the diff of the data rather than the whole thing. Not only is cluster performance dramatically improved, but there’s also no difference in Pachyderm between a batched job and a streaming job, the same code and infrastructure will work for both! | |
The clustering layer is comprised of tools that let you manage the machines used for data storage and processing. | |
In Hadoop, the two main tools that address this are: YARN, which handles job scheduling and cluster resource management; and Zookeeper, which provides highly-reliable configuration synchronization. At Hadoop’s conception, there weren’t any other good tools available to solve these problems, so YARN and Zookeeper became strongly coupled to the Hadoop ecosystem. While this contributed to Hadoop’s early success, it now presents a significant obstacle to adopting new advances. This lack of modularity is, unequivocally, one of Hadoop’s biggest weaknesses. | |
Pachyderm ascribes to Docker’s philosophy of “batteries included, but removable.” We focus on doing one thing really well — analyzing large data sets — and use off-the-shelf components for everything else. We chose Kubernetes for our cluster management and Docker as our containerization format, but these are interchangeable with variety of other options. | |
In the Pachyderm stack, clustering is handled by Kubernetes and the CoreOS tool Etcd, which serve similar purposes to YARN and Zookeeper, respectively. Kubernetes is a scheduler that figures out where to run services based on resource availability. Etcd is a fault-tolerant datastore that stores configuration information and dictates machine behavior during a net split. If a machine dies, Etcd registers that information and tells Kubernetes to reallocate all processes that were running on that machine. Other cluster management tools, such as Mesos, can be used instead of CoreOS and Kubernetes, but they aren’t officially supported yet. | |
Using off-the-shelf solutions has two big advantages. First, it saves us having to write our own versions of these tools and gives us a clean abstraction layer. Second, Etcd and Kubernetes are themselves designed to be modular, so it’s easy to support a variety of other deployment methods as well. | |
Both stacks run on Linux. There’s nothing too interesting to say here. | |
Scalable, distributed, data analytics tools are a fundamental piece of software infrastructure. Modern web companies are collecting ever-increasing amounts of data and making more data-oriented decisions. The world needs a modern open source solution in this space and Hadoop’s archaicness is becoming a burden on an increasingly data-driven world. It’s time for our data analytics tools to catch up to the future. | |
If you’d like to get in touch, you can email me at [email protected]. Or find us on GitHub: github.com/pachyderm/pachyderm | |
Thanks to Joe Doliner and Gabe Dillon for reading drafts of this.","2" | |
"privacyint","https://medium.com/privacy-international/down-with-the-data-monarchy-efa37e539ada","1","{Privacy,""Internet Of Me"",""Internet of Things"",Surveillance,""Big Data""}","77","3.16981132075472","Down with the data monarchy","Our devices generate and collect personal data without our knowledge or consent. This has to change. | |
This piece was written by PI legal officer Tomaso Falchetta and originally appeared here. | |
If you want to understand the power dynamics in our increasingly data-driven economy, look no further than the language being used to describe the participants. | |
In data protection lingo, an individual is called a “data subject” and the companies that collect, store, analyze and disseminate the individual’s personal data are called “data controllers.” And sadly, it wouldn’t be a stretch — when it comes to massive companies such as Facebook, Apple and Google — to define the relationship between the subjects and the controllers as an absolute monarchy. | |
These days, there are new, innovative and often downright creepy ways in which our devices generate and collect personal data. This data is then used by companies to predict, monitor and even steer individuals’ future behavior. Much of this data generation and transmission is done without the individual’s knowledge or involvement. | |
Vast amounts of data are being generated and collected by an increasing array of products that are no longer limited to your computer or your phone, but also include your fridge, car, children’s toys, fitness bands, and more. News about these practices usually reaches consumers only when something goes horribly wrong, such as the massive theft of data from Yahoo servers. | |
In most cases, how the information is used is beyond the control of the data subject. | |
And yet, there is reason to hope that we are not completely at the mercy of big corporations. Last week, a data protection authority in Germany prohibited Facebook from collecting and storing the data of WhatsApp users in the country. For good measure, the authority also ordered Facebook to delete all data already forwarded by the app. | |
It is too early to assess the impact of the decision, but there is a good chance it could have implications beyond Germany. The U.K. information commissioner and other data protection authorities in Europe had previously stated that they intend to investigate the implications of the changes in the way personal data is shared between Facebook and WhatsApp. And if the German authority’s order is upheld and enforced, it might be technologically too burdensome, as well as potentially unlawful, for Facebook to apply different sharing rules to different users depending on where they are located. | |
This is not an isolated case. There are currently over 100 countries across the world with data protection laws, albeit with varying standards of enforcement. Data protection authorities are increasingly looking at how new technologies that generate and share data affect an individual’s privacy. | |
This year, some 25 data protection authorities participated in a coordinated review of more than 300 devices connected to the Internet of Things, such as fitness trackers, thermometers, heart rate monitors, smart TVs, smart meters, connected cars and connected toys. Their findings are unsurprisingly worrying, both in terms of poor privacy policies and poor security measures to prevent misuse of data or hacking of the devices. | |
It is becoming increasingly difficult to ask companies how exactly they use consumer data. When you wear a fitness band, for example, is your weight and your average number of steps per day accessible to the company? Does the company sell that personal health data to insurance companies? Is that legal? The answers are unclear. With the proliferation of smart devices, it is crucial that we ask such questions and that companies be compelled to provide a clear response. | |
The EU’s General Data Protection Regulation, adopted earlier this year, offers a good data protection framework, but also leaves a lot of room for interpretation about the scope of protection offered to data subjects. Other initiatives, such as the current revision of the EU ePrivacy Directive, could help developing the data protection principles that should underpin a digitized society. | |
In the end, data protection authorities alone cannot address all the concerns related to the modern use, misuse and exploitation of data. We must develop a coordinated effort between regulators to ensure that digital societies are free from exploitation. Given the complexities of an economy powered by data, we need to protect citizens’ fundamental rights and increase transparency around how data is used. | |
We cannot let the data subjects face the data monarchs alone. We must continue to ensure that the emerging legal framework empowers individuals to take back control of their personal data. The often hidden exploitation of personal data — and its subjects — cannot be allowed to continue. | |
Tomaso Falchetta is a legal officer at Privacy International and primarily develops the organization’s advocacy approaches to relevant U.N. and regional human rights bodies.","2" | |
"pete","https://m |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment