Skip to content

Instantly share code, notes, and snippets.

@sbward
Last active December 6, 2016 21:52
Show Gist options
  • Select an option

  • Save sbward/a5911523044889ee7fc0d45f60ac4308 to your computer and use it in GitHub Desktop.

Select an option

Save sbward/a5911523044889ee7fc0d45f60ac4308 to your computer and use it in GitHub Desktop.

Comparing Google Speech to Pocketsphinx

Summary

Rough estimate:

POCKETSPHINX:  33%
GOOGLE SPEECH: 66%

speechtest:

POCKETSPHINX PRECISION: 81%
POCKETSPHINX RECALL: 68%

GOOGLE PRECISION: 82%
GOOGLE RECALL 72%

Speechtest

$ speechtest sample2_actual.txt sample2_sphinx.txt 
sample2_actual.txt = 234 words
sample2_sphinx.txt = 280 words
True  Positives: 190
False Positives: 44
False Negatives: 90

(sphinx) Precision: 0.811965811965812
(sphinx) Recall   : 0.6785714285714286

$ speechtest sample2_actual.txt sample2_google.txt 
sample2_actual.txt = 234 words
sample2_google.txt = 264 words
True  Positives: 191
False Positives: 43
False Negatives: 73

(google) Precision: 0.8162393162393162
(google) Recall   : 0.7234848484848485

Sample 1

@ 3:00 of Content Mining Taxonomies, Ontologies, and Semantics, Metadata Madness Luncheon

HUMAN

The biggest thing that you would do to kind of do that across multiple data sets starts with normalization, right, so if you want to make sure that data is searchable and you're looking at it across multiple data aggregators, data providers, normalizing that content across those providers is sort of the first thing you can do to make sure that garage is clean. The cleaner that garage becomes, the easier it is to find that content and we can talk a little bit more I think throughout the course of this panel, how that how that goes and how that works.

We um, I think we spend a lot of time thinking about how to clean the garage but maybe not enough about how we got all the stuff in there in the first place. You know, semantics is really, obviously important, especially when you're talking about standardization, um, but, if you don't have the data in the first place, if you can't think about all of this stuff in the context of billions of files, because really, we're not talking about a few boxes in the garage, we're talking about billions of them. And you need to be able to um account for the fact that whatever semantics you do come up with, it, there's gonna be, it's gonna be reliant on humans entering metadata, it's going to be reliant on remembering what the organization is, and also, it not changing, and not to mention, it not failing to account for, to extend the metaphor, a box of new sprinklers or something that doesn't fit the mold, or doesn't fit the organization that you agreed upon.

GOOGLE SPEECH

you would do to kind of do that across multiple datasets start with normalization right so if you want to make sure that data is is searchable and youre looking across multiple

aggregator State provider normalizing that content across those providers to sort of the first thing you can do to make sure that that garage is clean and then to clean that garage becomes the easier it is to find

I think we spend a lot of time thinking about how to

in the garage but maybe not enough time about how we got all the stuff in there in the first place you know as soon as it is really you know obviously in Port

especially when youre talking about standardization but if you dont have the data in the first place if you cant if you cant think about all of this stuff in the

text of billions of files because really were not no were not talking about you know if you boxes in the garage were talking about billions of them and you need to be able to

account for the fact that whatever semantics you do come up with it theres going to be its going to be reliant on humans entering metadata its going to be reliant on the form

during what the organization is and also if not changing and not to mention it is failing to account for you know that I extend the metaphor a box of

sprinklers are something that doesnt fit the mold or doesnt fit the organization that you agreed upon

POCKETSPHINX

you would do the time to get across multiple dataset started normalization right so [NOISE] if you wanna make sure that day is this article in you're looking across multiple ah but the aggregated skipper wires [NOISE] normalizing that thompson across those providers us were the first thing you can do to make sure that that crisis clean [NOISE] on and then the clinic arise because the losers i find that constantly talk a little more thing for the course of this town how about how it goes and that's how it works

we young spreading we spend a lot time thinking about how to clean it raj but [NOISE] maybe not of time about how we got all the stuff in there in the first place [NOISE] tom you know this is the id is semantics is it is really a note on this important especially to talk about standardization [NOISE] off by youth if you don't have that in the first place if you can if he can't think about all this stuff in the context of billions of files because really or not you know we're not talk about you know few boxes in the garage are talking about billions of them [NOISE] and you need to be able to bomb account for the fact that they were never semantics you to come up with

it does get it's it's gonna be relied on humans entering meditate and it's gonna be relied on remembering what the organization is and also did not changing [NOISE] and not to mention it failing to account for you know that it ought to extend the metaphor of box of [NOISE] new sprinklers or something [NOISE] that that it doesn't fit the mold or does that organization that you agreed upon

Sample 2

HUMAN

GOOGLE SPEECH

POCKETSPHINX

Sample 3

HUMAN

GOOGLE SPEECH

POCKETSPHINX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment