Skip to content

Instantly share code, notes, and snippets.

@itay-grudev
Last active June 23, 2017 12:44
Show Gist options
  • Save itay-grudev/6b916b6fba2995f082f6 to your computer and use it in GitHub Desktop.
Save itay-grudev/6b916b6fba2995f082f6 to your computer and use it in GitHub Desktop.
A simple script that calculates letter frequency for any language by analysing given subset of articles on the internet. This script is provided with a list of English Wikipedia articles.
# The MIT License (MIT)
# Copyright (c) 2014 Itay Grudev
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
# This example illustrates how you can get a statistically correct list of the most
# used letters in every language. It parses text found on article from all
# meaningful tags.
#
# This example is preloaded with 77 English articles from different subjects
# in order to acquire the most accurate data.
#
# All the URLs are stored in the article_list variable.
#
# Ruby by default uses UTF-8, so theoretically should work with every language.
require 'open-uri'
require 'nokogiri'
# List of different articles
article_list = [
# Generic language and country information
'http://en.wikipedia.org/wiki/United_States',
'http://en.wikipedia.org/wiki/Canada',
'http://en.wikipedia.org/wiki/United_Kingdom_of_Great_Britain_and_Ireland',
'http://en.wikipedia.org/wiki/Scotland',
'http://en.wikipedia.org/wiki/Ireland',
'http://en.wikipedia.org/wiki/American_English',
'http://en.wikipedia.org/wiki/Scottish_English',
'http://en.wikipedia.org/wiki/British_English',
'http://en.wikipedia.org/wiki/Hiberno-English',
'http://en.wikipedia.org/wiki/Canadian_English',
# History Articles
'http://en.wikipedia.org/wiki/History_of_the_United_States',
'http://en.wikipedia.org/wiki/History_of_the_United_Kingdom',
'http://en.wikipedia.org/wiki/History_of_Canada',
'http://en.wikipedia.org/wiki/History_of_Scotland',
# Mathematics articles
'http://en.wikipedia.org/wiki/Gamma_function',
'http://en.wikipedia.org/wiki/Ellipse',
'http://en.wikipedia.org/wiki/Differentiation_(mathematics)',
'http://en.wikipedia.org/wiki/Mathematical_induction',
'http://en.wikipedia.org/wiki/Statistics',
'http://en.wikipedia.org/wiki/Geometry',
'http://en.wikipedia.org/wiki/Fractal',
'http://en.wikipedia.org/wiki/Game_theory',
# Biology articles
'http://en.wikipedia.org/wiki/Blood',
'http://en.wikipedia.org/wiki/Cell_(biology)',
'http://en.wikipedia.org/wiki/Evolution',
'http://en.wikipedia.org/wiki/Mitochondria',
'http://en.wikipedia.org/wiki/Virus',
'http://en.wikipedia.org/wiki/Botulinum_toxin',
'http://en.wikipedia.org/wiki/Dna',
'http://en.wikipedia.org/wiki/Rna',
# Physics articles
'http://en.wikipedia.org/wiki/Pendulum',
'http://en.wikipedia.org/wiki/Quantum_mechanics',
'http://en.wikipedia.org/wiki/Standard_Model',
'http://en.wikipedia.org/wiki/String_theory',
'http://en.wikipedia.org/wiki/Newton%27s_laws_of_motion',
'http://en.wikipedia.org/wiki/Relativism',
'http://en.wikipedia.org/wiki/Thermal_conduction',
'http://en.wikipedia.org/wiki/Electromagnetic_radiation',
# Chemistry articles
'http://en.wikipedia.org/wiki/Chemistry',
'http://en.wikipedia.org/wiki/Photochemical_reaction',
'http://en.wikipedia.org/wiki/Molecule',
'http://en.wikipedia.org/wiki/Chemical_reaction',
'http://en.wikipedia.org/wiki/Chemical_equation',
'http://en.wikipedia.org/wiki/Chemical_substance',
'http://en.wikipedia.org/wiki/Pharmaceutical',
'http://en.wikipedia.org/wiki/Helium',
'http://en.wikipedia.org/wiki/Ionic_compounds',
'http://en.wikipedia.org/wiki/Avogadro_constant',
'http://en.wikipedia.org/wiki/Chemical_bond',
# Geology
'http://en.wikipedia.org/wiki/Geology',
# Philosophy articles
'http://en.wikipedia.org/wiki/Ancient_Greek_philosophy',
'http://en.wikipedia.org/wiki/Aristotle',
'http://en.wikipedia.org/wiki/Socrates',
'http://en.wikipedia.org/wiki/Plato',
'http://en.wikipedia.org/wiki/Modern_philosophy',
'http://en.wikipedia.org/wiki/Shopenhauer',
'http://en.wikipedia.org/wiki/Friedrich_Nietzsche',
'http://en.wikipedia.org/wiki/Freud',
'http://en.wikipedia.org/wiki/Immanuel_Kant',
# Computer science
'http://en.wikipedia.org/wiki/Computer_science',
'http://en.wikipedia.org/wiki/Database',
'http://en.wikipedia.org/wiki/Programming_language_theory',
'http://en.wikipedia.org/wiki/Programming_language',
'http://en.wikipedia.org/wiki/Theory_of_computation',
'http://en.wikipedia.org/wiki/Artificial_intelligence',
'http://en.wikipedia.org/wiki/Computer_engineering',
'http://en.wikipedia.org/wiki/Computer_network',
'http://en.wikipedia.org/wiki/Concurrency_(computer_science)',
'http://en.wikipedia.org/wiki/Turing_machine',
# Generic
'http://en.wikipedia.org/wiki/Bill_Gates',
'http://en.wikipedia.org/wiki/Steve_Jobs',
'http://en.wikipedia.org/wiki/Kama_Sutra',
'http://en.wikipedia.org/wiki/Harry_Potter',
'http://en.wikipedia.org/wiki/Miss_World',
'http://en.wikipedia.org/wiki/Commonwealth_Saga',
'http://en.wikipedia.org/wiki/The_Hunger_Games',
'http://en.wikipedia.org/wiki/Paris_Hilton'
]
letters = { }
counter = 0
total_letters = 0
article_list.each do |article|
# Some usefull output
print "\rProcessing article " + (counter +=1 ).to_s + ' of ' + article_list.length.to_s
# Open the article
doc = Nokogiri::HTML(open(article))
# Gather text from meaningful tags
text = ''
doc.css('p,h1,h2,h3,a,h4,h5,h5,td,b,i,em,strong,span').each do |e|
text << e.content
end
total_letters += text.length
text.split('').each do |char|
letter = char.downcase.to_sym
letters[letter] = 0 if letters[letter].nil?
letters[letter] += 1
end
end
puts
(letters.keys.sort { |a, b| letters[a] <=> letters[b] }).each do |key|
puts '"' + (key.to_s == "\n"? '\n' : key.to_s) + '": ' + letters[key].to_s
end
puts 'Total letters analyzed: ' + total_letters.to_s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment