Skip to content

Instantly share code, notes, and snippets.

View baditaflorin's full-sized avatar

Florin Badita-Nistor baditaflorin

View GitHub Profile
@baditaflorin
baditaflorin / 30_under_30_europe_linkedin_info_final.csv
Created February 3, 2018 18:01
Forbes 30 under 30 Europe 2018 list containing 384 persosns. For 308 of them there is also the LinkedIn profile, if they have.
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 3.
Category were the 30 under 30 person was selected,Forbes Name,Forbes Age,Forbes Country,Role in the Company,Company,Description,LinkedIn Profile,linkedin_name,Exact match between the name ?,Did I had o perform a manual searh to find the linkedin profile ?
30 Under 30 - Europe - Social Entrepreneurs,Margherita Pagani,29,Argentina,Founder, Impacton.org,"Pagani aims to create an encyclopedia of blueprints for of purpose-driven projects for impact investing. The Italian-born and Argentina-based entrepreneur believes works with universities, governments and private companies to co-design programs based on proven models.",https://www.linkedin.com/in/magheritapagani,Margherita Pagani - CEO and Founder - Impacton.org ,FALSE,NO
30 Under 30 - Europe - Media & Marketing,Mohamed Khairat,25,Australia,Cofounders, Egyptian Streets,"Amin and Khairat founded Egyptian Streets back in 2012, less than two years after the Arab Spring. The digital publication that strives to address challenging issues--such as sexual harassment an
import glob
import os
import xml.etree.cElementTree as ET
#TODO Interator over a specifc folder,
#TODO increase value by 100000 for each new file
input_folder = './../../cygnus_output/output/'
output_folder = './../../step_11_ready2merge_output/'
start_number = 1
We can't make this file beautiful and searchable because it's too large.
"user_username","article_url","image_count","post_tags","recommends","reading_time","title","text","link_count"
"neuroecology","https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4","22","{Writing,Literature,""Data Visualization""}","2670","3.67641509433962","Punctuation in novels","","1"
"eklimcz","https://medium.com/truth-labs/designing-data-driven-interfaces-a75d62997631","14","{""Data Visualization"",""Design Thinking"",UX}","2660","7.83867924528302","Designing Data-Driven Interfaces","","2"
"quincylarson","https://medium.com/free-code-camp/the-economics-of-working-remotely-28d4173e16e2","5","{Tech,""Life Lessons"",""Data Science"",Travel,Startup}","2068","3.95786163522013","Fitter. Happier. More productive. Working remotely.","Travel the world as a digital nomad. Surf a new beach every morning. Eat a different local cuisine each night.
Or just stay home all day in your pajamas.
It doesn’t really matter. You can get your work done either way.
More than 10% of Americans now work remotely.
I’
---------------------------SELECT --------------------------------------------
select mps.user_username, -- 1st column
mps.article_url, -- 2nd column
mps.image_count, -- 3rd column
mps.post_tags, -- 4th column
mps.recommends, -- 5th column
mps.reading_time, -- 6th column
mps.title, -- 7th column
mpl.link_count, -- 8th column - we get this data from the left join, were we do a subquery
'' full_text, -- dummy 9th column
#!/bin/bash
echo 'Based on the work of Frederik Ramm https://lists.openstreetmap.org/pipermail/osmosis-dev/2013-October/001613.html'
CMDLINE=`
echo "--read-xml $1"
echo "--sort"
shift
while [[ $# > 0 ]]
do
echo "--read-xml $1"
echo "--sort"
select s.*,tag_name,title,mps.user_username,post_tags,article_url from (
SELECT post_id,ts_headline(text, keywords, 'MaxFragments=35,MaxWords=50,MinWords=6') as result
-- tweak the setting to reflect what you want. the text column is where i have the text
FROM medium_posts_text mptxt, plainto_tsquery('pg_catalog.english','training') as keywords
--change bot with the word that you are searching
WHERE to_tsvector(text) @@ keywords
) s
inner join medium_posts_tags mpt on mpt.post_id = s.post_id
inner join medium_posts_stats mps on mps.post_id = s.post_id
select * from (
select
regexp_split_to_table(lower(post_text), '\s+') as word
, count(1) as word_count
from
(select post_text from
We can't make this file beautiful and searchable because it's too large.
"user_username","article_url","image_count","post_tags","recommends","reading_time","title","link_count"
"ageitgey","https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471","12","{""Machine Learning""}","4132","14.0216981132075","Machine Learning is Fun!","12"
"tonyaub","https://medium.com/swlh/no-ui-is-the-new-ui-ab3f7ecec6b3","12","{Design,""Artificial Intelligence"",UI}","3666","7.7877358490566","No UI is the New UI","9"
"cdixon","https://medium.com/@cdixon/eleven-reasons-to-be-excited-about-the-future-of-technology-ef5f9b939cb2","32","{Technology,""Artificial Intelligence"",Future,Robotics,Space}","3658","11.1047169811321","Eleven Reasons To Be Excited About The Future of Technology","17"
"2noame","https://medium.com/basic-income/deep-learning-is-going-to-teach-us-all-the-lesson-of-our-lives-jobs-are-for-machines-7c6442e37a49","5","{""Artificial Intelligence"",""Machine Learning"",""Basic Income""}","3101","13.7276729559748","Deep Learning Is Going to Teach Us All the Lesson of Our Lives: Jobs
User Username Recommends
stevenlevy 12,636
ageitgey 9,605
cdixon 4,519
perborgen 4,215
tonyaub 3,666
olivercameron 3,552
2noame 3,151
GilFewster 2,608
intercom 2,270
@baditaflorin
baditaflorin / medium_top_1000_tags.csv
Created June 4, 2017 11:14
This is based on a scrapping project that i did, where i downloaded the list with all of the posts from medium.com https://medium.com/@baditaflorin
We can make this file beautiful and searchable if this error is corrected: It looks like row 9 should actually have 7 columns, instead of 4 in line 8.
"tag_name","count_tag_name","avg_reading_time","avg_recommends","avg_image_count","distinct_users","avg_post_data"
"Startup","134323","2.71552486545965","13.9311510314689219","1.7488739828622053","61152","2016-04-14 18:21:12.174416+03"
"Life","104197","1.78386182016248","9.0952714569517357","0.96809888960334750521","50767","2016-05-15 04:42:14.355121+03"
"Politics","99301","3.14315696061383","7.8097400831814383","1.2824543559480771","44825","2016-06-28 05:55:29.552608+03"
"Entrepreneurship","94911","2.79053337648529","14.2673241247063038","1.5481977852935908","43454","2016-04-19 22:28:35.956523+03"
"Life Lessons","94414","2.40382926434131","13.9626114771114453","1.1250450145105599","45045","2016-06-02 19:01:39.876276+03"
"Travel","80332","2.97209768940031","3.3994672110740427","4.3289224717422696","35644","2016-04-19 07:09:28.578+03"
"Design","75555","2.80802184122601","24.1416848653298921","3.5368142412811859","36471","2016-04-08 01:06:35.247556+03"
"Education","68855","2.74605567601954","6.1865369254229903"