Created
December 10, 2019 04:26
-
-
Save ruebot/87203bdd1f332b8afe1fcc9634f5dfff to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# twut walkthrough\n", | |
"\n", | |
"How to get here?\n", | |
"\n", | |
"We'll assume you have the [Anaconda distribution](https://www.anaconda.com/) installed, or at least Python 3.7+ and Jupyter Notebooks.\n", | |
"\n", | |
"\n", | |
"```\n", | |
"$ git clone https://github.com/archivesunleashed/twut.git\n", | |
"$ cd twut\n", | |
"$ mvn clean install\n", | |
"\n", | |
"$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook /path/to/spark-3.0.0-preview-bin-hadoop2.7/bin/pyspark --py-files /path/to/twut/target/twut.zip --driver-class-path /path/to/twut/target/twut-0.0.1-SNAPSHOT-fatjar.jar --jars /path/to/twut/target/twut-0.0.1-SNAPSHOT-fatjar.jar\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's import `twut`:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from twut import *" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, to use `twut` we will need to load in some line-oriented JSON twitter data as a DataFrame. We have three example resources included in the repo that come from the Twitter Sample API using the [`sample`](https://github.com/docnow/twarc#sample) command in [`twarc`](https://github.com/docnow/twarc). " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"path = \"/home/nruest/Projects/au/twut/src/test/resources/500-sample.jsonl\"\n", | |
"df = spark.read.json(path)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We've loaded up 500 tweets from the Sample API to work with, and we can access them in a DataFrame using the variable `df`.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's look at the hashtags, and assign the hashtags DataFrame that `twut` will create for us to variable, `hashtags`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"hashtags = SelectTweet.hashtags(df)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------------------+\n", | |
"|hashtags |\n", | |
"+------------------------+\n", | |
"|安元江口と夜あそび |\n", | |
"|DavidoDidntCum |\n", | |
"|DEMoniocratas |\n", | |
"|もっとホットなクリスマス|\n", | |
"|Tenerife |\n", | |
"|مساءالخير |\n", | |
"|FakeNews |\n", | |
"|CyberMonday |\n", | |
"|INEC |\n", | |
"|killarney |\n", | |
"+------------------------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"hashtags.show(10, False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How about images?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"image_urls = SelectTweet.imageUrls(df)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-----------------------------------------------+\n", | |
"|image_url |\n", | |
"+-----------------------------------------------+\n", | |
"|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|\n", | |
"|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|\n", | |
"|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|\n", | |
"|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|\n", | |
"|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|\n", | |
"|https://pbs.twimg.com/media/EKwsNH-UYAAJuxZ.jpg|\n", | |
"|https://pbs.twimg.com/media/EKyZ3k2VUAEMltk.jpg|\n", | |
"|https://pbs.twimg.com/media/EKxI3nPVUAEkhee.jpg|\n", | |
"|https://pbs.twimg.com/media/EKyaQk0WsAAvsyP.jpg|\n", | |
"|https://pbs.twimg.com/media/EKyat3IWoAAJgq2.jpg|\n", | |
"+-----------------------------------------------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"image_urls.show(10, False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How about filtering out retweets?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"230" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"no_retweets = FilterTweet.removeRetweets(df)\n", | |
"no_retweets.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What do the users look like?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n", | |
"|favourites_count|followers_count|friends_count|id_str |location |name |screen_name |statuses_count|verified|\n", | |
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n", | |
"|8302 |101 |133 |1027887558032732161|nct🌱 |车美 |M_chemei |3720 |false |\n", | |
"|2552 |73 |218 |2548066344 |null |ひーこ☆禿げても愛せ |heeko_gr_029 |15830 |false |\n", | |
"|4305 |1715 |98 |715850628 |0179.Kuwait♡دار جابر |Danahdenou |aldanah_94 |74967 |false |\n", | |
"|1870 |337 |53 |1081163420748046337|null |翔 |yoyoyopisannn |10702 |false |\n", | |
"|1544 |246 |240 |703120446 |Rio de Janeiro, Brasil|vilixo |vinismachadoo |16273 |false |\n", | |
"|2331 |91 |83 |973424490934714368 |日本 山口 |イサオ(^^)最近ディスクにハマル🎵|isao777sp2 |2137 |false |\n", | |
"|34258 |366 |562 |716598636247777281 |Johore, Malaysia |kimî |kimeowmy |46484 |false |\n", | |
"|0 |24 |7 |2587221716 |液晶の裏側 |貞子ちゃんbot |sadako_okadas |67549 |false |\n", | |
"|115 |123 |149 |1221632856 |TANJUNGPINANG |Dian Ramadita |dian05ramadita|631 |false |\n", | |
"|28 |5 |141 |1051802857610072064|Bayern, Deutschland |matias |matyas_0385 |32 |false |\n", | |
"+----------------+---------------+-------------+-------------------+----------------------+--------------------------------+--------------+--------------+--------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"users = SelectTweet.userInfo(no_retweets)\n", | |
"users.show(10, False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.9" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment