Skip to content

Instantly share code, notes, and snippets.

@sxfmol
Last active July 27, 2020 08:53
Show Gist options
  • Save sxfmol/ee04000b9947fdf96dd134df04412c41 to your computer and use it in GitHub Desktop.
Save sxfmol/ee04000b9947fdf96dd134df04412c41 to your computer and use it in GitHub Desktop.

pattern

  • pattern[email protected]:clips/pattern.git -Pattern is a web mining module for Python. It has tools for:

    Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron) Network Analysis: graph centrality and visualization. It is well documented, thoroughly tested with 350+ unit tests and comes bundled with 50+ examples. The source code is

tools

This is the official documentation for the BS4 package: http://www.crummy.com/software/BeautifulSoup/bs4/doc This is the official documentation for the Scrapy package: http://doc.scrapy.org/en/latest This is the official documentation for the mechanize package:http://wwwsearch.sourceforge.net/mechanize scrapy commands: https://doc.scrapy.org/en/latest/topics/commands.html Comparison between Portia and ParseHub: https://www.parsehub.com/blog/portia-vs-parsehub-comparison-which-alternative-is-the-best-option-for-web-scraping/ Twint: https://github.com/twintproject/twint

  • awesome-osint

  • spiderfoot

  • / SpiderFoot is an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of methods for data analysis, making that data easy to navigate.

  • / SpiderFoot has an embedded web-server for providing a clean and intuitive web-based interface but can also be used completely via the command-line. It's written in Python 3 and GPL-licensed.

  • trape

  • Trape is an OSINT analysis and research tool, which allows people to track and execute intelligent social engineering attacks in real time. It was created with the aim of teaching the world how large Internet companies could obtain confidential information such as the status of sessions of their websites or services and control their users through their browser, without their knowlege, but It evolves with the aim of helping government organizations, companies and researchers to track the cybercriminals.

twitter

  • socialbearing 输入关键词产出dashboard dashboard
  • twint
  • twint源码分享
  • twintproject
  • Twint-Distributed [IN PROGRESS]
  • twint_kibana -Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.
    • hashtabs:主题
    • cashtags:金融相关如股票主题 相关packages: aiodns 2.0.0 aiohttp 3.5.4 aiohttp-socks 0.2.2 async-timeout 3.0.1 attrs 19.1.0 beautifulsoup4 4.7.1 cchardet 2.1.4 certifi 2019.3.9 cffi 1.12.3 chardet 3.0.4 elasticsearch 7.0.0 fake-useragent 0.1.11 geographiclib 1.49 geopy 1.19.0 idna 2.8 idna-ssl 1.1.0 multidict 4.5.2 numpy 1.16.3 oauthlib 3.0.1 pandas 0.24.2 pip 19.1 pycares 3.0.0 pycparser 2.19 PySocks 1.6.8 python-dateutil 2.8.0 pytz 2019.1 requests 2.22.0 requests-oauthlib 1.2.0 schedule 0.6.0 setuptools 41.0.1 six 1.12.0 soupsieve 1.9.1 tweepy 3.7.0 twint 1.2.3 /home/james/app/twitter_crawler/src/twint typing 3.6.6 typing-extensions 3.7.2 urllib3 1.25.2 wheel 0.33.1 yarl 1.3.0

Twint utilizes Twitter's search operators to let you scrape Tweets from specific users, scrape Tweets relating to certain topics, hashtags & trends, or sort out sensitive information from Tweets like e-mail and phone numbers. I find this very useful, and you can get really creative with it too.

Twint also makes special queries to Twitter allowing you to also scrape a Twitter user's followers, Tweets a user has liked, and who they follow without any authentication, API, Selenium, or browser emulation.

Display Tweets by verified users that Tweeted about Trevor Noah.

twint -s "Trevor Noah" --verified

Scrape Tweets from a radius of 1 km around the Hofburg in Vienna export them to a csv file.

twint -g="48.2045507,16.3577661,1km" -o file.csv --csv

Collect Tweets published since 2019-10-11 20:30:15.

twint -u username --since "2019-10-11 21:30:15"

Resume a search starting from the last saved tweet in the provided file

twint -u username --resume file.csv

other

  • (Python Modules for Scraping:)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment