Skip to content

Instantly share code, notes, and snippets.

@emraher
Forked from anonymous/findsimilar.py
Created October 27, 2015 17:50
Show Gist options
  • Save emraher/869afe05a40054012246 to your computer and use it in GitHub Desktop.
Save emraher/869afe05a40054012246 to your computer and use it in GitHub Desktop.

findsimilar

Using the whoosh python library, this script indexes and then give the ability to find similar text documents.

Requirements

Requires python and whoosh. To install whoosh use:

$ pip install whoosh

Preliminary structures

Notes directory structure

The script accepts the following note directory structure:

$ tree -a -L 1
.
├── .index
├── archive
├── findsimilar.py

All of the notes are contained within the archive directory. Note that the directory .index is created by the script if it does not exist.

Notes form

My notes are structured in the following form:

A reinforcing feedback loop creates more input to a stock the more that is already within it. It enhances whatever direction of change is imposed on it.

For example population growth, company profits, pollution etc..

They exist when a system element has the ability to reproduce itself or to grow at a constant fraction of itself.

----

{Meadows2008t}
{1502141701}

@systems-theory
@feedback-loops

The content of the note is all text above the markdown horizontal break syntax ----. Below this are the metadata-links to references, links to UIDs of other notes and tags, prepended with @. The scripts currently do not take into account anything below the horizontal break.

Usage

Note: The script has only been tested using a notes directory structure similar to that shown in the section Notes directory structure.

There are three variables at the top of script to be set.

  • notesDir - Set this to the directory that your notes are contained in.
  • notesFileExtension - Set this to a specific filetype for your notes. Default is md
  • indexDir - This folder will be created relative to where the script is located and contains all of the indexed data of the notes.

Once this has all been set and configured, enter the directory that the script is in and run the following

$ python createindex.py

Now, using the following you can find which notes contain similar rare words

$ python findsimilar.py <note path>

for example:

$ python findsimilar.py search "archive/1502141701 Balancing feedback loops in systems.md"
from whoosh.index import open_dir
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
import glob
import os
# USER SET PARAMETERS ############
notesDir = "archive" # set this to the (relative to the script) folder that contains your notes.
notesFileExtension = "md" # set this to the file extension of the notes you want to query (txt, md, markdown etc.)
indexDir = ".index" # whoosh index folder
##################################
def createIndex():
""" Create index for whoosh to be able to query """
if not os.path.exists(indexDir):
os.makedirs(indexDir)
schema = Schema(title=TEXT(stored=True),
path=ID(stored=True),
content=TEXT(stored=True))
ix = create_in(".index", schema)
writer = ix.writer()
for filename in glob.glob(notesDir + '/*.' + notesFileExtension):
noteContent = ""
with open(filename, 'r') as myfile:
for line in myfile:
if '----' not in line:
if line[0][0] is not '!':
noteContent += line
else:
break
writer.add_document(title= unicode(os.path.basename(filename), 'utf-8'),
path= unicode(filename, 'utf-8'),
content=unicode(noteContent, 'utf-8'))
writer.commit()
print "index created"
def searchSimilar(fullfilename):
""" Search for similar documents using a document pathname that
has already been indexed.
"""
ix = open_dir(indexDir)
with ix.searcher() as searcher:
filename = os.path.basename(fullfilename)
docnum = searcher.document_number(path=unicode(fullfilename, 'utf-8'))
if docnum is None:
print "This document has not been indexed"
else:
r = searcher.more_like(docnum, 'content', numterms=20)
if len(r) > 1:
header = "Similar files to '" + filename.replace(".md", "") + "'"
print "\n" + header + "\n" + "-"*len(header) + "\n"
for hit in r:
print hit['title'].replace(".md","")
print " score: " + str(hit.score) + "\n"
print "keywords: " + ", ".join(zip(*r.key_terms('content'))[0])
def printUsage():
print "usage:"
print " python findsimilar.py createindex"
print " python findsimilar.py search <filepath>"
def main():
if sys.argv[1] == 'createindex':
createIndex()
elif sys.argv[1] == 'search':
if sys.argv[2]:
searchSimilar(sys.argv[2])
else:
printUsage()
exit()
else:
printUsage()
exit()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment