Skip to content

Instantly share code, notes, and snippets.

def findWeeklyContents(page: Int = 1, amount: Int = 8): List[LightContent] = {
toLightContents(coll.find("date" $gt get_last_week()).take(amount))
}
def findMonthlyContents(page: Int = 1, amount: Int = 8): List[LightContent]
= toLightContents(coll.find("date" $gt get_last_month()).take(amount))
def findWeeklyPopularContents(page: Int = 1, amount: Int = 8): List[LightContent]
= toLightContents(coll.find("date" $gt get_last_week()).sort(DBObject("views" -> -1)))
@kyu999
kyu999 / mongo_query
Created November 21, 2014 06:17
mongo_query
課題:多くのデータがあるカテゴリーから12個のデータを各ページにて取り出すときに毎度find queryを発行していては遅すぎるしアクセスが増えた時にすぐ死ぬる。
解決方針:
1. if(iterator does not take so much time to extract even 1000 data) get iterator and pass through it to another page
=> what about the user visit the first page and go to the last page? => out?
2. slice index the page => if page is 3, the data to slice is 3 * 12 to (3 + 1) * 12.
page : 1 -> 1 * 1 to 1 * 12 = 1 to 12
page : 2 -> (2 - 1) * 12 to 2 * 12 = 12 to 24
@kyu999
kyu999 / background_color
Created November 20, 2014 18:20
background
#e9e6de
@kyu999
kyu999 / backup
Created November 20, 2014 04:06
backup manga mining
/*
def constructContents(i: Int): List[Content]
= (1 to amount).map{ i =>
val id = cache.get("week_pop_id_" + i)
val title = cache.get("week_pop_title_" + i)
val views = cache.get("week_pop_views_" + i)
val category = cache.get("week_pop_category_" + i)
val date = cache.get("week_pop_date_" + i)
val storages = content.storages.map( link => cache.rpush("week_pop_storages_" + i, link)
}
@kyu999
kyu999 / backup
Created November 20, 2014 03:32
backup
"""
def swap_iterate(self, cursor):
if(not cursor.alive):
return "Finish"
data = cursor.next()
swapped_date = self.swap_date(data.get("date"))
data["date"] = swapped_date
self.coll.save(data)
@kyu999
kyu999 / mangarian_design
Created November 17, 2014 12:08
mangarian design
backは何かの作品の画像をドーン、containerは落ち着いた色で。白系統。
@kyu999
kyu999 / pymongo
Created November 17, 2014 11:04
pymongo
found = coll.find()[0:x]
found.alive # if could get more date potentially, return True, else false
found.next() # get more date. it occcurs error when no more data in the coll
@kyu999
kyu999 / save_content
Created November 17, 2014 05:00
save content
'''
for each_content in whole_dom.cssselect("div.base"):
#h1が余分に含まれている場合そのdivは対象外の可能性高し。
if(len(each_content.cssselect("h1")) > 1):
continue
for each in each_content.iter():
self.get_title(each)
@kyu999
kyu999 / scrapy
Created November 16, 2014 13:48
scrapy
scrapy genspider example example.com
@kyu999
kyu999 / agenda
Created November 15, 2014 08:50
mangarina agenda
開発手順:
 excnnのみ着手 => 他のサイトも同様のステップを踏む
 各漫画ページからスクレイピング => 取得したデータを保存 => サイト全体のクローリングステップを考える => その実装
 
技術面:
 python, mysql(cloud sql served by Google), mongoDB(mongoLab), scrapy, scala, play, slick
 
基本的にクローラーをpythonで作成し、webサーバーをplayで作成する。
mongodbはクローリングの際に取得したDOMを全て記録するためのもの。