Skip to content

Instantly share code, notes, and snippets.

@allanruin
Created February 15, 2016 11:26
Show Gist options
  • Save allanruin/5dc383ec750af80198d1 to your computer and use it in GitHub Desktop.
Save allanruin/5dc383ec750af80198d1 to your computer and use it in GitHub Desktop.
# -- coding: UTF-8
"""
2016年2月15日
读取liblinear的输出文件,形成kaggle格式的提交方案
liblinear 在指明输出概率时,格式类似这样:
head liresult.txt
labels 0 1
0 0.506272 0.493728
0 0.506272 0.493728
0 0.506272 0.493728
0 0.506272 0.493728
0 0.506272 0.493728
0 0.506272 0.493728
0 0.506272 0.493728
1 0.5 0.5
0 0.506272 0.493728
test_fn 是测试文件名,需要提取第一个字段(id)
li_fn 是liblinear预测结果文件名,需要提取第三列的正分类概率
res_fn 是最终输出的符合kaggle规则的结果文件名
"""
from itertools import izip
"""
这种实现用iterator遍历一个test文件,同时手工用next遍历相应的linear结果文件
这个结果当是不正确,因为有多余的空行
"""
def toKaggleSubmission(test_fn, li_fn, res_fn):
with open(test_fn,"rb") as tf, open(li_fn,"rb") as lf, open(res_fn,"w") as rf:
ti = iter(tf)
li = iter(lf)
next(li)
next(ti)
rf.write("id,click\n") # header
for line in ti:
sid = line.split(",")[0]
tline = next(li)
p = tline.split(" ")[2]
rf.write("{0},{1}\n".format(sid,p))
"""
这个实现使用izip把两个iterator结合到一块,一起遍历,看起来比较的优雅
"""
def toKaggleSubmission(test_fn, li_fn, res_fn):
with open(test_fn,"rb") as tf, open(li_fn,"rb") as lf, open(res_fn,"w") as rf:
ti = iter(tf)
li = iter(lf)
next(li)
next(ti)
rf.write("id,click\n") # header
x = 0
for tline,line in izip(ti,li):
sid = tline.split(",")[0]
p = line.split(" ")[2]
if x<10:
x+=1
print "p={0}".format(p)
rf.write("{0},{1}".format(sid,p))
# curious about why here we don't need \n in write
"""
这是分别遍历两个文件提取各自需要的字段,再第三次遍历结合成想要的结果写入
"""
def toKaggleSubmission(test_fn, li_fn, res_fn):
with open(test_fn,"rb") as tf, open(li_fn,"rb") as lf, open(res_fn,"w") as rf:
rf.write("id,click\n") # header
ids = []
for line in tf:
sid = line.split(",")[0]
ids.append(sid)
ps = []
for line in lf:
p = line.split(" ")[2]
ps.append(p)
for i in range(1,len(ids)):
rf.write("{0},{1}".format(ids[i],ps[i]))
toKaggleSubmission("test.csv","liresult.txt", "submi.csv")
"""
我不明白的是为什么写csv头的那句需要换行符,在里面的又不需要换行符。想想会不会是split(",")的结果中最后一个元素是带换行符的?!
这就可以解释为什么中途尝试输出p=xxx的时候也是有换行
string有splitlines([keepends])方法把字符串分成多个行的数组,
“Line breaks are not included in the resulting list unless keepends is given and true.”
而通常split的是包含换行的
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment