我手上有一些HTML网页,只有HTML文件,没有图片,我想把图片下载下来保存到本地然后更改HTML里面的图片文件使其指向本地文件。现在我的初步想法是用BeautifulSoup分析然后下载再把元素写回去。不知道Python里有没有现成的库。或者如果有什么其他的工具能够更简单地实现这个目的就更好了
本帖最后由 dakkor 于 2012-8-24 20:35 编辑
总之花了一点时间写了个原型,代码很丑陋,没有做异常抛出处理,估计bug也很多,就先将就着吧。调用了wget来下载,懒得再写urllib2的文件代码了,反正wget也很稳健。
[mw_shl_code=python,true]import os
import sys
from BeautifulSoup import BeautifulSoup
import urllib2
import re
Fullpath = sys.argv[1]
if os.path.isfile(Fullpath) == 0:
sys.exit()
filePath = os.path.dirname(Fullpath)
fileName = os.path.splitext(Fullpath)[0]
print fileName
fp = open(Fullpath,'r')
soup = BeautifulSoup(fp.read())
Links = soup.findAll('img')
downIMGpath = os.path.join(filePath,fileName)
print downIMGpath
if os.path.exists(downIMGpath) == 0:
os.mkdir(downIMGpath)
i = 0
for imgLink in Links:
imgsubName = re.sub('http://.+/.+\.','.',imgLink['src'])
cmdLine = 'wget ' + imgLink['src'] + ' -nv -t 10 -c -Y on -O ' + downIMGpath + '\\' + str(i) + imgsubName
os.system(cmdLine)
imgLink['src'] = downIMGpath + '/' + str(i) + imgsubName
i = i+1
fp.close()
fp = open(Fullpath,'w+')
fp.write(soup.prettify())
fp.close
[/mw_shl_code]
[查看全文]