Python3如何使用urllib

Python3使用urllib的步骤是：导入urllib库、发起请求、处理响应、处理异常。其中发起请求是使用urllib的核心步骤之一。发起请求可以通过 urllib.request 模块来实现，它提供了一些函数和类用于从URL获取数据。

一、导入urllib库

在使用urllib之前，首先需要导入相关模块。Python3中，urllib被拆分成几个子模块，常用的有 urllib.request、urllib.parse、urllib.error 和 urllib.robotparser。

import urllib.request
import urllib.parse
import urllib.error
import urllib.robotparser

二、发起请求

使用 urllib.request 模块发送请求是最常见的操作。可以通过 urllib.request.urlopen 函数来实现。

1. 发送GET请求

GET请求是最常见的HTTP请求方法，用于从服务器获取数据。下面是发送GET请求的示例：

import urllib.request
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
html = response.read()
print(html)

urlopen 函数会返回一个包含响应数据的对象，可以使用 read 方法读取响应内容。

2. 发送POST请求

POST请求用于向服务器发送数据。可以通过构造请求数据并传递给 urlopen 函数来实现。

import urllib.request
import urllib.parse
url = 'http://www.example.com'
data = urllib.parse.urlencode({'key': 'value'}).encode('utf-8')
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
html = response.read()
print(html)

在POST请求中，数据需要经过 urllib.parse.urlencode 编码，并且将编码后的数据转换为字节流格式。

三、处理响应

处理响应是指从服务器返回的数据中提取有用的信息。除了读取响应内容外，还可以访问其他响应信息，例如状态码和头部信息。

1. 获取状态码

状态码用于判断请求是否成功。

import urllib.request
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
print(response.status)  # 输出状态码

2. 获取头部信息

头部信息包含了关于响应的元数据。

import urllib.request
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
print(response.getheaders())  # 输出所有头部信息
print(response.getheader('Content-Type'))  # 输出特定头部信息

四、处理异常

在网络请求过程中，可能会遇到各种异常情况，需要进行适当的处理。常见的异常包括 URLError 和 HTTPError。

1. 处理URLError

URLError 用于处理由于网络连接问题引起的错误。

import urllib.request
import urllib.error
url = 'http://www.example.com'
try:
    response = urllib.request.urlopen(url)
except urllib.error.URLError as e:
    print(e.reason)  # 输出错误原因

2. 处理HTTPError

HTTPError 是 URLError 的子类，专门用于处理HTTP请求错误。

import urllib.request
import urllib.error
url = 'http://www.example.com'
try:
    response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e.code)  # 输出状态码
    print(e.reason)  # 输出错误原因
    print(e.headers)  # 输出头部信息

五、使用代理

有时需要通过代理服务器发送请求，可以通过 ProxyHandler 来实现。

import urllib.request
proxy = urllib.request.ProxyHandler({'http': 'http://proxy.example.com:8080'})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
html = response.read()
print(html)

六、处理Cookies

处理Cookies可以通过 http.cookiejar 模块和 HTTPCookieProcessor 类来实现。

import http.cookiejar
import urllib.request
cookie_jar = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie_jar)
opener = urllib.request.build_opener(handler)
url = 'http://www.example.com'
response = opener.open(url)
html = response.read()
print(html)
打印Cookies
for cookie in cookie_jar:
    print(cookie)

七、URL解析和构建

urllib.parse 模块提供了一些函数用于解析和构建URL。

1. 解析URL

可以使用 urlparse 函数解析URL。

import urllib.parse
url = 'http://www.example.com/path;params?query=arg#frag'
parsed_url = urllib.parse.urlparse(url)
print(parsed_url)

2. 构建URL

可以使用 urlunparse 函数构建URL。

import urllib.parse
scheme = 'http'
netloc = 'www.example.com'
path = '/path'
params = 'params'
query = 'query=arg'
fragment = 'frag'
constructed_url = urllib.parse.urlunparse((scheme, netloc, path, params, query, fragment))
print(constructed_url)

八、URL编码和解码

urllib.parse 模块还提供了函数用于URL编码和解码。

1. URL编码

可以使用 quote 函数对URL中的特殊字符进行编码。

import urllib.parse
string = 'Hello World!'
encoded_string = urllib.parse.quote(string)
print(encoded_string)  # 输出: Hello%20World%21

2. URL解码

可以使用 unquote 函数对编码后的URL进行解码。

import urllib.parse
encoded_string = 'Hello%20World%21'
decoded_string = urllib.parse.unquote(encoded_string)
print(decoded_string)  # 输出: Hello World!

九、处理表单数据

urllib.parse 模块提供了 urlencode 函数用于处理表单数据。

import urllib.parse
data = {'key1': 'value1', 'key2': 'value2'}
encoded_data = urllib.parse.urlencode(data)
print(encoded_data)  # 输出: key1=value1&key2=value2

十、解析HTML和XML

虽然 urllib 本身不提供解析HTML和XML的功能，但可以结合 html.parser 和 xml.etree.ElementTree 等模块来实现。

1. 解析HTML

可以使用 html.parser 模块解析HTML。

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print('Start tag:', tag)
    def handle_endtag(self, tag):
        print('End tag:', tag)
    def handle_data(self, data):
        print('Data:', data)
parser = MyHTMLParser()
html_string = '<html><head><title>Test</title></head><body><h1>Hello World!</h1></body></html>'
parser.feed(html_string)

2. 解析XML

可以使用 xml.etree.ElementTree 模块解析XML。

import xml.etree.ElementTree as ET
xml_string = '<root><child>data</child></root>'
root = ET.fromstring(xml_string)
for child in root:
    print(child.tag, child.text)

十一、处理JSON数据

urllib 可以与 json 模块结合使用来处理JSON数据。

1. 解析JSON

可以使用 json.loads 函数解析JSON数据。

import json
import urllib.request
url = 'https://jsonplaceholder.typicode.com/posts/1'
response = urllib.request.urlopen(url)
data = response.read().decode('utf-8')
json_data = json.loads(data)
print(json_data)

2. 生成JSON

可以使用 json.dumps 函数生成JSON数据。

import json
data = {'key': 'value'}
json_data = json.dumps(data)
print(json_data)

十二、下载文件

urllib 可以用于下载文件。

import urllib.request
url = 'http://www.example.com/sample.pdf'
file_path = 'sample.pdf'
urllib.request.urlretrieve(url, file_path)
print('File downloaded successfully.')

十三、自定义请求头

可以通过构造 Request 对象并设置请求头来自定义请求头。

import urllib.request
url = 'http://www.example.com'
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read()
print(html)

十四、处理重定向

默认情况下，urlopen 会自动处理HTTP重定向。如果需要自定义重定向处理，可以通过 HTTPRedirectHandler 来实现。

import urllib.request
class MyHTTPRedirectHandler(urllib.request.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        print('Redirected to:', headers['Location'])
        return urllib.request.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
opener = urllib.request.build_opener(MyHTTPRedirectHandler)
urllib.request.install_opener(opener)
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
html = response.read()
print(html)

十五、使用上下文管理器

urlopen 函数支持上下文管理器，可以使用 with 语句自动关闭响应对象。

import urllib.request
url = 'http://www.example.com'
with urllib.request.urlopen(url) as response:
    html = response.read()
    print(html)

十六、使用缓存

urllib 不提供内置缓存功能，但可以结合第三方库如 requests_cache 来实现缓存。

import requests_cache
import urllib.request
requests_cache.install_cache('demo_cache')
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
html = response.read()
print(html)

十七、多线程下载

可以结合 threading 模块实现多线程下载。

import threading
import urllib.request
def download_file(url, file_path):
    urllib.request.urlretrieve(url, file_path)
    print(f'File {file_path} downloaded successfully.')
urls = [
    ('http://www.example.com/sample1.pdf', 'sample1.pdf'),
    ('http://www.example.com/sample2.pdf', 'sample2.pdf'),
]
threads = []
for url, file_path in urls:
    thread = threading.Thread(target=download_file, args=(url, file_path))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()

十八、上传文件

可以通过POST请求上传文件。

import urllib.request
import urllib.parse
url = 'http://www.example.com/upload'
file_path = 'sample.pdf'
with open(file_path, 'rb') as f:
    file_data = f.read()
    data = {'file': file_data}
    encoded_data = urllib.parse.urlencode(data).encode('utf-8')
    req = urllib.request.Request(url, encoded_data)
    response = urllib.request.urlopen(req)
    html = response.read()
    print(html)

十九、使用SSL

urllib 支持SSL，可以通过 ssl 模块配置SSL上下文。

import ssl
import urllib.request
url = 'https://www.example.com'
context = ssl.create_default_context()
response = urllib.request.urlopen(url, context=context)
html = response.read()
print(html)

二十、解析robots.txt

urllib.robotparser 模块可以用于解析 robots.txt 文件。

import urllib.robotparser
url = 'http://www.example.com/robots.txt'
rp = urllib.robotparser.RobotFileParser()
rp.set_url(url)
rp.read()
user_agent = 'Mozilla/5.0'
print(rp.can_fetch(user_agent, 'http://www.example.com'))