背景
目标网站:https://unipass.customs.go.kr/csp/index.do
网站描述:韩国网站 不要翻墙 全国海关信息网服务
需求
输入M B/L - H B/L号查询货物进度信息。
红色框内为主要获取内容
网站分析
这两个接口 复制为 CURL
此次为 搜索接口1
curl 'https://unipass.customs.go.kr/csp/myc/bsopspptinfo/cscllgstinfo/ImpCargPrgsInfoMtCtr/retrieveImpCargPrgsInfoLst.do?savedToken=MYC0405101Q_F2_savedToken&MYC0405101Q_F2_savedToken=NZSDTADT85O5XK1ZGSAGZHSPY03LH8UC' \
-H 'Accept: application/json, text/javascript, */*; q=0.01' \
-H 'Accept-Language: zh-CN,zh;q=0.9' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Cookie: JSESSIONID=00019jYqep6r3TRibGJQ14qji7nZ9I_eMZJIkVQoDOxj0WZHORHwSc2YiCYST17KtkXhKBMrq8g1_69scZUVSLa3_At2_zYmZjPzJJsEwaOpI8Kaqkw8N3jKmi94TFp8vMSN:csp11; WMONID=QuZCNjsieM1; MagicLineSession=qOkfpLVhMcFBkdSSXM1q' \
-H 'Origin: https://unipass.customs.go.kr' \
-H 'Pragma: no-cache' \
-H 'Referer: https://unipass.customs.go.kr/csp/index.do' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'isAjax: true' \
-H 'sec-ch-ua: "Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
--data-raw 'firstIndex=0&page=1&pageIndex=1&pageSize=10&pageUnit=10&recordCountPerPage=10&qryTp=2&cargMtNo=&mblNo=HASLC02231200435&hblNo=&blYy=2024'
此次为 搜索接口2
curl 'https://unipass.customs.go.kr/csp/myc/bsopspptinfo/cscllgstinfo/ImpCargPrgsInfoMtCtr/retrieveImpCargPrgsInfoDtl.do' \
-H 'Accept: application/json, text/javascript, */*; q=0.01' \
-H 'Accept-Language: zh-CN,zh;q=0.9' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H $'Cookie: MYC_RCNT_MENU=%3Cli%3E%3Ca%20href%3D%22javascript%3Amyc_f_goRecentMenu(\'MYC_MNU_00000450\')%3B%22%3E%EC%88%98%EC%9E%85%ED%99%94%EB%AC%BC%20%EC%A7%84%ED%96%89%EC%A0%95%EB%B3%B4%3C%2Fa%3E%3C%2Fli%3E; JSESSIONID=00019jYqep6r3TRibGJQ14qji7nZ9I_eMZJIkVQoDOxj0WZHORHwSc2YiCYST17KtkXhKBMrq8g1_69scZUVSLa3_At2_zYmZjPzJJsEwaOpI8Kaqkw8N3jKmi94TFp8vMSN:csp11; WMONID=QuZCNjsieM1; MagicLineSession=qOkfpLVhMcFBkdSSXM1q' \
-H 'Origin: https://unipass.customs.go.kr' \
-H 'Pragma: no-cache' \
-H 'Referer: https://unipass.customs.go.kr/csp/index.do' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'isAjax: true' \
-H 'sec-ch-ua: "Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
--data-raw 'firstIndex=0&recordCountPerPage=10&page=1&pageIndex=1&pageSize=10&pageUnit=10&cargMtNo=24EASKS000i20260001'
两个接口的数据都包含需要采集的数据
搜索接口1 需要参数 cargMtNo: 24EASKS000i20260001,没有这个字段的数据
因此我们选择 搜索接口2 为主要研究对象
使用postman软件模拟 搜索接口2 的请求
{"message":"Session 내에 유효한 토큰이 없습니다. 정상적인 요청이 아닙니다.","errortype":"SavedTokenInvalidException","error":"true"}
翻译:Session内没有有效的令牌。这不是正常的请求。
经多次试验证明,搜索接口2 内的令牌有效期为一次。在浏览器中使用一次,所以在 postman中 无法成功请求到数据。
因此,我们需要找到有效的令牌
和令牌相关关键信息
搜索关键信息 NZSDTADT85O5XK1ZGSAGZHSPY03LH8UC
经多次搜索试验证明:在搜索接口2执行之前 会执行一次 获取 token
因此,如果想发起 搜索接口2 请求 , 就需要先发起一次 获取 token 请求。
此处为获取token接口
curl 'https://unipass.customs.go.kr/csp/myc/mainmt/MainMtCtr/menuExec.do' \
-H 'Accept: text/html, */*; q=0.01' \
-H 'Accept-Language: zh-CN,zh;q=0.9' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H $'Cookie: MYC_RCNT_MENU=%3Cli%3E%3Ca%20href%3D%22javascript%3Amyc_f_goRecentMenu(\'MYC_MNU_00000450\')%3B%22%3E%EC%88%98%EC%9E%85%ED%99%94%EB%AC%BC%20%EC%A7%84%ED%96%89%EC%A0%95%EB%B3%B4%3C%2Fa%3E%3C%2Fli%3E; WMONID=QuZCNjsieM1; MagicLineSession=qOkfpLVhMcFBkdSSXM1q; JSESSIONID=00109jYqep6r3TRibGJQ14qji7nZ9I_eMZJIkVQoDOxj0WZHORHwSc2YiCYST17KtkXhKBMrq8g1_69scZUVSLa3_At2_zYmZjPzJJsEwaOpI8Kaqkw8N3jKmi94TFp8vMSN:csp11' \
-H 'Origin: https://unipass.customs.go.kr' \
-H 'Pragma: no-cache' \
-H 'Referer: https://unipass.customs.go.kr/csp/index.do' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'sec-ch-ua: "Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
--data-raw 'selectedId=MYC_MNU_00000450&mblNo=HASLC02231200435&hblNo=&blYy=2024'
测试:
1.通过 postman 执行获取token请求
2.将获取到token设置到 搜索接口2 中
3.是否可获取到正确数据?
经实验证明可行
接下来,弄清楚 获取token接口 的参数即可
selectedId: MYC_MNU_00000450
mblNo: HASLC02231200435
hblNo:
blYy: 2024
在切换不同的M B/L - H B/L号进行搜索,发现 selectedId 没有变化,可理解为搜索类型
mblNo,即 输入的 M B/L - H B/L号
hblNo,空
blYy 对应 搜索时可选择的年份
POST /csp/myc/mainmt/MainMtCtr/menuExec.do HTTP/1.1
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br, zstd
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: no-cache
Connection: keep-alive
Content-Length: 67
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: MYC_RCNT_MENU=%3Cli%3E%3Ca%20href%3D%22javascript%3Amyc_f_goRecentMenu('MYC_MNU_00000450')%3B%22%3E%EC%88%98%EC%9E%85%ED%99%94%EB%AC%BC%20%EC%A7%84%ED%96%89%EC%A0%95%EB%B3%B4%3C%2Fa%3E%3C%2Fli%3E; WMONID=QuZCNjsieM1; MagicLineSession=qOkfpLVhMcFBkdSSXM1q; JSESSIONID=00109jYqep6r3TRibGJQ14qji7nZ9I_eMZJIkVQoDOxj0WZHORHwSc2YiCYST17KtkXhKBMrq8g1_69scZUVSLa3_At2_zYmZjPzJJsEwaOpI8Kaqkw8N3jKmi94TFp8vMSN:csp11
Host: unipass.customs.go.kr
Origin: https://unipass.customs.go.kr
Pragma: no-cache
Referer: https://unipass.customs.go.kr/csp/index.do
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0
X-Requested-With: XMLHttpRequest
sec-ch-ua: "Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
获取 获取token接口 不要 cookie 也可以正确 获取
自此,整个流程已理清楚
1.获取token接口 请求 获取token
- 搜索接口2使用token获取数据
3.解析JSON获取目标数据
4.存入mysql
爬虫测试
注意使用 session = requests.Session()
def get_saved_token(mblNo, session):
url = 'https://unipass.customs.go.kr/csp/myc/mainmt/MainMtCtr/menuExec.do'
headers = {
"Accept": "text/html, */*; q=0.01",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Origin": "https://unipass.customs.go.kr",
"Pragma": "no-cache",
"Referer": "https://unipass.customs.go.kr/csp/index.do",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua-mobile": "?0"
}
data = {
"selectedId": "MYC_MNU_00000450",
"mblNo": mblNo,
"hblNo": "",
"blYy": "2024"
}
response = session.post(
'https://unipass.customs.go.kr/csp/myc/mainmt/MainMtCtr/menuExec.do',
headers=headers,
data=data,
)
# print(response.text)
doc = PyQuery(response.text)
results = doc('form#MYC0405101Q_form_tab1')
saved_token = results('input[name="savedToken"]').attr('value')
saved_token_value = results(f'input[name="{saved_token}"]').attr('value')
print(saved_token, saved_token_value)
return saved_token, saved_token_value
def get_unipass_info(mblNo):
try:
session = requests.Session()
saved_token, saved_token_value = get_saved_token(mblNo, session)
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': "MYC_RCNT_MENU=%3Cli%3E%3Ca%20href%3D%22javascript%3Amyc_f_goRecentMenu('MYC_MNU_00000450')%3B%22%3E%EC%88%98%EC%9E%85%ED%99%94%EB%AC%BC%20%EC%A7%84%ED%96%89%EC%A0%95%EB%B3%B4%3C%2Fa%3E%3C%2Fli%3E; WMONID=5UPQoJCocyu; MagicLineSession=ZGVOOyUaGvKX3vm3uk4Q; JSESSIONID=0011FvXTwY8egUkNZn9-UJ0ZsVlN0SKgDzBAj3i5q3BzsqSE0U9rKXybAqePJdh6QbPJevBSBncRuY3P8dTMzLfHL4L16MG-fEw_SrRnnzNo_PZCK8ZEu603NQ44ac392gXe:csp42",
'Origin': 'https://unipass.customs.go.kr',
'Pragma': 'no-cache',
'Referer': 'https://unipass.customs.go.kr/csp/index.do',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'isAjax': 'true',
'sec-ch-ua': '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
params = {
'savedToken': saved_token,
saved_token: saved_token_value,
}
data = {
'firstIndex': '0',
'page': '1',
'pageIndex': '1',
'pageSize': '10',
'pageUnit': '10',
'recordCountPerPage': '10',
'qryTp': '2',
'cargMtNo': '',
'mblNo': mblNo,
'hblNo': '',
'blYy': '2024',
}
response = session.post(
'https://unipass.customs.go.kr/csp/myc/bsopspptinfo/cscllgstinfo/ImpCargPrgsInfoMtCtr/retrieveImpCargPrgsInfoLst.do',
params=params,
headers=headers,
data=data,
)
print(response.text)
parse_json(json_str=response.text, mblNo=mblNo)
except Exception as e:
print(e)
def parse_json(json_str, mblNo):
has_data = ''
cargMtNo = ''
prnm = ''
ttwg = ''
kg = ''
data = json.loads(json_str)
try:
has_data = data['count']
cargMtNo = data['resultList'][0]['cargMtNo']
prnm = data['resultList'][0]['prnm']
ttwg = data['resultList'][0]['ttwg']
kg = data['resultList'][0]['kg']
except:
pass
item = {
'mblNo': mblNo,
'has_data': has_data,
'cargMtNo': cargMtNo,
'prnm': prnm,
'ttwg': ttwg,
'kg': kg,
'json_data': json_str
}
update(dt=item, dt_condition={'mblNo': mblNo}, tb='unipass_search')
效率提升
将 26 万条 需要查询的 M B/L - H B/L号 插入 mysql 数据库
def get_all():
query = "SELECT * from unipass_search where has_data IS NULL"
cursor.execute(query)
datas = cursor.fetchall()
total = len(datas)
for index, data in enumerate(datas):
print(f'index: {index} total:{total}')
mblNo = data[0]
get_unipass_info(mblNo)
# time.sleep(3)
效率有点满!!!
def get_all_multi1():
pool = Pool(processes=10)
urls = []
query = "SELECT * from unipass_search where has_data IS NULL"
cursor.execute(query)
datas = cursor.fetchall()
total = len(datas)
for index, data in enumerate(datas):
# print(f'index: {index} total:{total}')
mblNo = data[0]
urls.append(mblNo)
# 将列表中每一个列表元素传递给scrape函数进行处理
pool.map(get_unipass_info, urls)
pool.close()
在使用python多进行,效率提升。大约 1万条/小时
-- 06点40分 66929
-- 06点44分 67824
-- 07点00分 71273 约 1万/H
-- 09点48分 114458
-- 14点26分 172545
-- 18点24分 222270
-- 20点58分 255680
SELECT count(1) from unipass_search where has_data IS NOT NULL
附上表结构
DROP TABLE IF EXISTS `unipass_search`;
CREATE TABLE `unipass_search` (
`mblNo` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT 'M B/L - H B/L',
`has_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT '是否能查询到',
`cargMtNo` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT '货物管制号码',
`prnm` varchar(2000) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT '产品名称',
`ttwg` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT '毛重',
`kg` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL COMMENT '毛重单位',
`json_data` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL COMMENT '元素数据'
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
SET FOREIGN_KEY_CHECKS = 1;
用于测试的M B/L - H B/L号
HASLC02231200435
SOFBJTG2352608
KMTCXGG2559300
SNKO022231202438
总结
1.token有效期仅有一次
2.cookie不使用也可以获取数据
3.多进程提升效率
联系我
以上提供的代码不是完整版本,需要完整版可以通过邮箱(wkssmile@163.com)联系我
如对您有帮助,请帮我点赞