0、楔子
1)什么是数据提取?
简单的来说,数据提取就是从响应中获取我们想要的数据的过程
2)数据分类
- 非结构化的数据:html,文本等
处理方法:正则表达式、xpath、beautiful soup - 结构化数据:json,xml等
3)什么是JSON?
JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。
适用于进行数据交互的场景,比如网站前台与后台之间的数据交互。
4)如何找到返回json的url呢?
- 使用浏览器/抓包工具进行分析 wireshark(windows/linux),tcpdump(linux)
- 抓包手机app的软件
1、json.dumps()
json.dumps()用于将dict类型的数据转成str
,因为如果直接将dict类型的数据写入json文件中会发生报错,因此在将数据写入时需要用到该函数。
先来看一段代码:
import json
# 学生信息
myinfo_dict = {"name":"李易阳", "age":23, "sex":"男"}
# 类型
print(type(myinfo_dict))
# 转换成json字符串数据
json_obj_str = json.dumps(myinfo_dict)
# 类型
print(type(json_obj_str))
输出结果如下:
<class 'dict'>
<class 'str'>
再看官方文档参数说明如下:
def dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True,
allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw):
Serialize ``obj`` to a JSON formatted ``str``.
If ``skipkeys`` is true then ``dict`` keys that are not basic types
(``str``, ``int``, ``float``, ``bool``, ``None``) will be skipped
instead of raising a ``TypeError``.
If ``ensure_ascii`` is false, then the return value can contain non-ASCII
characters if they appear in strings contained in ``obj``. Otherwise, all
such characters are escaped in JSON strings.
If ``check_circular`` is false, then the circular reference check
for container types will be skipped and a circular reference will
result in an ``OverflowError`` (or worse).
If ``allow_nan`` is false, then it will be a ``ValueError`` to
serialize out of range ``float`` values (``nan``, ``inf``, ``-inf``) in
strict compliance of the JSON specification, instead of using the
JavaScript equivalents (``NaN``, ``Infinity``, ``-Infinity``).
If ``indent`` is a non-negative integer, then JSON array elements and
object members will be pretty-printed with that indent level. An indent
level of 0 will only insert newlines. ``None`` is the most compact
representation.
If specified, ``separators`` should be an ``(item_separator, key_separator)``
tuple. The default is ``(', ', ': ')`` if *indent* is ``None`` and
``(',', ': ')`` otherwise. To get the most compact JSON representation,
you should specify ``(',', ':')`` to eliminate whitespace.
``default(obj)`` is a function that should return a serializable version
of obj or raise TypeError. The default simply raises TypeError.
If *sort_keys* is true (default: ``False``), then the output of
dictionaries will be sorted by key.
To use a custom ``JSONEncoder`` subclass (e.g. one that overrides the
``.default()`` method to serialize additional types), specify it with
the ``cls`` kwarg; otherwise ``JSONEncoder`` is used.
2、json.loads()
json.loads()用于将str类型的数据转成dict。
import json
name_emb = {'a':'1111','b':'2222','c':'3333','d':'4444'}
jsDumps = json.dumps(name_emb)
# jsDumps是json字符串格式
jsLoads = json.loads(jsDumps)
print(name_emb)
print(jsDumps)
print(jsLoads)
print(type(name_emb))
print(type(jsDumps))
print(type(jsLoads))
官方文档参数说明如下:
def loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
containing a JSON document) to a Python object.
``object_hook`` is an optional function that will be called with the
result of any object literal decode (a ``dict``). The return value of
``object_hook`` will be used instead of the ``dict``. This feature
can be used to implement custom decoders (e.g. JSON-RPC class hinting).
``object_pairs_hook`` is an optional function that will be called with the
result of any object literal decoded with an ordered list of pairs. The
return value of ``object_pairs_hook`` will be used instead of the ``dict``.
This feature can be used to implement custom decoders. If ``object_hook``
is also defined, the ``object_pairs_hook`` takes priority.
``parse_float``, if specified, will be called with the string
of every JSON float to be decoded. By default this is equivalent to
float(num_str). This can be used to use another datatype or parser
for JSON floats (e.g. decimal.Decimal).
``parse_int``, if specified, will be called with the string
of every JSON int to be decoded. By default this is equivalent to
int(num_str). This can be used to use another datatype or parser
for JSON integers (e.g. float).
``parse_constant``, if specified, will be called with one of the
following strings: -Infinity, Infinity, NaN.
This can be used to raise an exception if invalid JSON numbers
are encountered.
To use a custom ``JSONDecoder`` subclass, specify it with the ``cls``
kwarg; otherwise ``JSONDecoder`` is used.
The ``encoding`` argument is ignored and deprecated.
3、json.dump()
json.dump()用于将dict类型的数据转成str
,并写入到json文件中。下面两种方法都可以将数据写入json文件.
import json
name_emb = {'a':'1111','b':'2222','c':'3333','d':'4444'}
emb_filename = ('/home/cqh/faceData/emb_json.json')
json.dump(name_emb, open(emb_filename, "w"))
4、json.load()
json.load()用于从json文件中读取数据。
import json
emb_filename = ('/home/cqh/faceData/emb_json.json')
jsObj = json.load(open(emb_filename))
print(jsObj)
print(type(jsObj))
for key in jsObj.keys():
print('key: %s value: %s' % (key,jsObj.get(key)))
运行结果如下:
{u'a': u'1111', u'c': u'3333', u'b': u'2222', u'd': u'4444'}
<type 'dict'>
key: a value: 1111
key: c value: 3333
key: b value: 2222
key: d value: 4444
总结:
json.dumps : dict转成str ,一个是将字典转换为字符串
json.loads: str转成dict ,一个是将字符串转换为字典
json.dump 是将python数据保存成json文件
json.load 是读取json数据(文件)
具有read()或者write()方法的对象就是类文件对象
f = open(“a.txt”,”r”),其中f就是类文件对象
@墨雨出品 必属精品 如有雷同 纯属巧合
`非学无以广才,非志无以成学!`