https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
结构化数据 -- 在数据上定义了一层模式, 例如关系型数据库
非结构数据 -- 自由形式数据, 没有任何约束, 例如报纸新闻
半结构化数据 -- 没有全局的数据模式, 但是对于每一条数据都有自身的模式定义, 例如文档数据库。
在python应用中往往需要定义结构化数据,来管理业务数据。本文总结几种结构化数据存储方法。
Structured data
Structured data sources define a schema on the data. With this extra bit of information about the underlying data, structured data sources provide efficient storage and performance. For example, columnar formats such as Parquet and ORC make it much easier to extract values from a subset of columns. Reading each record row by row first, then extracting the values from the specific columns of interest can read much more data than what is necessary when a query is only interested in a small fraction of the columns. A row-based storage format such as Avro efficiently serializes and stores data providing storage benefits. However, these advantages often come at the cost of flexibility. For example, because of rigidity in structure, evolving a schema can be challenging.
Unstructured data
By contrast, unstructured data sources are generally free-form text or binary objects that contain no markup, or metadata (e.g., commas in CSV files), to define the organization of data. Newspaper articles, medical records, image blobs, application logs are often treated as unstructured data. These sorts of sources generally require context around the data to be parseable. That is, you need to know that the file is an image or is a newspaper article. Most sources of data are unstructured. The cost of having unstructured formats is that it becomes cumbersome to extract value out of these data sources as many transformations and feature extraction techniques are required to interpret these datasets.
Semi-structured data
Semi-structured data sources are structured per record but don’t necessarily have a well-defined global schema spanning all records. As a result, each data record is augmented with its schema information. JSON and XML are popular examples. The benefits of semi-structured data formats are that they provide the most flexibility in expressing your data as each record is self-describing. These formats are very common across many applications as many lightweight parsers exist for dealing with these records, and they also have the benefit of being human readable. However, the main drawback for these formats is that they incur extra parsing overheads, and are not particularly built for ad-hoc querying.
https://docs.python.org/3/tutorial/datastructures.html#dictionaries
实际上没有模式定义, 需要开发者使用的时候按照需求列举出各个fields。
>>> tel = {'jack': 4098, 'sape': 4139} >>> tel['guido'] = 4127 >>> tel {'jack': 4098, 'sape': 4139, 'guido': 4127} >>> tel['jack'] 4098
https://medium.com/swlh/structures-in-python-ed199411b3e1
命名元组, 定义的元组各个位置的应用名字, 并可以使用名字来索引元素。
from collections import namedtuple Point = namedtuple('Point', ['x', 'y']) Point = namedtuple('Point', ['x', 'y'], defaults=[0, 0]) ntpt = Point(3, y=6) ntpt.x + ntpt.y ntpt[0] + ntpt[1]
https://docs.python.org/3/tutorial/classes.html#class-objects
使用class管理复合数据属性。
>>> class Complex: ... def __init__(self, realpart, imagpart): ... self.r = realpart ... self.i = imagpart ... >>> x = Complex(3.0, -4.5) >>> x.r, x.i (3.0, -4.5)
https://www.geeksforgeeks.org/understanding-python-dataclasses/
dataclass在class的基础上做了增强,专门面向数据存储, 包括初始化, 打印, 和比较。
DataClasses has been added in a recent addition in python 3.7 as a utility tool for storing data. DataClasses provides a decorator and functions for automatically adding generated special methods such as __init__() , __repr__() and __eq__() to user-defined classes.
# default field example from dataclasses import dataclass, field # A class for holding an employees content @dataclass class employee: # Attributes Declaration # using Type Hints name: str emp_id: str age: int # default field set # city : str = "patna" city: str = field(default="patna") emp = employee("Satyam", "ksatyam858", 21) print(emp)
https://pydantic-docs.helpmanual.io/
在定义数据模式基础上, 增强了一些功能:
数据验证
运行时类型错误提示
Data validation and settings management using python type annotations.
pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.
Define how data should be in pure, canonical python; validate it with pydantic.
from datetime import datetime from typing import List, Optional from pydantic import BaseModel class User(BaseModel): id: int name = 'John Doe' signup_ts: Optional[datetime] = None friends: List[int] = [] external_data = { 'id': '123', 'signup_ts': '2019-06-01 12:22', 'friends': [1, 2, '3'], } user = User(**external_data) print(user.id) #> 123 print(repr(user.signup_ts)) #> datetime.datetime(2019, 6, 1, 12, 22) print(user.friends) #> [1, 2, 3] print(user.dict()) """ { 'id': 123, 'signup_ts': datetime.datetime(2019, 6, 1, 12, 22), 'friends': [1, 2, 3], 'name': 'John Doe', } """