最近学的东西有些杂乱无章,用到什么就要学习什么,简单记录一下所学的东西,方便后面的巩固学习。
pyspark简单查询数据库的一些信息
from pyspark.sql import SparkSession, Row from pyspark import SQLContext from pyspark.sql.functions import udf, col, explode, collect_set, get_json_object, concat_ws, split from pyspark.sql.types import StringType, IntegerType, StructType, StructField, ArrayType, MapType # from offline_verification_func import * spark = SparkSession \ .builder.master("local[50]") \ .config("spark.executor.memory", "10g")\ .config("spark.driver.memory", "20g")\ .config("spark.driver.maxResultSize","4g")\ .appName("test") \ .enableHiveSupport() \ .getOrCreate()
spark.sql(""" select id, name, age from students where age > 14 order by age """).show()
df = spark.sql(""" select id, name, age from students where age > 14 order by age """) # df.repartition(1).write.mode("overwrite").format('csv').save("dfr.csv") df.toPanads().to_csv("df.csv")
spark.sql()
中用到的是select from where group by having order by limit
等通用的查询和筛选的条件,这个是通用的。