scala> val employee = sqlparquet.read.json("employee.json") 这里将txt转化为parquet应该也行 employee: org.apache.spark.sql.DataFrame = [_corrupt_record: string, age: string ... 2 more fields] scala> employee.write.parquet("employee.parquet") scala> val sqlpar = new org.apache.spark.sql.SQLContext(sc) warning: one deprecation (since 2.0.0); for details, enable `:setting -deprecation' or `:replay -deprecation' sqlpar: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4bdf398a scala> val parread = sqlpar.read.parquet("employee.parquet") parread: org.apache.spark.sql.DataFrame = [_corrupt_record: string, age: string ... 2 more fields] scala> parread.show() 此处虽然可以输出但是没在表中,这里属于parquet文件读取
scala> val allcol = sqlpar.sql("SELECT * FROM Demo") allcol: org.apache.spark.sql.DataFrame = [_corrupt_record: string, age: string ... 2 more fields] scala> val allcol = sqlpar.sql("SELECT id,age,name FROM Demo") allcol: org.apache.spark.sql.DataFrame = [id: string, age: string ... 1 more field] scala> allcol.show() +----+----+-------+ | id| age| name| +----+----+-------+ |null|null| null| |1201| 25| satish| |1202| 28|krishna| |1203| 39| amith| |1204| 23| javed| |1205| 23| prudvi| |null|null| null| +----+----+-------+ 此处为存在临时表中用sql读表
后续补充json. hive. paruqet三种数据源优缺点