問題描述
我想在我的 Glue 作業中從 Aurora-rds mysql 表創建一個 DynamicFrame.我可以使用自定義查詢從我的 rds 表創建 DynamicFrame - 有 where 子句嗎?我不想每次都在我的 DynamicFrame 中讀取整個表格,然后再進行過濾.看了這個網站,但沒有在這里或其他地方找到任何選項,https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
I want to create a DynamicFrame in my Glue job from an Aurora-rds mysql table. Can I create DynamicFrame from my rds table using a custom query - having a where clause? I dont want to read the entire table every time in my DynamicFrame and then filter later. Looked at this website but didnt find any option here or elsewhere, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
構造 JDBC 連接選項
connection_mysql5_options = {"url": "jdbc:mysql://:3306/db","dbtable": "測試",用戶":管理員","密碼": "密碼"}
connection_mysql5_options = { "url": "jdbc:mysql://:3306/db", "dbtable": "test", "user": "admin", "password": "pwd"}
從 MySQL 5 讀取 DynamicFrame
df_mysql5 =glueContext.create_dynamic_frame.from_options(connection_type="mysql",connection_options=connection_mysql5_options)
df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql5_options)
有什么方法可以給出一個 where 子句并說只從測試表中選擇前 100 行,說它有一個名為id"的列,我想使用這個查詢來獲取:
Is there any way to give a where clause and say select only top 100 rows from test table, say it has a column named "id" and I want to fetch using this query:
select * from test where id<100;
select * from test where id<100;
感謝任何幫助.謝謝!
推薦答案
抱歉,我本來可以發表評論的,但我沒有足夠的聲譽.我能夠在 AWS Glue 中使用 Guillermo AMS 提供的解決方案,但它確實需要兩個更改:
Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:
- jdbc"格式無法識別(提供的錯誤是:py4j.protocol.Py4JJavaError:調用 o79.load 時發生錯誤.:java.lang.ClassNotFoundException:無法找到數據源:jbdc.請在 http://spark.apache.org/third-party- 找到軟件包project.html") -- 我必須使用全名:org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
- 查詢選項對我不起作用(提供的錯誤是:"py4j.protocol.Py4JJavaError:調用 o72.load 時發生錯誤.: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"),但幸運的是,"dbtable"選項支持傳入表或子查詢 - 即在查詢周圍使用括號.
- The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
- The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.
在我下面的解決方案中,我還圍繞所需的對象和導入添加了一些上下文.
我的解決方案最終看起來像:
In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_context = GlueContext(SparkContext.getOrCreate())
tmp_data_frame = glue_context.spark_session.read\
.format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
.option("url", jdbc_url)\
.option("user", username)\
.option("password", password)\
.option("dbtable", "(select * from test where id<100)")\
.load()
這篇關于從選項(來自 rds - mysql)創建動態框架,提供帶有 where 子句的自定義查詢的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!