IT技術互動交流平台

Apache Sqoop

作者--万力可恒力片:Overview——Sqoop 概述 - 林子係  來源-别克赛欧srv怎么样:IT165收集  發布日期-_赢咖娱乐黑钱么?:2016-12-14 20:32:52

Apache Sqoop - Overview

Apache Sqoop 概述 

使用Hadoop來分析和處理數據需要將數據加載到集群中並且將它和企業生產數據庫中的其他數據進行結合處理|-成都购房入户政策。從生產係統加載大塊數據到Hadoop中或者從大型集群的map reduce應用中獲得數據是個挑戰--_11选5南京华彩。用戶必須意識到確保數據一致性_-借身份证号,消耗生產係統資源_|_奇迹世界2战士加点,供應下遊管道的數據預處理這些細節__-阳城县人力资源和社会保障局。用腳本來轉化數據是低效和耗時的方式||-芦荟棉是什么面料。使用map reduce應用直接去獲取外部係統的數據使得應用變得複雜和增加了生產係統來自集群節點過度負載的風險_钟楼兄弟。

這就是Apache Sqoop能夠做到的-__英伦学院风男装。Aapche Sqoop 目前是Apache軟件會的孵化項目_|_360走试图。更多關於這個項目的信息可以在http://incubator.apache.org/sqoop查看

Sqoop能夠使得像關係型數據庫-_css教程下载、企業數據倉庫和NoSQL係統那樣簡單地從結構化數據倉庫中導入導出數據--|咖啡屋剧情介绍。你可以使用Sqoop將數據從外部係統加載到HDFS_2004cad下载,存儲在Hive和HBase表格中__永城。Sqoop配合Ooozie能夠幫助你調度和自動運行導入導出任務_-_英文空间名。Sqoop使用基於支持插件來提供新的外部鏈接的連接器_-对开纸尺寸。

當你運行Sqoop的時候看起來是非常簡單的--电影致青春经典台词,但是表象底層下麵發生了什麼呢|_桑叶采摘器?數據集將被切片分到不同的partitions和運行一個隻有map的作業來負責數據集的某個切片|_乏力草。因為Sqoop使用數據庫的元數據來推斷數據類型所以每條數據都以一種類型安全的方式來處理__覃晶。

在這篇文章其餘部分中我們將通過一個例子來展示Sqoop的各種使用方式_-11515藏宝阁开奖资料22。這篇文章的目標是提供Sqoop操作的一個概述而不是深入高級功能的細節_--花城广场停车。

導入數據

下麵的命令用於將一個MySQL數據庫中名為ORDERS的表中所有數據導入到集群中
---
$ sqoop import --connect jdbc:mysql://localhost/acmedb
  --table ORDERS --username test --password ****
---

在這條命令中的各種選項解釋如下_-_嘉酒视窗网:

  • import: 指示Sqoop開始導入 --connect <connect string>, --username <user name>, --password <password>: 這些都是連接數據庫時需要的參數|_cctv10怪兽之谜视频。這跟你通過JDBC連接數據庫時所使用的參數沒有區別 --table <table name>: 指定要導入哪個表

    導入操作通過下麵Figure1所描繪的那兩步來完成||南国彩票论坛七星彩。第一步_--11选5任7万能34组,Sqoop從數據庫中獲取要導入的數據的元數據_-新金庸群侠传2攻略。第二步-|铜旗阵,Sqoop提交map-only作業到Hadoop集群中_|盐城一中吧。第二步通過在前一步中獲取的元數據做實際的數據傳輸工作_|2015年世界gdp排名。

    Figure 1: Sqoop Import Overview

    導入的數據存儲在HDFS目錄下--国世平本人新浪博客。正如Sqoop大多數操作一樣--掌上彩票pro官网登录,用戶可以指定任何替換路徑來存儲導入的數據|_-陕西医科大学。

    默認情況下這些文檔包含用逗號分隔的字段|26岁毒贩获死刑,用新行來分隔不同的記錄-|gts3370。你可以明確地指定字段分隔符和記錄結束符容易地實現文件複製過程中的格式覆蓋-|我想看一级片。

    Sqoop也支持不同數據格式的數據導入_-药酒是哪个朝代。例如_-13458时时彩规律,你可以通過指定 --as-avrodatafile 選項的命令行來簡單地實現導入Avro 格式的數據-|-15号买的福利彩什么时候开奖。

    There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    Sqoop提供許多選項可以用來滿足指定需求的導入操作--助赢手机版免费。

    導入數據到 Hive

    在許多情況下-|东方热线测网速,導入數據到Hive就跟運行一個導入任務然後使用Hive創建和加載一個確定的表和partition_-_金蝉脱壳造句。手動執行這個操作需要你要知道正確的數據類型映射和其他細節像序列化格式和分隔符-_235彩票平台。Sqoop負責將合適的表格元數據填充到Hive 元數據倉庫和調用必要的指令來加載table和partition_|亿彩彩票能提出来钱么?。這些操作都可以通過簡單地在命令行中指定--hive-import 來實現_-_全际通物流。
    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    當你運行一個Hive import時|诺基亚n78软件下载,Sqoop將會將數據的類型從外部數據倉庫的原生數據類型轉換成Hive中對應的類型_|-2m彩票2m彩票振撼来袭,Sqoop自動地選擇Hive使用的本地分隔符||丁丁网上海。如果被導入的數據中有新行或者有其他Hive分隔符-_众发娱乐计划,Sqoop允許你移除這些字符並且獲取導入到Hive的正確數據__卓君身高。

    一旦導入操作完成-|-132彩票手机版,你就像Hive其他表格一樣去查看和操作__160彩票登录。

    導入數據到 HBase

    你可以使用Sqoop將數據插入到HBase表格中特定列族|_|running man宋茜。跟Hive導入操作很像|_注册就送18元彩金彩票,可以通過指定一個額外的選項來指定要插入的HBase表格和列族||-纳丹堡论坛。所有導入到HBase的數據將轉換成字符串並以UTF-8字節數組的格式插入到HBase中

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----
     

    下麵是命令行中各種選項的解釋_-兰州最好的摸吧:

  • --hbase-create-table: 這個選項指示Sqoop創建HBase表. --hbase-table: 這個選項指定HBase表格的名字. --column-family: T這個選項指定列族的名字.

    剩下的選項跟普通的導入操作一樣||_女人二十种外阴。

    導出數據

    在一些情況中_|-洪洞贴吧,通過Hadoop pipelines來處理數據可能需要在生產係統中運行額外的關鍵業務函數來提供幫助_-诺亚信a800。Sqoop可以在必要的時候用來導出這些的數據到外部數據倉庫|-_爱肺金雪茄。還是使用上麵的例子_-_朱云来妻子,如果Hadoop pieplines產生的數據對應數據庫OREDERS表格中的某些地方|-注册申请送38彩金,你可以使用下麵的命令行|||后知后觉 金池:


    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----
     

    下麵是各種選項的解釋|-ume重庆:

  • export: 指示Sqoop開始導出 --connect <connect string>, --username <user name>, --password <password>:這些都是連接數據庫時需要的參數|_王宝强性格大变。這跟你通過JDBC連接數據庫時所使用的參數沒有區別 --table <table name>: 指定要被填充的表格 --export-dir <directory path>: 導出路徑.

    導入操作通過下麵Figure2所描繪的那兩步來完成__-杭州租房口碑网。第一步_-500g是多少斤,從數據庫中獲取要導入的數據的元數據-__下载2011qq,第二步則是數據的傳輸_--234彩票登录。Sqoop將輸入數據集分割成片然後用map任務將片插入到數據庫中-_-3g彩票开奖。為了確保最佳的吞吐量和最小的資源使用率_||gm730e,每個map任務通過多個事務來執行這個數據傳輸||亿乐彩手机版。

     

    Figure 2: Sqoop Export Overview

    一些連接器支持臨時表格來幫助隔離那些任何原因導致的作業失敗而產生的生產表格-__三星i5500。一旦所有的數據都傳輸完成正大饲料价格,臨時表格中的數據首先被填充到map任務和合並到目標表格_-|青年宫电影院影讯。

    Sqoop 連接器

    使用專門連接器|--5230最新软件下载,Sqoop可以連接那些擁有優化導入導出基礎設施的外部係統___2019王中王平特肖图,或者不支持本地JDBC|-肖志恒简历。連接器是插件化組件基於Sqoop的可擴展框架和可以添加到任何當前存在的Sqoop-_3a彩票提现不到账。一旦連接器安裝好|||lightroom3 6,Sqoop可以使用它在Hadoop和連接器支持的外部倉庫之間進行高效的傳輸數據-|诺基亚7070拆机教程。

    默認情況下__陶江湖,Sqoop包含支持各種常用數據庫例如MySQL_6月新股,PostgreSQL,Oracle-_媚行深宫 菏泽天下,SQLServer和DB2的連接器||优点彩票。它也包含支持MySQL和PostgreSQL數據庫的快速路徑連接器__易彩快3是真的吗。快速路徑連接器是專門的連接器用來實現批次傳輸數據的高吞吐量-|打吊针原曲。Sqoop也包含一般的JDBC連接器用於連接通過JDBC連接的數據庫

    跟內置的連接不同的是||金正大oa,許多公司會開發他們自己的連接器插入到Sqoop中诺基亚n78主题,從專門的企業倉庫連接器到NoSQL數據庫--|喜悦张海军。

    總結

    在這篇文檔中可以看到大數據集在Hadoop和外部數據倉庫例如關係型數據庫的傳輸是多麼的簡單--新浪团购。除此之外--王小麟,Sqoop提供許多高級提醒如不同數據格式__|北京市委书记是谁、壓縮--106购彩平台、處理查詢等等-_冰心的原名是什么。我們建議你多嚐試Sqoop並給我們提供反饋_|-云鼎彩票公司。
     

    更多關於Sqoop的信息可以在下麵路徑找到-|茶叶生意:
     

    Project Website: http://incubator.apache.org/sqoop

    Wiki: https://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

    下麵是原文


    Apache Sqoop - Overview 

    Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.


    This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

    Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.

    What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

    In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.

    Importing Data

    The following command is used to import all data from a table called ORDERS from a MySQL database:


    ---
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password ****
    ---

    In this command the various options specified are as follows:

  • import: This is the sub-command that instructs Sqoop to initiate an import. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be imported.


    The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.

     

    Figure 1: Sqoop Import Overview

    The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.

    By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.

    Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option --as-avrodatafile with the import command.
     

    There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    Importing Data into Hive

    In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option --hive-import with the import command.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.
     

    Once the import is complete, you can see and operate on the table just like any other table in Hive.

    Importing Data into HBase

    You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----

    In this command the various options specified are as follows:

  • --hbase-create-table: This option instructs Sqoop to create the HBase table. --hbase-table: This option specifies the table name to use. --column-family: This option specifies the column family name to use.

    The rest of the options are the same as that for regular import operation.

    Exporting Data

    In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:

    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----

    In this command the various options specified are as follows:

  • export: This is the sub-command that instructs Sqoop to initiate an export. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be populated. --export-dir <directory path>: This is the directory from which data will be exported.


    Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

    Figure 2: Sqoop Export Overview

    Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.

    Sqoop Connectors

    Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

    By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

    Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.

    Wrapping Up

    In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.

    More information regarding Sqoop can be found at:
     

    Project Website: http://incubator.apache.org/sqoop

    Wiki: https://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

延伸閱讀_2000彩首页:

Tag標簽-|_掌上购彩七天彩邀请码: Apache   Sqoop  
  • nginx那些事兒
  • 本文為我學習nginx時的筆記與心得_-_宋茜微博,如有錯誤或者不當... 詳細
  • 專題推薦

  • Directx11 遊戲編程入門教程
  • 專題主要學習DirectX的初級編程入門學習_|_电子邮箱号码大全,對Directx11的入門及初學者有...... 詳細
  • Windows7係統入門 優化 技巧技術專題
  • Windows7係統專題 無論是升級操作係統|360彩票杀号定胆、資料備份_|-洛克王国蕾纳斯、加強資料的安全及管...... 詳細
About IT165 - 廣告服務 - 隱私聲明 - 版權申明 - 免責條款 - 網站地圖 - 網友投稿 - 聯係方式
本站內容來自於互聯網,僅供用於網絡技術學習,學習中請遵循相關法律法規
运盛彩票sg飞艇重庆时时彩盈众彩票快乐飞艇盛世彩票

免责声明: 本站资料及图片来源互联网文章,本网不承担任何由内容信息所引起的争议和法律责任。所有作品版权归原创作者所有,与本站立场无关,如用户分享不慎侵犯了您的权益,请联系我们告知,我们将做删除处理!