0

0

Archival and Analytics

php中文网

php中文网

发布时间:2016-06-01 13:14:02

|

1509人浏览过

|

来源于php中文网

原创

May 16, 2014

by severalnines

We won’t bore you with buzzwords like volume, velocity and variety. This post is for MySQL users who want to get their hands dirty with Hadoop, so roll up your sleeves and prepare for work. Why would you ever want to move MySQL data into Hadoop? One good reason is archival and analytics. You might not want to delete old data, but rather move it into Hadoop and make it available for further analysis at a later stage. 

In this post, we are going to deploy a Hadoop Cluster and export data in bulk from a Galera Cluster usingApache Sqoop. Sqoop is a well-proven approach for bulk data loading from a relational database into Hadoop File System. There is alsoHadoop Applieravailable fromMySQL labs, which works by retrieving INSERT queries from MySQL master binlog and writing them into a file in HDFS in real-time (yes, it applies INSERTs only).

We will useApache Ambarito deploy Hadoop (HDP 2.1) on three servers. We have a clustered Wordpress site running on Galera, and for the purpose of this blog, we will export some user data to Hadoop for archiving purposes. The database name is wordpress, we will use Sqoop to import the data to a Hive table running on HDFS. The following diagram illustrates our setup:

Archival and Analytics

The ClusterControl node has been installed with an HAproxy instance to load balance Galera connections and listen on port 33306.

Prerequisites

All hosts are running CentOS 6.5 with firewall and SElinux turned off. All servers’ time are using NTP server and synced with each other. Hostname must be FQDN or define your hosts across all nodes in /etc/hosts file. Each host has been configured with the following host definitions:

192.168.0.100		clustercontrol haproxy mysql192.168.0.101		mysql1 galera1192.168.0.102		mysql2 galera2192.168.0.103		mysql3 galera3192.168.0.111		hadoop1 hadoop1.cluster.com192.168.0.112		hadoop2 hadoop2.cluster.com192.168.0.113		hadoop3 hadoop3.cluster.com

Create an SSH key and configure passwordless SSH on hadoop1 to other Hadoop nodes to automate the deployment by Ambari Server. In hadoop1, run following commands as root:

$ ssh-keygen -t rsa # press Enter for all prompts$ ssh-copy-id -i ~/.ssh/id_rsa hadoop1.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop2.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop3.cluster.com

On all Hadoop hosts, install and configure NTP:

$ yum install ntp -y$ chkconfig ntp on$ service ntpd on$ ntpdate -u se.pool.ntp.org

Deploying Hadoop

1. Install Ambari Server on one of the Hadoop nodes (we chose hadoop1.cluster.com), this will help us deploy the Hadoop cluster. Configure Ambari repository for CentOS 6 and start the installation:

$ cd /etc/yum.repos.d$ wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo$ yum -y install ambari-server

2. Setup and start ambari-server:

$ ambari-server setup # accept all default values on prompt$ ambari-server start

Give Ambari a few minutes to bootstrap before accessing the web interface at port 8080.

3. Open a web browser and navigate to http://hadoop1.cluster.com:8080. Login with username and password ‘admin’. This is the Ambari dashboard, it will guide us through the deployment. Assign a cluster name and clickNext.

4. At theSelect Stackstep, choose HDP2.1:

Archival and Analytics

5. Specify all Hadoop hosts in theTarget Hostsfields. Upload the SSH key that we have generated in the Prerequisites section during passwordless SSH setup and clickRegister and Confirm:

Archival and Analytics

6. This page will confirm that Ambari has located the correct hosts for your Hadoop cluster. Ambri will check those hosts to make sure they have the correct directories, packages, and processes to continue the install. ClickNextto proceed.

7. If you have enough resources, just go ahead and install all services:

Archival and Analytics

8. InAssign Masterpage, we let Ambari choose the configuration for us before clickingNext:

搜狐资讯
搜狐资讯

AI资讯助手,追踪所有你关心的信息

下载

Archival and Analytics

9. InAssign Slaves and Clients page, we’ll enable all clients and slaves on each of our Hadoop hosts:

Archival and Analytics

10. Hive, Oozie and Nagios might requires further input like database password and administrator email. Specify the needed information accordingly and clickNext

11. You will be able to review your configuration selection before clickingDeploy to start the deployment:

Archival and Analytics

WhenSuccessfully installed and started the servicesappears, chooseNext. On the summary page, chooseComplete. Hadoop installation and deployment is now complete. Verify that all services are running correctly:

Archival and Analytics

We can now proceed to import some data from our Galera cluster as described in the next section.

Importing MySQL Data using Sqoop to Hive

Before importing any MySQL data, we need to create a target table in Hive. This table will have a similar definition as the source table in MySQL as we are importing all columns at the same time. Here is MySQL’s CREATE TABLE statement:

CREATE TABLE `wp_users` (`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,`user_login` varchar(60) NOT NULL DEFAULT '',`user_pass` varchar(64) NOT NULL DEFAULT '',`user_nicename` varchar(50) NOT NULL DEFAULT '',`user_email` varchar(100) NOT NULL DEFAULT '',`user_url` varchar(100) NOT NULL DEFAULT '',`user_registered` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',`user_activation_key` varchar(60) NOT NULL DEFAULT '',`user_status` int(11) NOT NULL DEFAULT '0',`display_name` varchar(250) NOT NULL DEFAULT '',PRIMARY KEY (`ID`),KEY `user_login_key` (`user_login`),KEY `user_nicename` (`user_nicename`)) ENGINE=InnoDB AUTO_INCREMENT=5864 DEFAULT CHARSET=utf8

SSH into any Hadoop node (since we installed Hadoop clients on all nodes) and switch to hdfs user:

$ su - hdfs

Enter into Hive console:

$ hive

Create a Hive database and table, similar to our MySQL table (Hive does not support DATETIME data type, so we are going to replace it with TIMESTAMP):

hive> CREATE SCHEMA wordpress;hive> SHOW DATABASES;OKdefaultwordpresshive> USE wordpress;hive> CREATE EXTERNAL TABLE IF NOT EXISTS users (ID BIGINT,user_login VARCHAR(60),user_pass VARCHAR(64),user_nicename VARCHAR(50),user_email VARCHAR(100),user_url VARCHAR(100),user_registered TIMESTAMP,user_activation_key VARCHAR(60),user_status INT,display_name VARCHAR(250));hive> exit;

Now we can start to import the wp_users MySQL table into Hive’s users table, connecting to MySQL nodes through HAproxy (port 33306):

$ sqoop import /--connect jdbc:mysql://192.168.0.100:33306/wordpress /--username=wordpress /--password=password /--table=wp_users /--hive-import /--hive-table=wordpress.users /--target-dir=wp_users_import /--direct

We can track the import progress from the Sqoop output :

..INFO mapreduce.ImportJobBase: Beginning import of wp_users..INFO mapreduce.Job: Job job_1400142750135_0020 completed successfully..OKTime taken: 10.035 secondsLoading data to table wordpress.usersTable wordpress.users stats: [numFiles=5, numRows=0, totalSize=240814, rawDataSize=0]OKTime taken: 3.666 seconds

You should see that an HDFS directory wp_users_import has been created (as specified in --target-dir in the Sqoop command) and we can browse its files using the following commands:

$ hdfs dfs -ls$ hdfs dfs -ls wp_users_import$ hdfs dfs -cat wp_users_import/part-m-00000 | more

Now let’s check our imported data inside Hive:

$ hive -e 'SELECT * FROM wordpress.users LIMIT 10'Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.propertiesOK2	admin	$P$BzaV8cFzeGpBODLqCmWp3uOtc5dVRb.	admin	my@email.com		2014-05-15 12:53:12		0	admin5	SteveJones	$P$BciftXXIPbAhaWuO4bFb4LVUN24qay0	SteveJones	demouser2@54.254.93.50		2014-05-15 12:57:59		0	Steve8	JanetGarrett	$P$BEp8IY1zvvrIdtPzDiU9D/br.FtzFa1	JanetGarrett	demouser3@54.254.93.50		2014-05-15 12:57:59		0	Janet11	AnnWalker	$P$B1wix5Xn/15o06BWyHa.r/cZ0rwUWQ/	AnnWalker	demouser4@54.254.93.50		2014-05-15 12:57:59		0	Ann14	DeborahFields	$P$B5PouJkJdfAucdz9p8NaKtS9WoKJu01	DeborahFields	demouser5@54.254.93.50		2014-05-15 12:57:59		0	Deborah17	ChristopherMitchell	$P$Bi/VWI1W4iP7h9mC0SXd4f.kKWnilH/	ChristopherMitchell	demouser6@54.254.93.50		2014-05-15 12:57:59		0	Christopher20	HenryHolmes	$P$BrPHv/ZHb7IBYzFpKgauBl/2WPZAC81	HenryHolmes	demouser7@54.254.93.50		2014-05-15 12:58:00		0	Henry23	DavidWard	$P$BVYg0SFTihdXwDhushveet4n2Eitxp1	DavidWard	demouser8@54.254.93.50		2014-05-15 12:58:00		0	David26	WilliamMurray	$P$Bc8FmkMadsQZCsW4L5Vo8Xax2ex8we.	WilliamMurray	demouser9@54.254.93.50		2014-05-15 12:58:00		0	William29	KellyHarris	$P$Bc85yvlxvWQ4XxkeAgJRugOqm6S6au.	KellyHarris	demouser10@54.254.93.50		2014-05-15 12:58:00		0	KellyTime taken: 16.282 seconds, Fetched: 10 row(s)

Nice! Now we can see that our data exists both in Galera and Hadoop. You can also use --query option in Sqoop to filter the data that you want to export to Hadoop using an SQL query. This is a basic example of how we can start to leverage Hadoop for archival and analytics. Welcome to big data!

References

  • Sqoop User Guide (v1.4.2) http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
  • Hortonworks Data Platform Documentation http://docs.hortonworks.com/HDPDocuments/Ambari-1.5.1.0/bk_using_Ambari_book/content/index.html

本站声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn

热门AI工具

更多
DeepSeek
DeepSeek

幻方量化公司旗下的开源大模型平台

豆包大模型
豆包大模型

字节跳动自主研发的一系列大型语言模型

通义千问
通义千问

阿里巴巴推出的全能AI助手

腾讯元宝
腾讯元宝

腾讯混元平台推出的AI助手

文心一言
文心一言

文心一言是百度开发的AI聊天机器人,通过对话可以生成各种形式的内容。

讯飞写作
讯飞写作

基于讯飞星火大模型的AI写作工具,可以快速生成新闻稿件、品宣文案、工作总结、心得体会等各种文文稿

即梦AI
即梦AI

一站式AI创作平台,免费AI图片和视频生成。

ChatGPT
ChatGPT

最最强大的AI聊天机器人程序,ChatGPT不单是聊天机器人,还能进行撰写邮件、视频脚本、文案、翻译、代码等任务。

相关专题

更多
pixiv网页版官网登录与阅读指南_pixiv官网直达入口与在线访问方法
pixiv网页版官网登录与阅读指南_pixiv官网直达入口与在线访问方法

本专题系统整理pixiv网页版官网入口及登录访问方式,涵盖官网登录页面直达路径、在线阅读入口及快速进入方法说明,帮助用户高效找到pixiv官方网站,实现便捷、安全的网页端浏览与账号登录体验。

145

2026.02.13

微博网页版主页入口与登录指南_官方网页端快速访问方法
微博网页版主页入口与登录指南_官方网页端快速访问方法

本专题系统整理微博网页版官方入口及网页端登录方式,涵盖首页直达地址、账号登录流程与常见访问问题说明,帮助用户快速找到微博官网主页,实现便捷、安全的网页端登录与内容浏览体验。

100

2026.02.13

Flutter跨平台开发与状态管理实战
Flutter跨平台开发与状态管理实战

本专题围绕Flutter框架展开,系统讲解跨平台UI构建原理与状态管理方案。内容涵盖Widget生命周期、路由管理、Provider与Bloc状态管理模式、网络请求封装及性能优化技巧。通过实战项目演示,帮助开发者构建流畅、可维护的跨平台移动应用。

34

2026.02.13

TypeScript工程化开发与Vite构建优化实践
TypeScript工程化开发与Vite构建优化实践

本专题面向前端开发者,深入讲解 TypeScript 类型系统与大型项目结构设计方法,并结合 Vite 构建工具优化前端工程化流程。内容包括模块化设计、类型声明管理、代码分割、热更新原理以及构建性能调优。通过完整项目示例,帮助开发者提升代码可维护性与开发效率。

13

2026.02.13

Redis高可用架构与分布式缓存实战
Redis高可用架构与分布式缓存实战

本专题围绕 Redis 在高并发系统中的应用展开,系统讲解主从复制、哨兵机制、Cluster 集群模式及数据分片原理。内容涵盖缓存穿透与雪崩解决方案、分布式锁实现、热点数据优化及持久化策略。通过真实业务场景演示,帮助开发者构建高可用、可扩展的分布式缓存系统。

19

2026.02.13

c语言 数据类型
c语言 数据类型

本专题整合了c语言数据类型相关内容,阅读专题下面的文章了解更多详细内容。

27

2026.02.12

雨课堂网页版登录入口与使用指南_官方在线教学平台访问方法
雨课堂网页版登录入口与使用指南_官方在线教学平台访问方法

本专题系统整理雨课堂网页版官方入口及在线登录方式,涵盖账号登录流程、官方直连入口及平台访问方法说明,帮助师生用户快速进入雨课堂在线教学平台,实现便捷、高效的课程学习与教学管理体验。

11

2026.02.12

豆包AI网页版入口与智能创作指南_官方在线写作与图片生成使用方法
豆包AI网页版入口与智能创作指南_官方在线写作与图片生成使用方法

本专题汇总豆包AI官方网页版入口及在线使用方式,涵盖智能写作工具、图片生成体验入口和官网登录方法,帮助用户快速直达豆包AI平台,高效完成文本创作与AI生图任务,实现便捷智能创作体验。

371

2026.02.12

PostgreSQL性能优化与索引调优实战
PostgreSQL性能优化与索引调优实战

本专题面向后端开发与数据库工程师,深入讲解 PostgreSQL 查询优化原理与索引机制。内容包括执行计划分析、常见索引类型对比、慢查询优化策略、事务隔离级别以及高并发场景下的性能调优技巧。通过实战案例解析,帮助开发者提升数据库响应速度与系统稳定性。

28

2026.02.12

热门下载

更多
网站特效
/
网站源码
/
网站素材
/
前端模板

精品课程

更多
相关推荐
/
热门推荐
/
最新课程
关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送

Copyright 2014-2026 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号