GET Data Lake - from theory to practice. Methods for integrating Hadoop data and corporate DWH / Tinkoff.ru blog / Sudo Null IT News FREE
In this article I want to discourse an important task that you need to entertain and necessitate to exist able to solve if such an grave component as Hadoop appears in the analytical platform for working with data - the chore of integrating Hadoop data and corporate DWH data. At Data Lake at Tinkoff Bank, we scholarly how to solve this trouble effectively, and in the article I will severalise you how we did it.
This article is a continuation of a series of articles about Data Lake at Tinkoff Bank (previous Data Lake article - from theory to rehearse. A tale about how we build ETL along Hadoop ).
Task
Lyrical deviation. The picture preceding shows a picturesque lake, operating theater rather the system of lakes - cardinal smaller, the separate larger. What is smaller, beautiful, ennobled, with yachts is corporate DWH. And what is visible connected the horizon and does not correspond in the depiction due to its size is Hadoop. The digression is o'er, to the point.
Our task was quite trivial in terms of requirements, and not-fiddling in terms of technology choice and implementation. We had to toil a canalise between these two lakes, to establish a simple and effective way to publish information from Hadoop to DWH and plump for as part of the turn processes that go on in Information Lake.
Technology selection
It would appear that the task is very simple - to determine how to quick transfer data from the Hive away table to the Greenplum table and vice versa. Solving such problems is usually ready-made through ETL. But, thinking active the size of the tables (tens of millions of rows, gigabytes of data), we conducted a study from the source. In the study, we compared four approaches:
- Sqoop - a tool enclosed in the Hadoop ecosystem for transferring information between structured memory board and HDFS;
- Informatica Generous Data Variation - used A an ETL platform for muckle information processing in Hadoop;
- SAS Data Integration Studio - we use information technology as an ETL platform for processing data in house DWH (Greenplum);
- gphdfs - a creature / utility that is part of the Greenplum DBMS, for operative (reading / writing data) with HDFS.
Next, I bequeath discourse the advantages and disadvantages of apiece of them.
Sqoop
Sqoop is a tool designed to transfer information betwixt Hadoop clusters and relational databases. Using it, you hind end import information from a relational database management system (relational DBMS), for exercise, SQL Server, MySQL or Oracle, into a Hadoop diffuse file system (HDFS), convert data to a Hadoop system using MapReduce or Hive, and then export the information support to a relational DBMS.
Because the task was not ab initio supposed to beryllium transformed, IT seems that Sqoop is ideally suited for solving the task. It turns down that atomic number 3 soon as there is a need to print a put over (either in Hadoop or in Greenplum), it is necessary to write a job in Sqoop and check how to call this job on one of the schedulers (SAS or Informatica), depending on the schedule.
All is swell, but Sqoop whole kit and caboodle with Greenplum through JDBC. We are sweet-faced with extremely low performance. A 30 G test postpone was uploaded to Greenplum for about 1 hour. The result is extremely unsatisfactory. Sqoop was abandoned. Although generally, this is a very convenient tool for, for illustration, uploading one-meter to Hadoop, the data of some not very large table from a relative database. But, in order to build regulatory processes on Sqoop, you indigence to distinctly empathize the performance requirements of these processes and take in a decision supported this.
Informatica Big Data Edition
We use Informatica Big Data Edition atomic number 3 an ELT data processing locomotive engine in Hadoop. Those. just with the help of Informatica BDE we are building in Hadoop those windows that pauperism to embody published in Greenplum, where they will be available to other application systems of the trust. IT seems to be logical, after the ELT processes worked on the Hadoop clump, built a data showcase, push this showcase in Greenplum. To work with the Greenplum DBMS, Informatica BDE has PWX for Greenplum, which can knead some in Native mode and in Beehive mode. That is, as soon equally there is a need to publish a tabular array from Hadoop to Greenplum, you moldiness write a mapping task on Informatica BDE and call this task on the Informatica scheduler.
Everything is fine, but there is a nuance. PWX for Greenplum in Native-born mode works like a classic ETL, i.e. it reads the data from the Hive to the ETL waiter and already on the ETL server lifts the gpload session and rafts the data into Greenplum. It turns out that the stallion data pour rests happening the ETL server.
So conducted experiments in Hive up mode. PWX for Greenplum in Hive mode works without the participation of an ETL server, the ETL host only controls the process, all information is handled on the Hadoop cluster (Informatica BDE components are also installed happening the Hadoop bundle). In this case, gpload sessions rise on the nodes of the Hadoop cluster and load the data into Greenplum. Hither we do not get a bottleneck in the class of an ETL server and the execution of this set about turned dead set be quite good - a examine table of 30 Gb was uploaded to Greenplum for or so 15 minutes. But PWX for Greenplum in Hive up mode was unstable at the time of research. And at that place is another important point. If you want to reverse publish information (from Greenplum to Hadoop), PWX for Greenplum works through ODBC.
To solve the problem, it was decided non to use Informatica BDE.
SAS Data Desegregation Studio
We use SAS Information Consolidation Studio as an ELT data processing engine in Greenplum. Present we get a different picture. Informatica BDE builds the necessary showcase in Hadoop, then SAS Dis pulls this showing case in Greenplum. Or else, SAS DIS builds a storefront in Greenplum, then pushes the shopfront in Hadoop. Information technology seems to be beautiful. To work with Hadoop, SAS DIS has special SAS Access Interface to Hadoop components. Drawing a symmetric with PWX for Greenplum, the Special Air Service Get at User interface to Hadoop does non have a Hive away mood of operation and hence all data testament flow through an ETL waiter. Received unsatisfactory process performance.
gphdfs
gphdfs is a utility that is part of the Greenplum DBMS, which allows organizing parallel data transport between a section of Greenplum servers and nodes with Hadoop data. We conducted experiments with the publication of data from both Hadoop to Greenplum, and vice versa - the performance of the processes was only amazing. A 30 Gb psychometric test pattern was uploaded to Greenplum for about 2 minutes.
Analysis of the results
For clarity, the table below shows the research results.
Engineering science | The complexity of integrating into regulatory processes | The complexness of process development | Process Performance (Hadoop -> Greenplum) | Process Performance (Greenplum -> Hadoop) |
---|---|---|---|---|
Sqoop | Difficult | Low | Poor (JDBC) | Poor (JDBC) |
Informatica Big Data Edition (PWX for Greenplum in Inbred Mode) | Easy | Low | Poor (gpload on ETL server) | Poor (ODBC) |
Informatica Big Information Edition (PWX for Greenplum in Hive up fashion) | Easy | Low | Beautiful (gpload on Hadoop cluster nodes) | Poor (ODBC) |
Special Air Service Information Integration Studio (SAS Access Interface to Hadoop) | Loose | Low | Unsatisfactory | Unsatisfactory |
gphdfs | Difficult | High | Very last (gphdfs) | Identical high (gphdfs) |
The termination was ambiguous - with the least problems in the performance of processes, we get a utility that is completely unacceptable to use in the development of ETL processes arsenic is. We thought ... The ELT platform SAS Data Integration Studio allows us to develop our have components (transforms) on information technology and we decided, in order to shrink the complexity of developing ETL processes and reduce the complexness of consolidation into regulatory processes, to develop deuce transformations that will facilitate crop with gphdfs without exit carrying into action of place processes. Close, I'll talk about the effectuation details.
Metamorphose Implementation
These deuce transforms have a fairly simple task, to execute a serial publication of operations around Hive away and gphdfs.
An exercise (design) transform for publishing data from Hadoop to Greenplum.
- Hive Table - a table in Hive registered in SAS DI metadata;
- Transform - transform whose steps I wish describe promote;
- Greenplum Set back - direct or worktable in Greenplum;
What does the transubstantiate do:
- Creates an outer table in the work database in Hive. An external table is created using a serializer that is understandable to gphdfs (i.e., either CSV or TEXT);
- Performs an overcharge from the Hive table (source) we need, to the worktable in Hive (created in the previous paragraph). We do it in social club to transfer the data we need into a formatting that is understandable to gphdfs. Because the task is performed on the bunch, we get along non lose performance. In accession, we aim independence from the information format misused in the table informant in Hive (PARQUET, ORC, etc.);
- Creates an foreign gphdfs table in the work scheme of the job in Greenplum, which looks at the files in hdfs that were recorded A a result of the previous step;
- Performs select from an external table (created in the previous step) - earnings! Data flowed from the Hadoop cluster data nodes to the Greenplum cluster server segment.
It stiff for the developer to add this transform to the job and destine the name calling of the stimulus and output tables.
The development of much a process takes almost 15 minutes.
By doctrine of analogy, a transform was implemented to publish data from Greenplum to Hadoop.
IMPORTANT. Another of the benefits we obtained by solving this problem, we are possibly ready to organize the process of offloading data from corporate DWH to a cheaper Hadoop.
Conclusion
What I hot to tell with this. There are ii primary points:
1. When you work with large volumes of data, be real careful when choosing a technology. Cautiously subject area the task you are near to solve from each sides. Pay attention to the strengths and weaknesses of the applied science. Try out to avoid bottlenecks. The ill-timed choice of applied science keister greatly regard, if not immediately, the performance of the organization and, as a result, the business summons in which your system is involved;
2. Do not be alarmed, but rather welcome improvements to your data integration platform with self-longhand components. This allows decreasing the cost and time of further development and support by orders of order of magnitude.
DOWNLOAD HERE
GET Data Lake - from theory to practice. Methods for integrating Hadoop data and corporate DWH / Tinkoff.ru blog / Sudo Null IT News FREE
Posted by: brownagen1949.blogspot.com
0 Response to "GET Data Lake - from theory to practice. Methods for integrating Hadoop data and corporate DWH / Tinkoff.ru blog / Sudo Null IT News FREE"
Post a Comment