-
Notifications
You must be signed in to change notification settings - Fork 12
Data Management + Feed Processing Platform over Hadoop
License
sriksun/Ivory
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Ivory Overview Ivory is a feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clusters. Why? * Dependencies across various data processing pipelines are not easy to establish. Gaps here typically leads to either incorrect/partial processing or expensive reprocessing. Repeated duplicate definition of a single feed multiple times can lead to inconsistencies / issues. * Input data may not arrive always on time and it is required to kick off the processing without waiting for all data to arrive and accommodate late data separately * Feed management services such as feed retention, replications across clusters, archival etc are tasks that are burdensome on individual pipeline owners and better offered as a service for all customers. * It should be easy to onboard new workflows/pipelines * Smoother integration with metastore/catalog * Provide notification to end customer based on availability of feed groups (logical group of related feeds, which are likely to be used together) Usage a. Setup cluster definition $IVORY_HOME/bin/ivory entity -submit -type cluster -file /cluster/definition.xml -url http://ivory-server:ivory-port b. Setup feed definition $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed1/definition.xml -url http://ivory-server:ivory-port $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed2/definition.xml -url http://ivory-server:ivory-port c. Setup process definition $IVORY_HOME/bin/ivory entity -submit -type process -file /process/definition.xml -url http://ivory-server:ivory-port d. Once submitted, entity definition, status and dependency can be queried. $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-definition|-status|-dependency] -url http://ivory-server:ivory-port or entities for a particular type can be listed through $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -list e. Schedule process $IVORY_HOME/bin/ivory entity -type process -name process -schedule -url http://ivory-server:ivory-port f. Once scheduled entities can be suspended, resumed or deleted (post submit) $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-suspend|-delete|-resume] -url http://ivory-server:ivory-port g. Once scheduled process instances can be managed through irovy CLI $IVORY_HOME/bin/ivory instance -processName <<name>> [-kill|-suspend|-resume|-re-run] -start "yyyy-MM-dd'T'HH:mm'Z'" -url http://ivory-server:ivory-port Example configurations Cluster: <?xml version="1.0"?> <!-- Production cluster configuration --> <cluster colo="ua2" description="" name="staging-red" xmlns="uri:ivory:cluster:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <interfaces> <interface type="readonly" endpoint="hftp://gsgw1001.red.ua2.inmobi.com:50070" version="0.20.2-cdh3u0" /> <interface type="write" endpoint="hdfs://gsgw1001.red.ua2.inmobi.com:54310" version="0.20.2-cdh3u0" /> <interface type="execute" endpoint="gsgw1001.red.ua2.inmobi.com:54311" version="0.20.2-cdh3u0" /> <interface type="workflow" endpoint="http://gs1134.blue.ua2.inmobi.com:11000/oozie/" version="3.1.4" /> <interface type="messaging" endpoint="tcp://gs1134.blue.ua2.inmobi.com:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/projects/ivory/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/projects/ivory/working" /> </locations> <properties/> </cluster> Feed: <?xml version="1.0" encoding="UTF-8"?> <!-- Hourly ad carrier summary. Generated by hourly processing of rr logs --> <feed description="RRHourlyAdCarrierSummary" name="RRHourlyAdCarrierSummary" xmlns="uri:ivory:feed:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <partitions/> <groups>rmchourly</groups> <frequency>hours</frequency> <periodicity>1</periodicity> <late-arrival cut-off="hours(6)" /> <clusters> <cluster name="staging-red" type="source"> <validity start="2009-01-01T00:00Z" end="2099-12-31T00:00Z" timezone="UTC" /> <retention limit="months(24)" action="delete" /> </cluster> </clusters> <locations> <location type="data" path="/projects/bi/rmc/rr/${YEAR}-${MONTH}-${DAY}-${HOUR}.concat/HourlyAdCarrierSummary" /> <location type="stats" path="/none" /> <location type="meta" path="/none" /> </locations> <ACL owner="rmcuser" group="users" permission="0755" /> <schema location="/none" provider="none" /> <properties/> </feed> Process: <?xml version="1.0" encoding="UTF-8"?> <!-- RMC Daily process, produces 34 new feeds --> <process name="rmc-daily"> <cluster name="staging-red" /> <frequency>days(1)</frequency> <validity start="2012-04-03T06:00Z" end="2022-12-30T00:00Z" timezone="UTC" /> <inputs> <input name="WapAd" feed="WapAd" start="today(0,0)" end="today(0,0)" /> </inputs> <outputs> <output name="TrafficDailyAdSiteSummary" feed="TrafficDailyAdSiteSummary" instance="yesterday(0,0)" /> </outputs> <properties> <property name="lastday" value="${formatTime(yesterday(0,0), 'yyyy-MM-dd')}" /> </properties> <workflow engine="oozie" path="/projects/bi/rmc/pipelines/workflow/rmcdaily" /> <retry policy="backoff" delay="5" delayUnit="minutes" attempts="3" /> </process>
About
Data Management + Feed Processing Platform over Hadoop
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published