_posts/2018-11-27-announcing-the-release-of-samza.html (158 lines of code) (raw):

--- layout: post status: PUBLISHED published: true title: Announcing the release of Samza 1.0 id: aa488579-6e6d-424d-bda5-140d085dd6f2 date: '2018-11-27 09:26:32 -0500' categories: samza tags: [] permalink: samza/entry/announcing-the-release-of-samza --- <p>We&rsquo;re thrilled to announce to the release of Apache Samza 1.0.</p> <p>Today Samza forms the backbone of hundreds of real-time production<br /> applications across a multitude of companies, such as LinkedIn, VMWare,<br /> Slack, Redfin among many others. This release of Samza adds a variety of<br /> features and capabilities to Samza&rsquo;s existing arsenal, coupled with new<br /> and improved <em>documentation</em>, <em>code</em> snippets, <em>examples</em>, and a<br /> brand-new <em>website design</em>! Here are a few selected highlights:</p> <ul> <li> <p><strong>Stable</strong> high level APIs that allow creating complex processing<br /> pipelines with ease.</p> </li> <li> <p><strong>Beam Samza Runner</strong> now marries Beam&rsquo;s best in class support for<br /> EventTime based windowed processing and sophisticated triggering<br /> with Samza&rsquo;s stable and scalable stateful processing model.</p> </li> <li> <p><strong>Table API</strong> that provides a common abstraction for accessing<br /> remote or local databases. Developers are now able to &ldquo;join&rdquo; an<br /> input event stream with such a Table.</p> </li> <li> <p>Integration <strong>Test Framework</strong> to enable effortless testing of Samza<br /> jobs without deploying a Kafka, Yarn, or Zookeeper cluster.</p> </li> <li> <p>Support for <strong>Apache Log4j2</strong> allowing improved logging performance,<br /> customization, and efficiency.</p> </li> <li> <p><strong>Upgraded Kafka</strong> client and consumer.</p> </li> <li> <p>An interactive <strong>shell for Samza SQL</strong> for seamless formulation,<br /> development, and testing of SamzaSQL queries.</p> </li> <li> <p><strong>Side-input</strong> support that allows using log-compacted data sources<br /> to populate KV state for Samza applications.</p> </li> <li> <p><strong>An improved website</strong> with detailed documentation and lots of code<br /> samples!</p> </li> </ul> <p>In addition, Samza 1.0 brings numerous bug-fixes, upgrades, and<br /> improvements listed <a href="#jiralist">below</a>.</p> <h2 id="new-features">New features</h2> <p>Samza 1.0 brings full-feature support for the following:</p> <h3 id="improved-stable-high-level-apis">Improved Stable High Level APIs</h3> <p>Samza 1.0 brings <em>Descriptor APIs</em> that allows applications to specify<br /> their input and output <em>systems</em> and <em>streams</em> in code. Samza&rsquo;s new<br /> <em>Context APIs</em> provide applications unified access to job-level,<br /> container-level, task-level, and application-level context and<br /> capabilities. This also simplifies Samza&rsquo;s <em>ApplicationRunner</em><br /> interface.</p> <p>This API evolution requires a few simple modifications to application<br /> code, which we describe in detail in our <a href="#upgradesteps">upgrade steps</a></p> <h4 id="beam-runner-support">Beam Runner Support</h4> <p>Samza&rsquo;s Beam Runner enables executing Beam pipelines over Samza. This<br /> enables Samza applications to create complex processing pipelines that<br /> require event-time based processing, varying types of event-time based<br /> windowing, and more. This feature is supported in both the YARN and<br /> standalone deployment models.</p> <h4 id="joining-streams-and-tables">Joining Streams and Tables</h4> <p>Samza&rsquo;s Table API provides developers with unified access to local<br /> and remote data sources such as <em>Key-Value stores</em> or <em>web services</em>,<br /> while providing features such as <em>rate-limiting, throttling,</em> and<br /> <em>caching</em> capabilities. This provides first-class API primitives for<br /> building Stream-Table join jobs. Learn more about the use, semantics,<br /> and examples for Table API <a href="http://TODO">here</a>.</p> <h4 id="test-samza-without-zk-yarn-or-kafka">Test Samza without <em>ZK, Yarn</em> or <em>Kafka</em></h4> <p>Samza 1.0 brings a test framework that allows testing Samza applications<br /> using <em>in-memory</em> input and output. Users can now setup test and<br /> testing pipelines for their applications without needing to setup any<br /> other services, such as Kafka, YARN, or Zookeeper.</p> <h4 id="log4j2-support">Log4J2 support</h4> <p>Samza now supports Apache Log4j 2 for system and application logging.<br /> Log4j 2 is an upgrade to Log4j that provides significant improvements<br /> over its predecessor, Log4j 1.x, such as better throughput and latency,<br /> custom log levels, and a pluggable logging architecture.</p> <h4 id="kafka-upgrade">Kafka upgrade</h4> <p>This release upgrades Samza to use Kafka&rsquo;s high-level consumer (Kafka<br /> v0.11.1.62). This brings latency and throughput benefits for Samza<br /> applications that consume from Kafka, in addition to bug-fixes. This<br /> also means Samza applications can now better their utilization of the<br /> underlying Kafka cluster.</p> <h4 id="samzasql-shell">SamzaSQL Shell</h4> <p>SamzaSQL now provides a shell for users to type-in their SQL queries,<br /> while Samza does the heavy-lifting of wiring the inputs and outputs, and<br /> sizing the application in the background. This is great for testing and<br /> experimenting with queries while formulating your application-logic,<br /> specially suited for data-scientists and tinkerers.</p> <h4 id="side-inputs">Side-inputs</h4> <p>Samza 1.0 brings the ability to leverage existing log-compacted data<br /> sources (e.g., Kafka topics) to populate KV state for Samza<br /> applications. If your data processing pipeline involves Hadoop-to-Kafka<br /> push, this feature alleviates the need for your Samza job to create<br /> separate Kafka-topics to back KV state.</p> <h4 id="improved-website-documentation-and-samples">Improved website, documentation, and samples</h4> <p>We&rsquo;ve re-designed the Samza website making it easier to find details on<br /> key Samza concepts and patterns. All documentation has been revised and<br /> rewritten, keeping in mind the feedback we got from our customers. We&rsquo;ve<br /> revised and added sample application code to showcase Samza 1.0 and the<br /> use of its new APIs.</p> <p><!--------------------------------------------------------------------------></p> <h2 id="enhancements-and-upgrades"><a name="jiralist"></a> Enhancements and Upgrades</h2> <p>This release brings the following enhancements, upgrades, and<br /> capabilities:</p> <h3 id="api-enhancements-and-simplifications">API enhancements and simplifications</h3> <p>SAMZA-1789: unify ApplicationDescriptor and ApplicationRunner for high-<br /> and low-level APIs in YARN and standalone environment</p> <p>SAMZA-1804: System and stream descriptors</p> <p>SAMZA-1858: Public APIs for shared context</p> <p>SAMZA-1763: Add async methods to Table API</p> <p>SAMZA-1786: Introduce the metadata store abstraction</p> <p>SAMZA-1859: Zookeeper implementation of MetadataStore</p> <p>SAMZA-1788: Add the LocationIdProvider abstraction</p> <h3 id="upgrades-and-bug-fixes">Upgrades and Bug-fixes</h3> <p>SAMZA-1768: Handle corrupted OFFSET file</p> <p>SAMZA-1817: Long classpath support for non-split deployments<br /> SAMZA-1719: Add caching support to table-API</p> <p>SAMZA-1783: Add Log4j2 functionality in Samza</p> <p>SAMZA-1868: Refactor KafkaSystemAdmin from using SimpleConsumer</p> <p>SAMZA-1776: Refactor KafkaSystemConsumer to remove the usage of<br /> deprecated SimpleConsumer client</p> <p>SAMZA-1730: Adding state validation in StreamProcessor before any<br /> lifecycle operation and group coordination</p> <p>SAMZA-1695: Clear events in ScheduleAfterDebounceTime on session<br /> expiration</p> <p>SAMZA-1647: Fix race conditions in StreamProcessor</p> <p>SAMZA-1371: Some Samza Containers get stuck at \&ldquo;Starting BrokerProxy\&rdquo;</p> <p>SAMZA-1648: Integration Test Framework &amp; Collection Stream Impl</p> <p>SAMZA-1748: Failure tests in the standalone deployment</p> <p>A source download of Samza 1.0 is available <a href="https://dist.apache.org/repos/dist/release/samza/1.0.0/">here</a>, and in Apache&rsquo;s Maven repository. </p> <h2 id="community"><a name="jiralist"></a> Community Developments</h2> <p>A <a href="https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797">symposium</a><br /> on Stream processing with Apache Samza and Apache Kafka was held on July<br /> 19th and on October 23rd. Both were attended by more than 350<br /> participants from across the industry. It featured in-depth talks on<br /> Samza&rsquo;s Beam integration, its use at LinkedIn for real-time<br /> notifications, a talk on Kafka-replication at Uber, and Kafka cruise<br /> control, and many others.</p> <p>Samza was also the focus of a talk at <a href="https://www.youtube.com/watch?v=2y8QImf-RpI">Strange Loop&#39;18</a>,<br /> focussing in depth on its scalability, performance, extensibility, and<br /> programmability.</p>