Lemonade (Live Exploration and Mining Of a Non-trivial Amount of Data from Everywhere) is an analytics platform that supports intuitive definition of tasks for knowledge discovery, mining, and learning from large amounts of data that come from a wide spectrum of scenarios. The platform interface is a web application in which users may define analytics workflows visually by dragging and dropping operations and data sources, and connecting them. Lemonade is being developed by UFMG as part of the EUBra-BIGSEA project and targets users who do not want to learn a programming language, but need to develop analytics workflows. It supports the creation of a processing workflow, import, export or manageent of datasets, executing and managing workflows and the data visualisation.
Lemonade provides a rich web interface, which is both accessible to learners and powerful to experts. Lemonade scope plan comprises more than 30 different operations of data mining, machine learning and extraction, transformation and loading of data. The platform is also capable of processing massive amounts of data (“Big Data”), since it is being built on top of three scalable processing and storage technologies: Apache Spark, CMCC Ophidia and BSC COMPSs, being the last two technologies developed by partners of EUBra-BIGSEA project.
Users will be able to upload data sets using a service provided by Lemonade. Data are kept in a redundant file system, aimed to provide high-availability and high throughput. Data storage requirements will depend on use cases and installation. Users may process terabytes of data and their volume will directly impact the storage and processing costs.
Lemonade can be scaled to support hundreds of users by increasing cluster capacity. A large number of users can be supported in a modest cluster of commodity computers and a volume of data often found in most of organizations.
Lemonade has 7 micro-components:
Limonero: stores meta-data about data sources and provides them as service.For each data source, it has information about its location access permissions, storage details (such as name, data type, size, precision, data format) and data characteristics such as distribution, missing values, mean and maximum values.
Tahiti: maintains metadata about individual operations and dataflows created by users and provides them as service. Operations are the smallest units in Lemonade, and they are divided in five categories: execution, privacy/security, monitoring, appearance, and quality of service requirements (QoS).
Citron: the web interface user use to create, execute, and monitor their data flows. With it, users can choose predefined operations, drag and conncect them throught their ports to compose a data flow.
Juicer: the module that actually runs the data flows and supports the monitoring of their execution. Upon receiving a data flow, it generates the equivalent Spark source code, acting as a transpiler (source-to-source compiler), where each operation becomes a method. The Spark code is then instantiated in the cloud execution environment, observing the user-defined QoS parameters to make sure operations execute with sufficient resources to meet user demands.
Stand: coordinates the communication between Citron and Juicer, ensuring independence between the two components. Execution starts when a user requests to run a dataflow through the Citron interface, which then invokes Stand, which connects back to the first to provide feedback to the user.
Thorn: responsible for security, privacy and access control (AAA) in Lemonade. Some of its tasks are challenging, such as determining who will be able to access the results from applying an operation to a database that contains sensitive attributes.
Caipirinha: provides visualizations through different visual metaphors.
Essential information for potential users
Lemonade is an open-source solution. All dependencies (operating system, processing frameworks, infrastructure technologies) are also open source, so there are no licensing costs. The license scheme is under discussion and it will be finalised for the first release.
To be kept up and running, Lemonade requires a cluster of processing computers and data storages. The size and capacity of the cluster depends on the number of users, data volume and complexity of workflow/tasks.
Lemonade depends on Apache Mesos (standalone mode) or a distributed processing technology (Apache Spark, BSC COMPSs or CMCC Ophidia), Oracle MySQL database server and a Linux operating system distribution.
Lemonade requires a reliable infrastructure to run that may be provided by platform-as-a-service (PaaS) companies, such as Google, Amazon or Microsoft or by the organization using Lemonade.
Three different user roles are supported in Lemonade: a system administrator, a data scientist and a data explorer. System administrator will be responsible for keeping Lemonade running, adding new users, setting permissions and security, and managing data sets. Data scientists must know about Lemonade operations in order to create processing workflows and data being processed, their characteristics and how his/her results can be applied in a real scenario. Data explorers are the users of existing models.
Lemonade targets those users from areas such as Mathematics, Statistics, Business Administration, as well as Data Science practitioners from any knowledge area.