Entity Matching-as-a-Service (EMaaS) targets the problem of identifying records that refer to the same entity in the real world.
This task is known to be challenging due to its pair-wise comparison nature, especially when the datasets involved in the matching process have a high volume (Big Data). Since the EM task has critical importance for data cleaning and integration, e.g., to find duplicate points of interest in different databases, the importance of the efforts focused on the challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark, has grown considerably nowadays.
For this reason, the EMaaS service, to be provided by the main API of the EUBra-BIGSEA, consists of a set of tools and functions that can process the Entity Matching task (e.g., geo/spatial- matching) in parallel by using Apache Spark.
The EMaaS service will attend the requests from applications/systems interested in submitting Entity Matching tasks to the cluster environment. To this end, the service will establish a connection to the Hadoop Eco-system to perform the necessary operations such as submitting artifacts (e.g. datasets) to the HDFS or starting the execution of Spark jobs.
In the context of EUBra-BIGSEA the service provides a set of Spark approaches to enable the large-scale matching of various geographic entities (e.g., trajectories, streets, city facilities). In other words, the Entity Matching (EM) process deals with large geographic data sources related to the cities around the world (one of them is from Curitiba, Brazil). The data sources store a set of points (coordinates) that define geographic entities of the city, e.g., streets, parks, gardens, cemeteries. Since in the real world entities can be modified (e.g., to improve the urban mobility or construct/extend new roads or facilities), it can suffer changes in its dimensions (geolocation). Due to this modification, the data sources may receive conflicting data about the entities, e.g., the same set of points that represents distinct entities. Thus, the Spark-based EM approaches can identify duplicated or correlated entities (sets of points) in single or multiple (large) data sources in parallel.
EMaaS targets every scientific/industrial sector interested in a service to process large-scale geo/spatial data matching.