Big Data Suitability - Evaluating the Stream Processor

Beyer and Laney1 coined the definition of Big Data as High Volume, High Velocity, and High Variety. Volume means large amounts of data; velocity addresses how much information is handled in real time; variety addresses data diversity. The implemented Luzzu framework currently scales for both Volume and Variety. With regard to Volume, the processor runtime grows linearly with the amount of triples. We also cater for Variety since in Luzzu the results are not affected by data diversity. In particular, since we support the analysis of all kinds of data being represented as RDF any data schema and even various data models are supported as long as they can be mapped or encoded in RDF (e.g. relational data with R2RML mappings). Velocity completes the Big Data definition. Currently we employed Luzzu for quality assessment at well-defined checkpoints rather than in real time. However, due to its streaming nature, Luzzu can easily assess the performance of data streams as well thus catering for velocity.

We regularly evaluate our stream processors (Jena Stream Processor and Spark Stream Processor) and framework with regard to the scalability and big data. The following test parameters were used:

  • Setup: Google Cloud Platform (with 3 Clusters for the Spark Stream Processor)
  • BSBM Generator for datasets with different scale factors (24, 56, 128, 199, 256, 666, 1369, 2089, 2785, 28453, 70812, 284826)
  • Metrics (Available here) - Starting from no initialised metrics up till Extensional Conciseness:
    1. Dereferenceability of Forward Links (≈ 5.221s)
    2. Detection of a Human Readable License (≈ 5.334s)
    3. Detection of a Machine Readable License (≈ 5.228s)
    4. Dereferenceability of Backward Links (≈ 14.364s)
    5. Linkage Degree of Linked External Data Providers (≈ 25.415s)
    6. Detection of a Human Readable Labels (≈ 6.283s)
    7. Short URIs (≈ 5.069s)
    8. Identification of Literals with Malformed Datatypes(≈ 5.346s)
    9. Extensional Conciseness (≈ 5.376s)

Jena Stream Processor VS Spark Stream Processor
Time vs Dataset Triples with different Metric Initialisations

1 Beyer, M. A., Laney, D. The Importance of ‘Big Data’: A Definition. 21st June 2012. http://www.gartner.com/resId=2057415.