Hi everybody, I have a data set and a set of queries, one of which performs a cross product: String queryString1 = "PREFIX ml: <http://example.org/movies#>\n" + "SELECT ?user ?movie ?user2 ?movie2\n" + "WHERE {\n" + "?user ml:rates ?personalRating .\n" + "?personalRating ml:ratedMovie ?movie .\n" + "?user2 ml:rates ?personalRating2 .\n" + "?personalRating2 ml:ratedMovie ?movie2 .\n" + "} "; I store the results like this: FileOutputStream queryOutput = new FileOutputStream(
"./results/sparql/movies/query1.srx"); SPARQLResultsTSVWriter sparqlWriter =
new SPARQLResultsTSVWriter(queryOutput); Since I have 100.000 ratings the result of that query should have 100.000x100.000 results, which is a lot. Obviously this ends up in an OutOfMemory error. I know how the evaluation works in Sesame, and I think that the iterators are not the problem, since they deal with one solution mapping at a time. The problem might be the writer, since I suppose that the results are only written all together (am I right?). Is there any solution for this problem? Or do I have to implement my own writer, based for example on a BufferedWriter? Any other scalable solutions? Thanks a lot for your help in advance. Anthony Arrascue
------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
On 30/03/14 11:10, Anthony Arrascue wrote:
> Hi everybody, > > I have a data set and a set of queries, one of which performs a cross > product: > > */ String queryString1 =/* > > */"PREFIX ml: <http://example.org/movies#>\n"/* > > */ + "SELECT ?user ?movie ?user2 ?movie2\n"/* > > */ + "WHERE {\n" /* > > */ + "?user ml:rates ?personalRating .\n"/* > > */ + "?personalRating ml:ratedMovie ?movie .\n"/* > > */ + "?user2 ml:rates ?personalRating2 .\n"/* > > */ + "?personalRating2 ml:ratedMovie ?movie2 .\n"/* > > */ + "} ";/* > > I store the results like this: > > */ FileOutputStream queryOutput = new FileOutputStream(/* > > */"./results/sparql/movies/query1.srx");/* > > */SPARQLResultsTSVWriter sparqlWriter = /* > > *//* > > */new SPARQLResultsTSVWriter(queryOutput);/* > > Since I have 100.000 ratings the result of that query should have > 100.000x100.000 results, which is a lot. > > Obviously this ends up in an OutOfMemory error. I know how the > evaluation works in Sesame, and I think that the iterators are not the > problem, since they deal with one solution mapping at a time. How exactly do you pass the query result on to the writer? Do you do something like this: SPARQLResultsTSVWriter sparqlWriter = new SPARQLResultsTSVWriter(queryOutput); TupleQuery query = conn.prepareTupleQuery(SPARQL, queryString1); query.evaluate(sparqlWriter); If that is the case, I would not really expect an OutOfMemoryError to occur even on such a large result set (though of course it does depend a bit on how much memory you have allocated to begin with). > The problem might be the writer, since I suppose that the results are > only written all together (am I right?). No, the writer itself streams individual results using a buffered OutputStreamWriter. So in normal operation it should not put significant pressure on the memory heap. > Is there any solution for this problem? Or do I have to implement my own > writer, based for example on a BufferedWriter? Any other scalable solutions? The solution as-is _should_ be scalable. It's a little hard to figure out what is going wrong for you since I haven't seen your code or an error stacktrace. Can you tell us a few details: 1. how much heap space does your java process have? 2. what kind of repository are you querying (in-memory, native, http)? 3. what is the stacktrace you get with the OutOfMemoryError? 4. which version of Sesame are you using? With those details in place, we should be able to get to the bottom of this. Cheers, Jeen PS as an aside: the file extension .srx is typically used for query results in SPARQLResults-XML format only. For TSV I'd just use .tsv. ------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
Thank you for your answer. I realized that the writer was not filling the file with the results after each iteration. It is because the query above did have an ORDER BY statement in the end, which I omitted because I thought it was not relevant.
String queryString1 = "PREFIX ml: <http://example.org/movies#>\n" + "SELECT ?user ?movie ?user2 ?movie2\n" + "WHERE {\n" + "?user ml:rates ?personalRating .\n" + "?personalRating ml:ratedMovie ?movie .\n" + "?user2 ml:rates ?personalRating2 .\n" + "?personalRating2 ml:ratedMovie ?movie2 .\n" + "} " + ORDER BY ....; Of course for sorting you need to compute the whole set of solution mappings before applying the operator. So basically for (very) big datasets if one uses ORDER BY, the memory will at some point be filled with the intermediate results and this might produce an an OutOfMemoryError.
Is my intuition correct? Is there a more scalable solution? Thank you in advance. Best Regards,
P.S: 1. how much heap space does your java process have?
-Xmx32768m 2. what kind of repository are you querying (in-memory, native, http)? in-Memory (my own sail)
4. which version of Sesame are you using?
Sesame 2.7.7. Anthony Arrascue
On Sat, Mar 29, 2014 at 11:52 PM, Jeen Broekstra <[hidden email]> wrote:
------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
On 2/04/14 0:45, Anthony Arrascue wrote:
> Thank you for your answer. > I realized that the writer was not filling the file with the results > after each iteration. > It is because the query above did have an ORDER BY statement in the end, > which I omitted because I thought it was not relevant. Ah. Well, yes, it _is_ relevant. [snip] > */ + ORDER BY ....;/* > Of course for sorting you need to compute the whole set of solution > mappings before applying the operator. > So basically for (very) big datasets if one uses ORDER BY, the memory > will at some point be filled with the intermediate results and this > might produce an an OutOfMemoryError. > > Is my intuition correct? Yes. > Is there a more scalable solution? The only scalable solution I can think of is to not impose ordering and just write to file as-is. Ordering as implemented in Sesame at the moment happens in memory. Jeen ------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
And for GROUP BY, one would have the same kind of behaviour, since an ordering or partial ordering is applied, right? Regards Anthony Arrascue
On Tue, Apr 1, 2014 at 9:10 PM, Jeen Broekstra <[hidden email]> wrote:
------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
On 2/04/14 20:00, Anthony Arrascue wrote:
> And for GROUP BY, one would have the same kind of behaviour, since an > ordering or partial ordering is applied, right? Yes. I _should_ point out, by the way, that this is just the default behavior of the query engine. Any SAIL implementation that wishes to do so can override this strategy - for example some stores may be natively set up to be able to return results in a certain ordering. Jeen ------------------------------------------------------------------------------ _______________________________________________ Sesame-general mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/sesame-general |
Free forum by Nabble | Edit this page |