Writing a big number of results using a writer

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing a big number of results using a writer

Anthony Arrascue
Hi everybody,

I have a data set and a set of queries, one of which performs a cross product:

        String queryString1 =

        "PREFIX ml: <http://example.org/movies#>\n"

                + "SELECT ?user ?movie ?user2 ?movie2\n" 

                + "WHERE {\n"       

                + "?user ml:rates ?personalRating .\n"

                + "?personalRating ml:ratedMovie ?movie .\n"

                + "?user2 ml:rates ?personalRating2 .\n"

                + "?personalRating2 ml:ratedMovie ?movie2 .\n"

                + "} ";

I store the results like this:

 FileOutputStream queryOutput = new FileOutputStream(

            "./results/sparql/movies/query1.srx");

SPARQLResultsTSVWriter sparqlWriter = 

            new SPARQLResultsTSVWriter(queryOutput);

Since I have 100.000 ratings the result of that query should have 100.000x100.000 results, which is a lot.

Obviously this ends up in an OutOfMemory error. I know how the evaluation works in Sesame, and I think that the iterators are not the problem, since they deal with one solution mapping at a time. 

The problem might be the writer, since I suppose that the results are only written all together (am I right?).

Is there any solution for this problem? Or do I have to implement my own writer, based for example on a BufferedWriter? Any other scalable solutions?

Thanks a lot for your help in advance.


Anthony Arrascue

------------------------------------------------------------------------------

_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Writing a big number of results using a writer

Jeen Broekstra
On 30/03/14 11:10, Anthony Arrascue wrote:

> Hi everybody,
>
> I have a data set and a set of queries, one of which performs a cross
> product:
>
> */        String queryString1 =/*
>
> */"PREFIX ml: <http://example.org/movies#>\n"/*
>
> */                + "SELECT ?user ?movie ?user2 ?movie2\n"/*
>
> */                + "WHERE {\n" /*
>
> */                + "?user ml:rates ?personalRating .\n"/*
>
> */                + "?personalRating ml:ratedMovie ?movie .\n"/*
>
> */                + "?user2 ml:rates ?personalRating2 .\n"/*
>
> */                + "?personalRating2 ml:ratedMovie ?movie2 .\n"/*
>
> */                + "} ";/*
>
> I store the results like this:
>
> */ FileOutputStream queryOutput = new FileOutputStream(/*
>
> */"./results/sparql/movies/query1.srx");/*
>
> */SPARQLResultsTSVWriter sparqlWriter = /*
>
> *//*
>
> */new SPARQLResultsTSVWriter(queryOutput);/*
>
> Since I have 100.000 ratings the result of that query should have
> 100.000x100.000 results, which is a lot.
>
> Obviously this ends up in an OutOfMemory error. I know how the
> evaluation works in Sesame, and I think that the iterators are not the
> problem, since they deal with one solution mapping at a time.

How exactly do you pass the query result on to the writer? Do you do
something like this:

     SPARQLResultsTSVWriter sparqlWriter = new
SPARQLResultsTSVWriter(queryOutput);

     TupleQuery query = conn.prepareTupleQuery(SPARQL, queryString1);

     query.evaluate(sparqlWriter);

If that is the case, I would not really expect an OutOfMemoryError to
occur even on such a large result set (though of course it does depend a
bit on how much memory you have allocated to begin with).

> The problem might be the writer, since I suppose that the results are
> only written all together (am I right?).

No, the writer itself streams individual results using a buffered
OutputStreamWriter. So in normal operation it should not put significant
pressure on the memory heap.

> Is there any solution for this problem? Or do I have to implement my own
> writer, based for example on a BufferedWriter? Any other scalable solutions?

The solution as-is _should_ be scalable. It's a little hard to figure
out what is going wrong for you since I haven't seen your code or an
error stacktrace.

Can you tell us a few details:

  1. how much heap space does your java process have?
  2. what kind of repository are you querying (in-memory, native, http)?
  3. what is the stacktrace you get with the OutOfMemoryError?
  4. which version of Sesame are you using?

With those details in place, we should be able to get to the bottom of this.

Cheers,

Jeen

PS as an aside: the file extension .srx is typically used for query
results in SPARQLResults-XML format only. For TSV I'd just use .tsv.

------------------------------------------------------------------------------
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Writing a big number of results using a writer

Anthony Arrascue
Thank you for your answer.
I realized that the writer was not filling the file with the results after each iteration.
It is because the query above did have an ORDER BY statement in the end, which I omitted because I thought it was not relevant.

        String queryString1 =

        "PREFIX ml: <http://example.org/movies#>\n"

                + "SELECT ?user ?movie ?user2 ?movie2\n" 

                + "WHERE {\n"       

                + "?user ml:rates ?personalRating .\n"

                + "?personalRating ml:ratedMovie ?movie .\n"

                + "?user2 ml:rates ?personalRating2 .\n"

                + "?personalRating2 ml:ratedMovie ?movie2 .\n"

                + "} "

                + ORDER BY ....;


Of course for sorting you need to compute the whole set of solution mappings before applying the operator.
So basically for (very) big datasets if one uses ORDER BY, the memory will at some point be filled with the intermediate results and this might produce an an OutOfMemoryError.

Is my intuition correct?
Is there a more scalable solution?

Thank you in advance.

Best Regards,

P.S:

  1. how much heap space does your java process have?
-Xmx32768m
  2. what kind of repository are you querying (in-memory, native, http)?
in-Memory (my own sail)
 4. which version of Sesame are you using?
Sesame 2.7.7.




Anthony Arrascue


On Sat, Mar 29, 2014 at 11:52 PM, Jeen Broekstra <[hidden email]> wrote:
On 30/03/14 11:10, Anthony Arrascue wrote:
> Hi everybody,
>
> I have a data set and a set of queries, one of which performs a cross
> product:
>
> */        String queryString1 =/*
>
> */"PREFIX ml: <http://example.org/movies#>\n"/*
>
> */                + "SELECT ?user ?movie ?user2 ?movie2\n"/*
>
> */                + "WHERE {\n" /*
>
> */                + "?user ml:rates ?personalRating .\n"/*
>
> */                + "?personalRating ml:ratedMovie ?movie .\n"/*
>
> */                + "?user2 ml:rates ?personalRating2 .\n"/*
>
> */                + "?personalRating2 ml:ratedMovie ?movie2 .\n"/*
>
> */                + "} ";/*
>
> I store the results like this:
>
> */ FileOutputStream queryOutput = new FileOutputStream(/*
>
> */"./results/sparql/movies/query1.srx");/*
>
> */SPARQLResultsTSVWriter sparqlWriter = /*
>
> *//*
>
> */new SPARQLResultsTSVWriter(queryOutput);/*
>
> Since I have 100.000 ratings the result of that query should have
> 100.000x100.000 results, which is a lot.
>
> Obviously this ends up in an OutOfMemory error. I know how the
> evaluation works in Sesame, and I think that the iterators are not the
> problem, since they deal with one solution mapping at a time.

How exactly do you pass the query result on to the writer? Do you do
something like this:

     SPARQLResultsTSVWriter sparqlWriter = new
SPARQLResultsTSVWriter(queryOutput);

     TupleQuery query = conn.prepareTupleQuery(SPARQL, queryString1);

     query.evaluate(sparqlWriter);

If that is the case, I would not really expect an OutOfMemoryError to
occur even on such a large result set (though of course it does depend a
bit on how much memory you have allocated to begin with).

> The problem might be the writer, since I suppose that the results are
> only written all together (am I right?).

No, the writer itself streams individual results using a buffered
OutputStreamWriter. So in normal operation it should not put significant
pressure on the memory heap.

> Is there any solution for this problem? Or do I have to implement my own
> writer, based for example on a BufferedWriter? Any other scalable solutions?

The solution as-is _should_ be scalable. It's a little hard to figure
out what is going wrong for you since I haven't seen your code or an
error stacktrace.

Can you tell us a few details:

  1. how much heap space does your java process have?
  2. what kind of repository are you querying (in-memory, native, http)?
  3. what is the stacktrace you get with the OutOfMemoryError?
  4. which version of Sesame are you using?

With those details in place, we should be able to get to the bottom of this.

Cheers,

Jeen

PS as an aside: the file extension .srx is typically used for query
results in SPARQLResults-XML format only. For TSV I'd just use .tsv.

------------------------------------------------------------------------------
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general


------------------------------------------------------------------------------

_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Writing a big number of results using a writer

Jeen Broekstra
On 2/04/14 0:45, Anthony Arrascue wrote:
> Thank you for your answer.
> I realized that the writer was not filling the file with the results
> after each iteration.
> It is because the query above did have an ORDER BY statement in the end,
> which I omitted because I thought it was not relevant.

Ah. Well, yes, it _is_ relevant.

[snip]

> */  + ORDER BY ....;/*

> Of course for sorting you need to compute the whole set of solution
> mappings before applying the operator.
> So basically for (very) big datasets if one uses ORDER BY, the memory
> will at some point be filled with the intermediate results and this
> might produce an an OutOfMemoryError.
>
> Is my intuition correct?

Yes.

> Is there a more scalable solution?

The only scalable solution I can think of is to not impose ordering and
just write to file as-is. Ordering as implemented in Sesame at the
moment happens in memory.


Jeen


------------------------------------------------------------------------------
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Writing a big number of results using a writer

Anthony Arrascue
And for GROUP BY, one would have the same kind of behaviour, since an ordering or partial ordering is applied, right?

Regards

Anthony Arrascue


On Tue, Apr 1, 2014 at 9:10 PM, Jeen Broekstra <[hidden email]> wrote:
On 2/04/14 0:45, Anthony Arrascue wrote:
> Thank you for your answer.
> I realized that the writer was not filling the file with the results
> after each iteration.
> It is because the query above did have an ORDER BY statement in the end,
> which I omitted because I thought it was not relevant.

Ah. Well, yes, it _is_ relevant.

[snip]

> */  + ORDER BY ....;/*

> Of course for sorting you need to compute the whole set of solution
> mappings before applying the operator.
> So basically for (very) big datasets if one uses ORDER BY, the memory
> will at some point be filled with the intermediate results and this
> might produce an an OutOfMemoryError.
>
> Is my intuition correct?

Yes.

> Is there a more scalable solution?

The only scalable solution I can think of is to not impose ordering and
just write to file as-is. Ordering as implemented in Sesame at the
moment happens in memory.


Jeen


------------------------------------------------------------------------------
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general


------------------------------------------------------------------------------

_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Writing a big number of results using a writer

Jeen Broekstra
On 2/04/14 20:00, Anthony Arrascue wrote:
> And for GROUP BY, one would have the same kind of behaviour, since an
> ordering or partial ordering is applied, right?

Yes.

I _should_ point out, by the way, that this is just the default behavior
of the query engine. Any SAIL implementation that wishes to do so can
override this strategy - for example some stores may be natively set up
to be able to return results in a certain ordering.

Jeen


------------------------------------------------------------------------------
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Loading...