Parsing very large quad files - memory problems

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing very large quad files - memory problems

Felix Obenauer-2
Hello everyone,

I am writing an application which parses very large (up to 200 GByte / 800 Million quads) quad files.
I use the NQuadsParser with a custom RDFHandler . The parser is configured like this:

        parser.setPreserveBNodeIDs(true);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_LANGUAGES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_RELATIVE_URIS, false);
        parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.VERIFY_DATATYPE_VALUES);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES);
        parser.getParserConfig().addNonFatalError(NTriplesParserSettings.FAIL_ON_NTRIPLES_INVALID_LINES);


Unfortunately, even with -Xmx7000m and -Xms7000m I sooner or later (usually between 200 and 400 Mio quads)
run into memory problems. The following exception is thrown:

java.lang.OutOfMemoryError : GC overhead limit exceeded'

I have analyzed the heapdump and found the problem to be the very large size of the Map<String, BNode> bNodeIDMap,
which is handled in RDFParserBase . The files actually do have quite a few blank nodes, but not abnormally many.

Is there anything I can do to still be able to parse these files? Does setting parser.setPreserveBNodeIDs(false); make any difference?
I looked into the source but as far as I can tell, this would only store another name in the Map but not really reduce the size.
It would be preferable if the BNode ids would be preserved.

I am very glad for any comments or suggestions regarding this :)

Cheers
Felix


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing very large quad files - memory problems

Barry Norton

Why not preserve bnode IDs and split the file into chunks?

The heap problem is generally due to the overhead of preserving transactionality - if you commit in stages (generally about 1Mquads, but with ~7G heap you might manage more), the problem should go away.

Barry


On 24/07/13 21:59, Felix Obenauer wrote:
Hello everyone,

I am writing an application which parses very large (up to 200 GByte / 800 Million quads) quad files.
I use the NQuadsParser with a custom RDFHandler . The parser is configured like this:

        parser.setPreserveBNodeIDs(true);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_LANGUAGES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_RELATIVE_URIS, false);
        parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.VERIFY_DATATYPE_VALUES);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES);
        parser.getParserConfig().addNonFatalError(NTriplesParserSettings.FAIL_ON_NTRIPLES_INVALID_LINES);


Unfortunately, even with -Xmx7000m and -Xms7000m I sooner or later (usually between 200 and 400 Mio quads)
run into memory problems. The following exception is thrown:

java.lang.OutOfMemoryError : GC overhead limit exceeded'

I have analyzed the heapdump and found the problem to be the very large size of the Map<String, BNode> bNodeIDMap,
which is handled in RDFParserBase . The files actually do have quite a few blank nodes, but not abnormally many.

Is there anything I can do to still be able to parse these files? Does setting parser.setPreserveBNodeIDs(false); make any difference?
I looked into the source but as far as I can tell, this would only store another name in the Map but not really reduce the size.
It would be preferable if the BNode ids would be preserved.

I am very glad for any comments or suggestions regarding this :)

Cheers
Felix



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk


_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing very large quad files - memory problems

Peter Ansell-2
In reply to this post by Felix Obenauer-2
On 25 July 2013 06:59, Felix Obenauer <[hidden email]> wrote:
Hello everyone,

I am writing an application which parses very large (up to 200 GByte / 800 Million quads) quad files.
I use the NQuadsParser with a custom RDFHandler . The parser is configured like this:

        parser.setPreserveBNodeIDs(true);

This call is superceded, and is mapped directly through to parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true)
 
        parser.getParserConfig().set(BasicParserSettings.VERIFY_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_LANGUAGES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_RELATIVE_URIS, false);

I am fairly sure that this is never used by N-Quads, so you don't need to set it in this case.
 
        parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true);

This is the recommended way to do this now.
 
        parser.getParserConfig().addNonFatalError(BasicParserSettings.VERIFY_DATATYPE_VALUES);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES);

You should only find the need to add these as non-fatal errors if they are set to true, and you are setting them to false, so these should have no effect.
 
        parser.getParserConfig().addNonFatalError(NTriplesParserSettings.FAIL_ON_NTRIPLES_INVALID_LINES);


This setting is definitely recommended for large loads!
 

Unfortunately, even with -Xmx7000m and -Xms7000m I sooner or later (usually between 200 and 400 Mio quads)
run into memory problems. The following exception is thrown:

java.lang.OutOfMemoryError : GC overhead limit exceeded'

I have analyzed the heapdump and found the problem to be the very large size of the Map<String, BNode> bNodeIDMap,
which is handled in RDFParserBase . The files actually do have quite a few blank nodes, but not abnormally many.

Is there anything I can do to still be able to parse these files? Does setting parser.setPreserveBNodeIDs(false); make any difference?

Blank nodes are difficult to support in all use cases, but the API and implementations try to support as many cases as possible in as simple a manner as possible.

The sole arbiter of whether a nodeID from a document is translated to an equivalent BNode in-memory, in a perfect world, would be the ValueFactory. This should be the case for NQuadsParser as N-Quads does not support anonymous blank nodes, so successive calls to ValueFactory.createBNode(String) should return an equivalent BNode, although not necessarily the same Java object. In the case of anonymous blank nodes the nodeID must be created by the ValueFactory using createBNode(), and not createBNode(String), so it should not affect N-Quads in either case.

The reasoning for adding an extra mapping layer to RDFParserBase.createBNode(nodeID) is that it makes it possible to use the ValueFactoryImpl.getInstance() singleton (if preserveBNodeIDs is turned off), for example, and not have naive collisions between blank nodes from different documents. However, ValueFactory is also designed to be able to be linked to some permanent mapping source, where the collisions would be protected through an outside transactional structure, such as RepositoryConnection.begin()/commit()/rollback(), for example.

I looked into the source but as far as I can tell, this would only store another name in the Map but not really reduce the size.
It would be preferable if the BNode ids would be preserved.


One immediate solution that you could implement yourself to overcome the hurdle immediately would be to create a subclass of NQuadsParser and override the createBNode(String) method so that it doesn't cache the blank nodes, and simply wraps valueFactory.createBNode(nodeID) in all cases, as it is clearly a bottleneck in this case. Then the memory usage would be linked to your ValueFactory implementation instead of RDFParserBase.bNodeIDMap, which should never be used in that case. Another solution would be to regularly call RDFParserBase.clearBNodeIDMap() from your subclass after every N calls to RDFParserBase.createStatement, which you would have to override to track the numbers for. If the blank nodes are clustered in your documents, then that may reduce the peak memory usage without having to stop using bNodeIDMap.

The RDFParserBase.createBNode(String) method could also easily be rewritten so that it only caches mappings for blank nodes if preserveBNodeIDs is false, without affecting the parser, assuming that the setting for preserveBNodeIDs is not changed during a parse run.

The current RDFParserBase.createBNode(String) actually seems to be written to support changes to preserveBNodeIDs during a parse run without affecting consistency (as long as you recognise that the blank node ids would not be consistently preserved, but in all cases they would still have a consistent mapping from source to BNode objects--and you only access the parser instance using a single thread as it isn't synchronized right now).

I don't mind either adding a new setting for disabling caching, or rewriting the createBNode(String) method so that it only caches if preserveBNodeIDs is false, and in other cases relies solely on the ValueFactory implementation for caching, per the overall API design.

I am very glad for any comments or suggestions regarding this :)



Hope that helps,

Peter


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing very large quad files - memory problems

Felix Obenauer-2
Thank you _very_ much for your quick and profound answer, Peter!


One immediate solution that you could implement yourself to overcome the hurdle immediately would be to create a subclass of NQuadsParser and override the createBNode(String) method so that it doesn't cache the blank nodes, and simply wraps valueFactory.createBNode(nodeID) in all cases, as it is clearly a bottleneck in this case. Then the memory usage would be linked to your ValueFactory implementation instead of RDFParserBase.bNodeIDMap, which should never be used in that case. Another solution would be to regularly call RDFParserBase.clearBNodeIDMap() from your subclass after every N calls to RDFParserBase.createStatement, which you would have to override to track the numbers for. If the blank nodes are clustered in your documents, then that may reduce the peak memory usage without having to stop using bNodeIDMap.


I have tried this and it seems like it is actually working quite well. I had not have the time to run the program on my largest file yet, but since it crashed around 200Mio before, I think/hope
400 or 800 Mio wil not make a difference.
Am I correct in assuming, that when I turn off preserving BNodes, then the BNode object created from a BNode String will have the same Stringvalue? E.g. if I parse _:1234 and then serialize the resulting BNode, it is assured that _:1234 is written? As far as I understand, this should be the case.

Thanks again, this really helped me!
Felix


On 25.07.2013 07:42, Peter Ansell wrote:
On 25 July 2013 06:59, Felix Obenauer <[hidden email]> wrote:
Hello everyone,

I am writing an application which parses very large (up to 200 GByte / 800 Million quads) quad files.
I use the NQuadsParser with a custom RDFHandler . The parser is configured like this:

        parser.setPreserveBNodeIDs(true);

This call is superceded, and is mapped directly through to parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true)
 
        parser.getParserConfig().set(BasicParserSettings.VERIFY_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_DATATYPE_VALUES, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, false);
        parser.getParserConfig().set(BasicParserSettings.FAIL_ON_UNKNOWN_LANGUAGES, false);
        parser.getParserConfig().set(BasicParserSettings.VERIFY_RELATIVE_URIS, false);

I am fairly sure that this is never used by N-Quads, so you don't need to set it in this case.
 
        parser.getParserConfig().set(BasicParserSettings.PRESERVE_BNODE_IDS, true);

This is the recommended way to do this now.
 
        parser.getParserConfig().addNonFatalError(BasicParserSettings.VERIFY_DATATYPE_VALUES);
        parser.getParserConfig().addNonFatalError(BasicParserSettings.FAIL_ON_UNKNOWN_DATATYPES);

You should only find the need to add these as non-fatal errors if they are set to true, and you are setting them to false, so these should have no effect.
 
        parser.getParserConfig().addNonFatalError(NTriplesParserSettings.FAIL_ON_NTRIPLES_INVALID_LINES);


This setting is definitely recommended for large loads!
 

Unfortunately, even with -Xmx7000m and -Xms7000m I sooner or later (usually between 200 and 400 Mio quads)
run into memory problems. The following exception is thrown:

java.lang.OutOfMemoryError : GC overhead limit exceeded'

I have analyzed the heapdump and found the problem to be the very large size of the Map<String, BNode> bNodeIDMap,
which is handled in RDFParserBase . The files actually do have quite a few blank nodes, but not abnormally many.

Is there anything I can do to still be able to parse these files? Does setting parser.setPreserveBNodeIDs(false); make any difference?

Blank nodes are difficult to support in all use cases, but the API and implementations try to support as many cases as possible in as simple a manner as possible.

The sole arbiter of whether a nodeID from a document is translated to an equivalent BNode in-memory, in a perfect world, would be the ValueFactory. This should be the case for NQuadsParser as N-Quads does not support anonymous blank nodes, so successive calls to ValueFactory.createBNode(String) should return an equivalent BNode, although not necessarily the same Java object. In the case of anonymous blank nodes the nodeID must be created by the ValueFactory using createBNode(), and not createBNode(String), so it should not affect N-Quads in either case.

The reasoning for adding an extra mapping layer to RDFParserBase.createBNode(nodeID) is that it makes it possible to use the ValueFactoryImpl.getInstance() singleton (if preserveBNodeIDs is turned off), for example, and not have naive collisions between blank nodes from different documents. However, ValueFactory is also designed to be able to be linked to some permanent mapping source, where the collisions would be protected through an outside transactional structure, such as RepositoryConnection.begin()/commit()/rollback(), for example.

I looked into the source but as far as I can tell, this would only store another name in the Map but not really reduce the size.
It would be preferable if the BNode ids would be preserved.


One immediate solution that you could implement yourself to overcome the hurdle immediately would be to create a subclass of NQuadsParser and override the createBNode(String) method so that it doesn't cache the blank nodes, and simply wraps valueFactory.createBNode(nodeID) in all cases, as it is clearly a bottleneck in this case. Then the memory usage would be linked to your ValueFactory implementation instead of RDFParserBase.bNodeIDMap, which should never be used in that case. Another solution would be to regularly call RDFParserBase.clearBNodeIDMap() from your subclass after every N calls to RDFParserBase.createStatement, which you would have to override to track the numbers for. If the blank nodes are clustered in your documents, then that may reduce the peak memory usage without having to stop using bNodeIDMap.

The RDFParserBase.createBNode(String) method could also easily be rewritten so that it only caches mappings for blank nodes if preserveBNodeIDs is false, without affecting the parser, assuming that the setting for preserveBNodeIDs is not changed during a parse run.

The current RDFParserBase.createBNode(String) actually seems to be written to support changes to preserveBNodeIDs during a parse run without affecting consistency (as long as you recognise that the blank node ids would not be consistently preserved, but in all cases they would still have a consistent mapping from source to BNode objects--and you only access the parser instance using a single thread as it isn't synchronized right now).

I don't mind either adding a new setting for disabling caching, or rewriting the createBNode(String) method so that it only caches if preserveBNodeIDs is false, and in other cases relies solely on the ValueFactory implementation for caching, per the overall API design.

I am very glad for any comments or suggestions regarding this :)



Hope that helps,

Peter



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk


_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing very large quad files - memory problems

Peter Ansell-2
In reply to this post by Peter Ansell-2
On 25 July 2013 15:42, Peter Ansell <[hidden email]> wrote:

> On 25 July 2013 06:59, Felix Obenauer <[hidden email]> wrote:
>>
>> Hello everyone,
>>
>> I am writing an application which parses very large (up to 200 GByte / 800
>> Million quads) quad files.
>> I use the NQuadsParser with a custom RDFHandler . The parser is configured
>> like this:
>>
>>         parser.setPreserveBNodeIDs(true);
>

There is a Pull Request open for peer review to fix this issue:

https://bitbucket.org/openrdf/sesame/pull-request/206/ses-1941-do-not-cache-blank-node

Any and all reviewers are welcome,

Peter

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
Sesame-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/sesame-general