Topic: MODx limits on the number of documents  (Read 37353 times)

Pages: [1] 2 3 ... 5   Go Down

#1: 29-Sep-2007, 10:14 AM

andytwiz
Posts: 13

Does MODx 0.9.6 still have the 5000 document limit?

I need to have about 250,000 documents - if so how can I modify the core to get around this limit? Where should I start looking?

Thanks

#2: 29-Sep-2007, 10:40 AM

Foundation

OpenGeek
MODx Co-Founder
Posts: 6,938

damn accurate caricatures...

WWW
Does MODx 0.9.6 still have the 5000 document limit?

I need to have about 250,000 documents - if so how can I modify the core to get around this limit? Where should I start looking?

Andy, I think your going to need to purchase/develop something custom to support 250,000 documents with all the features of MODx.  It simply is not designed to handle that volume of document meta-data, and likely will not ever be the right choice for doing so. 

That said, I'm curious what you are doing with 250,000 documents.  Are these really individual documents, or could the majority of these documents be stored in a custom database table and presented through a dynamic script (i.e. via a single MODx document, with additional scripts for searching and managing the items)?  Again, without knowing details as to what these 250,000 individual documents represent, it's hard to say what the best solution is for your challenge...
Jason Coward
MODx Co-Founder
xPDO Founder
CTO @ Collabpad
work productively.
work intelligently.
work together.
Light is just a vibration of a note too. Everything is. You've got to keep that in mind.
  Frank Zappa

#3: 29-Sep-2007, 11:54 AM

Testers

ZAP
Posts: 1,619

I was going to ask the same questions. But I was also wondering if turning off caching entirely might be a good idea for a site with an excessive number of documents. Around what size does the current cache become self-defeating?
"Things are not what they appear to be; nor are they otherwise." - Buddha

"Well, gee, Buddha - that wasn't very helpful..." - ZAP

Useful MODx links: documentation | wiki  | forum guidelines  | bugs & requests  | info you should include with your post | commercial support options

#4: 29-Sep-2007, 02:00 PM

andytwiz
Posts: 13

Does MODx 0.9.6 still have the 5000 document limit?

I need to have about 250,000 documents - if so how can I modify the core to get around this limit? Where should I start looking?

Andy, I think your going to need to purchase/develop something custom to support 250,000 documents with all the features of MODx.  It simply is not designed to handle that volume of document meta-data, and likely will not ever be the right choice for doing so. 

That said, I'm curious what you are doing with 250,000 documents.  Are these really individual documents, or could the majority of these documents be stored in a custom database table and presented through a dynamic script (i.e. via a single MODx document, with additional scripts for searching and managing the items)?  Again, without knowing details as to what these 250,000 individual documents represent, it's hard to say what the best solution is for your challenge...

Thanks for your reply.

Yes I'm using FeedX to import an XML feed then DocManager to create new documents for each element of the feed. Ideally I'd like each feed element (about 250,000 of them) to be individual documents so they can be searched/cached e.t.c.....

I could use a dynamic script I guess - are there any snippets/plugins that make this easy for me to get started?

If I was to use documents can you suggest where I need to start looking to modify the core? (I'm familiar with PHP but not Modx yet).

While I'm on the topic I'd also like to generate a XML Google Base (Froogle) feed from these documents. I couldn't find a snippet/plugin for this so I assume using Ditto in a similar way to http://webbake.com/tutorials/modx-cms/google-sitemap-with-ditto would be the best way (sorry for slightly off topic question - only installed Modx last weekend for our website rewrite).

#5: 30-Sep-2007, 12:06 PM

Foundation

OpenGeek
MODx Co-Founder
Posts: 6,938

damn accurate caricatures...

WWW
@andytwiz:
First, why are you importing content that already exists and can be presented via XML feeds?  Isn't that a) duplicating content and b) against copyright of the original content owners/publishers?  And then republish as XML???  Why not cache it as XML data locally and simply present it through your templates?

With that out of the way, my answer was that the core, using documents for each article, is not going to be possible.  Period.  No amount of core hacking is going to fix that IMHO.  You would need to develop or find another tool altogether for that.

Ditto would be useless with 250,000 documents, as would the core.


@ZAP:
You can not turn off this part of MODx caching; period. This is not partial page caching we are talking about. MODx would not work without the siteCache.idx.php and in the current architecture, all 250,000 documents would need several lines each in this file, plus all of the PHP source code in the site definition (snippets, plugins, modules) would also be in there.  You would have a file so large that it would likely never execute on any PHP installation, and certainly not with any reasonable performance.
Jason Coward
MODx Co-Founder
xPDO Founder
CTO @ Collabpad
work productively.
work intelligently.
work together.
Light is just a vibration of a note too. Everything is. You've got to keep that in mind.
  Frank Zappa

#6: 30-Sep-2007, 12:56 PM

andytwiz
Posts: 13

@andytwiz:
First, why are you importing content that already exists and can be presented via XML feeds?  Isn't that a) duplicating content and b) against copyright of the original content owners/publishers?  And then republish as XML???  Why not cache it as XML data locally and simply present it through your templates?

With that out of the way, my answer was that the core, using documents for each article, is not going to be possible.  Period.  No amount of core hacking is going to fix that IMHO.  You would need to develop or find another tool altogether for that.

Ditto would be useless with 250,000 documents, as would the core.


The XML is provided by third parties with the intention of being published on other websites - it is not a voilation of copyright.

I could cache it locally. The XML file is about 100MB big containing the 250,000 items. These contents of these items needs to be modified before displaying to the user and the same modified information needs to be output in the Google Base XML feed. This is why using Documents for each item would be ideal as I could parse the XML, modify the items as needed, save in a Document giving me searching/caching e.t.c and then use Ditto to output the Google Base feed.

If you could point me in the direction of how I could cache the XML data locally and display it as "virtual documents" that would be much appreciated.

Thanks for your advice.

#7: 30-Sep-2007, 01:59 PM

Testers

ZAP
Posts: 1,619

100MB is a huge file, no matter how you handle it. I think that I'd split up the XML and store it in a custom MySQL table (properly indexed) and use a snippet to display the data as desired. I would think that would work with the current MODx core, and depending upon how well you optimize your table I imagine it will behave well enough. As long as you're not outputting huge files I would think you could cache them, but you may or may not want to do that anyway. As I understand it, this should keep the MODx sitecache file from containing your XML data.

What you lose is the ability to search through this data using a standard MODx snippet, since those aren't designed to index dynamic content or data in non-standard tables. But writing your own search snippet shouldn't be all that tough, given that your data is standardized and MySQL will pretty much do the work for you. And of course you don't get any of the other features of using the MODx system, so for example you may need to create a separate module to import or edit your XML data (since you won't be able to do this via the Manager).

You'd still benefit from MODx's templating system, API, etc., so I guess I'd give it a shot in MODx and see what happens.
"Things are not what they appear to be; nor are they otherwise." - Buddha

"Well, gee, Buddha - that wasn't very helpful..." - ZAP

Useful MODx links: documentation | wiki  | forum guidelines  | bugs & requests  | info you should include with your post | commercial support options

#8: 5-Oct-2007, 10:57 PM


Adam Wintle
Posts: 70

WWW
A quarter of a million files into MODx! If you don't mind me asking, what're you trying to make exactly, Amazon?! Tongue

#9: 5-Oct-2007, 11:21 PM

Coding Team

sottwell
Posts: 10,503

WWW
I've dealt with ~4Mb XML files; the client refreshed the file every 15 minutes and I had a snippet that parsed the XML file and returned the desired data on demand. I was surprised at how snappy the SimpleXML +xpath is at parsing such a large file! It requires PHP 5, but there are some nice-looking libraries for PHP 4 that worked almost as well. In my opinion, if you're going to be working with XML files, it's worth it to make an upgrade or even change service providers to get PHP 5 and SimpleXML.

One of the things I really like about the MODx forums! I do a lot of searching to research answers to posts, and find all sorts of neat stuff! I just now found this resource, and it looks like a really good one.

http://hudzilla.org/phpwiki/index.php?title=Main_Page
sottwell.com has moved to a lovely Solaris 10 server!
Log in username guest, password guestuser.
Templates are now becoming available at http://sottwell.com/templates.html

#10: 6-Oct-2007, 08:05 PM

andytwiz
Posts: 13

Hi,

I've hacked my modx core to not cache document map, aliases, document listing and content types in siteCache.idx currently stored in:

Code:
$a = &$this->aliasListing;
$d = &$this->documentListing;
$m = &$this->documentMap;
$c = &$this->contentTypes;

My core populates these arrays for the requested document fetching its document map, aliases, document listing and content types from the database and also fetches the same data for its parent and all its children.

I haven't yet come across a need to get any more document data than the document's parent and children.

This massively speeds up loading with 400,000 documents!

I'm afraid my hacked code is a real mess and I can't remember what I've changed so I'm afraid I can't post it here but may I suggest something similar is done to modx in the future release to make it more scaleable?

Thanks

100MB is a huge file, no matter how you handle it. I think that I'd split up the XML and store it in a custom MySQL table (properly indexed) and use a snippet to display the data as desired. I would think that would work with the current MODx core, and depending upon how well you optimize your table I imagine it will behave well enough. As long as you're not outputting huge files I would think you could cache them, but you may or may not want to do that anyway. As I understand it, this should keep the MODx sitecache file from containing your XML data.

What you lose is the ability to search through this data using a standard MODx snippet, since those aren't designed to index dynamic content or data in non-standard tables. But writing your own search snippet shouldn't be all that tough, given that your data is standardized and MySQL will pretty much do the work for you. And of course you don't get any of the other features of using the MODx system, so for example you may need to create a separate module to import or edit your XML data (since you won't be able to do this via the Manager).

You'd still benefit from MODx's templating system, API, etc., so I guess I'd give it a shot in MODx and see what happens.

#11: 10-Oct-2007, 08:49 AM


MasterzDee
Posts: 32

Sooner or later MODx must support large docs tree and dirs (30.000 plus).  This days 5000 docs is nothing.

#12: 10-Oct-2007, 09:16 AM

Foundation

OpenGeek
MODx Co-Founder
Posts: 6,938

damn accurate caricatures...

WWW
Sooner or later MODx must support large docs tree and dirs (30.000 plus).  This days 5000 docs is nothing.
Because everyone wants a site with 30,000 pages?  I've never even come close to building one with a hundred pages, at least where more than a dozen or so were actually ever worth reading.  But seriously, IMHO, anything with that large a quantity of articles needs special attention, regardless of what product you are using.

In any case, MODx will support what it can based on a number of factors, mostly related to the limitations of PHP and the software/hardware environment it is running within, as well as the architecture.  Several changes are coming in the new releases that should help you better organize documents into distinct sections that can live by themselves, but large numbers of articles in the same sections are still going to be an issue.  This is why some content needs to be managed in ways other than the standard MODx web content management paradigm (i.e. documents in the tree).

Finally, feel free to make suggestions and feature requests for MODx all you want, but keep in mind just telling us that MODx must do something is likely to accomplish nothing.
« Last Edit: 10-Oct-2007, 09:21 AM by OpenGeek »
Jason Coward
MODx Co-Founder
xPDO Founder
CTO @ Collabpad
work productively.
work intelligently.
work together.
Light is just a vibration of a note too. Everything is. You've got to keep that in mind.
  Frank Zappa

#13: 10-Oct-2007, 09:43 AM

Foundation

rthrash
Posts: 11,348

WWW
I've gotta somehow be guessing this is very much Adsense related... no?
MODx is a content managmeent framework that allows web professionals to turn over sites to end-users for daily maintenance without worrying. Please help us help you when asking for assistance and read the wiki. Searching the forums from the top level helps, too.
Ryan Thrash
MODx Co-Founder
Principal @ Collabpad
work productively.
work intelligently.
work together.

#14: 10-Oct-2007, 09:48 AM

Marketing & Design Team

davidm
MODx evangelist
Posts: 7,073

The best way to predict the future is to invent it

WWW
My biggest MODx website has around 1 200 documents, and it's a BIG corporate website with two languages and LOTS of data... Most of the websites I run are between 80 and 500 docs. Most corporate websites won't go past the 5 000 doc limit (unless they have thousands of products...but then they're likely not to run something based on PHP and MySQL, you're talking enterprise level there...).

I don't see the point of trying to manage this much data without building a custom database, no matter what CMS you're using you'll face the limitations of its environment...


.: nodeo.net : Pour un web libre, moderne et ouvert ! :: david-molliere.net : Suivez en "live" mes expérimentations et billets sur les CMS et autres applications web :.

*** Forums modxcms.fr Participez à l'élaboration du site MODx francophone ! ***

! Nouveau !  En live, ne manquez pas les news de modxcms.fr sur Twitter   ! Nouveau !

MODx est l'outil idéal pour les developpeurs et webdesigners qui cherchent un framework de gestion de contenu hautement flexible et performant, tout en étant simple d'accès pour les utilisateurs finaux.

Config : Apache 2.2.8 - MySQL 5.0.67 - PHP 5.2.8 | Debian 4.0 (Etch)

Réalisations sous MODx : | pargade-notaires.fr | soleil.info | gican.asso.fr | michelez-notaires.com | amadom.gerondicap.com | jocelyne-violet.net

#15: 11-Oct-2007, 05:53 PM

Testers

ZAP
Posts: 1,619

@ZAP:
You can not turn off this part of MODx caching; period. This is not partial page caching we are talking about. MODx would not work without the siteCache.idx.php and in the current architecture, all 250,000 documents would need several lines each in this file, plus all of the PHP source code in the site definition (snippets, plugins, modules) would also be in there.  You would have a file so large that it would likely never execute on any PHP installation, and certainly not with any reasonable performance.

I'm wondering at what size does the siteCache.idx.php file tend to become a problem. We have one site now that is experiencing occasional 500 server errors, and the siteCache.idx.php is about 550k. I can probably reduce this quite a bit if I make the snippets and plugins into include files, but is this file size significant enough that I should try to reduce it?
"Things are not what they appear to be; nor are they otherwise." - Buddha

"Well, gee, Buddha - that wasn't very helpful..." - ZAP

Useful MODx links: documentation | wiki  | forum guidelines  | bugs & requests  | info you should include with your post | commercial support options

#16: 11-Oct-2007, 06:15 PM

Foundation

splittingred
Posts: 1,510

i am alt-country rock

WWW
Sooner or later MODx must support large docs tree and dirs (30.000 plus).  This days 5000 docs is nothing.

If you're trying to do an ecommerce or db-driven site incorrectly, yes, 5,000 is nothing.

If you're not trying to make singular pages for every single thing, you'll do fine. I work at UT, and our College of Ed. is converting their site to MODx 0.9.6.1, and we've had little-to-no problems. Our siteCache is at around 378k right now (we've got in about 560 pages, 1/5th or so of the total content), and runs smoothly on a test server that's only a G4 running OS X. I can't imagine how fast it will run when it hits the real servers.

All that said, 0.9.7 puts forth some real improvements in this area, both PHP-side and UI-side. Also, with the introduction of Contexts, it will prove quite easy to isolate sections of your sites into simply manageable parts.

If you're running at 250,000 docs (or honestly anything more than 2,500), I do have to ask what you're doing. It sounds to me like you should be writing some real app dev code to handle things, rather than trying to fit it all into a CMS. MODx is great - but it doesn't replace true application development, yet.
shaun mccormick | modx foundation
modx revolution | jira bugtracker | official docs | svn tracker | api docs

#17: 11-Oct-2007, 07:57 PM

Foundation

OpenGeek
MODx Co-Founder
Posts: 6,938

damn accurate caricatures...

WWW
Just an additional note regarding some of the caching improvements that are coming in MODx 0.9.7 to address performance and scalability against a graph of volume a site's content repository might consist of:
  • More modular cache files, loaded on demand; config is separate from the document map, which is divided into contexts, scripts and content are separate, etc.
  • More control over what parts get cached and which parts don't; even how/where it gets cached.  You can even disable caching entirely; however, the database load at that point would be tremendous, unless you had...
  • Database result set caching to address database load in general, with support for memcached or your own custom cache implementations.  Individual result sets can even be cached indefinitely, for a specified number of seconds, or excluded from the cache via the new xPDO-powered API.
  • More granular ability to control caching on any kind of content element (i.e. chunk, snippet, TV, plugin, module, template, etc.) via tags or API, i.e. force TVs not to be cached.

But, that's the challenge in developing any web application with an eye for performance and scalability; it's a constant struggle to balance the two because they are, and will always be, at odds.
Jason Coward
MODx Co-Founder
xPDO Founder
CTO @ Collabpad
work productively.
work intelligently.
work together.
Light is just a vibration of a note too. Everything is. You've got to keep that in mind.
  Frank Zappa

#18: 12-Oct-2007, 05:40 AM


MasterzDee
Posts: 32

Sooner or later MODx must support large docs tree and dirs (30.000 plus).  This days 5000 docs is nothing.

If you're trying to do an ecommerce or db-driven site incorrectly, yes, 5,000 is nothing.

If you're not trying to make singular pages for every single thing, you'll do fine. I work at UT, and our College of Ed. is converting their site to MODx 0.9.6.1, and we've had little-to-no problems. Our siteCache is at around 378k right now (we've got in about 560 pages, 1/5th or so of the total content), and runs smoothly on a test server that's only a G4 running OS X. I can't imagine how fast it will run when it hits the real servers.

All that said, 0.9.7 puts forth some real improvements in this area, both PHP-side and UI-side. Also, with the introduction of Contexts, it will prove quite easy to isolate sections of your sites into simply manageable parts.

If you're running at 250,000 docs (or honestly anything more than 2,500), I do have to ask what you're doing. It sounds to me like you should be writing some real app dev code to handle things, rather than trying to fit it all into a CMS. MODx is great - but it doesn't replace true application development, yet.

I'am running Entertainment site  (celebs, movies, reviews, lyrics...etc) and in db I have 24.000 names, and each name have a lot of data (news, film, bio, pics, movies ...etc) , lyrics contain 30.000 artists and over 600k lyrics, db size is 3 GB. So I was thinking it would be good to convert site to MODx system, because now I do programming myself and is a lot of work and time consuming.

#19: 12-Oct-2007, 10:45 AM

Foundation

rthrash
Posts: 11,348

WWW
MODx would be good for the basic parts of the site and more static marketing oriented information. I'd really recommend having a custom database implementation for all those records though. Snippets and Modules would be good for managing those items (I certainly would not make them into individual pages). For example in an ecommerce perspective, anything over 250-500 records I consider the threshold for doing a purpose-built application/implementation inside MODx, keeping the marketing bits and "view" pages in MODx and the data filtering in from the custom development work.
MODx is a content managmeent framework that allows web professionals to turn over sites to end-users for daily maintenance without worrying. Please help us help you when asking for assistance and read the wiki. Searching the forums from the top level helps, too.
Ryan Thrash
MODx Co-Founder
Principal @ Collabpad
work productively.
work intelligently.
work together.

#20: 21-Feb-2008, 11:32 AM

bakalek
Posts: 135

WWW
Do unpublished pages count toward the 'maximum' page limit?
Are there any tricks to reduce the cache size? 
if we dont use the internal search - does it help with performance?

thanks!
MODx 0.9.6 | Apache 2.0.52 | PHP 5.2 | MySQL 5 | MAC OS 10.5.5 (intel) | FF 3 | stels dg
Pages: [1] 2 3 ... 5   Go Up
0 Members and 1 Guest are viewing this topic.