Migrating Tons of Articles to WordPress Without Killing the Database
Migrating a massive amount of content to WordPress seems simple at first: export your data, loop over it, and call wp_insert_post().
In this article, I’ll walk you through my journey: what didn’t work, what did, and the lessons learned
Starting Point
We had a custom CMS with around 500,000 articles that needed to be migrated to WordPress. The challenge was to do it without breaking the database or taking the site down, and in a reasonable amount of time. Part of our database responsible for articles was composed of several tables, each holding different things: meta data, content itself (one record per paragraph), tags, label, etc. Goal was to migrate them to Wordpress ecosystem with keeping business consistence. First step, preparation phase, was to figure out which tables/columns needs to be mapped to which Wordpress counterpart. There were some clear mappings, like our labels to WP tags, or our tags to WP categories, but some were not that obvious. Hardest was content itself.
Ways to migrate
When facing similar problem, at first glance you have several options, and internet will try to convince you that all of them are good ideas. Let's go through them one by one.
1. Manual SQL Migration
You directly map your CMS tables to WordPress tables (wp_posts, wp_postmeta, wp_terms, wp_term_taxonomy, wp_term_relationships).
Pros:
- Full control over data mapping
- Can handle complex relationships
Cons:
- High risk of breaking WordPress database integrity
- Must manually handle taxonomy, slugs, and metadata
- Not user-friendly
- Requires deep understanding of WordPress database schema
- No retry mechanism
It looks tempting, fastest in theory, but in practice it's a nightmare. Firstly, you need to understand DEEPLY Wordpress database schema with all dependencies from plugins you use. Taxonomy system is complex, its easy to corrupt relationships. One mistake, broken category, tag, or postmeta - you need to start over.
2. CSV/XML Import Using Plugins
Popular options:
- WP All Import – accepts CSV or XML, supports custom post types, taxonomies, and custom fields
- WordPress Importer – supports WXR XML format
Pros:
- Easier than manual SQL
- Handles meta fields and taxonomies
- Less risk of database corruption
Cons:
- Requires preparing a clean CSV/XML
- May need multiple steps for tags, labels, or images
Its friendlier option than previous one. Looks appealing - no coding (at first glance). You need to prepare fully denormalized dataset, which means: one row = one article, which means a literal JOIN HELL.
Data can easily became corrupted, complex meta fields dont fit nicely into columns. And we didnt touched performance yet, this approach will result in really slow query time on such big db. File size becomes a monster, php might have problems with loading it, import tools may choke, thus splitting required, which adds additional complexity.
3. Export/Import via RSS or JSON Feed
Export CMS content as RSS or JSON, then write an importer in WordPress or use existing plugins like WP All Import. Can also map custom fields and taxonomies.
Pros:
- Works if your CMS has API or export functionality
- Minimal database access required
Cons:
- Limited control over complex metadata
- Images and attachments may need extra handling
In our case our API was not very well designed to support such migration. One of problems here is that you lose every data that is not in exposed feed. Another one is its http-based import, so expect massive overhead. Also, after all you still need to think about mapping.
4. Use Existing CMS-to-WordPress Plugins
Some CMSs have specialized import plugins (e.g., FG plugins: FG Joomla, FG Drupal, etc.) that sometimes support generic databases with some customization.
Pros:
- Less coding
- Handles post metadata and taxonomies automatically
Cons:
- May not fit completely custom CMS
- Often requires paid version for advanced features
As our CMS was completly custom, we couldn't use this approach, but if you use some known CMS structures, this might be a good option.
5. Custom Migration Script or Plugin
You can write a custom WordPress plugin to migrate content programmatically.
Pros:
- Fully automated migration
- Handles complex logic like merging multiple tables, custom metadata, images, etc.
- Works with completly custom source CMS
- Can run safely within WordPress environment using its own functions
Cons:
- Requires PHP and WordPress knowledge
- Development time needed
It gives full control over entire process. While it requires more development time, it will pay off in the long run.
We used last approach for our migration. We built plugin with different migration types (articles, tags, categories, etc.). We exposed copy of our cms database to world and make migrator call it. Dont process articles one by one, process them in batches, by 500-1000 in one batch. As it uses native WordPress functions, it will handle all the complexity for you, no need to worry about db structure. We also made it resumable: we save each migrated post id in our source db, and fetch only not migrated articles. We also made it parallel: we run multiple batches at the same time, so migration was much faster, but more on that later. Mapping was implemented carefully - each postmeta or term was set separatelly. Only problem was that it resulted in 20-30 queries against Wordpress db per post migrated. Overall, I believe its best approach for migration from completly custom CMS and huge datasets.
Key Migration Challenges
At this scale, the problem is no longer “how to insert data into WordPress.”
It becomes a distributed systems problem: consistency, concurrency, and failure handling.
Idempotency
In practice this means: if a job runs twice, it should not create a second post.
In a migration scenario, this is absolutely critical. Workers can crash, jobs can be retried, and messages can be delivered more than once. Without idempotency, every retry risks corrupting your data.
If your migration is not idempotent, retries will create duplicates. And trust me, retries will happen.Example problem:
- Worker inserts a post
- Process crashes before saving metadata
- Job is retried
- A second post is inserted → duplicate content
Now you have two posts with the same content, metadata that doesn’t match, and relationships that are just wrong.
How to fix it:
- Introduce a stable wordpress_id for every article in your source db
- At every post insert, set also wordpress_id in source db
- Use unique constraints at the database level
- Prefer “upsert” logic over blind inserts
Concurrency
Single wordpress process running migration will take weeks with database of such scale. To speed up migration, you’ll naturally want multiple workers running in parallel.
This is where it started to break.
With multiple workers:
- two workers might process the same article
- multiple workers update the same taxonomy
- order of execution becomes unpredictable
Classic race condition:
- Worker A checks: “does post exist?” → no
- Worker B checks: “does post exist?” → no
- Both insert → duplicate posts
Where it gets worse:
- taxonomy assignment (terms created multiple times - and it happened to us!)
- metadata updates overwriting each other
- partial writes
How to handle it:
- Never rely on “check then insert”, it is inherently race-condition prone
- Use:
- database-level constraints (like unique index on wordpress_id)
- atomic operations (like upserts instead of separate read/write)
- job-level locking (only when necessary, as it limits scalability)
- Partition work carefully:
- by ID ranges (static sharding)
- or via queue guarantees (each job processed exactly once logically)
Database Bottleneck
At some point I realized that the problem wasn’t really my code or even WordPress itself - it was the amount of pressure I was putting on the database.
And it’s worth noting: in this setup you’re not dealing with just one database, but two - the source one and the WordPress one. But in practice, the real bottleneck was always on the WordPress side.
The important detail here is how WordPress works under the hood.
I wasn’t inserting data directly into the database. I was using WordPress APIs like:
wp_insert_post()update_post_meta()wp_set_object_terms()
Which is the “correct” way… but also a very expensive one.
In my case, inserting a single post resulted in roughly 20–30 database queries.
That doesn’t sound like much - until you add concurrency.
- 1 worker → ~1,000+ queries per batch
- 50 workers → tens of thousands of queries hitting the database at the same time
That’s where things started to break down.
What actually went wrong:
- The same tables (
wp_posts,wp_postmeta,wp_terms) were being hammered from multiple workers - Query latency started increasing under load
- Deadlocks began to appear occasionally
- Adding more workers stopped making things faster - and eventually made it worse
One thing that was initially counterintuitive:
Batching didn’t reduce the number of queries at all.
Each post still triggered the same 20–30 queries, because that’s how WordPress works.
Batching helped in other ways:
- it limited memory usage
- it gave me clear retry boundaries
- and most importantly - it let me control how much load I was putting on the database
What would have made a difference:
- Limiting the number of concurrent workers (this was the biggest lever)
- Caching things inside a batch (like taxonomy IDs) to avoid repeated lookups
- Being careful with anything that touches shared tables (terms, meta, etc.)
- Watching the database metrics instead of guessing (CPU, slow queries, locks)
The real bottleneck wasn’t the insert itself - it was the number of queries generated per post combined with concurrency.
Once I accepted that, the solution wasn’t “make it faster” - it was “control the pressure on the database”.
At this point, it becomes clear that migrating data at this scale is not just a WordPress problem - it’s a distributed systems problem.
Once you accept that, the architecture decisions become much more obvious.
Our Setup With K8
After deciding on approach, we needed to set up our infrastructure. We set up wordpress running in a kubernetes cluster on powerful machine, we scaled it horizontally to 48 pods.
The custom migration plugin lived inside WordPress and was responsible for reading from the source database and inserting content through native WordPress APIs. This gave us safety and compatibility with the rest of the WordPress ecosystem.
The architecture itself was simple:
- source database with article data
- WordPress running in Kubernetes
- custom migration plugin inside WordPress
- multiple pods processing migration batches in parallel
- target WordPress database receiving writes
This setup gave us the control we needed, but it also made it obvious where the real limit was: not PHP, not Kubernetes, but the amount of write pressure we could safely push through WordPress into MySQL.
Why Not WP-Cron?
WP-Cron is great for simple, periodic tasks, but it's not designed for heavy, long-running operations like data migration of such scope.
First, it’s not a real cron. It depends on incoming HTTP traffic to trigger execution, which already makes it unreliable in non-production or controlled environments. You can work around this with a system cron hitting wp-cron.php, but at that point you’re already patching around its limitations.
More importantly, WP-Cron is fundamentally not built for parallel, high-throughput workloads. Even with tools like Action Scheduler, you quickly run into limited concurrency and coordination issues. At the scale of hundreds of thousands of records, this becomes a bottleneck rather than a solution.
Then there are the typical PHP constraints. Long-running processes hit execution time limits, memory limits, or both. You end up artificially splitting work into smaller chunks, which adds complexity and still doesn’t guarantee stability.
Failure handling is another problem. If a job crashes halfway through a batch, you need to carefully track what was already processed and what wasn’t. Without strong idempotency guarantees, retries can easily create duplicates or inconsistent state. Recovery becomes messy, especially when multiple jobs overlap.
Finally, pushing large volumes of writes through WP-Cron often leads to database pressure. Since execution is not well controlled, it’s easy to end up with overlapping jobs hitting the same tables, causing lock contention and, in some cases, deadlocks.
WP-Cron works well when tasks are small, independent, and infrequent. A large-scale migration is none of those things.
How I Would Do This Next Time
I would keep the Kubernetes layer and the custom WordPress plugin, but I would also add Redis - not really as a cache layer, but as a coordination layer for the migration itself. The main idea would be simple: WordPress would still be responsible for actually creating posts, meta fields, and taxonomy relations through its native APIs, but the work distribution would no longer live inside WordPress. Instead, I would split the migration into small batch jobs and push them into Redis Streams. Multiple worker pods running in Kubernetes would consume those jobs in parallel, fetch the required data from the source database, run the migration through the plugin, and acknowledge the job only after the batch was fully processed. That would give me much better control over concurrency, retries, and recovery when a worker crashes. Combined with a proper mapping table, this would make the whole process much more resilient, easier to resume, and much safer to scale without creating duplicates or putting uncontrolled pressure on the database.
Lessons learned
- WordPress is not designed for bulk writes so expect query amplification
- WP-Cron is not a job queue as it breaks down at scale
- Concurrency without control will hurt more than it helps
- Idempotency is not optional, be sure retries will happen
- The real bottleneck is almost always the database, not your code