Migrating Tons of Articles to WordPress Without Killing the Database

Migrating a massive amount of content to WordPress seems simple at first: export your data, loop over it, and call wp_insert_post().

In reality, WordPress isn’t built for this scale, and naive solutions can break everything - from PHP timeouts to database deadlocks.

In this article, I’ll walk you through my journey: what didn’t work, what did, and the lessons learned

Starting Point

We had a custom CMS with around 500,000 articles that needed to be migrated to WordPress. The challenge was to do it without breaking the database or taking the site down, and in a reasonable amount of time. Part of our database responsible for articles was composed of several tables, each holding different things: meta data, content itself (one record per paragraph), tags, label, etc. Goal was to migrate them to Wordpress ecosystem with keeping business consistence. First step, preparation phase, was to figure out which tables/columns needs to be mapped to which Wordpress counterpart. There were some clear mappings, like our labels to WP tags, or our tags to WP categories, but some were not that obvious. Hardest was content itself.

Ways to migrate

When facing similar problem, at first glance you have several options, and internet will try to convince you that all of them are good ideas. Let's go through them one by one.

1. Manual SQL Migration

You directly map your CMS tables to WordPress tables (wp_posts, wp_postmeta, wp_terms, wp_term_taxonomy, wp_term_relationships).

Pros:

Full control over data mapping
Can handle complex relationships

Cons:

High risk of breaking WordPress database integrity
Must manually handle taxonomy, slugs, and metadata
Not user-friendly
Requires deep understanding of WordPress database schema
No retry mechanism

It looks tempting, fastest in theory, but in practice it's a nightmare. Firstly, you need to understand DEEPLY Wordpress database schema with all dependencies from plugins you use. Taxonomy system is complex, its easy to corrupt relationships. One mistake, broken category, tag, or postmeta - you need to start over.

2. CSV/XML Import Using Plugins

Popular options:

WP All Import – accepts CSV or XML, supports custom post types, taxonomies, and custom fields
WordPress Importer – supports WXR XML format

Pros:

Easier than manual SQL
Handles meta fields and taxonomies
Less risk of database corruption

Cons:

Requires preparing a clean CSV/XML
May need multiple steps for tags, labels, or images

Its friendlier option than previous one. Looks appealing - no coding (at first glance). You need to prepare fully denormalized dataset, which means: one row = one article, which means a literal JOIN HELL.

Data can easily became corrupted, complex meta fields dont fit nicely into columns. And we didnt touched performance yet, this approach will result in really slow query time on such big db. File size becomes a monster, php might have problems with loading it, import tools may choke, thus splitting required, which adds additional complexity.

3. Export/Import via RSS or JSON Feed

Export CMS content as RSS or JSON, then write an importer in WordPress or use existing plugins like WP All Import. Can also map custom fields and taxonomies.

Pros:

Works if your CMS has API or export functionality
Minimal database access required

Cons:

Limited control over complex metadata
Images and attachments may need extra handling

In our case our API was not very well designed to support such migration. One of problems here is that you lose every data that is not in exposed feed. Another one is its http-based import, so expect massive overhead. Also, after all you still need to think about mapping.

4. Use Existing CMS-to-WordPress Plugins

Some CMSs have specialized import plugins (e.g., FG plugins: FG Joomla, FG Drupal, etc.) that sometimes support generic databases with some customization.

Pros:

Less coding
Handles post metadata and taxonomies automatically

Cons:

May not fit completely custom CMS
Often requires paid version for advanced features

As our CMS was completly custom, we couldn't use this approach, but if you use some known CMS structures, this might be a good option.

5. Custom Migration Script or Plugin

You can write a custom WordPress plugin to migrate content programmatically.

Pros:

Fully automated migration
Handles complex logic like merging multiple tables, custom metadata, images, etc.
Works with completly custom source CMS
Can run safely within WordPress environment using its own functions

Cons:

Requires PHP and WordPress knowledge
Development time needed

It gives full control over entire process. While it requires more development time, it will pay off in the long run.

We used last approach for our migration. We built plugin with different migration types (articles, tags, categories, etc.). We exposed copy of our cms database to world and make migrator call it. Dont process articles one by one, process them in batches, by 500-1000 in one batch. As it uses native WordPress functions, it will handle all the complexity for you, no need to worry about db structure. We also made it resumable: we save each migrated post id in our source db, and fetch only not migrated articles. We also made it parallel: we run multiple batches at the same time, so migration was much faster, but more on that later. Mapping was implemented carefully - each postmeta or term was set separatelly. Only problem was that it resulted in 20-30 queries against Wordpress db per post migrated. Overall, I believe its best approach for migration from completly custom CMS and huge datasets.

Key Migration Challenges

At this scale, the problem is no longer “how to insert data into WordPress.”
It becomes a distributed systems problem: consistency, concurrency, and failure handling.

Idempotency

In practice this means: if a job runs twice, it should not create a second post.

In a migration scenario, this is absolutely critical. Workers can crash, jobs can be retried, and messages can be delivered more than once. Without idempotency, every retry risks corrupting your data.

If your migration is not idempotent, retries will create duplicates. And trust me, retries will happen.

Example problem:

Worker inserts a post
Process crashes before saving metadata
Job is retried
A second post is inserted → duplicate content

Now you have two posts with the same content, metadata that doesn’t match, and relationships that are just wrong.

How to fix it:

Introduce a stable wordpress_id for every article in your source db
At every post insert, set also wordpress_id in source db
Use unique constraints at the database level
Prefer “upsert” logic over blind inserts

Concurrency

Single wordpress process running migration will take weeks with database of such scale. To speed up migration, you’ll naturally want multiple workers running in parallel.

This is where it started to break.

With multiple workers:

two workers might process the same article
multiple workers update the same taxonomy
order of execution becomes unpredictable

Classic race condition:

Worker A checks: “does post exist?” → no
Worker B checks: “does post exist?” → no
Both insert → duplicate posts

Where it gets worse:

taxonomy assignment (terms created multiple times - and it happened to us!)
metadata updates overwriting each other
partial writes

How to handle it:

Never rely on “check then insert”, it is inherently race-condition prone
Use:
- database-level constraints (like unique index on wordpress_id)
- atomic operations (like upserts instead of separate read/write)
- job-level locking (only when necessary, as it limits scalability)
Partition work carefully:
- by ID ranges (static sharding)
- or via queue guarantees (each job processed exactly once logically)

Concurrency must be designed, not added as an afterthought.

Database Bottleneck

At some point I realized that the problem wasn’t really my code or even WordPress itself - it was the amount of pressure I was putting on the database.

And it’s worth noting: in this setup you’re not dealing with just one database, but two - the source one and the WordPress one. But in practice, the real bottleneck was always on the WordPress side.

The important detail here is how WordPress works under the hood.

I wasn’t inserting data directly into the database. I was using WordPress APIs like:

wp_insert_post()
update_post_meta()
wp_set_object_terms()

Which is the “correct” way… but also a very expensive one.

In my case, inserting a single post resulted in roughly 20–30 database queries.

That doesn’t sound like much - until you add concurrency.

1 worker → ~1,000+ queries per batch
50 workers → tens of thousands of queries hitting the database at the same time

That’s where things started to break down.

What actually went wrong:

The same tables (wp_posts, wp_postmeta, wp_terms) were being hammered from multiple workers
Query latency started increasing under load
Deadlocks began to appear occasionally
Adding more workers stopped making things faster - and eventually made it worse

One thing that was initially counterintuitive:

Batching didn’t reduce the number of queries at all.

Each post still triggered the same 20–30 queries, because that’s how WordPress works.
Batching helped in other ways:

it limited memory usage
it gave me clear retry boundaries
and most importantly - it let me control how much load I was putting on the database

What would have made a difference:

Limiting the number of concurrent workers (this was the biggest lever)
Caching things inside a batch (like taxonomy IDs) to avoid repeated lookups
Being careful with anything that touches shared tables (terms, meta, etc.)
Watching the database metrics instead of guessing (CPU, slow queries, locks)

The real bottleneck wasn’t the insert itself - it was the number of queries generated per post combined with concurrency.

Once I accepted that, the solution wasn’t “make it faster” - it was “control the pressure on the database”.

At this point, it becomes clear that migrating data at this scale is not just a WordPress problem - it’s a distributed systems problem.

Once you accept that, the architecture decisions become much more obvious.

Our Setup With K8

After deciding on approach, we needed to set up our infrastructure. We set up wordpress running in a kubernetes cluster on powerful machine, we scaled it horizontally to 48 pods.

Source Database

Custom CMS

500k articles

Kubernetes Cluster

Pod 1

Pod 2

Pod 3

Pod N

Wordpress in each pod

Target Database

WordPress MySQL

Receives writes

Source

Processing

Target

The custom migration plugin lived inside WordPress and was responsible for reading from the source database and inserting content through native WordPress APIs. This gave us safety and compatibility with the rest of the WordPress ecosystem.

The architecture itself was simple:

source database with article data
WordPress running in Kubernetes
custom migration plugin inside WordPress
multiple pods processing migration batches in parallel
target WordPress database receiving writes

This setup gave us the control we needed, but it also made it obvious where the real limit was: not PHP, not Kubernetes, but the amount of write pressure we could safely push through WordPress into MySQL.

Why Not WP-Cron?

WP-Cron is great for simple, periodic tasks, but it's not designed for heavy, long-running operations like data migration of such scope.

First, it’s not a real cron. It depends on incoming HTTP traffic to trigger execution, which already makes it unreliable in non-production or controlled environments. You can work around this with a system cron hitting wp-cron.php, but at that point you’re already patching around its limitations.

More importantly, WP-Cron is fundamentally not built for parallel, high-throughput workloads. Even with tools like Action Scheduler, you quickly run into limited concurrency and coordination issues. At the scale of hundreds of thousands of records, this becomes a bottleneck rather than a solution.

Then there are the typical PHP constraints. Long-running processes hit execution time limits, memory limits, or both. You end up artificially splitting work into smaller chunks, which adds complexity and still doesn’t guarantee stability.

Failure handling is another problem. If a job crashes halfway through a batch, you need to carefully track what was already processed and what wasn’t. Without strong idempotency guarantees, retries can easily create duplicates or inconsistent state. Recovery becomes messy, especially when multiple jobs overlap.

Finally, pushing large volumes of writes through WP-Cron often leads to database pressure. Since execution is not well controlled, it’s easy to end up with overlapping jobs hitting the same tables, causing lock contention and, in some cases, deadlocks.

WP-Cron works well when tasks are small, independent, and infrequent. A large-scale migration is none of those things.

How I Would Do This Next Time

I would keep the Kubernetes layer and the custom WordPress plugin, but I would also add Redis - not really as a cache layer, but as a coordination layer for the migration itself. The main idea would be simple: WordPress would still be responsible for actually creating posts, meta fields, and taxonomy relations through its native APIs, but the work distribution would no longer live inside WordPress. Instead, I would split the migration into small batch jobs and push them into Redis Streams. Multiple worker pods running in Kubernetes would consume those jobs in parallel, fetch the required data from the source database, run the migration through the plugin, and acknowledge the job only after the batch was fully processed. That would give me much better control over concurrency, retries, and recovery when a worker crashes. Combined with a proper mapping table, this would make the whole process much more resilient, easier to resume, and much safer to scale without creating duplicates or putting uncontrolled pressure on the database.

Lessons learned

WordPress is not designed for bulk writes so expect query amplification
WP-Cron is not a job queue as it breaks down at scale
Concurrency without control will hurt more than it helps
Idempotency is not optional, be sure retries will happen
The real bottleneck is almost always the database, not your code