About Me

Mein Bild
Freelancing software developer / architect, mobile geek. Topics: Software technologies (java, groovy, agile), politics, web, science, green, mobile, android.
Homepage: www.tutaona.com

"The absence of limitations is the enemy of art." - Orson Welles

Montag, 3. September 2012

Migrate Data from MySQL to MongoDB

A grails project needed to migrate from MySQL to MongoDB. The data included persons with an adress but the new data model should leave room for more than one address per person. Moreover the adresses have got geo coordinates, so the lat/lon values must be moved to a nested structure as mongo needs it if you want to query a geo spatial index.

As there is no reasonable tool for that I ended up writing migration code by myself. I also think that there can't be a good tool for that, because the paradigm shift from schema-full to schema-less is too big. IMHO it's recommendable to refine your datastructures as with MongoDB it is sometimes recomendable to have realted data embedded instaed of linked.

Anyway, I tried to migrate the person table with 400.000+ rows. I decided for embedded addresses here as they are usually not shared among several persons. The data should look like this:

{
    "_id" : ObjectId("5044ab202b5e1ae2ffcd44a1"),
    "firstname" : "John",
    "lastname" : "Doe",
    "address" : [
        {
            "city" : "Großerlach",
            "street" : "Sulzbacher Str. 50",
            "zip" : "71577",
            "loc" : {
                "lon" : 9.57669,
                "lat" : 49.03721
            }
        }
    ]
}


My first stupid attempt was to solve this completely with GORM. I installed the grails mongodb plugin and tried something like this:

Person.list().each() {
    it.mongo.save()
}

GORM seemed to be horribly slow when handling huge datasets at once. After some research I learned that I should do it in chunks and flush the hibernate session inbetween chunks. I tried with a chunk size of 1000 rows.

def sessionFactory

def copyToMongo() {
    int chunk = 1000
    long offset = 0
    long max = Person.count()
    while(offset < max)  {
        Person.list(max:page, offset:offset).each {
            it.mongo.save()
        }
        sessionFactory.currentSession.flush()
     offset += chunk
    }
}

But this was again very slow. After inserting about 60.000 rows the performance started to degrade and after 180.000 rows I ran out of heap memory. Finally I found out that it's best not to use GORM and Hibernate at all when handling a massive amount of data. No object mapping, just raw data and it worked quite well. My solution uses bare Groovy SQL and the low level API of MongoDB using the bulk insert feature:

def mongo
def dataSource

def copyToMongo() {
    int chunk = 1000
    long offset = 0
 
    def db = mongo.getDB("test")
    db.person.drop()
 
    Sql sql = new Sql(dataSource)
 
    long max = Person.count()
    while(offset < max)  {
        def bulk = []
        sql.eachRow("SELECT * FROM person LIMIT 1000 OFFSET $offset", {
            bulk << [
                firstname:it.firstname,
                lastname:it.lastname,
                address:[[
                    city:it.city, street:it.street, zip:it.zip, 
                    loc: [lon:it.longitude, lat:it.latitude]
                ]]
            ]
        })
        db.person.insert(bulk)
        offset += chunk
    }
    db.person.ensureIndex(['address.loc':'2d'])
    render "Done. "
    render "MySQL: $max "
    render "Mongo: ${db.person.count()}"
}

Keine Kommentare:

Kommentar veröffentlichen