Search Index Jobs

When you handle updates and upgrades of the Dataverse application or rollout your custom metadata schema (blocks), you will need to take care of your search index, based on Solr.

Inplace Re-Indexing

There are two main reasons when you might need to rebuild your search index:

  1. Sometimes upgrading to a new Dataverse version, the Solr configuration has been changed by upstream. In these cases, release notes will advise you to do an “inplace reindex”.
  2. You changed your metadata schema, renamed fields, changed type etc. A data migration is not possible for our index, instead we need to rebuild it.

For your convenience, a batch job has been added, containing all actions mentioned in the docs. Simply deploy it during off-hours (or fork and create a CronJob):

kubectl create -f k8s/dataverse/jobs/inplace-reindex.yaml

Hint

Beware, this type of re-index does not guarantee for a clean index. See upstream index guide.

Update Solr schema with custom metadata fields

The Solr container comes with a default index configuration, supporting the upstream metadata schemas. This configuration resides in ${$COLLECTION_DIR}/conf (see also important directories of the image).

Dataverse provides an API endpoint to retrieve a Solr schema configuration fitting the metadata schemas present in your Dataverse installation. We use a forked version of the upstream script at $SCRIPT_DIR/schema/update.sh to generate an updated configuration and reload Solr.

Important

Most likely you will need to do Inplace Re-Indexing after deploying new schemas. Many, if not all schema changes will also require a rebuild of your index.

… gracefully when starting Solr

As the Solr index configuration is not persisted, but loaded from Dataverse, we need to ask Dataverse for it when Solr starts. This is done via an init container.

This is done gracefully with a fallback to the default upstream metadata. Unless you change those, worst case is loosing searchability of custom metadata when configuration is not available during startup.

@startuml
!includeurl "https://raw.githubusercontent.com/michiel/plantuml-kubernetes-sprites/master/resource/k8s-sprites-unlabeled-25pct.iuml"
hide footbox

participant "<color:#royalblue><$job></color>\nMetadata Update Job" as MDJ
box "Solr Pod"
  participant "<color:#royalblue><$pod></color>\nSchema Init" as SI
  participant "<color:#royalblue><$pod></color>\nSchema Sidecar" as SS
  participant "<color:#royalblue><$pod></color>\nSolr" as Solr
end box
participant "<color:#royalblue><$pod></color>\nDataverse" as DV

== Startup ==

activate SI
SI -> SI : Call //update.sh//
activate SI
SI -> DV ++ : Request metadata fields
DV --> SI -- : Send fields
SI -> SI : Write Solr configuration to ///schema//
SI --> SI : Trigger //RELOAD// (will fail on purpose)
deactivate SI

SI --> SI : Fail gracefully
destroy SI
create SS
SI --> SS : //init done//
create Solr
SI --> Solr : //init done//

@enduml

Hint

To understand the above, please keep in mind that init, sidecar and main Solr container share /schema via emptyDir volume.

… when updating metadata schemas

A sidecar container of Solr Pod, executed by a webhook. This webhook is fired by the metadata update Job for you, once metadata blocks have been uploaded.

@startuml
!includeurl "https://raw.githubusercontent.com/michiel/plantuml-kubernetes-sprites/master/resource/k8s-sprites-unlabeled-25pct.iuml"
hide footbox

participant "<color:#royalblue><$job></color>\nMetadata Update Job" as MDJ
box "Solr Pod"
  participant "<color:#royalblue><$pod></color>\nSchema Sidecar" as SS
  participant "<color:#royalblue><$pod></color>\nSolr" as Solr
end box
participant "<color:#royalblue><$pod></color>\nDataverse" as DV

MDJ -> SS : Fire webhook
activate SS

SS -> SS : Check request,\nTranslate parameters,\nCall //update.sh//
activate SS

SS -> DV ++ : Request metadata fields
DV --> SS -- : Send fields

SS -> SS : Write Solr configuration to ///schema//
SS -> Solr : Trigger //RELOAD//
activate Solr

Solr -> Solr : Restart core,\nLoad configuration\nfrom ///schema// now
Solr --> SS
deactivate Solr

SS --> SS
deactivate SS

SS --> MDJ : Send status code and script output (to be logged)
deactivate SS
@enduml

Hint

To understand the above, please keep in mind that init, sidecar and main Solr container share /schema via emptyDir volume.

See also

Webhooks implemented using https://github.com/adnanh/webhook and extendable if necessary.